CS计算机代考程序代写 chain Economics Teaching Materials

Economics Teaching Materials
Lecture Notes in Linear Algebra
Chris D. Orme
September 2009
Economics
School of Social Sciences The University of Manchester Manchester M13 9PL

Lecture Notes in Linear Algebra
Chris D. Orme Economics
School of Social Sciences University of Manchester chris.orme@manchester.ac.uk
September 2009
c The University of Manchester
These notes are published with the permission of the University of Manchester and can be downloaded by any individual for their own private study. Permission must be obtained to copy and distribute for teaching purposes, outside of The University of Manchester.

Contents
Preface
1 Vectors
vii
1
1.1 Introductory remarks, notation, deÖnition and concepts
1.2 Manipulatingvectors………………….. 3
1.2.1 Scalarmultiplicationofvectors . . . . . . . . . . . . . 4
1.2.2 Additionofvectors……………….. 4
1.2.3 Linearcombinationsofvectors . . . . . . . . . . . . . 6
1.2.4 Scalar(dotorinner)product ………….. 6
1.3 Angle&orthogonality …………………. 7 1.3.1 Orthogonality………………….. 9
1.4 Exercises ……………………….. 9
2 Vector Spaces 11
2.1 Introductoryremarks………………….. 11 2.2 Spacesandsubspaces………………….. 12 2.3 Lineardependence&independence…………… 13 2.4 Someresults………………………. 14 2.5 Exercises ……………………….. 18
3 Matrices & Rank 21
3.1 Introductoryremarks………………….. 21 3.1.1 Somespecialmatrices ……………… 22
3.2 Basicmatrixoperations&deÖnitions. . . . . . . . . . . . . . 22
3.3 Multivariaterandomvariables……………… 23
3.4 Matrixmultiplication………………….. 24 3.4.1 Someusefulresults……………….. 25
3.5 Linearindependenceandrank……………… 26 3.5.1 Someresultsconcerningrank ………….. 27
3.6 Exercises ……………………….. 28
4 Determinants and the Inverse Matrix 31
4.1 Introductoryremarks………………….. 31 4.2 Determinants ……………………… 31
iii
. . . 1

iv
CONTENTS
4.2.1 Propertiesofdeterminants …………… 33
4.2.2 ExpansionbyCofactors …………….. 33
4.3 Theinversematrix …………………… 35
4.4 Some further results concerning rank, determinants and inverses 36
4.5 Exercises ……………………….. 37
5 Linear Transformations and Simultaneous Equations 41
5.1 Introductoryremarks………………….. 41
5.2 Non-singulartransformations ……………… 43 5.2.1 Special cases: elementary transformations . . . . . . . 44
5.3 SimultaneousLinearEquations …………….. 45 5.3.1 ExistenceandUniqueness……………. 47 5.3.2 Homogeneousequations …………….. 49 5.3.3 SomeExamples…………………. 49
5.4 Exercises ……………………….. 50
6 The Characteristic Value Problem 53
6.1 Introductoryremarks………………….. 53 6.2 The characteristic polynomial and equation . . . . . . . . . . 53 6.3 SymmetricMatrices ………………….. 55 6.4 TheDiagonalisationofaSymmetricMatrix . . . . . . . . . . 58 6.5 Exercises ……………………….. 59
7 Quadratic Forms 61
7.1 Introductoryremarks………………….. 61
7.2 DeÖniteandindeÖnitequadraticforms…………. 62
7.3 Propertiesofquadraticforms ……………… 63
7.4 Projectionmatrices…………………… 66
7.4.1 Orthogonal projections and least squares . . . . . . . 67
7.5 Applicationsinstatistics………………… 68
7.5.1 Linear combinations of normal random variables . . . 68
7.5.2 Independent normal random vectors . . . . . . . . . . 70
7.5.3 The distribution of some quadratic forms in normal randomvariables………………… 70
7.6 Exercises ……………………….. 73
8 Matrix Di§erential Calculus and Optimisation 77
8.1 Introductoryremarksandnotation…………… 77
8.2 TheGradientvectorandJacobianmatrix . . . . . . . . . . . 78 8.2.1 Real-valuedfunctionofavector. . . . . . . . . . . . . 79 8.2.2 Vectorfunctionofavector …………… 80
8.3 Using di§erentials to identify the derivative . . . . . . . . . . 80 8.3.1 Identifyingthederivative ……………. 81
8.4 Unconstrainedoptimisation ………………. 83

CONTENTS v
8.4.1 Globalandlocaloptima…………….. 83 8.4.2 StationaryPoints………………… 84 8.4.3 Hessianmatrix …………………. 85 8.4.4 Characterising stationary points . . . . . . . . . . . . 86
8.5 ConcavityandConvexity ……………….. 93 8.5.1 Concavityconditions………………. 94 8.5.2 Convexityconditions………………. 94
8.6 First and second order mean value expansions . . . . . . . . . 95 8.6.1 Real-valuedfunctionsI……………… 95 8.6.2 Real-valuedfunctionsII …………….. 97 8.6.3 Vectorfunctions ………………… 98
8.7 Exercises ……………………….. 98
A Some useful results from statistical theory 101
B A swift revision of matrix algebra 105
C The classical linear regression model 111
C.1 Ordinary Least Squares (OLS) Estimation . . . . . . . . . . . 114

vi CONTENTS

Preface
These lecture notes present a number of topics in linear algebra that, I believe, are necessary for the understanding, and subsequent analysis, of various problems in statistics/econometrics. Chapters 1-7 were originally prepared as a one term sequence of 18 lectures and 9 problem classes for spe- cialist second year undergraduates, studying at the University of York over the period 1990-1995. These students had a good statistical background, from their Örst year of study, including some exposure to simple bivariate regression, probability and hypothesis testing. Concurrently, in their second year, they took a course in statistical theory (probability, distribution theory and inference) and a course in univariate and bivariate calculus. They were about to embark (in their third year) on an advanced econometric theory course. In recent years these notes (supplemented by a little material on ma- trix di§erential calculus) have been recommended to postgraduate (taught Masters) students at the University of Manchester as reference material un- derpinning a one semester course in econometrics. In the current edition, I have substantially extended the material on multivariate optimisation and the result is Chapter 8.
It is, therefore, intended that these lecture notes should provide an ap- propriate grounding in linear algebra for those wishing to study economet- ric theory. On reáection, the undergraduates at York were fortunate since, nowadays, most econometrics courses do without such preliminary train- ing in linear algebra. Whilst most undergraduate (or graduate) texts in econometrics provide ìappendicesî, or preliminary chapters, itemising im- portant results, they do not necessarily o§er much in the way of a deeper investigation of the technology which delivers these results. In my view this is unsatisfactory, from both the studentsíand lecturerís perspectives. Of course, putting on another course and requiring students to invest in yet another text can be viewed as being costly (in the short run); even at Manchester! Hence, and in the absence of a self-contained lecture course in linear algebra that would, undoubtedly, make the students better prepared, these notes are made freely available to students studying econometrics at the University of Manchester. Although presented in book form, the content does not yet amount to a text book. The reader should bear in mind that they these are, as the title states, lecture notes and the writing style reáects
vii

viii PREFACE
this.
In preparing, and updating, these notes, my aim was to give an intuitive
feel for the subject at the conceptual level, whilst also adopting a reasonable amount of rigour – given the assumed background of the students. Since, they are also written with a follow-on course in econometrics in mind, the notation employed is consistent with that of modern econometrics. However, the language of linear algebra, as developed here, will (I hope) be of use in many other areas of advanced economics courses in that it will enable students to express quite sophisticated concepts succinctly.
Each Chapter concludes with a set of problems listed as exercises. Many of these were tried out on students during my time at York and I thank them for their e§orts (and, sometimes, ingenuity) in solving the problems set. I must also thank various colleagues who have helped develop this material – although they may not know it! Especially, Don Poskitt, Les Godfrey, Peter Lambert, Peter Smith and Mel Weeks. I inherited Don Poskittís lecture notes when teaching the course for the Örst time, at York, and I am happy to acknowledge his ináuence which must surely feature heavily in what follows. However, the notes have been redrafted a number of times and Don can not be held responsible for what follows.
Finally I am eternally grateful to Rashmi Sarmah, a PhD student at Manchester, for carefully transcribing the original from ChiWriter to Scien- tiÖcWord.
Chris Orme
Manchester, September 2009

Chapter 1 Vectors
1.1 Introductory remarks, notation, deÖnition and concepts
We use vectors as a compact notation to represent concepts/phenomena in applied mathematics/social sciences that, in some sense, can not be rep- resented by a single number. For example, the n prices, (p1;:::pn); and quantity demands, (q1;::::qn) can be represented in the following way:
26 p 1 37 26 q 1 37 p=6 p2 7; q=6 q2 7;
4 . 5 4 . 5
pn qn
where pi denotes the price of good i; and qj the demand for good i; pi and qi
are referred to as the elements of the vectors p and q; respectively, and we often denote this by writing p = fpig ; for example, where pi often referred to as a typical element (or co-ordinate) of p:
In the above both p and q would be referred to as (n 1) column vectors or simply (n 1) vectors, by which we imply that a vector is a column of numbers. In most texts it is usual to denote a vector in lower case bold face type (and I shall try to keep this convention in these notes wherever possible). Thus, when we see a, b, x or u we imagine a column of numbers.
In some cases it is convenient to think of a row of numbers, rather than a column, in which case we consider the tranpose of a vector. In econometrics it is usual to deÖne the transpose of the vector x as x0 (although many authors employ xT; instead of x0; in order to avoid confusion with di§erentiation). Thus for the vector of prices, p, we have p0 = [p1;::::pn]. Note also I employ square brackets, [:::] ; or ìroundîbrackets, (:::) ; to enclose the elements of a vector.
Having introduced the idea of a transpose we now deÖne how we can succinctly express total expenditure, m; on n goods:
1

2 CHAPTER 1. VECTORS
x
θ
O
Figure 1.1: Directed Line Segment: Vector
Xn i=1
the operation p0q known as the scalar or inner or dot product. This is because the result of p0q is a scalar, m. We shall say more about the scalar product operation shortly.
Historically, vectors arose naturally in mechanics as being 2 numbers (magnitude and direction) deÖning a force, and this is depicted by a directed line segment, %; or, &; indicating magnitude and direction. It is, however, convenient to have all vectors ëstartingíat the same point. In this sense, a vector is often thought of as a directed line segment emanating from the origin of a rectangular co-ordinate system. The simple case of such a (2 1) vector, x, is illustrated in Figure 1.1. The vector emanates from the origin, 0; at an angle of to the horizontal co-ordinate axis and travels to the tip of the arrow head.
In three dimensions we have, for example, x0 = (x1;x2;x3) where the three numbers x1;x2 and x3 are the co-ordinates of a point (called the ele- ments of the vector x), which together with the origin enables us to draw and deÖne the vector, although from now on I shall ìterminateîthe vector with a dot (and not an arrow head) since all vectors emanate from the origin so that there is no ambiguity about direction; see Figure 1.2.
Notice that the same vector is equivalently deÖned by the three quan- tities: , and OX, where and are angles giving direction (the Örst with respect to the x1 axis and the second giving elevation with respect to
m =
piqi p0q

1.2. MANIPULATING VECTORS 3
θ
φ
Figure 1.2: A vector in 3 dimensions
the x1-x2 plane) and OX is the distance (from the origin) giving magnitude. Such diagrams get more di¢ cult to draw as the number of elements in a vector increases.
1.2 Manipulating vectors
We have already introduced the transpose of an arbitrary (n 1) vector x as x0 = [x1; :::xn] : This can be regarded simply as a notational device as x and x0 represent essentially the same information. We also write x = fxig, meaning ìlet the vector x have typical element xiî. The following concepts are important:
1. NULL VECTOR: 0 = f0g 8 i
2. EQUALITY: x = y i§ xi = yi 8 i; both x and y must be (n 1)
3. INEQUALITY:xyi§ xi yi 8i;bothxandymustbe(n1)
4. MAGNITUDE or EUCLIDEAN NORM: In the 2-dimensional case we can deÖne the magnitude of a vector quite readily by using ëPythagorasí- and this extends quite naturally to 3-dimensions. For example, if x0 = [x1;x2]; then the length or magnitude of x is called the Euclidean Norm and is deÖned as:

4 CHAPTER 1. VECTORS
Z
z = 2x
X
x
O
Figure 1.3: Scalar multiplication
k x k = q x 21 + x 2 2
where, by default, square roots are positive. We now simply generalise this
for(n1)vectors,x0 =fxg=[x ;xv;:::x ]: i12n
uXn
k x k = t x 2i
i=1
Note that this also deÖnes the magnitude of a scalar; i.e., a (1 1) vector.
We can change the magnitude of a vector by scalar multiplication.
1.2.1 Scalar multiplication of vectors
If x = fxig; then y = x = (x1;:::xn)0 = fxig: In which case, kyk =
s Pn 2 2 s 2 Pn 2
kxk= xi = xi =jjkxk:
i=1 i=1
Note that if > 0 then jj = and the direction of x remains unchanged;
if < 0; then x is rotated through 180 ( radians); see Figure 1.3 where we the vector x; of length OX; and z = 2x of length OZ (twice that of x): 1.2.2 Addition of vectors The addition of vectors can be illustrated using The Parallelogram Law; see Figure 1.4. 1.2. MANIPULATING VECTORS 5 z=x+y xX Y y Z O Figure 1.4: The Parallelogram Law for the addition of vectors Let x = fxig and y = fyig, both (n 1) : Then z = x + y is deÖned and isa(n1)vectorsuchthatz=fzig=fxi +yig,fori=1;:::;n:Thevector z is the directed line segment OZ: (Vectors can only be added together if they have the same number of elements.) Note: x+y=y+x x + y + z = (x + y) + z = x + (y + z): Subtraction is deÖned as a combination of scalar multiplication (by 1) and addition. Provided x and y are both (n1) then xy = x+(1)y and has typical element fxi yig. The Euclidean Norm, kx yk ; measures the distance between x and y; see Figure 1.5, where x y is depicted, for two given vectors x and y and OX (the length of x y) measures the distance between x and y: The concept of addition and scalr multiplication introduces the useful idea of the resolution of a vector into its components. Let x = fxig be a (n 1) vector, and ei; i = 1; :::n; a sequence of n; (n 1) ; vectors where ei has 1 as its ith element and 0 elsewhere. For example, if n = 3, then e1 = (1;0;0)0 ; e2 = (0;1;0)0 and e3 = (0;0;1)0 : The ei are called unit vectors (they each have unit length) and one can think of them as ëdeÖningí the axes in the n-dimensional co-ordinate system or n-dimensional Euclidean Space (see Chapter 2 for a further discussion of vector spaces). From the deÖnition of the ei we have 6 CHAPTER 1. VECTORS x x- y X Y y O Figure 1.5: The Distance between 2 vectors: XY Xn i=1 x is said to be linear combination of the n unit vectors. This is simply illustrated for the n = 2 case, with two vectors x and y; in Figure 1.6. For example, if x = [3; 4]0 ; then x = 3e1 + 4e2; by the Parallelogram Law. We shall explore the very important concept of linear combinations of vectors in Chapter 2. For the present we simply note the following (general) deÖnition of linear combinations. 1.2.3 Linear combinations of vectors Let fx1; ::::xmg be a set of m; (n 1), vectors and f1; :::; mg a set of m scalars. Then the following (n 1) vector, y; is called a linear combination of these m vectors fx1; :::; xmg : We have already introduced this concept. In general, let x = fxig and y = fyig be two (n 1) vectors. The scalar product between x and y is written x0y and deÖned as x= xiei =x1e1 +x2e2 +::::::+xnen: y= Xm j=1 jxj =1x1+:::+mxm: 1.2.4 Scalar (dot or inner) product 1.3. ANGLE & ORTHOGONALITY 7 x y e2 O e1 Figure 1.6: Resolution of a vector Xn x0y = i1 xiyi = Xn i=1 yixi = y0x: Note: x0y is NOT the same as xy0; which is something entirely di§erent. The latter is called an OUTER PRODUCT and will not be discussed until we talk about matrices in Chapter 3. Using scalar product, we Önd that x0x = kxk2 ; the squared "length" (or squared Euclidean Norm) of x: 1.3 Angle & orthogonality The scalar product between two vectors is sometimes deÖned in terms of the angle between those two vectors. We consider two (2 1) vectors, x and y, as depicted in Figure 1.7. In the above diagram, is the angle between the two vectors, x and y: The lengths of the vectors are kxk = OX and kyk = OY: The dashed line, XA is perpendicular (that is, orthogonal) to OY: Elementary trigonometry tells us that cos = OA = OA; OX kxk sothatkxkcos=OA=OY AY =kykAY:Thus,squaring,weget kxk2cos2=(kykAY)2 =kyk22kykAY +(AY)2: (1.1) 8 CHAPTER 1. VECTORS X θ x y Y A O Figure 1.7: The Angle between Vectors But, we can also write (AY )2 = (XY )2 (AX)2 = kyxk2 kxk2sin2 because ky xk = XY and AX = kxk sin : Substituting this into (1.1), and also remembering that AY = kyk kxk cos : we get kxk2 cos2 = kyk2 2 kyk (kyk kxk cos ) + ky xk2 kxk2 sin2 : Since sin2 + cos2 = 1; we get (by taking kxk2 sin2 to the left) kxk2 =kyk2 +2kykkxkcos+kyxk2: Now, y x = (y1 x1; y2 x2)0 ; so that kyxk2 = (y1 x1)2 +(y2 x2)2 = y12 +y2 +x21 +x2 2y1x1 2y2x2 = kyk2 + kxk2 2y0x: Then substituting this into (1.2) and re-arranging the terms yields cos= y0x = x0y : kyk kxk kxk kyk (1.2) Thus, cos = x0y and this deÖnes , the angle between x and y: In kxk kyk higher than three dimensions the angle between two vectors is di¢ cult to imagine, but this expression for cos still obtains. 1.4. EXERCISES 9 Let x and y be (n 1) vectors, then Pn i=1 i=1 cos = x0y = s i=1 xi yi xiyi kxkkyk Pn2Pn2 and note that the Cauchy-Schwartz Inequality (see Exercises at the end of this Chapter) ensures that with this deÖnition of angle, 1 cos 1, as would be required. 1.3.1 Orthogonality If n = 2, the two vectors are said to be orthogonal (to one another) if they are at right-angles (or perpendicular, to one another). Clearly, in such cases cos () = 0: In general, we deÖne orthogonality in exactly the same way for (n 1) vectors, x and y, and the deÖnition of cos() above implies that x and y will be orthogonal i§ x0y = 0 (since both kxk > 0 and kyk > 0).
Note that the unit vectors are mutually orthogonal so that e0iej = ij; where ij is the Kronecker Delta deÖned as ij = 1 if i = j and is zero otherwise.
1.4 Exercises
1. Given the vectors a0 = (2; 1), b0 = (1; 3) ; plot the vectors 2a; a + b and illustrate the use of the parallelogram rule to Önd a+b and 2a+b.
2. If a0 = (3;2;1) and b0 = (1;5;6) solve the following vector equation for x:
3a + 2x = 5b
Illustrate geometrically.
3. Let x and y be any two (n1) vectors.
(a) Prove that (x0y)2 kxk2 kyk2.
(b) Thus deduce that jx0yj kxkkyk; the Cauchy-Scwartz Inequal-
ity.
(c) Using this result prove the triangle inequality:
ka bk + kb ck ka ck
[Hint: writekack2 ask(ab)+(bc)k2 andshowthatthis isequaltokabk2+kbck2+2(ab)0(bc). Thenwrite x = a b and y = b c:]

10
CHAPTER 1. VECTORS
Verify, by direct computation, that this inequality holds for the specialcasea0 =(3;7;1);b0 =(9;1;4)andc0 =(3;0;2):Illustrate geometrically:
4. Let(Xi;Yi)denotedata,i=1;:::n;anddeÖnethe(n1)vectors x0 =X1 X;:::;Xn X; y0 =Y1 Y;:::;Yn Y
where X and Y are the respective sample means.
(a) Give a statistical interpretation to
i. the scalar product between x and y
ii. the scalar product between x with itself; and,
iii. the cosine between x and y.
(b) Suppose the Yi are independently and identically distributed nor- mal random variables, each having mean and variance 2; i.e., the Yi are iid N ; 2 : What is the exact sampling distribution of y0y=2 ? 2
(c) If the Xi are also iid, but with common distribution N ; ; independent of Yi; then what is the exact sampling distribution of y0y=x0x; under the assumption that 2 = 2 ?
5. Consider the following simple bivariateb regression model
Yi= + Xi+ui; i=1;::;n:
(a) Express the formula for the estimated ordinary least squares slope
coe¢ cient, b, as the ratio of two inner (or scalar) products.
(b) By showing that
x0u Xn
b= +x0x= + ciui;
i=1
where u = (u1; :::; un)0 ; x is deÖned as in Question 4, and the ci areXiX=x0x;i=1;::;n;provethefollowingresult: 2 ìIf the disturbance terms in the regression model are iid N 0; ; then (assuming that the Xi are prespeciÖed constants) b has a sampling distribution which is normal with mean and variance 2=x0x:î

Chapter 2 Vector Spaces
2.1 Introductory remarks
The totality of all (n1) vectors is called the n-dimensional VECTOR SPACE, also sometimes called the n-dimensional EUCLIDEAN SPACE and denoted thus:
Rn =x:x0 =(x1;::;xn); xi 2R; i=1;:::;n :
In order to be able to deÖne explicitly any one of these vectors, in this space, we only need to think of the n unit vectors ei; i = 1;:::;n; as intro- duced in the previous chapter. This is because any x 2 Rn can be written as a simple linear combination of these unit vectors (recall the discussion pertaining to the resolution of a vector into its component parts):
Xn i=1
Observe that:
(i) any vector x 2 Rn can be constructed as a linear combination of e1; :::en;
(ii) the e1; :::en represent a subset of the totality of all (n 1) vectors, Rn; but none of the ei can be expressed as a linear combination of the remaining ej; j 6= i:.
ì(i)îsays that fe1;:::;eng SPANS the whole space; and ì(ii)îsays the set of vectors fe1; :::; eng is LINEARLY INDEPENDENT. ì(i)îand ì(ii)î together say that fe1; :::; eng forms a BASIS for the n-dimensional vector space.
It is because the basis set contains a maximum of n vectors, from which any other (n 1) vector can be constructed, that Rn is called n-dimensional.
11
x=x1e1 +:::+xnen =
xiei:

12 CHAPTER 2. VECTOR SPACES
It is not so-called because all vectors in the space have n elements (or co- ordinates). Also note that a basis set of vectors is NOT unique: any set which is linearly independent and spans can serve as a basis.
To illustrate this last point consider R3; which has a basis of e1 = (1;0;0)0 ; e2 = (0;1;0)0 and e3 = (0;0;1)0 : Any vector x 2 R3 can be writ- ten as x = x1e1 + x2e2 + x3e3: Now suppose for a particular vector a 2 R3; a3 6= 0: This means we can write e3 = (aa1e1 a2e2) =a3; which is a lin- ear combination of fe1; e2; ag. Thus any vector which can be expressed as a linear combination of fe1;e2;e3g can equivalently be expressed as a linear combination of fe1; e2; ag :
x = =
=
x1e1 +x2e2 +x3e3
x1e1 + x2e2 + x3 (aa1e1 a2e2) =a3
(x1 x3a1)e1 + (x2 x3a2)e2 + x3a: a3 a3 a3
Therefore, the set fe1; e2; ag SPANS R3: The set is also LINEARLY INDE- PENDENT since, as a3 6= 0 and both e1 and e2 have their third elements equal to zero, it is impossible to express any one of the vectors in fe1; e2; ag as a linear combination of the remaining two vectors. Thus fe1 ; e2 ; ag is also a BASIS for R3:
2.2 Spaces and subspaces
DeÖnition 1 (Vector Space) A vector space, V , is a non-empty set (or collection) of vectors which is closed under the operations of addition and scalar multiplication. ìClosedî means that if a and b 2 V then a 2V ; b2V and a+b2V; for any scalars 2R and 2R:
As noted, Rn is sometimes referred to as the ndimensional vector space. We immediately have the following result:
Proposition 1 A vector space must contain the null vector 0 = (0; :::; 0)0 : Problem 1 Left as an exercise.
DeÖnition 2 (Subspace) A subspace, of Rn; is a non-empty subset of Rn which is also a vector space.
Example 1 Consider R3 as illustrated Figure 1.2 with a single vector. All vectors lying in the x1 x2 plane form a subspace of R3; this subspace of (3 1) vectors is clearly two-dimensional, since all the vectors in this vector (sub)space reside in the 2 dimensional x1 x2 plane. Algebraically we only require two linearly independent vectors, e1 = (1; 0; 0)0 and e2 = (0; 1; 0)0 to

2.3. LINEAR DEPENDENCE & INDEPENDENCE 13
generate all the (3 1) vectors in this subspace. A ray through the origin is a one-dimensional subspace; if the ray is at 45o to all axes then a basis is simply (1; 1; 1)0 : Another example is depicted in this FLASH demo1 . Here we have two vectors in R3 : one red vector and one blue vector. On these two vectors we have rested a plane (which must pass through the origin). This plane is (of course) two-dimensional, but all points lying on the plane correspond to a vector in R3: The plane is a two-dimensional subspace of R3: It is a subspace because it contains the origin of R3; but not all vectors in R3 lie on it. It is two-dimensional because only two vectors in the subspace (e.g., the red vector and the blue vector) are required to generate all vectors in the subspace; i.e., any vector in the subspace can (by the parallelogram rule) be expressed as a linear combination of the red and blue vector. The red and blue vector form a basis for this two-dimensional subspace.
DeÖnition 3 (Dimension) the dimension of a vector (sub)space is the number of vectors in its basis.
DeÖnition 4 (Basis) A basis, of a vector space, is a set of vectors which is linearly independent and which spans the whole space.
Although a basis is not unique, the number of vectors contained in a basis for as given subspace is constant and is called the dimension of that subspace; see Proposition 8 below.
At this point we need to discuss in more detail the concepts of linear independence and spanning sets.
2.3 Linear dependence & independence
Linear independence/dependence is a concept that pertains to a SET of vectors (not individual vectors). A SET of m, (n 1), vectors fx1; :::; xmg is said to be linearly independent if none of the vectors in the set can be expressed as a linear combination of the rest. If at least one of the xj (j = 1; :::; m) can be expressed as a linear combination of the rest then the SET is said to be linearly dependent.
Formally,
DeÖnition 5 (Linear independence/dependence) If there exists scalars j; j = 1;:::;m; such that
Xm
jxj =1×1+:::+mxm =0; withj 6=0foratleastonej; j=1
1 http://media.humanities.manchester.ac.uk/humanities/screencasts/soss/economics/linalg/3D Vector Plane.exe

14 CHAPTER 2. VECTOR SPACES then the set is said to be linearly dependent. If the ONLY set of j for
Proposition 2 A set of vectors which contains the null vector must be lin- early dependent.
Proof. Trivial and left as an exercise.
DeÖnition 6 (Spanning Set) A set of m; (n 1) ; vectors fx1; :::; xmg is said to span a vector (sub)space if every vector in the space can be written as a linear combination of fx1; :::; xmg :
Note that a spanning set need not be linearly independent. For ex- ample, (1; 0)0 ; (0; 1)0 and (0; 2)0 span R2 but do not constitute a linearly independent set of vectors. This points to the essential di§erence between a spanning set and a basis: a basis is a set which contains the maximum number of linearly independent vectors in the space.2 (A spanning set must contain a basis set.)
The idea of linear independence/dependence and spanning sets is cru- cially bound up the problem of Önding solutions to a system of m linear equations in n unknowns, as we shall see later in these notes.
2.4 Some results Proposition 3
1. If a set of vectors is linearly independent then any subset of these vectors is also linearly independent.
2. If a set of vectors is linearly dependent then any larger set must also be linearly dependent.
Proof.
1. (The proof is by contradiction.) Let fx1; :::; xm g be a linearly inde-
pendent set. Suppose, however, that fx1;:::;xkg is linearly dependent,
k < m (perhaps after some re-ordering of the vectors); then there must Pk Pm j=1 jxj =0isj =0 8j,thenthesetissaidissaidtobelinearly independent. which be some non-zero i; i = i; :::; k; such that ixi = 0: Then, letting i=1 j =j;j=1;:::;k;andj =0forj=k+1;:::;mwecanwrite Pm jxj = 0; where not all the j are zero. This is a contradiction, j=1 2 Strictly speaking, we should write that the basis set contains the maximum number of vectors that can exist within any linearly independent set of vectors in the space. 2.4. SOME RESULTS 15 since fx1; ::; xmg is, by assumption, linearly independent. Therefore, the set fx1;:::;xkg can not be linearly dependent for any k < m. 2. Proof is left as an exercise Proposition 4 Suppose k < m is the maximum number of linearly inde- pendent vectors in a set of m; (n 1) ; vectors. Then given any linearly independent subset of k vectors from this set, every other vector in the set can be expressed as a linear combination of these k vectors. (k is the max- imum number of linearly independent vectors. This means that any set of k + 1 vectors must be linearly dependent, and there exists at least one set of k vectors which is linearly independent). Proof. Proof is left as an exercise Proposition 5 Consider a vector (sub)space V . The representation of any vector a 2 V in terms of a basis is unique. Pm Proof. Let x1;:::;xm be a basis for V with a=P ixi; for some a 2 i=1 m V: Suppose, however, that we can also write a = ixi:Upon subtraction i=1 Pm Pm we Önd that aa=0= (i i)xi = ixi; say. But due to linear i=1 i=1 independence of the set fx1;:::;xmg (it is a basis) it must be that i = 0 8 i,sothati =i 8i: Proposition 6 The representation of any vector in terms of a linearly de- pendent set of vectors is not unique. Proof. Proof is left as an exercise Proposition 7 A basis is not unique. Proof. Let A = fx1; :::; xmg be a basis for a vector (sub)space V: Consider any other vectoPr y 2 V ; y 2 A: From Proposition 5, the unique representa- m tion of is y = ixi, in which it can be assumed (after some re-ordering i=1 if necessary) that 1 6= 0: To prove the result it is merely su¢ cient to show that B = fy; x2; ::xmg is also a basis for V: We must therefore show that B spans V and is linearly independent. 1. B spans V From the expression for y we have x1 = (y Pm i=2 ixi)=1; 1 6= 0: Thus any vector in V which can be expressed as a linear combina- tion of x1 and x2; :::; xm is also expressible as a linear combination of fy; x2; :::; xmg. So B spans V: 16 CHAPTER 2. VECTOR SPACES 2. B is linearly independent Consider the following equation Xm i=2 i.e.,1x1+ Pm i=2 Xm 1 Xm i=2 1y+ ixi = 0: If the only solution is i = 0 8 i = 1;:::;m; then B must be linearly independent. Substituting for y we obtain ixi + ixi =0;where1 =11 andi =1i+i;i= 2; :::; m: But A = fx1; ::; xmg is linearly independent, thus it must be that1 =2 =:::=m =0:However,byassumption1 6=0sothat 1 = 0: Since 1 must be zero, i = i = 0 for i = 2;:::;m: Therefore, i = 0 8 i and so B is linearly independent. i=1 ixi = 0; 1 6= 0; Although a basis is not unique, the following Proposition shows that all bases (for a particular vector (sub)space) must contain the same number of vectors: Proposition 8 Consider two alternative bases for a vector (sub)space V : A = fy1;:::;yrg; yi 6= 0; i = 1;:::;r: and B = fx1;:::xmg; xj 6= 0; j = 1;:::;m: Then r = m and this number is called the dimension of the (sub)space and is the maximum number of vectors that can exist in any linearly independent set of vectors in V: Proof. Suppose, without loss of generality, that A and B are mutually ex- clusive. Consider A; which is linearly independent and spPans V: Each xj 2 B r can be expressed inPterms of A and assume that x1 = iyi; 1 6= 0; so r i=2 i=1 Proposition 7, that y1 = that A1 = fx1; y2; ::; yrg is a baPsis for V: Assume the representation of x2 (x1 intermsofA1 isx2 =1x1+ the Proof of iyi;inwhichsomeofthei (i=2:::;r) i yi )=1 : From r we know i=2 must be non-zero. (If this were not so then we would have x2 = 1x1; which is a contradiction since x1 and x2 come from B; a linearly independent set of vectors.) Therefore we can assume that 2 6= 0 and consequently replace y2 with x2 in A1 to form a new basis A2 =Pfx1;x2;y3;:::;yrg. With this r basis we can write x3 as x3 = 1x1 + 2x2 + iyi, and similar arguments i=3 2.4. SOME RESULTS 17 to those above show that i 6= 0 for some i > 3: We can thus assume that 3 6= 0 and replace y3 by x3 to form a new basis A3 = fx1;x2;x3;y4;:::;yrg: This process can continue until all the remaining xj in B have been substi- tuted into the basis. For each substitution, one of the yi is replaced. In order to complete the process there must be at least as many vectors in A as there are in B; i.e.,., r > m: If r < m then there will come a point (before xm can be substituted in) when a contradiction concerning the linear independence of B arises. So we have r > m:
Now reverse the process; starting with B as a basis and, one-by-one, substi- tute the vectors from A into the basis. The process is perfectly symmetric to the one described above and from which we, therefore, Önd m > r:
Thus we have r > m and m > r: The reconciliation of these Öndings is r = m:
Corollary 1 Any set of m linearly independent vectors form a basis for a m-dimensional vector (sub)space.
Proposition 9 Let V be a m-dimensional vector (sub)space. Any set of m mutually orthogonal (non-zero) vectors, fx1; :::; xmg ; must form a basis for V . (The x1; :::; xm are mutually orthogonal if and only if x0ixj = 0, all i 6= j; and kxk2 = x0ixi > 0.)
Proof. In the light of Corollary 1, all we have to do is prove that fx1; :::; xmg
is linearly independent. To do so we consider the possibilities for i which Pm 0Pm00
solve ixi = 0: Pre-multiplying by xj yields i(xjxi) = jxjxj = 0; i=1 i=1
since x0jxi = 0: But x0jxj = kxjk2 > 0: Therefore it must be that j = 0
8 j = 1; :::; m; is the only possible solution to j kxj k2 = 0, which implies that the set fx1;::;xmg is linearly independent and thus a basis for an m- dimensional vector space V:
DeÖnition 7 (Orthonormal Basis) Any set of m mutually orthogonal set of vectors, fx1; :::; xmg, satisfying kxik = 1; for all i = 1; :::; m; is called an ORTHONORMAL basis for an m dimensional (sub)space.
Proposition 10 Any set of m linearly independent vectors from a m-dimensional vector (sub)space can be converted into an orthonormal basis using The Gram Schmidt Orthogonalisation Process.
Proof. Read up!
Proposition 11 If V is a vector (sub)space with dimension equal to m and fx1;:::;xmg is a basis with each xj 2 Rn: Then n m:
Proof. Since fx1; :::; xmg is a basis it is also linearly independent. Suppose n < m; then by Proposition 3, fx1; :::; xng must also be linearly independent. However, by Corollary 1 it must be that fx1;:::;xng is a basis for Rn. This implies that for j = n + 1; :::; m; xj can be expressed as a linear combination 18 CHAPTER 2. VECTOR SPACES of fx1; :::; xng : This is a contradiction as fx1; :::; xmg is a basis and thus a linearly independent set of vectors. Therefore, n m: 2.5 Exercises 1. Are the following sets of vectors linearly dependent? If so, Önd the appropriate linear combination which equates the null vector. (a) (1;0;0)0 ; (0;1;0)0 ; (0;0;1)0 ; (b) (1;0;0)0 ; (0;1;0)0 ; (0;0;1)0 ; (123;17;0)0 ; (c) (1;2;3)0 ; (2;3;4)0 ; (d) (1;2;3)0 ; (2;3;4)0 ; (3;5;7)0 ; (e) (4;1;5)0 ; (2;3;1)0 ; (4;7;3)0 ; (f) (2;3;1;1)0 ; (2;3;1;2)0 ; (4;6;2;3)0 : 2. A non-empty set of vectors in Rn is called a subspace of Rn if it is closed under addition and scalar multiplication. That is, if both x and y are in the set then so are x+y and x, for any scalar : Therefore, if the vectors x1; x2; :::; xm are in the set, all linear combinations must also be in the set for it to be a subspace. Pm i=1 ixi (a) Show that a subspace of Rn must always contain the null vector, 0 = (0;0;:::;0)0 : (b) If x = (x1; x2; x3; x4)0 is an arbitrary vector from R4; determine which of the following sets are subspaces of R4: All vectors with: i. x1 =x2 =x3 =x4; ii. x1 = x2; x3 = x4; iii. x4 = 0; iv. x1 = 1; v. x1; x2; x3; x4 all integers. 3. If x and y are vectors in Rn then the projection of y onto x is deÖned as the vector x x0y : kxk2 (a) Provide a geometrical interpretation of this for n = 2: (b) Show that if this vector is subtracted from y then the resulting vector is orthogonal to x. 4. Show that if a set of vectors is linearly dependent, then any larger set of vectors is also linear dependent. 2.5. EXERCISES 19 5. Consider a set of m vectors, W = fw1;:::;wmg where k < m is the maximum number of vectors that can exist in any linearly independent subset drawn from from W. Let A = fa1;:::;akg W denote any linearly independent set of k vectors. Show that any vector x 2 W; but x 2= A; can be expressed as a linear combination of the vectors in A. 6. Find the dimension of each of the three subspaces of Rn generated by the following sets of vectors: (a) n(1;1;2;3)0 ;(3;2;1;0)0 ;(4;1;3;3)0 ;(5;0;5;6)0o (b) n(1;0;3;1)0 ;(1;0;2;2)0 ;(2;0;3;3)0 ;(3;0;2;2)0o (c) n(1;2;1;2)0 ;(1;3;2;1)0 ;(2;3;1;0)0 ;(1;2;3;0)0o Also give a basis for each subspace. 7. Show that if the (n 1) vector u is orthogonal to the (n 1) vector v, then every scalar multiple of u is orthogonal to v. Find a vector of unit length which is orthogonal to both v1 = (1; 1; 2)0 and v2 = (0; 1; 3)0 in R3: 20 CHAPTER 2. VECTOR SPACES Chapter 3 Matrices & Rank 3.1 Introductory remarks In Chapter 2 we saw that a linear combination of a set of vectors fx1; :::; xmg is an important consideration. In order to pursue related theoretical con- cepts, as well as introducing an extremely important and useful deÖnition, it is convenient to arrange a set of (column) vectors to form what is known as a matrix. For example, if we have n observations on m variables, where both n and m are positive integers, then we can introduce the (n 1) vector xj to denote all n observations on variable j: In such a situation we might write xj = fxijg; i = 1;:::;n; where xij is the ith observation on variable j: Collecting all variables together we could write X = (x1; :::; xm) ; which would constitute a data matrix and is a rectangular array having n rows and m columns; we say that X is (n m) ; (rows columns ): Of course, a matrix can be used to represent other things, apart from data. In general, then, a (n m) matrix X is a rectangular array of numbers: 26x11 x12 x1m37 X=6x21 x22 x2m7 4 . . ... . 5 xn1 xn2 xnm and we write X = fxijg; meaning ìX has typical element xijî, where xij is the element in the ith row and jth column of X; for i = 1;::;n; and j = 1; :::; m; that is, xij is the (i; j)th element of X: Note that where x; a; z; etc, are used to denote vectors (which could be regarded as (n 1) matrices), in general we use X; A; Z, etc, to denote matrices. 212343 Example2Considerthe(34)matrixA=42 3 4 55;a23=4= a14 =a32; whilst a21 =a12 =2 and a34 =6: 21 3456 22 CHAPTER 3. MATRICES & RANK A matrix has NO NUMERICAL VALUE itself. However, matrices are very useful and apart from anything else they a§ord the statistician and econometrician a convenient and compact notation. 3.1.1 Some special matrices 1. Asquarematrix,A;isany(nn)matrix,andtheelementsa11;a22;:::;ann are termed the leading diagonal. 21 0 03 2. Theidentitymatrix: I =(e ;e ;:::;e )=60 1 .7; (nn) n 12 n 64. ...075 001 60 .7 3. A diagonal matrix: D = diag( ) = 2 , (nn); with i 6= 0 for at least one i: 4. The null matrix: any matrix containing only zeros. 3.2 Basic matrix operations & deÖnitions We can add, subtract and multiply matrices, under appropriate conditions (if we are careful), but direct DIVISION is NOT DEFINED: you should NEVER write ìA=Bîmeaning ìA divided by Bî- it is simply not done! In the following let A = faijg and B = fbijg; for i = 1;:::;n; j = 1;:::;m; so that both A and B are (n m) : 1. EQUALITY:A=Bi§aij =bij foralliandj. 2. SCALAR MULTIPLICATION: B = A = faijg = faijg = A; for any scalar 2 R: Consider a diagonal matrix D = diag () ; (n n); then we have D = In 3. ADDITION: C = A+B = faij +bijg = fbij +aijg = B+A: Triviallywehave: A+B+C=A+(B+C)=(A+B)+C;etc. 4. SUBTRACTION:C=AB=A+(1)B=faij bijg: 5. TRANSPOSE: the transpose of A, denoted A0;is the (m n) ma- trix whose m rows are the m columns of A; i.e., if A = faijg; i = 1;:::;n;j = 1;:::;m; then A0 = fbjig; j = 1;:::;m;i = 1;:::n; where bji = aij: 21 0 03 i 64. ...075 0 0 n MULTIVARIATE RANDOM VARIABLES 23 1 2 3 21 43 Example3IfA= 4 5 6 ;thenA0=42 55: 36 TRACE (square matrices only): If A is (n n) ; the trace of A, denoted tr(A) is the sum of the elements in the leading diagonal DeÖnition 8 (Symmetry) A is symmetric i§ A0 = A; and is skew- symmetric i§ A = A0: (In both cases A must be (n n) ; i.e., square. 3.3 Multivariate random variables In statistics it is necessary to draw the distinction between a random vari- able X, say, which has some distribution; a possible realisation from that distribution, denoted x; and, moments/parameters of that distribution. e.g., E [X] = : When considering multivariate random variables it is convenient to arrange all random variables under consideration into a vector. How- ever, adopting the notion of upper case letters to denote random variables would lead us to write something like ìX = (X1; :::; Xn)0î for a vector of n random variables (a random vector). This would be at odds with the notation developed so far. To get over this we shall adopt the rule of NOT distinguishing between a random variable and a realisation of that variable (a practise adopted by most econometricians). We thus write a multivariate random variable as x = (x1; :::; xn)0, a (n 1) random vector where fxig ; i = 1; :::; n; are univariate random variables. If E [xi] = i; we say that x has mean vector E[x] = = fig and note that E[x] = 0: In general when taking expectations of random vectors (or random matrices) we sim- ply apply the expectation operator to each element in the vector (or matrix) and thus obtain the corresponding vector (matrix) of expectations. From the deÖnition of trace, introduced above, it is easy to see that, when dealing with a square matrix of random variables, expectation and trace are interchangeable. Let X be a square matrix of random variables, xij ; such that E [X] = M = fE [xij ]g : Then E [tr (X)] = tr (M) : (The proof is trivial!) In the case of a (n 1) random vector, x; we deÖne the following (n n) variance-covariance matrix1 : var(x) = V = fcov(xi;xj)g, i;j = 1;:::;n: This matrix is square and provides a way of arranging variances of, and 1 Sometimes called simply the covariance or variance matrix. 3.3. 6. tr(A) = From the deÖnition of transpose, we deÖne the following: Xn i=1 aii: 24 CHAPTER 3. MATRICES & RANK covariances between, the elements in x: Observe that writing V = fvijg, we have vii = var (xi) and vij = vji: The leading diagonal, being the elements v11; v22; :::; vnn (top left to bottom right) give the variances and the o§- diagonal elements, vij ; i 6= j; are the covariance terms. Thus V is symmetric. (A variance matrix has other important properties as we shall in due course.) 3.4 Matrix multiplication We now consider the possibility of multiplying or taking the product of two (or more) matrices. Such a product, when deÖned, is written C = AB and is fundamental to many of the ideas in statistics and econometrics. In the product AB, A is called the pre-multiplier and B the post-multiplier (A is post-multiplied by B). There are other matrix products which are commonly used such as the Kronecker product A B and the Hadamard product A B, but these will not be discussed here. Howver, matrix multiplication is NOT always applicable; in order for C = AB to be deÖned the number of columns in A must equal the numbers of rows in B. Thus A = faikg must be (nr) and B = fbkjg must be (r m) ; notice that n and m are arbitrary and not necessarily equal (although they may be). In such a situation the matrix product AB can be taken and the result is the (n m) matrix C = fcij g where Xr k=1 Observe that, since matrix product is not always deÖned, matrix multipli- cation is NOT commutative; i.e., AB 6= BA; in general (BA may not be deÖned even though AB is). cij = aikbkj; i = 1;:::;n; j = 1;:::;m: To see how following way: 2c11 c1j 6. . 6ci1 cij 64 . . . . . . cn1 cnj from which we see that the typical element cij is in fact, the scalar product between the ith row of A and the jth column of B. Thus writing A as a stacked set of n, (1 r) row vectors 2 a 01 3 A=64 . 75, a0i =(ai1;:::;air), (1r); i=1;:::;n a0n matrix multiplication works, write the matrix product in the c1m3 2a11 a1r3 2b11 b1j b1m3 .7 6. .76 7 cim7=6ai1 air7 6 . . . 7 . . . 75 64 . . . . . . 75 64 75 cnm an1 anr br1 brj brm 3.4. MATRIX MULTIPLICATION 25 with B = [b1;::::;bm]; bj = (b1j;:::;brj)0; (r1); j = 1;:::;m; the i;jth element in the product AB can be written cij = a0ibj: Provided that the appropriate operations are deÖned the following results hold: 1. ASSOCIATIVE LAW (AB)C = A(BC) = ABC 2. DISTRIBUTIVE LAW A (B + C) = AB + AC; (B + C) D = BD + CD At this point, let us introduce the outer product between two vectors u and v. This is a particular example of matrix multiplication as applied to two vectors: DeÖnition 9 (Outer product) The outer product between u, (n 1) ; and v,(m1);isthe(nm)matrixuv0 =fuivjg; i=1;:::;n; j=1;:::;m: It is the operation of outer product which is used to construct a variance- covariance matrix: DeÖnition 10 (Variance-covariance matrix) Let x be a (n 1) ran- dom vector with mean vector E [x] = : The variance-covariance of x, when it exists, is constructed as: var(x)=E(x)(x)0=E(xi i)xj j =fcov(xi;xj)g: 3.4.1 Some useful results 1. If A is any (nm) matrix then: (a) InA=AIm =A (b) 0p;nA = 0p;m and A0m;q= 0n;q; where 0p;n is a (p n) matrix of zeros (the null matrix). 2. (AB)0 = B0A0; provided the product AB is deÖned. (The proof is left as an exercise.) 3. tr (AB) = tr (BA) ; provided AB is deÖned and square. (The proof is left as an exercise.) Note that, where the appropriate operations are deÖned we have that: tr (ABC) = tr (BCA) = tr (CAB) : 26 CHAPTER 3. MATRICES & RANK 4. For two (n1) vectors; u and v; tr(uv0) = v0u = u0v: (The proof is trivial using 2, and noting that tr(:) is a scalar). We use matrix multiplication and trace to deÖne a natural generalisation of Euclidean Norm for a matrix: DeÖnition 11 (Matrix Euclidean Norm) kAk = Pni=1 Pmj=1 a2ij is the Euclidean norm of an arbitrary (n m) mqatrix, A; and is equal to kAk = ptr (A0A) = tr (AA0): (You will be asked to show in the Exercsies at the end of this Chapter that kAk2 = tr (A0A) = tr (AA0) :) 3.5 Linear independence and rank Observe that a linear combination of vectors can be regarded as a linear combinPation of the columns of a matrix. If A = faijg; (nm); and we m write aijxj = bi; for some xj not all zero, then this is equivalent to j=1 Xm xjaj =x1a1 +x2a2 +:::+xmam =b, (n1) j=1 where aj = (a1j ; :::; anj )0 and b = fbj g : The vector b is, therefore, a linear combination of the columns of A: From the deÖnition of matrix product such an equation can, alternatively, be written Ax = b, where the (m 1) vector x = fxjg and the (nm) matrix A = [a1;a2;:::;am] = faijg; is the collection of the m, (n 1) ; vectors aj : (Such a matrix-vector formulation arises when considering the nature of a solution to n linear equations in m unknowns, and shall be discussed later). Of particular interest is the case of whether or not there exists a x 6= 0 such that Ax = 0. If ìyesî, then there is a (non-trivial) linear combination of the columns of A which equals the null vector. From the discussions in Chapter 2, it thus follows that the columns of A form a linearly dependent set of vectors and we say that A does NOT have full column rank. If, on the other hand, the answer is ìnoî, then the columns of A form a linearly independent set and A is said to have full column rank (the only possible vector, x, which satisÖes Ax = 0 is x=0). In general, the rank of matrix, denoted r(A), is the maximum number of linearly independent vectors that can exist in any subset of vectors taken from the columns of A, (n m); or the maximum number of linearly inde- pendent columns in A. If r(A) = k < m, then there is at least one subset 3.5. LINEAR INDEPENDENCE AND RANK 27 of k column vectors, in A, which is linearly independent and all subsets of k + 1 vectors are linearly dependent. You should be able to verify for yourself that the rank of a diagonal matrix is simply the number of non-zero diagonal elements. 3.5.1 Some results concerning rank Let A be (nm): 1. Clearly r(A) m: 2. Suppose r(A) = k < m: Then any linearly independent set of k vec- tors, taken from the columns of A; must form a basis for some k- dimensional subspace of Rn: It thus follows, from Proposition 11 in Chapter 2, that r(A) = k n: 3. 1 and 2 together imply that r(A) min(n; m): For example, a (4 3) matrix must have rank 3: 4. The above discussion describes rank in terms of the columns of A. But clearly a similar concept pertains to the maximum number of linearly independent rows of A. It is, however, unnecessary to distinguish between row rank and column rank: it turns out that row rank equals column rank; i.e., the maximum number of vectors in any linearly independent subset of rows drawn from A is exactly the same as the maximum number of vectors in any linearly independent subset of columns drawn from A. Thus we can refer to a single number as being the rank of any matrix. 5. From 4, r(A0) = r(A): As noted above, if A is (nm) and r(A) = k m; then it is possible to Önd x 6= 0 such that Ax = 0; since the columns of A form a linearly dependent set. But from Proposition 6, in Chapter 2, we know that such a vector x will not be unique in this case: there will be many vectors, x; which satisfy Ax = 0. Let us consider this set, W , of solution vectors. If x 2 W thensois xforanyscalar: Ax=0)A( x)=0:Ifbothx2W and y2W then(x+y)2W;becauseAx=0andAy=0)A(x+y)=0: Thus W is, in fact, a subspace of Rm (since all vectors in W are (m 1)): This subspace is called the null-space or kernel of A and its dimension is called the nullity of A. It can be shown that m = r(A) + nullity, where m is the number of columns in A. If, on the other hand, A is (nm) with r(A) = k m; then for all x 2 Rm; y = Ax 2 C where C is a k-dimensional subspace of Rn; :sometimes referred to as the column space of A: 28 CHAPTER 3. MATRICES & RANK 3.6 Exercises 1. Consider the simple linear regression model with no intercept yi= xi+ui; i=1;:::;n; where the ui are a sequence of iid random variables with mean zero and constant variance and xi is a scalar regressor. (a) Show from Örst principles that the ordinary least squares estima- tor of can be written Pn xi yi i=1 b=Pn : x2i i=1 (b) Show that b can be written b = x0y=(x0x); for suitably deÖned (n 1) vectors x and y. DeÖnethefollowingvectorsy^=(y^1;:::;y^n)0 ande=(y1;:::y^1;yny^n)0; where y^i = bxi; (c) x0e=0: (d) y0y = y^0y^ + e0e: (e) y^0e=0: 21 0 23 2. IfX=42 4 55andY=43 25;computetheproductmatrix i = 1; :::; n: Show that 21 13 3 2 1 2 0 Z = XY and determine: (a) all the elements in the third row and all the elements the Örst column; (b) the single element occupying the position in the second row and second column. 3. Let A and B be two (n n) matrices and let k be a scalar constant. Show that (a) tr(A+B)=tr(A)+tr(B): (b) tr(kA)=ktr(A): 4. If the matrix products XY and YX are both deÖned and square, show that tr(XY) = tr(YX): 3.6. EXERCISES 29 5. If the matrix X is (n n) with n2 random variables xij as elements we deÖne the expected value of X; E(X); to be the (n n) matrix with elements E (xij ) : Show that E ftr (X)g = tr fE (X)g : 6. Let A be (nm) and B (mp): Using the fact that the (i;j)th element of the product AB is the scalar product of the ith row of A with the jth column of B, answer the following: (a) Show that AA0 is deÖned and square having typical element Pm s=1 th aqsars in its (q;r) position, where q; r = 1;:::;n; and P A = faqsg; s = 1;:::;m: (b) Show that A0A has typical element n aqsaqt; in its (s;t)th posi- (c) From (a) and (b), deduce that kAk2 = tr (A0A) = tr (AA0) : q=1 miliar Cauchy-Schwartz Inequality) (e) Explain why (AB)0 = B0A0: Hence, show that for any matrix X, say, X0X is symmetric. 7. Let u be a (n 1) random vector and consider the outer product matrix uu0 which has typical element fui uj g ; i = 1; :::; n; j = 1; :::n: (a) Show that E ftr (uu0)g = E fu0ug : (b) Under what conditions will tr fuu0g have a chi-squared distribu- tion with n degrees of freedom? 8. Explain carefully how the following system of n equations Xk yi= 1+ xij j+ui; i=1;:::;n; j=2 can be expressed in vector-matrix notation as y = X + u, where y and u are (n1) vectors, is a (k1) vector and X is a (nk) matrix. (a) What is distinctive about the Örst column of the X matrix? (b) If ui are a sequence of independently and identically distributed ( iid) random variables each with zero mean and constant variance 2; show that E (uu0) = 2In, where u is a (n 1) random vector with typical element ui; i = 1; :::; n; and In is the (n n) identity matrix. tion. (d) Show that kABk2 kAk2 kBk2 : (This result generalises the fa- 30 CHAPTER 3. MATRICES & RANK (c) In addition to the assumptions about u; outlined above, suppose that the elements of X are known constants (that is, they are not random variables). Explain why E (y) = X and var (y) = 2In: Chapter 4 Determinants and the Inverse Matrix 4.1 Introductory remarks As made clear in Chapter 3, strictly speaking matrix division is not deÖned but we are able to talk about inverse matrices in terms of matrix multipli- cation. In this Chapter we shall investigate the following question in our pursuit of the, so-called, ìinverse matrixî: For any given matrix, X, does there exist another matrix, Y say, such that XY = YX = I ? where I is an identity matrix. If such a matrix, Y, does exist then it is called the inverse of X (and conversely X will be the inverse of Y). If the inverse of X exists then it is written X1 (it is NOT written ëI=Xí). Notice imme- diately that the existence of an inverse for all matrices is not guaranteed. For example, if X is (n m) so that it is impossible for XY = YX when n 6= m: Thus, we can immediately conclude that only square matrices can have inverses. Note that a square (n n) matrix is also referred to as an nth order matrix. It is not true, however, that all square matrices possess an inverse. It is the purpose of this Chapter to investigate the circumstances under which a square matrix is ëinvertibleíand these deliberations will lead directly to the construction of an inverse matrix, when one exists. The question of existence requires a discussion of determinants, to which we now turn. 4.2 Determinants The determinant of a matrix is a real number, and is ONLY deÖned for square matrices as follows: 31 32 CHAPTER 4. DETERMINANTS AND THE INVERSE MATRIX DeÖnition 12 (Determinant) The determinant of an nth order; (n n) ; matrix X is deÖned to be that real number computed from the following sum involving the n2 elements of X: det(X) = X ()x1ix2jx3k:::xnr i;j;:::;r where the sum is taken over all permutations of the second subscript. A plus or minus sign () is attached to each term in this sum according to whether the permutation of fi; j; k; :::rg is even (+) or odd () : (det (X) is sometimes, also, denoted jXj :) The Örst thing to note is that the number of terms in this sum is equal to the number of di§erent permutations of the n integers fi; j; k; :::; rg and this is, of course, equal to n! = n (n 1) (n 2) :::2:1 : The calculation is probably best illustrated by means of an example. Example 4 Consider the (3 3) matrix 2a11 a12 a13 3 A = 4a21 a22 a235 : a31 a32 a33 A typical term in the sum deÖning det (A) involves the product of the three elements a1ia2ja3k; i;j;k = 1;2;3 (notice that the Örst subscripts always appear in the order 1;2;3). The number of terms like this in the sum is equal to the number of di§erent permutations of f1;2;3g and this is equal to 6 = 3! To write down the expression for det(A) we need all of these 6 permutations (and their sign). Permutations are signed according to the following simple conventions: f1; 2; 3g is an even permutation; i.e., f1; 2; 3; :::n 1; ng is even simply transposing (ëswappingí) two integers changes an even permu- tation into an odd permutation, and odd into an even. We thus have the following table: start 2nd f2; 1; 3g 3rd f2; 3; 1g 4th f3;2;1g 5th f3; 1; 2g 6th f1;3;2g even + 1$2 odd 1$3 even + 2$3 odd 2$1 even + 3$1 odd f1; 2; 3g from which we can write out det (A) as det(A) = a11a22a33a12a21a33+a12a23a31a13a22a31+a13a21a32a11a23a32: 4.2. DETERMINANTS 33 It is not di¢ cult to see that such an expression can get quite compli- cated very quickly even for moderate values of n (e.g., for n = 4 there will be 24 times to consider!) However, this IS the deÖnition of a determi- nant and it does reveal an important feature. In any one term, in the sum deÖning det(X), no Örst subscripts are identical and no second subscripts are identical. The Örst subscripts are simply f1; 2; :::; ng and the second set of subscripts is simply a permutation of these n distinct integers. This means that whenever an element, xit, of the matrix appears in a term like x1i;x2j;x3k:::xnr; then no other element from the same row, i; OR column, t; can appear as well. 4.2.1 Properties of determinants Let X be (nn): 1. The interchange of any two columns of X will change the sign of the determinant. This is so because two of the second subscripts (corresponding to the columns that are interchanged) will be swapped in all terms of the sum deÖning det(X): Thus all terms change sign. 2. det(X) = det(X0): (Think about it!) 3. If X has two rows or columns which are identical then det(X) = 0. Let Y denote the matrix obtained from X by interchanging two columns. Then det(X) = det(Y). But if the interchanged columns are iden- tical then Y = X so that we have det(X) = det(X) and the only resolution of this det(X) = 0. 4. det(X) = n det(X); for any 2 R: Note that X=fxijg and the result follows immediately from the deÖnition of determinant. 5. It can be shown that det(AB) = det(A) det(B); provided A and B are square and of the same order. 6. If X is triangular then det(X) = x11x22x33:::xnm: (Xissaidtobetriangularifxij =0 forj>i; or ifxij =0 for i > j:)
4.2.2 Expansion by Cofactors
We now introduce an alternative approach to evaluating det(X). The method, known as expansion by cofactors, leads directly to the construction of the inverse matrix.

34 CHAPTER 4. DETERMINANTS AND THE INVERSE MATRIX Consider again the arbitrary (3 3) matrix A. We note that
det(A) = a11 (a22a33 a23a32)a12 (a21a33 a23a31)+a13 (a21a32 a22a31) = a11C11 + a12C12 + a13C13; say.
Now, deÖne A(ij) to be that (2 2) square matrix obtained from A by eliminating the ith row and jth column so that, for example,
A(12)=a21 a23; a31 a33
then we can see that C11 = a22a33a23a32 = det(A(11)); C12 = (a21a33 a23a31) = det(A(12)) and C13 = a21a32 a22a31 = det(A(13)):
In general, for an arbitrary (nn) matrix X, det(X(ij)) are called MINORS and Cij = (1)i+j det(X(ij)) are called SIGNED MINORS, i; j = 1; :::; n; or COFACTORS:
DeÖnition 13 (Cofactor) The cofactor of a square (n n) matrix, X; is Cij = (1)i+j det(X(ij)); where X(i;j) is the (n 1 n 1) matrix obtained from X by eliminating row i and column j:
DeÖnition 14 (Leading Principal Minors) The kth order leading prin- cipal minor of a square (n n) matrix X; k n; is the determinant of the (k k) matrix consisting of the Örst k rows and columns of X:
In the example given above, the expression for det(A) is called the ex- pansion by cofactors along the Örst row. However, we can evaluate det(A) using an expansion by cofactors along any row or column. For example, for A (3 3) ; alternative expressions for det(A) include:
1. det(A) = a11C11 + a11C12 + a13C13 : expansion along row 1. 2. det(A) = a21C21 + a22C22 + a23C23 : expansion along row 2 3. det(A) = a31C31 + a32C32 + a33C33 : expansion along row 3
which you can easily verify for yourselves.
In general, if X is (n n) the following are valid expression for det(X) :
Pn i=1
The fact that we can express a determinant in terms of an expansion by cofactors leads directly to the construction of an inverse matrix and also the condition for the existence of the inverse.
Expansion along row i : det(X) =
Expansion down column j : det(X) =
xij Cij
Pn j=1
xij Cij

4.3. THE INVERSE MATRIX 35 4.3 The inverse matrix
Let us consider an arbitrary (n n) matrix X: Since X(ij) is obtained by eliminating row i and column j from X it follows that the element xij does not appear in X(ij) and hence does not appear in Cij = (1)i+j det X(ij) : With this in mind consider the following expansion
Xn
xijCkj; for k 6= i; j=1
which is termed an expansion by alien cofactors as it is an expansion along
row i of X but using cofactors from row k 6= i: Observe that no elements from
row k; of X; appear in Ckj and (obviously) no elements from row k appear in
the ith row of X; i.e., both Ckj and xij ; j = 1; ::; n; are invariant with respect
to the elements in the kth row of X. Thus the elements fxk1; :::; xkng could be
anything and the value of the expansion given above will remain unaltered.
In particular, it doesnít matter if we make the elements in the kth identical
to the elements in the ith row: the value of this expansion is una§ected. Let
us do this and we see what we Önd: set xkj = xij: Substituting this into the
Pn
expansion above we get xkjCkj = det(X ); i.e., we get the determinant of
j=1
X; a matrix which has row k and row i identical, by construction. However,
previous results show that this determinant is zero. Thus zero is the value
Pn
Xn
of
the value is unaltered by considering other choices for xkj; whatever the kth
xijCkj; k 6= i, for one particular choice of xkj: But, as argued above, row of X is. Thus we have that
j=1
whilst
xijCkj = 0; for k 6= i; j=1
Xn
xijCij = det(X):
j=1
We now deÖne, and form, the ADJOINT of X:
DeÖnition 15 (Adjoint) The adjoint of a square (n n) matrix X; de- noted X+; is the transposed matrix of cofactors. Thus X+ = fCjig ; or
26C11 C21 Cn137 X+ = 6C12 C22 Cn2 7 :
4 . … . 5 C1n C2n Cnn

36 CHAPTER 4. DETERMINANTS AND THE INVERSE MATRIX We can now see, from simple matrix multiplication, that a typical ele-
ment in XX+ is (in row i and column j)
(Xn )
XX+ = xikCjk ; (i; j)th position,
k=1
which equals det(X) when i = j, but is zero otherwise. That is, XX+ has all diagonal elements equal to det(X) and zeros everywhere; or, more succinctly,
XX+ = det(X) In
And, indeed, similar considerations shos that X+X = det(X) In, also. Thus, provided det(X) 6= 0; we can deÖne Y = [det(X)]1X+ and the above analysis shows that XY = YX = In: Therefore, by construction, Y is the inverse matrix that we have been looking for. It exists only for square matrices with non-zero determinants and, if it exists, it MUST be unique
(see later).
DeÖnition 16 (Inverse matrix) Provided X is (n n) ; then X has a inverse matrix if and only if det(X) 6= 0: This inverse matrix is deÖned by
X1 = 1 X+; det(X)
where X+ is the adjoint matrix of X:
Remark 1 If X1 exists, we say that that X is non-singular. If det(X) = 0,
then X is said to be singular and its inverse does not exist.
Example5 Considerthe(22)matrixX=fxijg;i;j=1;2:Inthiscase
det(X) = x11x22 x21x12 and provided this is non-zero X1 = 1 x22 x12;
x11x22 x21x12 x21 x11 as is probably well known.
4.4 Some further results concerning rank, deter- minants and inverses
1. For an arbitrary (n m) matrix, A; r(A) is the size of the largest non-vanishing determinant contained in A: (Obtained by eliminating rows and columns from A:)
Proof: omitted – quite tiresome.

4.5. EXERCISES 37 Example 6
21 3 1 03 23 1 03
X=42 4 0 15)r(X)=3sincedet44 0 15=0 1
3501 501 (45) + 0=1:
21 2 13 2 1
X=42 4 05)r(X)=2sincedet 2 4 =4;eventhough
360 det 1 2 =0:
24
In the following let X be (n n) :
2. If X has an inverse then that inverse is unique. (Proof is left as an exercise)
3. If X and Y are non-singular having inverses X1 and Y1, respec- tively, then XY is non-singular and has inverse Y1X1.
(Proof is left as an exercise.)
4. Provided X is non-singular, det(X) = [det(X)]1:
Proof: Using previous results we have that det(In) = 1 and det(X1X) = det(X) det(X1). Since X1X = In, the result follows.
5. If the columns (or rows) in X form a linearly dependent set of vectors, the det(X) = 0.
Proof follows from 1 above; alternative derivation is left as an exercise.
6. For an (nn) matrix; X:[r(X) = n] , [det(X) 6= 0] , [X is non- singular]; i.e., X is non-singular i§ it has full rank.
Proof follows from 1: r(X) = n ) largest non-vanishing determinant is det(X) ) det(X) 6= 0:
det(X) 6= 0 ) largest non-vanishing determinant is det(X) ) r(X) = n:
7. If X = diag(xi) then, provided xi 6= 0 for all i; X1 = diag(x1): i
(Proof is left as an exercise).
8. Provided det(X) 6= 0; X10 = (X0)1: (Proof is left as an exercise.)
4.5 Exercises
1.

38
CHAPTER 4. DETERMINANTS AND THE INVERSE MATRIX
21 2 33 21 2 33
(a) IfA=40 1 15andB=40 1 15showbydirecteval-
4 2 2 4 2 2 uation that det(A) = det(B).
21 2 33 21 2 33
(b) IfA=44 1 25andB=41 1 15showbydirectevaluation
111 412 that det(A) = det(B).
21 3 13
(c) Show that det 42 6 25 is zero.
412
If the (n n) matrix A = faijg is lower triangular, so that aij = 0 for
all j > i; show that det(A) = a11a22:::ann: Explain why the determinant of a square matrix:
(a) will be zero it has two columns (or rows) which are identical;
(b) will change sign if two columns are interchanged;
(c) will be multiplied by if one of its columns has all its entries multiplied by :
Let A be a square matrix whose rows form a linearly dependent set of (row) vectors. Thus the rth row of A can be expressed as some linear combination of the remaining rows in A.
By writing down det(A) in terms of an expansion by cofactors along the rth row of A, show that det(A) = 0.
(Hint: if you write the algebra correctly, you will have to recognise that an expansion by alien cofactors is identically zero).
The concept of a block-partitioned or simply partitioned matrix is often very useful. Indeed, we have introduced the concept previously by regarding a matrix as a collection of column vectors (side by side) or as a collection of row vectors (stacked on top of one another). In general, it may be convenient to regard an (n m) matrix A in block partitioned form as
A=A11 A12; A21 A22
whereA11;(n1 m1); A12;(n1 m2);A21; (n2 m1);A22; (n2 m2); and,n1+n2 =n and m1+m2 =m:
If the matrix product AB is deÖned, for some matrix B, then obtain
2. 3.
4.
5.

4.5. EXERCISES
the appropriate partition of B such that
26AB+AB . AB+AB37
AB =64 . 75: A B +A B . A B +A B
6. Explain the partitions used if Ax = b is re-expressed as either:
(a) A1x1 + A2x2 = b
or
(b) A1x=b1 and A2x=b2
where A is (nn); x is (m1); b is (n1) and a subscript on a matrix/vector denotes a partition of that matrix/vector.
[Note that the partition of A in (a) is di§erent to the partitioned of Ain(b),sothatA1 in(a)isNOTthesameasA1 in(b).]
11 11 12 21 11 12 12 22
21 11 22 21 21 12 22 22
7. If possible calculate the inverse of the following matrices
A=2 1; B=1 3; C=1 2 6; 6 2 3 9 0 1 5
22 13 21 1 73 21 2 43 D=46 95; E=40 1 25; F=42 4 75:
7 4 2 4 17 0 1 3
If the inverse does not exist, explain why and also calculate the rank of these singular matrices.
8.
9. In general, an nth order matrix, X say, whose inverse is its own trans- pose, X0, is said to be orthogonal. An orthogonal matrix thus satisÖes: X0X = XX0= In:
(a) Verify that the n columns of an orthogonal matrix form an or- thonormal basis for the n-dimensional Euclidean Space (i.e., show that these column vectors are mutually orthogonal and have unit length).
(a) Show that if an nth order matrix is non-singular, then its inverse is unique.
(b) Consider two square, non zero matrices A and B which satisfy AB = 0. Show that BOTH A and B must be singular matrices.
(c) Prove that (XY)1 = Y1X1; when the inverses are deÖned.
(d) Provided det(X) 6= 0; show that (X0)1= (X1)0:
39

40 CHAPTER 4. DETERMINANTS AND THE INVERSE MATRIX (b) Let X be an orthogonal matrix and y and z two (n 1) vectors,
such that y = Xz: Show that kyk = kzk:
10. Let x = fxig denote a (n 1) vector of observations on some random
variable of interest. DeÖne the sum vector, n = (1; 1; :::; 1)0; to be a
(n 1) vector of ones, and the (n n) matrix M = In 1 n0n: n
(a) Verify that 0nx computes the sample mean of the observations
n fxig:
(b) VerifythatMhaselementsmii =11 and mij =1; i6=j: nn
(c) ShowthatM0 =MandthatMM=M.
(d) Finally, show that Mx = fxi xg ; where x is the sample mean of the observations.

Chapter 5
Linear Transformations and Simultaneous Equations
5.1 Introductory remarks
For our purposes we think of a linear transformation as being a simple linear combination of a set of m, (n 1) vectors, where the set of vectors concerned forms the columns of some (nm) matrix, A = [a1;:::;am]; say. A sin- gle linear combination of the m vectors fa1; :::; amg ; each (n 1) ; forms a single, new, (n 1) vector. A sequence of p linear combinations of these vectors forms a corresponding sequence of p, (n 1) ; vectors. Collecting all these p vectors together forms a new (n p) matrix C, say, which can be regarded as a linear transformation of A. This linear transformation can be represented by matrix multiplication, as follows.
Let A be the (n m) matrix described above, and let br; r = 1; :::; p; denote a sequence of p; (m 1) vectors. Then Abr = cr deÖnes a vector cr which is a simple linear combination of the columns of A. The scalar coe¢ cients in this linear combination are the elements in br :
Xm
j=1
where br = (b1r ; :::; bmr )0 : The sequence of p vectors, br ; deÖnes the (m p) matrix B = [b1;b2;:::;bp] and similarly we can write C = [c1;c2;:::;cp]; (n p) ; where Abr = cr; r = 1; :::; p: Now, applying the rules for matrix multiplication it is fairly easy to show that:
AB = A[b1; b2; :::; bp] = [Ab1; Ab2; :::; Abr] = [c1; c2; :::; cp] = C:
That is, by taking p (possibly di§erent) linear combinations of the columns of A, (nm), we form a new matrix C, (np); and the operation that achieves this linear transformation is the post-multiplication of the matrix A
41
Abr =
ajbjr; r=1;:::;p

42CHAPTER5. LINEARTRANSFORMATIONSANDSIMULTANEOUSEQUATIONS
by an appropriate (m p) matrix B. Thus we think of AB as representing a linear transformation of the columns of A:
Alternatively, we could think of the matrix product AB as representing a linear transformation of the rows of B, where the matrix A achieves the transformation. Either way we regard the operation of matrix product, AB, as providing a linear transformation:
post-multiplication, by B; takes linear combinations of the columns of the pre-multiplier
pre-multiplication, by A; takes linear combinations of the rows of the post-multiplier
Example 7
1. Suppose D = diag(di);i = 1;:::;n; A is (mn) and B is (np): Then
AD = [d1a1; d2a2; :::; dnan]
provides a new matrix whose columns are scalar multiples of the columns
of A. Further
dn bn
provides a new matrix whose rows are scalar multiples of the rows of
B.
2. Suppose the sequence of random variables, ui; are independently and identically distributed (iid), each having mean zero and variance equal to 2; i = 1;:::;n: Writing u = fuig we have that E[u] = 0 and var[u] = E[uu0] = fE[uiuj]g = 2In: Consider now the new vector random variable, z = Au; whose elements, zs; are simple linear com- binations of the elements in u (i.e., the random vector z is a linear transformation of the random vector u):
Xn i=1
and A = fasig is a matrix of constants. We wish to Önd E[z] and var[z]:
Firstly, and noting that E[:] is a linear operator, we have
(Xn )(Xn )
E[z] = fE[zs]g = E[ asiui] = asiE[ui] = f0g i=1 i=1
zs =
asiui;s = 1;::;r
2d1b013
DB = 6
4 05
7

5.2. NON-SINGULAR TRANSFORMATIONS 43
since E[ui] = 0: Thus we may write: E[z] = AE[u] = 0:
Finding var[z] is slightly more complicated but proceeds along similar lines. Since zs has zero mean for all s we have,
var[z] =
=
= =
(Xn Xn ) fE (zszt)g = E[ asiui atiui]
8 i=1 i=1 9
m then there are more equations than unknowns if n < m then there are more unknown than equations and if n = m then there are the same number of unknowns as equations. Notice that in general A1 may not exist (even if n = m) so that the unique solution of the form x = A1b is NOT guaranteed to be available. Moreover, it may be that no solution to Ax = b exists in any form; for example in the simple two equation two unknown case, given above, had A = 2 3 ; then there would have been no solution as the two equations would 46 have represented to distinct parallel straight lines which do not intersect. On the other hand, had A = 2 3 and b = (4; 8)0 there would have been 46 5.3. SIMULTANEOUS LINEAR EQUATIONS 47 an inÖnite number of solutions as both equations would represent the same straight line. Thus, in the case of two unknowns each equation (possibly more than two equations, say n) represents a straight line in R2 and the nature of the solution depends on whether there is: 1. a common point of intersection for all these lines (a unique solution) 2. no common point of intersection (no solution) or whether 3. every equation is in fact the same straight line (an inÖnite number of solutions). For n equations in three unknowns the situation is slightly more com- plicated. Here each equation represents a (two-dimensional) plane in R3: If the n planes intersect at a common point then a unique solution exists corresponding to that point of (common) intersection. The planes may also intersect along a straight line (for example, in the special case of only two equations, if the two planes do intersect that it can only be along a straight line). Alternatively, the equations may represent the same plane. In both these latter cases there are an inÖnite number of solutions representing the inÖnite number of possible points of common intersection. If the planes do not intersect in some ìcommon fashionîthen there will be no solution. The above discussion gives some geometric insights regarding the ques- tion of existence and uniqueness of a solution to a system of n linear equa- tions in m unknowns. We now formalise these ideas algebraically. 5.3.1 Existence and Uniqueness Let A be (nm); x (m1) and b (n1); A and b are known (given) whilst x is the unknown vector. Observe that if there exists a x = (x1; :::; xm)0 which satisÖes Ax = b, for given A and b, then we can write Xm ajxj =b j=1 where the aj are the m columns of A = [a1; ::; am]: Thus if a solution exists we are saying that it is possible to express b as a linear combination of the columns of A. Equivalently, b must lie in the vector subspace spanned by the columns of A and we say that b lies in the column space of A. Now the rank of A; r(A); gives the maximum number of linearly independent column vectors in A which is also the dimension of the column space of A. Consider the augmented matrix Ab = [A; b]; (n m + 1) ; which has 48CHAPTER5. LINEARTRANSFORMATIONSANDSIMULTANEOUSEQUATIONS columns [a1; a2; :::; am; b]: If a solution exists, then the (m + 1) columns of Ab, span exactly the same subspace as the columns of A: Thus the dimension of the space spanned by the columns of A; which is r(Ab); must be the same as r(A): On the other hand, if no solution exists then it is not possible to write b as a linear combination of the columns of A, then augmenting A with b must increase the maximum number of linearly independent vectors by 1; yielding r(Ab) = r(A) + 1: Since, for Ab; the only two possible states of the world are r(Ab) = r(A) or r(Ab) = r(A) + 1; we have proved the following result: Theorem 1 (Existence) A system of n linear equations in m unknowns, Ax = b, has a solution if and only if r(Ab) = r(A); where Ab = [A; b]: Furthermore, Theorem 2 (Uniqueness) If a system of n linear equations in m un- knowns, Ax = b, is such that r(Ab) = r(A) = r; say, then: 1. if r = m; the solution is unique 2. if r < m then the solution is not unique. Proof. Firstly, r(A) = r(Ab) = r; so that b lies in the column space of A (a solution exists). 1. If r = m (n); then A has full column rank implying that the m columns of A form a basis for V , a m-dimensional subspace of Rn: Since b 2 V , and [a1; :::; am] is a basis for V; the representation of b intermsoftheaj mustbeunique. Thusifr=mtheremustbea unique solution vector, x. 2. If r < m then A contains r < m linearly independent columns (at most). Thus [a1; :::; am] form a linearly dependent set of vectors and thus the representation of b in terms of [a1;:::;am] is not unique. Corollary 2 If m > n then there is either no solution or an inÖnite number of solutions. (For example, m = 3; n = 2 requires a consideration of whether or not two 3-dimensional planes intersect or not. However, if m = 2 and n = 1 there can only be an inÖnite number of solutions.)
Theorem 3 A system of n equations in n unknowns has a unique solution if and only if A is non-singular. If A is singular then a (non-unique) solution may or may not exist according to Theorem 1.

5.3. SIMULTANEOUS LINEAR EQUATIONS 49
Proof. If A is non-singular then r(A) = n which necessarily implies that r(Ab) = n also. Therefore a solution exists. Further, by Theorem 2 with m = n; it will be unique as A has full column rank. Now suppose a unique solution exists in a system of n equations in n unknowns. With m = n; Theorem 2 says that in this case A must have full column rank. But since A is square this is necessarily implies that A is non-singular.
Observe that the unique solution to a system on n equations in n un- knowns is given by x = A1b; provided that A is non-singular. This solution is also given by:
Proposition 13 (Cramerís rule) Provided A is non-singular, the unique solution to Ax = b is given by
1 2a11 a1j1 b1 a1j+1 a1n3 xj = det(A) det 64 . . . . . 75
an1 anj1 bn anj+1 ann 5.3.2 Homogeneous equations
A system of equations in which b = 0 is called homogeneous. The system is Ax = 0 and is called homogeneous since if x solves then so does for any scalar x for any scalar :
Theorem 4 (Homogeneous system) A solution always exist to a system of homogeneous equations; for example, x = 0. A non-trival solutions exists if and only if r(A) < m. Proof. Suppose a non-trival solution to Ax = 0 exists. Then it is possible to equate a linear combination of the columns of A to the null vector. Thus the columns of A must be a linearly dependent set of vectors. Hence r(A) < m. Now assume r(A) < m; then the columns of A form a linearly dependent set of vectors; therefore it must be possible to equate a non-trival linear combination of the columns of A to the null vector. Thus a non-trivial solution exists. 5.3.3 Some Examples We have seen that the question of the existence of a solution vector, x, is essentially the answer to the question: ìGiven A and b, is it possible to express b as some linear combination of the columns of the matrix A ?î If the answer to this question isìyesî then a solution vector exists, but it may not be unique. Consider the following examples 50CHAPTER5. LINEARTRANSFORMATIONSANDSIMULTANEOUSEQUATIONS (1) 3x1 +2x2 +4x3 =1 x1 + 3x3 = 0 2x1 +x2 +2x3 =2 (3) x1 + 2x2 + 4x3 = 1 2x1 +3x2 +2x3 =2 Solutions: 23 2 43 1.WritingAx=b,wehaveA=41 0 35; b=405:Rank(A)=3 212 2 (since det(A) 6= 0) which, clearly, must be the same as rank Ab = [A; b]. Therefore there is a unique solution for x0 = (x1; x2; x3) : Since A is square and non-singular, this can be obtained as x = A1b: Alter- natively, solving in the time-honoured fashion, we obtain the solution x0 = (3; 2; 1): 2. Here, A is (3 2) and clearly has rank 2. However, the rank of Ab = [A; b] is 3 (the matrix Ab = [A; b] is square and has a non-zero determinant). Hence there is no solution. 3. A is (2 3) and has rank 2. This necessarily implies that the matrix Ab = [A;b] also has rank 2. Therefore a solution exists, but it will not be unique since A does not have full column rank. A particular solution is x1 = 1 + 8x3; x2 = 6x3; for arbitrary x3: 4. A is (3 2) and is easily seen to have rank 2. The determinant of Ab = [A;b] is zero, so Ab = [A;b] must also have rank 2. Thus a solution exists and it is unique, since A has full column rank. Solving by, say, looking at just the Örst two equations gives x0 = (1; 2): 5.4 Exercises 1. Let x be a (n 1) random vector with mean vector and variance- covariance matrix . What is the variance-covariance matrix of u = x? Consider the linear transformation from the random vector x; to the random vector y; deÖned by y = Ax, for some (k n) matrix of con- stants A. Show that y has mean vector A and variance covariance matrix A A0. 2. Consider the system of n linear equations in m unknowns, Ax = b, such that r(A) = r[A; b] = r, say, so that a solution exists because b lies in the column space of A: (2) (4) 2x1 +3x2 =4 x1 2x2 = 1 3x1 +5x2 =2 2x1 + 3x2 = 8 x1 + x2 = 3 2x1 + x2 = 4 213 5.4. EXERCISES 51 (a) If r = m and the Örst m rows of the matrix A form a linearly independent set of row vectors, show that the unique solution is can be expressed as x = A1b1 and that A2A1b1= b2, where: 11 A1 is a matrix made up from the Örst m rows of A; A2 consists of the last (n m) rows of A; b1 consists of the Örst m elements of b; and, b2 the last (n m) elements of b. (b) Nowsupposer=n 0 or y0y > 0; or both, thus = 0 implying that the eigenvalue is real and real eigenvalues generate real eigenvectors.
Theorem 6 For a symmetric matrix, A, eigenvectors corresponding to dis-
tinct eigenvalues are pairwise orthogonal.
Proof. Let xi denote the eigenvector corresponding to eigenvalue i and consider two distinct eigenvalues, i and j; i 6= j: Then Axi = ixi and
Ax = j xj : Pre-multiplying by x0j and x0i yields, respectively, x0j Axi =
ix0j xi and x0iAxj = j x0ixj : But x0ixj = x0j xi and A0 = A; so that x0j Axi= x0iA0xj = x0iAxj: Thus (i j)x0ixj = 0: However, i and j are distinct; therefore
x 0i x j = 0 :
Theorem 7 If an eigenvalue of a symmetric matrix has multiplicity k, then there will be k orthogonal eigenvectors corresponding to this root.
Proof. omitted.
Theorems 5, 6 and 7 enable us to state the following important result:
Theorem 8 An nth order symmetric matrix, A, has real eigenvalues, 1; ::::; n; (possibly not all distinct) and correspondingly a set of n mutually orthogonal eigenvectors x1; :::; xn (such that x0ixj = 0; for all i 6= j; and kxik > 0).
Proof. Follows immediately from Theorems 5, 6 and 7.
Note that if x1 is an eigenvector of A then so is x1; 2 R. This arbitrariness can be removed by imposing the normalisation kxk = 1 for all eigenvectors of A. Then the CVP becomes
Ax=x; kxk=1
Thus eigenvectors of a symmetric matrix are orthogonal, having the property that x0ixj = ij, where ij is the Kronecker delta.

6.3. SYMMETRIC MATRICES 57 DeÖnition 18 (Kronecker delta) The Kronecker delta, ij ; satisÖes ii =
1 and ij =0; i6=j. Example 10
1. Consider a (2 2) matrix, A = faij g ; i; j = 1; 2: The characteristic polynomial is given by
f() = deta11 a12 a21 a22
= (a11 )(a22 )a12a21
= ()2 + (a11 + a22) () + a11a22 a12a21
= ()2 + tr(A)() + det(A):
Thus f () is quadratic with roots, = 1 tr(A) ptr(A)2 4 det(A) : 2
3 2p2
2. Let A= 2p2 1 . Then det(AI2) = (+5)(1); so that
the eigenvalues are 1 = 5 and 2 = 1: To get the eigenvectors we need to solve the following two systems of homogeneous equations
2 2p2x1
2p2 4 x = 0
2 4 2p2x1
2p2 2 x = 0: 2
The Örst system implies that x1 = 2px2; and the second that p2x1 = x2: Both systems (of course) yield non-unique solution vectors. In both cases we normalise to make the solution vector of unit length and thus obtain two orthonormal eigenvectors (observe that the form of the two solution vectors already implies that they are orthogonal). Thus, for the Örst (21) eigenvector we have that kx1k = 1 and x1 = p2x2; which gives the eigenvector (associated with 1 = 5) as x1 = p2=p3; 1=p30 : For the second eigenvector, associated with 2 = 1; we have that kx2k = 1 and p2x1 = x2; which yields x2 = 1=p3; p2=p30 :
3. Suppose A = In = diag( ); then det(AIn) = det[( )In] = ( )n : This implies that the n eigenvalues of A are all given by i = ; i = 1;:::;n: We thus need to Önd a set of n orthonormal eigenvectors which solve Ax = x; i.e., Inx = x: Clearly xi= ei; the ith unit vector, will su¢ ce.

58 CHAPTER 6. THE CHARACTERISTIC VALUE PROBLEM 6.4 The Diagonalisation of a Symmetric Matrix
Let the real (n n) symmetric matrix A have eigenvalues 1; :::; n with corresponding orthonormal eigenvectors x1; :::; xn: Thus the pair (i; xi) sat- isÖes Axi = ixi; with x0ixj = ij ; i; j = 1; :::; n: DeÖne the following (n n) matrix of eigenvectors, X = [x1; :::; xn]; so that the columns of X are the n orthonormal eigenvectors of A: The diagonal matrix of eigenvalues,
Then, we can write
i 40 0 … 05 0 0 0 n
261 0 0 037 = diag( ) = 6 0 2 0 0 7
AX= A[x1 ; :::; xn ]
= [Ax1 ; :::; Axn ]
= [1 x1 ; :::n xn ]
= X:
Thus the eigenvalues and eigenvectors satisfy the following:
AX = X:
(6.2)
The columns of X are orthonormal and thus form a linearly independent set of n;(n1), vectors. Thus X is non-singular and its inverse is simply X0, since X0X = fx0ixjg = fijg = In: From the properties of inverse matrices, it must be XX0 = In, also, and X is said to be orthogonal (not orthonormal as one might expect).
DeÖnition 19 (Orthogonal matrix) An orthogonal matrix is such that A0A = AA0= I:
Now, pre-multiplying AX by X0; in (6.2) gives
X0AX = X0X = In
= = diag(i): On the other hand, post-multiplying by X0 gives
A = XX0:
This latter result is called the spectral decomposition of A, whilst the former is known as the diagonalisation of A (we say that X diagonalises A).

6.5. EXERCISES 59
What is important here is the result concerning the diagonalisation of A, rather than the practicalities of carrying it out. Often we only need to know of the existence of a matrix X which diagonalises A, without actually having to write it down! For example, diagonalisation immediately gives us the following result:
Proposition 17 Let A be (n n), then r(A) is equal to the number of non-zero eigenvalues of A.
Proof. Although the result is true for any matrix, the proof is particularly straightforward for symmetric matrices. We have AX = X, where X is non-singular. Therefore X0 A and X0 AX represent non-singular transfor- mations of A and XA respectively. But we know that
r(A) = r(XA) = r(XAX) = r()
from Proposition 12, Chapter 5. But = diag(i), so r() is trivially the number of non-zero eigenvalues, as required. Pn
Finally, note that we can easily verify that tr (A) = i=1 i, when A is symmetric, as follows:
0 0 Xn tr(A) = tr(XX ) = tr(X X) = tr() = i:
i=1
6.5 Exercises
1. Determine the eigenvalues and eigenvectors for the following matrices: (a) 5 4
12 (b)2 1
1 4
Are the obtained eigenvectors orthogonal?
2. Show that the matrix 1 1 has no real eigenvalues. 2 1
3. Verify that any (2 2) matrix, A = a b ; has no real eigenvalues cd
if (a d)2 + 4bc < 0: 4. Consider the following polynomial in Y3 f() = (i ); i=1 60 5. 6. 7. 8. 9. CHAPTER 6. THE CHARACTERISTIC VALUE PROBLEM where the i; i = 1; 2; 3; are assumed known. By expanding the right- hand side: (a) show that f() can be expressed as f() = ()3 + b2()2 + b1() + b0; and (b) deÖne the coe¢ cients, b0; b1 and b2; in terms of 1; 2 and 2: If the matrix A has eigenvalues 1;:::;n prove that A2 = AA has eigenvalues 21; :::; 2n and, in general, that Am; m = 1; 2; 3; ::::; has eigenvalues m1 ; :::; mn : What are the eigenvectors of Am? Let A be a non-singular matrix and x an eigenvector of A with as- sociated eigenvalue : Show that x is also an eigenvector of A1, but has associated eigenvalue equal to 1=: Let A be an nth order square, symmetric, matrix having n eigenvalues which are non all non-negative. These eigenvalues deÖne the diagonal matrix = diag(i): DeÖne the diagonal matrix 1=2 = diag 1=2 ; i where positive square roots are implied. Using the diagonalisation of A, show that we can always Önd a square symmetric matrix T = A1=2, say, such that TT = T2= A and give an expression for such a matrix, T. (We often write that T = A1=2.) Let A be an arbitrary (2 2) matrix, and x and arbitrary (2 1) vector. Show directly that the scalar x0Ax turns out to be x0Ax = a11x21 + a22x2 + a12x1x2 + a21x2x1: Let A = faijg be a nth order square matrix (not necessarily symmet- ric), and let x be an arbitrary (n1) vectPor. By noting that the (n1) n vector z = Ax has typical element zi = aij xj ; show that the scalar j=1 x0Ax = x0z can be expressed (rather tediously) as Xn i=1 xizi = Xn Xn i=1 j=1 xiaijxj = Xn i=1 aiix2i + Xn X i=1 j6=i aijxixj: Chapter 7 Quadratic Forms 7.1 1. Introductory remarks We now investigate how the techniques of linear algebra can be used in the context of expressions which are essentially quadratic. Specif- ically, a quadratic form is a quadratic function of the elements in an (n 1) vector and, needless to say, such expressions often arise in statistics/econometrics. A quadratic form is a generalisation of (x1 + x2)2 = x21 + x1x2 + x2x1 + x2; but where coe¢ cients are assigned to each term on the right hand side. Thus a quadratic form in two variables (x1; x2) is written: 2 2 X2 X2 F(x1;x2)=a11x1 +a12x1x2 +a21x2x1 +a22x2 = aijxixj: i=1 j=1 This is generalised in an obvious manner to give a QUADRATIC FORM in n variables, x = (x1; :::; xn)0 ; (n 1) : Xn Xn i=1 j=1 a11x21 + a12x1x2 + ::: + a1nx1xn +a21x2x1 + a22x2 + ::: + a2nx2xn +::: +an1xnx1 + an2xnx2 + ::: + annx2n: F(x) = = aij xi xj The above scalar valued function determines a unique values of F for any given x. Linear algebra a§ords a more compact notation for F(x) as follows: Pn Let A = faijg; i;j = 1;:::;n and deÖne c = Ax; so that ci = aijxj: j=1 61 62 CHAPTER 7. QUADRATIC FORMS Then we can express F(x) as XnXn Xn8 0 for all x 6= 0.
2. x0Ax is positive semi-deÖnite (psd), and A is psd, if and only if x0Ax0 for all x6=0.
3. Negative deÖnite (nd) and negative semi-deÖnite (nsd) are de- Öned by reversing the inequality signs in (1) and (2), respectively.
4. x0Ax is indeÖnite otherwise (i.e., the quadratic form x0Ax can be strictly negative for some x 6= 0 and strictly positive for other x 6= 0 (or, perhaps it could also be zero).
In simple cases it can be easy to sign the quadratic form:
Example 12 Consider: 1.x01 0x
02
2.x02 0x 00
46 36

7.3. PROPERTIES OF QUADRATIC FORMS 63 3.x02 0x:
0 1
The quadratic form in 1 is clearly pd; 2 is psd and 3 is indeÖnite.
Example 13 In the (2 2) case, there are the following equivalent state- ments.
A and x0Ax are said to be …
POSITIVE DEFINITE i§ x0Ax > 0 for all x 6= 0
ñi§a11 >0ANDdet(A)>0
POSITIVE SEMI-DEFINITE i§ x0Ax 0 for all x 6= 0
ñi§a11 0ANDdet(A)0
NEGATIVE DEFINITE i§ x0Ax < 0 for all x 6= 0 ñi§a11 <0ANDdet(A)>0
NEGATIVE SEMI-DEFINITE i§ x0Ax 0 for all x 6= 0
ñi§a11 0ANDdet(A)0 INDEFINITE, otherwise
ñ i§ det (A) < 0 (i.e., none of the above) For more general matrices, A, it can be di¢ cult to sign the associated quadratic form from simple inspection. However, other properties/characteristics of A can be of some use in this respect, and the next sub-section investigates this. 7.3 Properties of quadratic forms In the following results, unless stated otherwise, A is a real, symmetric (n n) matrix. Theorem 9 A matrix A is positive deÖnite if and only if all its eigenvalues are (strictly) positive. Proof. To prove this we shall use the diagonalisation of A; X0AX = , as deÖned in Chapter 6. Consider the quadratic form, w0Aw, for any w 6= 0. The n columns of X form a basis for Rn; and since w 2 Rn; it must be possible to express w as a linear combination of the columns of X: Therefore there exists a y 6= 0 such 64 CHAPTER 7. QUADRATIC FORMS that w = Xy: Making this substitution we have w0Aw = y0X0AXy = y0y; by the diagonalisation of A: Thus the quadratic form can be expressed as 0 Pn 2 wAw= jyj;wherey=fyjg;forsomeyj 6=0: j=1 1. (Su¢ ciency.) Assume Örst that j > 0 for all j (positive eigenvalues)
Pn2 0
) jyj >0)wAw>0forallw6=0 () Aispositive
j=1 deÖnite.
2. (Necessity.) Now assume that w0Aw > 0 for all w 6= 0 () A is positive deÖnite. In particular, this will be true for the choice w = xi; where xi 6= 0 is an (orthonormal) eigenvector of A with associated eigenvalue i: Thus, x0iAxi= x0i(ixi) = ix0ixi > 0: But x0ixi = 1 =) i > 0; which will be true for all eigenvalues.
The relationship between eigenvalues and psd, nd, nsd and indeÖnite matrices follow accordingly. Thus A is psd i§ all its eigenvalues are non- negative, with at least one being equal to zero; A is nd i§ all its eigenvalues are strictly negative; A is nsd i§ all its eigenvalues are non-positive, with at least one being equal to zero; A is indeÖnite i§ it has both negative and positive eigenvalues (at least one of each).
Corollary 3 If a real, symmetric, matrix, A, is non-singular then it is either positive deÖnite or negative deÖnite or, perhaps, indeÖnite (e.g., if A is positive deÖnite then it must be non-singular, but not necessarily vice- versa).
Theorem 10 Ler B be a (nm) matrix, mn: Then 1. B0B is pd if and only if r(B)=m
2. BB0 is psd. Proof.
1. Consider the sign of x0B0Bx for all (m1) x6=0:
We can write x0B0Bx = y0y, where y = Bx is a (non-trivial) linear combination of the columns of B, and these m columns form a linearly independent set of vectors (equivalently, the columns of B form a basis for a m-dimensional sub-space of Rn): Therefore y 6= 0 for all x 6= 0; and so x0B0Bx=y0y>0 for all x6=0.

7.3. PROPERTIES OF QUADRATIC FORMS 65
2. Consider x0BB0x for all (n1) x6=0:
x0BB0x = z0z; where z = B0z is a linear combination of the columns of B0; and these n columns form a linearly dependent set of vec- tors. Thus, there exists a x 6= 0 such that z = B0x = 0; in which case x0BB0x = z0z = 0: In general, though, kzk2 = z0z 0: Thus, for all x6=0; x0BB0x=z0z0:
We have the following Corollary:
Corollary 4 SupposeAisapdmatrixandBis(nm)withmn:Then B0AB is pd if and only if r(B)=m:
Proof. Apply Theorem 12, so that B0AB = B0P0PB = C0C, where C = PB and P is non-singular. From a previous result concerning ranks (Proposition 12, Chapter 5), since P is non-singular, r(C) = r(PB) = r(B) = m n; where C is (n m) : Now simply apply Theorem 10 to C0C.
Theorem11 IfBis(nm);r(B)=k 0 8 i) i§ its square root matrix is non-singular.
Proposition 18 A is positive deÖnite if and only if B = A is negative deÖnite.
Proof. Trivial, left as an exercise.
Proposition 19 If A is positive deÖnite then aii > 0 for all i:
Proof. Straightforward: consider e0i Aei ; where ei is the ith unit vector. Since A is pd, e0iAei = aii > 0:
Note that the converse result is NOT necessarily true!
Proposition 20 A is positive deÖnite i§ all leading principle minors of A are positive; it is negative i§ leading principle minors alternate in sign from negative to positive.
Proof. Omitted
7.4 Projection matrices
Projection matrices lie at the heart of much econometric analysis.
DeÖnition 21 (Projection matrix) A Projection Matrix, P; is one that is BOTH idempotent, satisfying PP = P; and symmetric.
Thus, a projection matrix satisÖes: P = PP = P0P = PP0= P0:
Theorem 14 P is a projection matrix if and only if its eigenvalues are zero or one.
Proof. Let P be symmetric with typical eigenvalue and eigenvector pair (; x) : Then P = P2 () PX = P2X; where X is the orthornormal (non- singular) matrix of eigenvectors () Px = P2x; where x is any column of X () x=2x;sinceisaneigenvalueofP2 () (1)x=0 () = 0 or 1:
We immediately obtain the following Corollary:
Corollary 5 Let P be a (n n) projection matrix. Then
1. P must be positive semi-deÖnite. Indeed, it could be positive deÖnite; e.g., In is a projection matrix.

7.4. PROJECTION MATRICES 67 2. The rank of P is equal to its trace.
Proof.
1. Is immediate since the eigenvalues are 0 or 1:
2. Rank is equal to the number of non-zero eigenvalues. Since the eigen-
Pn Pn values are either 0 or 1; r(P) = i. But we know that i =
i=1 i=1 tr(P); in general, so that r(P) is the same as tr(P) in this case.
7.4.1 Orthogonal projections and least squares
Consider a (3 2) matrix, X = (x1; x2) where the xj are (3 1) column vectors, j = 1; 2; and r(X) = 2 so that fx1; x2g is a linearly indepenednt set (of vectors). Therefore fx1;x2g forms a basis for a 2-dimensional subspace, of R3 and this subspace is called the column space of X; which we shall denote CX : Thus any vector in the column space of X can be constructed as X ; for some (2 1) vector : Now introduce the (3 1) vector, y; which does not lie in CX: That is, there is no 2 R2 such that y=X : The situation is depicted in the following Figure 7.1. The red vector, x1; and the blue vector, x2; form a basis for CX (the shaded grey plane). The vector y sits ìaboveîthis column space.
Now consider the problem of constructing the vector p 2 CX such that ky pk ; the distance between y and p; is minimised. Since p 2 CX ; so that p = X for some ; this problem is equivalent to min kyX k2; known as the least squares problem, to which we shall return in the next Chapter.
For the moment, its seems intuitively obvious that the solution for p is obtained when when dotted line, which ìconnectsîy to p; is perpendicular (or orthogonal) to CX (the shaded grey plane).The vector p; so deÖned, is given by
p = X(X0X)1X0y
and is called the ìorthogonal projection of y onto the column space of Xî. Note that the matrix P = X(X0X)1X0 is a projection matrix (unsurpris- ingly), and satisÖes PX = X:
The idea of orthogonal projection, in this manner, generalises to the case of X; (nm); with r(X) = m < n; and y; (n1); such that r[X;y] = m+1: 68 CHAPTER 7. QUADRATIC FORMS y p x2 x1 Figure 7.1: Orthogonal Projection 7.5 Applications in statistics In a sense, all of the material developed in the previous sections has been leading up a discussion of the following results. Proposition 21 Let x be a (n 1) random vector, whose elements do not exhibit any linear dependencies. Then the (n n) matrix var(x) must be symmetric and pd. A linear dependency arises if and only if there exists at least one relationship of the form a0x = k, for some constants, k 2 R and a 6= 0, and var(x) will be psd. Proof. Consider the following arbitrary combination of the elements of x; y = b0x a scalar random variable, with b 6= 0: If V = var(x), we know from previous work that var(y) = b0Vb 0: If there are no linear dependencies in x, then y is a non-degenerate random variable (meaning var(y) > 0) for all choices of b 6= 0: In this case, and for all b 6= 0; b0Vb > 0 implying V = var(x) is pd. However, if there is at least one linear dependency of the form a0x = k, then a0x is degenerate and var(a0x) = a0Va = 0, in which case V = var(x) is psd.
7.5.1 Linear combinations of normal random variables
Let us consider the distribution of linear combinations of random variables. It has already been shown that if x is a random vector with mean vector and variance-covariance matrix , then y = Ax has mean vector A

7.5. APPLICATIONS IN STATISTICS 69
and variance-covariance matrix A A0; where A is a matrix of constants. If y=Ax+c; where c is vector of constants, then E[y] = A+c and var(y) = A A0 (why?).
We shall now demonstrate, the well known result, that linear combi- nations of normal random variables are themselves normally dis- tributed.
DeÖnition 22 (Multivariate normal distribution) Let the (m1) ran- dom vector x be multivariate normal, x N(; ), then it has a multivari- ate density and joint moment generating function (mgf) deÖned by, respec- tively,
f(x)=(2)m=2fdet( )g1=2exp1(x)0 1(x); 2
mx(t) = E[exp(t0x)] = exp(t0+1t0 t) 2
where t is a vector of (dummy) constants.
Theorem 15 (Linear combination of normal random variables) Let the(m1)vectorxN(; );andletAbea(nm)matrixofconstants andca(n1)vectorofconstants. Theny=Ax+cN(A+c; A A0): Proof. We shall demonstrate the result be considering the joint mfg of y = Ax + c (although strictly speaking it is the characteristic function which uniquely identiÖes a distribution). The joint mgf is given by
E[exp(s0y)] = E[exp(s0(Ax + c))]
my (s) =
= E [exp(s0 Ax + s0 c)]
= exp(s0 c) E [exp(s0 Ax)]
= exp(s0 c) mx (A0 s)
= exp(s0c) exp(s0A + 1 s0A0 A0s) 2
sothatmy(s)=exp(s0(A+c)+1s0(A0 A0)s);whichisthejointmfgofa 2
normal random vector having mean A + c and variance-covariance matrix A A0. Thus yN(A+c; A A0):
Corollary 6 Suppose that b is a (m 1) vector of constants such that b0 b=1;thenz=b0(x)N(0;1):
Proof. x N(0; ), and z = b0(x ) is a scalar random variable. Using the above result, z N (0; b0 b): But on the assumption that b0 b = 1,then z N(0;1):

70 CHAPTER 7. QUADRATIC FORMS
7.5.2 Independent normal random vectors
Let xN(; ); x being (m1); where is pd. Partition x; and conformably as x = (x01; x02)0; 0= (01; 02)0 and = 11 12 ; where the
21 22
(mj 1) vectors xj N(j; jj); and cov(xi;xj) = E[(xij)(xjj)0] = ij= 0ji; i;j = 1;2: Since is pd, then both 11 and 22 must also be
pd:
Assume that all the elements in x1 are uncorrelated with all the ele-
mentsinx2;thisimpliesthat 12 =0and 21 = 012 =0andthatthe determinant and inverse of are, respectively,
1 1 0 det( )=det( 11)det( 22); = 11 1 :
Theorem 16 If the (mj 1) vectors xj N(j; jj); j = 1;2; and cov(xi;xj) = 0; then x1 and x2 are also independently distributed. (That is, in the special case of normally distributed random variables zero covariance also implies independence.)
Proof. Consider the joint density of x = (x01; x02)0; (m 1) ; m = m1 + m2; which can be expressed as (using the notation and deÖnitions introduced above)
(2)m=2 fdet( )g1=2 exp1 (x)0 1 (x) 2
f(x) =
= (2)m1=2 fdet( 11)g1=2 exp1 (x1 1)0 1 (x1 1)
0 22
2 11 (2)m2=2 fdet( 22)g1=2 exp 1 (x2 2)0 1 (x2 2)
= f1(x1) f2(x2)
where we have explicitly used the fact that 21 = 0 and 12 = 0; and the last line indicates the product of the two (marginal) normal densities, N(j; jj); j = 1;2: Thus, since the joint density of x1 and x2 is the product of the marginal (joint) densities of x1 and x2, respectively, then x1 and x2 are said to be independently distributed. So when x1 and x2 are jointly normally distributed, cov(x1; x2) = 0 implies that x1 and x2 are also independent.
7.5.3 The distribution of some quadratic forms in normal random variables
Proposition 22 LetxN(; ); (mm)andpd. Then(x)0 1(x)
2m :
Proof. Left as an exercise.
[Hint: useTheorem12toshowthatz=(P0)1(x)N(0;Im);where
P is a non-singular matrix such that = P0P. Then note that z0z = (x )0 1(x )]
2 22

7.5. APPLICATIONS IN STATISTICS 71 Proposition 23 If y is a (q 1) random vector distributed N (0; ), inde-
pendently of w N(0; ); (m 1) ; then the ratio is w0 1 w=m distributed y0 1 y=q
as an F(m;q) random variable.
Proof. Follows immediately from the previous result.
Proposition 24 (Sampling distribution of the sample mean) Let xi be iid with E[xi] = and var[xi] = 2; i = 1;:::;n: DeÖne x = 1 Pni=1 xi;
1Pn
ni=1 (xi x)2 : Then Further, if xi N(; 2); then
s2 =
1. E[x] = ; var(x) = 2=n and E[s2] = 2:
n1
2. xN(;2=n)independentlyof(n1)s2=2 2n1:
Proof. Letn =(1;1;::::;1)0;the(n1)vectorofones,andx=(x1;:::;xn)0: Then we can write, E[x] = n and var[x] = 2In:
1. Firstly, x = 0nx=n and
since 0nn = n: Second,
var(x) = = = = =
E [x] =
= 0n (s=n)
= 0n n =n =
0n E [x]=n
v ar (0n x=n)
n 2 v a r ( 0n x ) n20n2Ins n 2 2 0n n 2=n:
Third, deÖne the matrix M = In n0n : Previous work has shown n
that M is a projection matrix and that Mx = fxi xg; i = 1;:::;n: Moreover Mn= 0 so Mx = M(x n) and, since M0M = M, we can write
Xn i=1
(n1)s2 =
(xi x)2 =x0M0Mx=(xn)0M(xn):

72
CHAPTER 7. QUADRATIC FORMS
= E[(x n)0M(x n)] = E[tr((x n)0M(x n))] = E[tr(M(x n)(x n)0)]
= tr(ME (x n)(x n)0)
= tr(M var[x])
Therefore
E[n 1)s2]
but
= 2tr(M):
tr(M) = tr In n0n = tr(In) n tr(n0n) = n 1:
2.
n
Therefore E[(n 1)s2] = 2 (n 1) ;so that s2 is unbiased for 2:
From previous work as x is a linear combination of normal random variables it must also be normally distributed, with mean and vari- ance 2=n: The distribution of v = Mx must also be normal with mean M(n) = Mn = 0 and variance 2M:Thus v and x must be jointly normally distributed with cov(x;v) = n1E[0nxx0M]: Now E[xx0] = var(x) + E[x] E[x]0 = 2In + 2n0n;so that
E[0nxx0M] = n(2In + 2n0n)M = 20nM+2 0nn 0nM = 0;
since 0nM = 00:Therefore cov(x; v) is a vector of zeros, which implies that x and v, or x and any function of v, are independently distributed. Thus, in particular, x and v0v = (n 1)s2 are independent. All that remains is to establish the distribution of s2.
We note that (n1)s2 = u0Mu;where u = xn N(0;2In): Further, from the spectral decomposition of M we have M = XX0, where X is the orthogonal matrix of eigenvectors of M and is the diagonal matrix of eigenvalues. From Theorem 14, the eigenvalues of M are either zero or one and since we have found that tr(M) = n 1, there must be n 1 eigenvalues equal to 1 and one equal to 0. Let us order these, so that the Örst n 1 elements in the leading diagonal of are 1 are the last is 0: Now we can write
nX1 i=1
where z = X0u is a linear transformation of u and so is normally distributed with mean vector E[z] = X0E[u] = 0;and covariance ma- trix var(z) = 2XX0 = 2In; since XX0 = In:Thus the elements
2 22nP12
of z are iid N(0; ) and (n1)s = = (zi=) is the sum of
i=1
squares of n1 independent standard normal random variables. Thus
(n1)s2=2 2n1:
(n1)s2 =z0z=
zi2

7.6. EXERCISES 73
We now generalise some of these results:
Let the (n 1) random vector u be multivariate normal, u N(0; 2In); so that the random elements of u are independently and identically distrib- uted (iid) as N(0,2) random variables. Let P and Q be two idempotent and symmetric matrices with rank(P) = trace(P) = r and rank(Q) = tr(Q) = s.
Proposition 25 u0Pu=2 2r and u0Qu=2 2s: Proof. Left as an exercise.
Proposition 26 If PQ = 0, then
1. Pu is normally distributed N(0;2P) independently of Qu N(0;2Q)
2. u0Pu=2 2r independently of u0Qu=2 2s 3. u0Pu=rF(r;s)
Proof. Both Pu and Qu are linear combinations of normal random vari- ables and so are themselves normally distributed and cov(Pu;Qu) is given by E[Pu(Qu)0] = E[Puu0Q] = 2PQ: Thus if PQ = 0, Pu and Qu are independently distributed, as are any functions of Pu and Qu; e.g., u0P0Pu = u0Pu and u0Q0Qu = u0Qu; which are 2r and 2s; respectively, by Proposition 25. The ratio of two independent 2 random variables has an F distribution.
Proposition 27 Let b be a vector of constants such that b0b = 1 and Pb = 0. Then the random variable
Tr = p b0u
u0 Pu=r
has a Studentís t distribution on r degrees of freedom.
Proof. b0u= N(0;1); since b0b = 1; independently of Pu= since Pb = 0: Therefore, b0u= is independent of u0Pu=2 which is 2r by Propo- sition 25 above. The result then follows from the standard deÖnition of a Student-t random variable.
7.6 Exercises
1. Express the following quadratic forms in matrix-vector notation; i.e., express them as x0Ax with A as symmetric matrix.
u0 Qu=s

74
2.
3.
CHAPTER 7. QUADRATIC FORMS
(a) 4×21 + 8x1x2 + 6×2
(b) 3×21 + x2 + 5×23 + 7x1x2 + 6x1x3 + 9x2x3
(c) 6×21 + 9×2:
Determine whether the above quadratic forms are positive or negative
deÖnite, positive or negative semi-deÖnite or indeÖnite. Prove the following:
(a)
(b)
Let
the (n 1) multivariate normal random vector u; where E(u) = 0: Let X be the (n n) matrix of orthonormal eigenvectors of and = diag(i); the diagonal matrix of eigenvalues.
(a) Show that = W2 = WW; where W = XX0 and = diag(pi); and verify by matrix multiplication that W1= X1X0 and W1 W1= In:
(b) Show that z = W1u has a multivariate normal distribution, N(0;In):
(c) Hence show that u0 1u has a chi-squared distribution with n degrees of freedom.
Let H be an orthogonal matrix of constants and let u be a (n 1)
random vector with zero mean vector and variance covariance matrix
equal to 2In: Consider the random vector w = Hu; with typical ele-
variance-covariance matrix.
(b) Suppose u has a multivariate normal distribution. Let P be any (n n) idempotent and symmetric matrix with rank r(P) = r n: Show that u0Pu=2 has a 2 distribution with r degrees of freedom.
[Things to note: P and r has eigenvalues equal to one and
n r equal to zero. Since P is symmetric it can be decomposed
as P = XX0; where X is the orthogonal matrix of eigenvectors
of P and is the diagonal matrix of eigenvalues. Construct
z = X0u=and consider its distribution; then show that u0Pu=2
Pr 2 can be expressed as zi :]
i=1
If A is positive deÖnite THEN there exists a non-singular matrix P such that A = P0P.
If A = P0P; P non-singular THEN A is positive deÖnite.
be a symmetric positive deÖnite variance-covariance matrix for
4.
Pn j=1
ment wi =
(a) Determine E[w] and var[w]; n i.e., the mean vector for w and its
hijuj; i = 1;:::;n:

7.6. EXERCISES 75
The following is a question on determinants of block-partitioned ma- trices, and does not relate directly to the lecture material on Quadratic Forms. You may Önd the following results useful:
Let V=V11 0 ; (nn); where V11 and V22 are both square 0 V22
but not necessarily of the same order. V can be expressed as the following matrix product
V=V11 0I 0=AB; say; 0 I 0 V22
where the two identity matrices may not be of the same order. By considering, for example, det(B) as an expansion by cofactors along its Örst row we must conclude that det(B) = det(V22): Similarly, det(A) = det(V11 ): Since we know that det (AB) = det (A) det (B); it must be that det (V) = det (V11) det (V22):
5. Let the (n n) matrix A be partitioned as A = A11 A12 ; where 0 A22
A11 is (mm); A22 is (pp); A12 is (mp) and 0 is a matrix of zeros, with m + p = n:
(a) Show that A = Im A12A11 0 = BC; say. 0 A22 0 Ip
(b) Explain why det(B) = det(A22):

76 CHAPTER 7. QUADRATIC FORMS

Chapter 8
Matrix Di§erential Calculus and Optimisation
8.1 Introductory remarks and notation
Often in econometrics, especially when considering methods of estimation (e.g., ordinary least squares, generalise method of moments or maximum likelihood) we shall have to consider not only functions of scalars, but also functions of vectors and possibly functions of matrices. Moreover, such func- tions can return scalars, vectors or matrices. The estimation of an econo- metric model involves the search for a suitable value of a parameter vector, given the data. The criterion for choosing such a value usually involves op- timising some scalar function of this parameter vector; e.g., minimising a residual sum of squares or maximising a log-likelihood over the set of all possible parameters values (the parameter space).
Consider a variable that we wish to model, y say, for which we can gather a sample of n observations fy1;:::;yng: Often it is assumed that the yi are iid; but suppose the mean of each of the yi depends on the values taken by another set of k variables. Such a speciÖcation might be
j=1
on a set of (so-called) k regressors, whose values are assumed to ináuence the mean of yi, and = ( 1; :::; k)0 is an unknown (k 1) parameter vector. For example, yi might denote the earnings of individual i sampled ran- domly from some population and xi might include information concerning that individuals characteristics (age, educational background, working ex- perience, etc.). Since yi is random, it will deviate from x0i and a commonly used method to estimate is to minimise the sum of squared deviations,
E[yi] = xi =
xij j; where xi = (xi1;::;xik) denotes the i observation
0Pk0 th
Pn 02
(yi xi ) ; over all possible values of : This is called ordinary 77
S( ) =
least squares and involves obtaining all the k partial derivatives of S( )
i=1

78CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION
with respect to the elements in : The adoption of simple rules makes such a procedure less daunting, and more elegant, than it might otherwise be.
Although, as statisticians or econometricians, we shall wish to maximise or minimise some function of a parameter vector, we shall develop the ideas by keeping the discussion general and adopting the following notation:
Notation 1 (Real-valued function of a scalar) y = f(x); x 2 R; y 2
R:
e.g., y = x2 or y = exp(x2):
Notation 2 (Real-valued function of a vector) y = f(x1;:::;xn) writ- ten as f(x); x 2 Rn:
e.g., y = a0x; y = x0x or y = x0Ax; where a is a (n1) vector of constants and A is a (n n) matrix of constants.
Note that we prefer f(x) = f(x1;:::;xn); rather than f(x0): Notation 3 (Vector function of a scalar) y = f(x); x 2 R; y 2 Rm
e.g., y = xa; a is a (m 1) vector of constants.
Notation 4 (Vector function of a vector) y = f(x); x 2 Rn; y 2 Rm
e.g., y = Ax; A a (m n) matrix of constants.
Remark 2 In this last case we have, y = (y1; :::; ym)0 = (f1(x); :::; fm(x))0 = f(x); so that each element in y is a real-valued function of x; that is, yi = fi(x) = a0ix; where a0i is the ith row of A. The fi are sometimes referred to as the component functions of f:
There are other cases we could consider. For example, f(X) = det(X), or f(X) = tr(X), for a square matrix X; f(X) = ; where is the vector of eigenvalues of X; F(x) = xx0; for a vector x; or F(X) = X1; for non- singular X:
We shall focus on scalar and vector functions of x; (n 1) ; for which we wish to construct vectors (or matrices) which contain the partial deriva- tives of the function with respect to the n elements x0 = (x1; :::; xn) :
8.2 The Gradient vector and Jacobian matrix
We shall consider two common situations. First let us provide a deÖnition of a function.
DeÖnition 23 (Function) Let S and T be two sets. If with each element x 2 S there is associated exactly one element y 2 T; denoted f(x); then f is said to be a function from S to T: We express this is f : S ! T; meaning that f is deÖned on S; with values in T: Then S is called the domain of f: The rangeoff isthesetofallvaluesoff;whichisgivenbyfy:y=f(x); x2Sg and which is a subset of T:

8.2. THE GRADIENT VECTOR AND JACOBIAN MATRIX 79 8.2.1 Real-valued function of a vector
In this case we have y = f(x); for x being (n1): SpeciÖcally, this is understood to be a real-valued function deÖned as f : Rn ! R; or if the domain of f is constrained such that we only consider x 2 S Rn; then f : S ! R:
In the case, all n partial derivatives are collected together in a derivative vector:
DeÖnition 24 (Derivative vector) Let f : S ! R; be a di§erentiable real-valued function deÖned on a set S in Rn: Then the derivative vector is
deÖned as
@f = Df(x) = @f ; @f ;:::; @f @x0 @x1 @x2 @xn
which is a (1 n) row vector.
Note that the derivative vector is a row of partial derivatives. The gra- dient vector is deÖned as rf(x) = (Df(x))0, where r is pronounced ëdelí and is the column vector of partial derivatives; often this column vector will be written g(x) = rf(x), indicating a column vector of partial derivatives and ëgíis for gradient.
Example 14
1. f = a0x =Pni=1 aixi: In this case, @f=@xj = aj; so that Df(x) = @f=@f;@f;:::;@f
= a0
2. f = x0x = Pni=1 x2i : In this case, @f=@xj = 2xj; so that Df(x) = 2×0:
3. f =ln(x0x);x6=0:UsingtheChainRule,weget@f=@x = P 1 2x ; j nx2j
@x0 @x1 @x2 @xn = (a1; :::; an)
i=1 i
Having di§erentiated once to get g(x), we may wish to di§erentiate again. We now discuss, in general, di§erentiating a column vector (of real valued functions).
sothatDf(x)=2×0 : kxk2

80CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION 8.2.2 Vector function of a vector
In this case we have y = f(x); for x being (n1) and y being (m1): SpeciÖcally, this is understood to be a vector function deÖned as f : Rn ! Rm; or if the domain of f is constrained such that we only consider x 2 S Rn; then f : S ! Rm: Note that each element in y; denoted yi, is a real-valued function of x; i.e., yi = fi(x); where f = (f1; :::; fm) :
Here we arrange all partial derivatives in a (m n) matrix. This matrix is obtained by taking each of the m elements of f = (f1;:::;fm) in turn and obtaining the (1 n) derivative vector and then stacking these m row vectors on top of one another. By arranging the mn partial derivatives in this way we obtain the Jacobian matrix:
DeÖnition 25 (Jacobian Matrix) Let f : S ! Rm; be a di§erentiable vector function deÖned on a set S in Rn: Then the Jacobian matrix is deÖned
as
@x0 @fi
which is a (mn) matrix with typical element @x ; i = 1;:::;m; j =
j
1; :::; n:
DeÖnition 26 (Jacobian) In the above, if m = n, then det(Df(x)) is
called simply the Jacobian.
The gradient, in this case, is a matrix of partial derivatives: rf(x) = (Df(x))0;(nm): P
Example 15 f = Ax; so that fi = a0ix = nk=1 aikxk; where a0i is the ith row of A = faikg; i = 1;:::;m; k = 1;:::;n. Here @fi=@xj = aij; @fi=@x0 = a0i; so that Df(x) = A:
8.3 Using di§erentials to identify the derivative
In the previous section, we gave some examples of the derivative vector, or Jacobian matrix, for some speciÖc functions. For other functions it might be quite di¢ cult to unravel the matrix/vector expression and then partially di§erentiate. However, of we use the notion of vector di§erentials then matrix di§erential calculus becomes much easier.
In general, suppose we have y = f(x); where f : S ! Rm is a vector function deÖned on a set S in Rn: We consider small pertubations in the vector x vector, denoted dx; which result in small pertubations in f; denoted
2@f132 3 @f 6 @x0 7 Df1(x)
@x0 =Df(x)=6 . 7=64 . 75 4@fm5 Dfm(x)

8.3. USING DIFFERENTIALS TO IDENTIFY THE DERIVATIVE 81
df: The vector dx signiÖes that each element in x is subject to di§erent small pertubations, so that dx is NOT proportional to x: There is a similar understanding for df:
These pertubation vectors are called di§erentials, or di§erential opera- tors, which can easily be applied and follow the familiar rules of univariate (real valued) di§erential calculus; i.e., product and chain rule. (Note that if f : S ! R; is a real-valued function deÖned on a set S in Rn; then we write df indicating a real-valued di§erential.)
Proposition 28 (Properties of di§erentials) Di§erentials are linear op- erators. For any ìconstantsî, 2 R; a 2 Rn and A 2 Rmn; a matrix, and variables x and y; di§erentials obey the following rules
1. d( x) = dx 2. d(xa) = adx
3. d(a0x) = a0dx 4. d(Ax) = Adx 5.d =0;da=0 6. dx0 = (dx)0
7. d(x+y)=dx+dy
Remark 3 Notice the change from d on the left hand side of 2 to d on the
right hand side and the change from d to d in 3.
In our quest for the derivative vector, or Jacobian matrix, we apply di§erentials using the above rules. The results of this leads directly to the identiÖcation of the derivative as follows.
8.3.1 Identifying the derivative
The following table will help us to identify the derivative of various functions based the results that are obtained after di§erentials are applied.
1: y = f(x); f : R ! R
2: y = f(x); f : Rn ! R 3:y=f(x); f :R!Rm 4:y=f(x); f :Rn !Rm
Function
Di§erential
Derivative
df = wdx df = w0dx df = wdx df = Wdx
df=dx = w @f=@x0 = w0 @f=@x = w @f=@x0 = W

82CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION
in which w; w and W are possibly functions of x; or x: For example, if we have a real valued function of x; f(x) say, such that when di§erentials are applied we Önd that df is of the form df = w0dx; then we can conclude
that the derivative vector must be @f = w0; which itself may be a (vector)
function of x: @x0
The product and chain rule can also be applied in an intuitively obvious
manner as the following examples illustrate.
DeÖnition 27 (Product rule) Suppose f : Rn ! Rm and g : Rn ! Rm are both di§erentiable and deÖne y = f0g; y : Rn ! R: Then dy = (df0)g + f0dg = g0df + f0dg:
DeÖnition 28 (Chain rule) Suppose f : Rp ! Rm and z : Rn ! Rp are both di§ erentiable and deÖne y = f (z); where z = z(x); x being (n 1) : Then dy = (@f=@z0) dz = Df(z)dz:
Example 16
1. Consider the real-valued function f(x) = x2: Writing this as f(x) = x x; we can use the product rule and obtain di§erentials as df = x dx + dx x = 2xdx: From the above identiÖcation table, result 1,
we infer that df = 2x; as we already know! dx
If f(x)=ax2; df =ad(x2)=a(2xdx); so that df =2ax: dx
2. f(x)=a0x; df =d(a0x)=a0dx =) df =a0 and rf(x)=a: @x0
3. f(x) = x0x: The product rule gives: df = x0dx + (dx0)x = 2x0dx =) @f =2×0 andrf(x)=2x:
@x0
4. f(x)=Ax; df =Adx =) @f =A and rf(x)=A0:
@x0 5.f(x)=x0Ax;df=x0Adx+(dx)0Ax=x0(A+A0)dx =) @f =
@f @x0 x0(A+A0): If A=A0; then @x0 =2x0A and rf(x)=2Ax:
6. f(x) = ln(x0x); x 6= 0: Applying the chain rule, df = (x0x)1d(x0x) = kxk1 2x0dx =) @f = 2×0 and rf(x) = 2x
@x0 kxk kxk
7. f(x) = exp(x0Ax); A = A0; df = exp(x0Ax) d(x0Ax) = f(x)
2x0Adx =) @f = 2f(x) x0A and rf(x) = 2f(x)Ax: @x0

8.4. UNCONSTRAINED OPTIMISATION 83 8.f(x)=xa;x2R;a(m1);df=adx =)@f=a;(m1);and
rf(x) = a0: @x 8.4 Unconstrained optimisation
We shall be concerned with optimising a real-valued function of a vec- tor,f(x);x0 =(x1;:::;xn);f :Rn !R:The(n1)gradientvector,deÖned above, is g(x) = rf(x) = (Df(x))0 with typical element gi(x) = @f=@xi; i = 1;:::;n:
We shall assume that f(x) is a suitably smooth real-valued function of x with continuous Örst and second order partial derivatives, and x can be any point in Rn: We consider Örst standard deÖnitions of global and local optima:
8.4.1 Global and local optima
DeÖnition 29 (Global maximum) A global maximum of f(x) is a point (x;f(x)) which satisÖes
f(x) f(x)
for all x 6= x; and x is called a global maximiser. If the inequality is STRICT, f(x) > f(x); then we have a strict (or unique) global maximum and x is the strict (or unique) global maximiser.
The deÖnition of a global minimiser follows accordingly by reversing the inequality.
A local maximum of f(x) is a point (x;f(x)) which satisÖes f(x) f(x) for all x 6= x in a small neighbourhood of x. The point x is called a local maximser. If the inequality is STRICT, f(x) > f(x); then we have a strict (or unique) local maximum and x is the strict (or unique) local maximiser.
Note that the above inequality is only required to hold ìcloseîor ìlocalî to x; that is, in a small neighbourhood of x: We now make the notion of a small neighbourhood more precise.
DeÖnition 30 (Neighbourhood) Let c 2 Rn and 2 R; with > 0: The set of all points x 2 Rn whose distance from c is less than is called an n-ball of radius and centre c; denoted B(c; ) and this set of points deÖnes a neighbourhood of c :
B(c;)=fx:x2Rn & kxck<g: For example, in R2 a neighbourhood of the point c = (c1; c2)0 would be all the points lying inside a circle of radius centred on c: In R3; B(c;) 84CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION would deÖne all points lying in a sphere (ball) of radius and centred at c: Notice that B(0; ) deÖnes the set of vectors x; such that kxk < : We now give the formal deÖnition of a local maximum (a local minimum obtains by reversing the inequality): DeÖnition 31 (Local maximum) A local maximum of f(x) is a point (x;f(x)); if there exists a > 0 (no matter how small) such that
f(x) f(x)
for all x 2 B(x; ). The point x is called a local maximser. If the inequality is STRICT, f(x) > f(x); then we have a strict (or unique) local maximum and x is the strict (or unique) local maximiser.
Remark 4 A local maximum (minimum) may also be a global maximum (minimum) but not necessarily.
Because we are assuming that f(x) is suitably smooth we can search for these optima using stationary points.
8.4.2 Stationary Points
A stationary point for a real-valued function, f; of a vector is where all partial derivatives of f vanish:
DeÖnition 32 (Stationary point) Let f : Rn ! R be a di§erentiable real-valued function. A stationary point is deÖned as (x;f(x)); where x satisÖes g(x) = rf(x) = 0:
In fact, it is useful to construct the deÖnition of a stationary point, as follows.
Let f : Rn ! R be a di§erentiable real-valued function, let x be any point in Rn and deÖne the real-valued function (t) = f(x + tz); for some z 2 Rn; z 6= 0: Then (0) f(x): When n = 2; x + tz traces out a straight line path, in the x1-x2 plane, through the point x0 = (x1;x2) with slope z2=z1; where z0 = (z1; z2) : This path, x + tz; is traced out by allowing t to vary over (1;1); i.e., points along the path are given by (x1;x2)=(x1+tz1;x2+tz2);fort2(1;1)andx1 =x1(t);x2 =x2(t);
sothatthelocusofpoints(x1;x2)satisÖesx2x2 = z2 (x1 x1):ByÖxing z1
the vector z; we Öx the particular path along which (t) is constructed. By changing z; we generate a number of paths along which (t) can be constructed.
For example, consider straight line paths through the point x0 = (1; 4) : whenz1 =0;thepathisgivenbyx1 =1;whenz2 =0thepathgivenby

8.4. UNCONSTRAINED OPTIMISATION
85
x2 =4;whenz0 =(2;3);thepathisgivenbyx24=3(x11):The 2
concept of these paths generalises quite naturally when n > 2:
Now, from basic univariate calculus we know how to characterise a sta- tionary point of (t) : (t;(t)) is a stationary point of (t) if and only if
d(t) = d(t) = 0: In our case (t) = f(x + tz) and (0) = f(x) dt dt t=t
for all z 6= 0: That is, (0) = f(x ) along all possible straight line paths which pass through x: If (0;(0)) is also a stationary point of (t) for all z 6= 0; then (correspondingly) (x;f(x)) is a stationary point of f(x): The converse is also true. To see what this means we can use the chain rule to di§erentiate (t) = f (x + tz): By writing x = x + tz; we have
so that
d= @fdx=@fzdt @x0 @x0
d(0) = rf(x)0z dt
but d(0) =0forallz6=0;ifandonlyifrf(x)0z=0forallz6=0;which dt
occurs if and only if g(x ) = rf(x ) = 0; which provides the deÖnition of a stationary point of f(x) given above.
As we shall see, and as in the univariate case, a stationary point can give a maximum, minimum (global/local), as deÖned above, or a saddle point where f(x) can be both bigger and smaller than f(x), in a neighbourhood of x satisfying g(x) = rf(x) = 0: Formally, we have:
DeÖnition 33 (Saddle point) Let g(x) = rf(x) = 0: If there exists a > 0; no matter how small, such that f(x) > f(x) and f(x) < f(x) for all x 2 B(x;) then (x;f(x)) is a saddle point. Remark 5 A sadddle point is a stationary point which is neither a local maximum or a local minimum. The characterisation of stationary points in to maxima, minima (global/local) or saddle points is achieved through an examination of the stationary points of (t): This requires an investigation of d2(t)=dt2; which necessitates the introduction of the Hessian matrix of f(x): 8.4.3 Hessian matrix DeÖnition 34 (Hessian matrix) Let f : Rn ! R be a twice di§erentiable real-valued function. The Hessian matrix, denoted H(x); is deÖned as the 86CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION Jacobian (derivative) matrix of g(x) =rf(x) H(x) = Dg(x) = @g(x) 2 3 @x0 @g1=@xn 3 @g=@x 7 @g1 2 @g1=@x1 @g1=@x2 6@x07 6@g=@x @g=@x =6.7=62 1 2 2 64.754 . 2 n7: ... . 5 @gn @gn=@x1 @gn=@x2 @x0 @gn=@xn Since @gi = @ @f = @2f ; the Hessian matrix arranges these 2 @2f=@x21 @2f=@x2@x1 @2f=@xn@x1 3 H(x) = 6 @2f=@x1@x2 @2f=@x2 @2f=@xn@x2 7: 4 . ... . 5 @2f=@x1@xn @2f=@x2@xn @2f=@x2n Remark 6 Notethat the Hessian is deÖned so that H(x) = fhijg = @2f=@xj@xi ; rather than hij = @2f=@xi@xj, which might seem more natural. Firstly, deÖning fhijg = @2f=@xj@xi emphasises that we Örst di§erentiate f with respect to xi and then with respect to xj; but this doesnít really matter since (apart from pathological cases) second order partial derivatives are symmet- ric. Second, the Hessian is deÖned by constructing the Jacobian (deriva- tive) matrix of g(x) = rf(x) and this gives a unifying feature for way we arrange second partial derivatives in more general cases (e.g., for the case f : Rn ! Rm). However, somewhat confusingly, for the f : Rn ! R case, we also write H(x) = @2f(x): @x@x0 8.4.4 Characterising stationary points In this section, we generate necessary and then su¢ cient conditions which characterise the nature of stationary points. They are not the same! Necessary conditions for local maxima and minima Here we ask the ask the question:ìIf (x;f(x)) is a local maximum of f(x); what can we concludeîThat is, we consider what necessarily must be true if (x;f(x)) is a local maximum. Firstly, if f(x) is di§erentiable at x, then it is immediate that rf(x) = 0: This can be seen by deÖning the univariate function (t); as previously, so that (0; (0)) is a local maximum of (t); for all z 6= 0: From univariate calculus, it must therefore be that d(")=dt 0 and d(")=dt 0; for all " > 0 and z 6= 0: We conclude, by
@xj @xj @xi @xj@xi
second order partial derivatives in the following way

8.4. UNCONSTRAINED OPTIMISATION 87
smoothness of (t); that d(0)=dt = 0; for all z 6= 0; implying rf(x) = 0: Moreover, if (x;f(x)) is a local maximum then as you move away from (x;f(x)), in all directions, it must be that f(x) can never increase; this follows from the fact that d(”)=dt 0 and d(“)=dt 0; for all ” > 0 and z = 0: Thus, if f(x) is twice di§erentiable at x, then so is (t) at t = 0; and d2(0)=dt2 0; implying that H(x) is nsd.
Under the same di§erentiability conditions, if (x;f(x)) is a local mini- mum, then rf(x) = 0 and H(x) must be psd. We have demonstrated the following proposition:
Proposition 29 Let f : Rn ! R have either a local maximum or a local minimum at x; with f(x) being twice di§erentiable at x:
1. If (x;f(x)) is a local maximum, then rf(x) = 0 and H(x) must be nsd.
2. If (x;f(x)) is a local minimum, then rf(x) = 0 and H(x) must be psd.
Remark 7 Note that this result tells us something about the Hessian at a local maximum (or minimum). It DOES NOT tell us that a stationary point is a local minimum (maximum) when the Hessian, at this point, is psd (nsd), as the example of f(x) = x31 + x2, in the next section, illustrates.
In order to identify whether a stationary point is a local maximum, minimum (or not) we look to su¢ cient conditions.
Su¢ cient conditions for local optima
Consider again the real-valued function (t) = f(x + tz); as discussed previously, but where now all we assume is that (x;f(x)) is a stationary point of f(x); so that rf(x) = 0; and f(x) is twice di§erentiable at x: The question now is: ìHow can we test whether (x;f(x)) is a local maximum, local minimum or a saddle point?î. Later, we shall also address the issue of whether (x;f(x)) can be identiÖed is a global maximum or minimum, but we Örst deal with identifying local optima.
As noted above, (0; (0)) is a stationary point of (t) for all z 6= 0; with d(0)=dt = 0; and we do have su¢ cient conditions which characterise sta- tionary points of a univariate function, such as (t): For example, consider a particular path, x + tz; along which (t) can be deÖned: This means Öxing z and letting t vary for this given z: If d2(0)=dt2 < 0; for this particular path, then d(t)=dt is (strictly) falling (from being strictly positive, to zero, to strictly negative) as t passes through 0 (from negative to positive) which means that (0;(0)) is a (strict) local maximum, along this particular path. This means that (t) falls/decreases, along this path, as soon as t moves 88CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION from zero. Now, thinking how (t) is constructed (imagine the n = 2 case), it must be that (0;(0)) = (x;f(x)) would also correspond to a strict local maximum of f(x) if (t) were to strictly decrease, as soon as t moves from zero, along all possible paths passing through x(i.e., for all pos- sible choices of z): That is to say, IF d2(0)=dt2 < 0 for all z 6= 0; THEN (x;f(x)), which satisÖes rf(x) = 0; is a strict local maximum. Now, since (by the chain rule) d(t) = rf(x + tz)0z = z0g(x + tz) dt another application of the chain rule gives d2(t) = z0 @g(x + tz)z dt2 @x0 since dg(x + tz) = @g zdt: Therefore, for arbitrary z; @x0 d2(0) = z0H(x)z dt2 from the deÖnition of the Hessian matrix. We can thus see that d2(0) < 0 dt2 for all z 6= 0; if and only if H(x) is negative deÖnite. Thus, IF H(x) is negative deÖnite; THEN (x;f(x)), which satisÖes rf(x) = 0; is a strict local maximum. Reversing the inequalities in the above arguments we get a su¢ cient condition for a strict local minimum: IF H(x) is positive deÖnite; THEN (x;f(x)), which satisÖes rf(x) = 0; is a strict local minimum. A further scenario needs attention. As above, we assume that (x;f(x)) is a stationary point of f(x); and (0;(0)) is therefore a stationary point of (t) for all z 6= 0: Suppose, however, that (t) strictly falls/decreases along certain paths (i.e., for some choices of z) but strictly increases along other paths. That is to say, for some choices of z; d2(0) < 0; whilst for others dt2 d2(0) > 0: Again, thinking about this in the n = 2 case, we see that we dt2
have a saddle point – that is, a stationary point which is neither a local maximum nor a local minimum. In terms of the Hessian matrix, z0H(x)z can be both (strictly) positive and negative for all z 6= 0; so that H(x) is indeÖnite (with det H(x) 6= 0):
In all three of the above cases det H(x) 6= 0 : either all the eigenvalues of H(x) are are non-zero and strictly negative (x is a local minimiser), or all the eigenvalues of H(x) are are non-zero and strictly positive (x is a local maximiser), or all the eigenvalues of H(x) are non-zero but not all of the same sign (x corresponds to a saddle point).
Finally, we must acknowledge an area of uncertainty which arises when det H(x) = 0: This occurs if, and only if, at least one eigenvalue of H(x)

8.4. UNCONSTRAINED OPTIMISATION 89 is zero so that H(x) is singular. In this case, for some paths (i.e., choices of
z 6= 0) the second derivative is d2(0) = z0H(x)z = 0: But when d2(0) = dt2 dt2
0; we know from univariate calculus that t = 0 could locate a maximum (e.g., for (t) = t4), a minimum (e.g., for (t) = t4) or a saddle (e.g., for (t) = t3), for (t): Thus along such paths x could be a local maximum, local minimum or saddle point.
We thus have the following proposition:
Proposition 30 (Characterisation of a stationary point) Let f : Rn ! R be twice di§erentiable at x; such that rf(x) = 0; implying that (x;f(x)) is a Stationary Point. Then:
1. if H(x) is nd (resp. pd), THEN (x;f(x)) is a STRICT LOCAL MAXIMUM (resp. MINIMUM) point
2. if detH(x) 6= 0 and H(x) is INDEFINITE, THEN x is a SAD- DLE point
3. if det H(x) = 0; THEN (x; f(x)) COULD BE a LOCAL MAX- IMUM, MINIMUM or SADDLE point.
Remark 8 For both case 1 and 2, H(x) is non-singular. For case 2 this is explicit; for case 1 this is implied.
Remark 9 Although H(x) being nsd is a necessary characteristic of a lo- cal maximum, (x;f(x)); it is not a su¢ cient condition for (x;f(x)); satisfying rf(x) = 0; to be a local maximum.
Remark 10 It is useful to emphasise that a general deÖnition of a saddle point is a stationary point which is neither a local minimum nor a local maximum.
Su¢ cient conditions for global optima
Now suppose that H(x) is negative deÖnite for all x; not just at x where (x;f(x)) is a stationary point of f(x): Again, deÖne (t) = f(x +tz); z 6= 0; so that (0; (0)) is a stationary point of (t) along this path. Then, for
all z 6= 0, d2(t) = z0H(x + tz)z must be strictly negative for all t. From dt2
univariate optimisation results, this implies that (0;(0)) is the (unique) strict global maximum of (t): Consequently, (x;f(x)) must therefore be the strict global maximum of f(x): Reversing the signs, if H(x) is positive deÖnite for all x; then any stationary point must be the (unique) strict global minimum of f(x):
If H(x) is globally nsd (resp psd) then any stationary point is a global maximum (resp minimum), but it may not be unique.

90CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION Putting all of these thoughts together we can articulate the following
su¢ cient conditions for identifying the nature of a stationary point: Proposition 31 (Global Optima) Let f : Rn ! R be twice di§erentiable
for all x: Then:
1. if H(x) is nsd (resp. psd) for all x; ANY (x ; f (x )) satisfying
rf(x) = 0 is a GLOBAL MAXIMUM (resp. MINIMUM) point
2. if H(x) is nd (resp. pd) for all x; THEN x satisfying rf(x) = 0 is the UNIQUE Stationary Point and (x;f(x)) is the STRICT GLOBAL MAXIMUM (resp. MINIMUM) point.
Remark 11 In case 1 above, the local maximum (minimum) x might not be unique. In case 2, x is unique.
Example 17 Let f(x) = x0Ax+2b0x+c; A positive deÖnite and symmet- ric.
Taking di§erentials
df = (dx)0 Ax + x0Adx + 2b0dx = 2x0A + 2b0
so that g(x) = rf(x) = 2Ax+2b: Stationary points satisfy, g(x) = 0 =) x = A1b; is unique. Checking the Hessian, we have dg = 2Adx; so that
H(x) = @g(x) = 2A @x0
which is positive deÖnite for all x; by assumption. Thus, x = A1b is the unique and strict global minimiser of f(x):
Remark 12 Conclusions about local (or global) optima can not always be based on the behaviour of univariate functions constructed from straight line paths passing through a stationary point. In particular, if you remove the word ìstrictî from the discussion preceding Proposition 30, the conclusions drawn are incorrect. This is because without the word strict, there exists at least one straight line path along which the function is constant.as it passes through the stationary point. This allows the possibilty that there may be non-linear paths along which the function could rise or fall (as it passes through the stationary point), even though it may never fall (say) along any straight line path. This is illustrated in the Further Examples section below.

8.4. UNCONSTRAINED OPTIMISATION 91
Figure8.1: f(x1;x2)=x31x21x2+10 Further Examples
Figure 8.1, f(x) = x31 x21 x2 +10; illustrates a case, f : R2 ! R; of two stationary points, one a strict local maximum, the other a saddle. There is no global maximum or minimum, as can be veriÖed by considering, for example, that path along x2 = 0 since the univariate function (t) = t3 t2 has no global maximum or minimum.
In the following example, Figure 8.2, f(x) = x21 + 2x1x2 x2 + 10 has inÖnitely many stationary points satisfying x1 = x2; for which detH(x) = 0; in fact, H(x) is nsd for all x; so that the path x1 = x2 corresponds to a ridge of global maxima (Proposition 31). In the next example, Figure 8.3, f(x) = x31+x2 has a unique stationary point at (0;0) at which the Hessian is psd, and singular. Interestingly, however, this corresponds to a saddle point as it is neither a local minimum nor a local maximum (Proposition 30, case 3). For example, setting x1 = 0; f(0;x2) = x2 does have a unique minimum at x2 = 0: But setting x2 = 0; f(x1;0) = x31 is strictly increasing through x1 = 0:
Now, consider f(x) = (x2 x21)(x2 2×21); f : R2 ! R; illustrated in Figure 8.4. Now, f has a saddle point at x = (0; 0)0 ; with det H(x) = 0 and H(x) psd. To verify the saddle point, observe that on the only solution to rf(x) = 0 is x = (0;0)0 and that f(0) = 0: However, f = 0 along two distinct non-linear paths: x2 = x21 and x2 = 2×2: Now suppose you start to move away from (0;0) along the path x2 = “x21; for 0 < " < 1; then 92CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION Figure8.2: f(x1;x2)=x21+2x1x2x2+10 Figure8.3: f(x1;x2)=x31+x2 8.5. CONCAVITY AND CONVEXITY 93 Figure8.4: f(x1;x2)=(x2x21)(x22x21) f takes the value f = (1")(12")x21 > 0; for any x1 6= 0; no matter how small the movement in x1: But if move away from (0;0) along the pathx2 =x21;for1<<2;thenf=(1)(12)x21 <0forall x1 6= 0; no matter how small the movement. Thus x = (0;0)0 must be a saddle. However, every straightline path through x = (0;0)0 generates a univariate function which always has a local minimum at (0;0): To verify this, we can write the value of f along any such straightline path as (t) = f(tz1; tz2) = t2(z2 3tz12z2 + 2t2z14); for all z1; z2 not both equal to 0: By viewing q(t) = z2 3tz12z2 + 2t2z14 as a quadratic in t; it is easily shown that q(t) 0 and thus (t) 0 for all t: Therefore for all z1 and z2; not both equal to 0; always has a local minimum at t = 0: Figure 8.5, which depicts the function f (x1 ; x2 ) = x2 (110x1 x2 ) exp(x21 x2) + 10; has multiple optima. A ridge lies between two local (one global) minima, along the path x2 = 0 with f(x1;0) = 0; x1 > 0; which becomes and a valley between two local (one global) maxima, when x1 < 0: 8.5 Concavity and Convexity The ìdeÖnitenessî of H(x) also a§ords the following su¢ cient conditions for concavity (convexity) of f(x); f : Rn ! R: In the following statements, x0 is an arbitrary point in Rn: 94CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION Figure8.5: f(x1;x2)=x2(110x1x2)exp(x21x2)+10 8.5.1 Concavity conditions DeÖnition 35 (Strict local concavity) If H(x0) is nd then f(x) is strictly concave at x0 Remark 13 The plane tangential to f(x) at x0 lies ìaboveîf(x) DeÖnition 36 (Strict global concavity) If H(x) is nd for all x, then f(x) is strictly GLOBALLY CONCAVE Remark 14 The converse is not necessarily true. However, when we raelax the ìstrictîrequirement we have the following result: DeÖnition 37 (Global concavity) f (x) is GLOBALLY CONCAVE if and only if H(x) is nsd for all x 8.5.2 Convexity conditions DeÖnition 38 (Strict local convexity) If H(x0) is pd then f(x) is strictly convex at x0 Remark 15 The plane tangential to f(x) at x0 lies ìbelowîf(x) DeÖnition 39 (Strict global convexity) If H(x) is pd for all x then f(x) is strictly GLOBALLY CONVEX DeÖnition 40 (Global convexity) f(x) is GLOBALLY CONVEX if and only if H(x) is psd for all x 8.6. FIRST AND SECOND ORDER MEAN VALUE EXPANSIONS 95 f(x) 20 15 10 5 0 012345 x Figure 8.6: First order mean value expansion 8.6 First and second order mean value expansions In this section, the useful tool of ìmean value expansionsî(MVE for short) is introduced. This begins with a review of the univariate case, which should be familiar. We then extend to real valued functions of a vector, and then vector valued functions. 8.6.1 Real-valued functions I Consider, Örst, the function f(x); f : R ! R; which is continuously twice di§erentiable over the relevant domain. Figure 8.6 is illustrative of how a Örst order mean value expansion is constructed. First, the slope of the red line is f (4) f (2) : Second, there is a value of x 2 (2; 4); denoted x (deÖning 42 the ìmeanîvalue), such that df(x)=dx is the same as the slope of the red line. Thatis,thereisa2(0;1);suchthatx=2(1)+4and or f(4)f(2) = df(x) 42 dx f(4) = f(2) + df(x)(4 2): dx 96CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION More generally, let x0 play the part of 2; and let x be arbitrary. Then the Örst order mean value expansion says that f(x) = f(x0) + df(x) (x x0) (8.1) dx for x = (1 ) x0 + x; 2 (0; 1); or x 2 (x0; x) if x > x0: A useful way to think about this is that it deÖnes a straight-line, l(x); which passes through the points (x0; f(x0)) and (x; f(x)) :
A second order mean value expansion is
f(x) = f(x0) + df(x0)(x x0) + 1 d2f(x)(x x0)2 (8.2)
for x = x0 + (1 )x; 2 (0; 1): Note that the Örst order derivative term is evaluated at x0; and it is the second derivative which is now evaluated at the mean value, x: A useful way to think about this is that it deÖnes a quadratic, q(x); which passes through the points (x0;f(x0)) and (x;f(x))
and such that dq(x0) df(x0) and dq(x) df(x): That is we need to dx dx dx dx
Önd constants a; b and c such that
ax2+bx+c = f(x) ax20+bx0+c = f(x0)
df ( x ) dx
df(x0) dx
dx 2 dx2
2ax+b = 2ax0+b =
the last of which can be written 2ax0(x x0) + b(x x0) = df(x0)(x x0):
Thus
and
f(x)f(x0)
dx
= a(x2 x20)+b(xx0)
= a(xx0)2 +b(xx0)+2axx0 2ax20 = a(xx0)2 +b(xx0)+2ax0(xx0)
df ( x ) df ( x 0 ) = 2 a ( x x 0 ) : dx dx
But since 2ax0(x x0) + b(x x0) = df(x0)(x x0); f(x) f(x0) can be
expressed
dx
f(x) f(x0) = a(x x0)2 + df(x0)(x x0): dx
Now, from (8.1), we can also infer that df(x) df(x0) = d2f(x)(x x0); dx dx dx2
by a Örst order MVE on df(x); so that d2f(x)(x x0) = 2a(x x0); or dx dx2

8.6. FIRST AND SECOND ORDER MEAN VALUE EXPANSIONS 97 a(x x0)2 = 1 d2f(x)(x x0)2: Thus
2 dx2
f(x) f(x0) = df(x0)(x x0) + 1 d2f(x)(x x0)2:
dx 2 dx2 8.6.2 Real-valued functions II
We now consider f (x); f : Rn ! R; which is continuously twice di§erentiable over the relevant domain, with x a vector, (n 1) : We can generalise (8.1) and (8.2) in the following plausible ways. A Örst order MVE is written
f(x) = f(x0) + @f(x)(x x0) @x0
= f(x0) + g(x)0(x x0) where x = (1 ) x0 + x for some (scalar) 2 (0; 1):
A second order MVE is written
f(x) = f(x0) + @f(x0)(x x0) + 1(x x0)0 @2f(x)(x x0) @x0 2 @x@x0
= f(x0) + g(x0)0(x x0) + 1(x x0)0H(x)(x x0) 2
(8.3)
(8.4)
where x = (1 ) x0 + x for some (scalar) 2 (0; 1); but di§erent to that used in the preceding Örst order MVE. Note that the Örst order derivative term, g(:); is again evaluated at x0:
To obtain the above results, Örst deÖne : [0; 1] ! R; such that d(t)=dt is continuous on [0;1] and d2(t)=dt2 exists in (0;1): Then from (8.1), we
can write
(t) = (0) + d(t) t;
dt and, in particular, setting t = 1 yields
t 2 (0; t) 2 (0; 1) :
(8.5)
(8.6)
(1) = (0) + d() ; dt
Now deÖne (t) = f(x0 + tz); so that (0) = f(x0), (1) = f(x0 + z) and, by the chain rule,
d(t) = @f(x0 + tz)z: dt @x0
Substituting into (8.6) gives
f(x0 + z) = f(x0) + @f(x0 + z)z @x0
whereupon, writing z = x x0 yields (8.3).

98CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION Similarly, by (8.2), we can write
(1) = (0) + d(0) + 1 d2 () ; dt 2dt2
where, since (t) = f (x0 + tz);
d2(t) = z0 @2f z
dt2 @x@x0
2 (0; 1) :
(8.7)
and (0) = f(x0), (1) = f(x0 + z) with d(0) = @f(x0)z: Substituting dt @x0
these into (8.7), and writing z = x x0; gives
f(x) = f(x0) + @f(x0)(x x0) + 1(x x0)0 @2f(x)(x x0):
@x0 2 @x@x0 8.6.3 Vector functions
Consider f(x); f : Rn ! Rm; which is continuously di§erentiable over the relevant domain. Since f(x) = ffi(x)g; i = 1;:::;m; we can take a separate Örst order MVE of each fi(x), fi : Rn ! R; using the preceding results. This yields
fi(x) = fi(x0) + @fi(xi)(x x0) (8.8) @x0
where xi = (1 i) x0 + ix for some (scalar) i 2 (0; 1); which depends on i since we are dealing with real-valued function fi(x):
Now, if we stack (8.8) into a column vector we get
f(x) = f(x0) + @f(x)(x x0)
where
@ f ( x ) @x0
@x0 2 @f1(x1)
6 6 @x0 = Df(x) means 6 .
4 @fm(xm) @x0
3 7 7
7; (mn); so that the ìusualî
5
@f(x) @fi(x) @x0 ; each row being @x0 ;
mean value, x is di§erent for each row of
i = 1;:::;m:
Second order mean value expansions can be deÖned, but the above should
be su¢ cient for our purposes.
8.7 Exercises
1. Find rf (x) and H(x) for the following real-valued bivariate functions, f : R2 ! R

8.7. EXERCISES 99
(a) f(x)=4×21+6×15×2 (b) f(x)=x31 x21 x2 +10
(c) f(x)=x21x2 (d) f(x)=x41+x42
(e) f(x)=x41x42 (f) f(x)=x31+x32
2. For (a)-(f) in Question 1, Önd and characterise any stationary points.
3. Let x be (n1); f(x) a real-valued function, f : Rn ! R: In what follows, A, W and Z are constant matrices and b is a vector of con- stants. In each of the following cases, Önd the (1 n) derivative vector Df(x) = @f(x)=@x0; and (n n) Hessian matrix, H(x).
(a) f(x) = tr(xx0)
(b) f(x)=(bAx)0(bAx)
(c) f(x) = tr(Axx0)
(d) f(x) = (b Ax)0W(b Ax)
(e) f(x) = (b Ax)0ZWZ0(b Ax) (f) f(x) = exp (x0x)
(g) f(x)=ln(exp(tr(xx0)))
4. Let x be a (n 1) multivariate normal random variable with density
(x; )=(2)n=2exp1(xA )0(xA ) 2
where A is a given (n k) matrix of constants, with r(A) = k < n; and is a (k 1) unknown parameter vector. Based on any observed sample, x; deÖne the log-likelihood for as L( ;x)=ln(x; ): (a) Find s( ;x) = (@L( ;x)=@ 0)0: (b) Find H( ;x) = @2L( ;x)=@ @ 0: (c) What is the maximum likelihood estimator of ? 5. Consider the following system of n equations in m unknowns, Ax = b (see Chapter 5). (a) If r[A;b] = r(A) = m < n; show that the unique solution is given by x = A1b; where A1 is the (m m) matrix containing any linearly independent set of m rows from A: Furtheremore, show that this solution is identical to (A0A)1 A0b: 1 100CHAPTER8. MATRIXDIFFERENTIALCALCULUSANDOPTIMISATION (b) If r[A; b] = r(A) + 1 = m + 1; then no solution exists. However, show that x = (A0A)1 A0b minimises kAx bk2 ; i.e., minimises the squared distance between Ax and b: 6. In Question 5(b), partition A by its columns as A = [A1; A2]; so that A1 is (m1 m1); r(A1) = m1; A2 is (m1 m2); r(A2) = m2 and r(A) = m = m1 + m2: Prove that A02M1A2 is positive deÖnite, even though M1 = In A1(A01A2)1A02 is positive semi-deÖnite. Appendix A Some useful results from statistical theory 1. Let X N(;2); i.e., a normally distributed random variable with mean and variance 2: Then Z = (X )= N(0;1); standard normal, and W = Z2 21; chi-squared with 1 degree of freedom. Gen- erally, 2q denotes a chi-squared distribution with q degrees of freedom. Z i 2 2n ; i=1 2. Let Xi; i = 1; : : : ; n; be iid (independently and identically distributed) N ; 2 variates, then X n where Zi = (Xi ) =; and 1Xn XiX22n1; where X = 1 Pn Xi: n i=1 Furthermore, and pP n X =s tn1; Student-t distribution with n1 degrees of freedom, where s2 = 1 n (Xi X )2 is distributed independently of X : n1 i=1 2Z 3. Let Z N(0;1) independently of Y q: Then, S = pY=q tq: 4. Let W 2m independently of V 2p; then U = W + V 2m+p and R = W=m Fm;p; i.e., R has an F-distribution with m and p degrees 2 i=1 pn X = N (0; 1) V=p of freedom. Hence, 101 102APPENDIX A. SOME USEFUL RESULTS FROM STATISTICAL THEORY (a) R1 Fp;m; and, (b) using previous results, if S tq then S2 F1;q: 5. Let Xi; i = 1;:::;n; be a sequence of random variables (not neces- sarily normally distributed), where E [Xi] = i, var [Xi] = 2i and covariances given by cov [XP; X ] = : Then, given a set of constants i j ij ki; the linPear combination ni=1 kiXi is a random variable with mean equal to ni=1 kii and variance equal to Xn XnXn ki22i + kikjij: i=1 i=1 j6=i If the Xi are uncorrelated with one another (in partPicular, if the Xi are independent) then this variance becomes simply ni=1 ki22i ; since cov[Xi;Xj] = 0 for all i;j: In Econometrics we often group together random variables into vec- tors, and sometimes matrices. In doing so, we usually drop the distinc- tion between a random variable (upper case letter) and its realisation (lower case letter). Thus, random variables and their realisations are denoted by lower case letters where there is no ambiguity. 6. Suppose x represents a (n 1) vector of random variables, fxig ; with E[xi] = i; i = 1;:::;n; then x has mean vector E[x] = =fig; (n 1) : Furthermore, we deÖne a (n n) variance-covariance matrix for x as that matrix which has var[xi] on its leading diagonal and cov [xi; xj ] in the general (i; j)th position. Formally, var[x]=E(xii)xjj =E(x)(x)0=V;say. Properties of V are (a) V0 = V; it is symmetric. (b) V is positive deÖnite, provided there is no real number c and (n 1) vector of constants a such that x0a = c; i.e., provided there is no random variable in x which can be expressed as a linear function of the remaining random variables. (c) If (at least) one of the random variables in x can be expressed as a linear combination of the rest (e.g., if x2 = 1 x1 + 4x3; say) then V is positive semi-deÖnite. 7. Let w=Ax+b; where A is a (mn) matrix of constants and b is a(m1)vectorofconstants,withE[x]=andvar[x]=V. Then (a) E[w]=A+b: (b) var[w]=AVA0: 8. Multivariate Normality (a) If x N (;V); multivariate normal, then w = Ax + b N A + b ; AVA 0 : P (b) If z N (0; In) ; multivariate standard normal, then z0z = ni=1 zi2 2n : (c) IfzN(0;V);VpositivedeÖnite(nn),thenz0V1z2n: (d) Let z N(0;In); and P be an idempotent and symmetric matrix (P0 =PandPP=P)withr(P)=tr(P)=q;say. Then z0Pz 2q: 103 104APPENDIX A. SOME USEFUL RESULTS FROM STATISTICAL THEORY Appendix B A swift revision of matrix algebra 1. Supposexisa(n1)vectorwithtypicalelementfxig;i=1;:::;n; then we write x = fxig ; implying a column of n numbers; i.e., 26 x 1 37 x=6x2 7: 4 . 5 xn (a) We deÖne 0 as the vector of zeros (every element of 0 is equal to zero). This is sometimes called the null vector. (b) We write x 6= 0 if x has at least one non-zero element. (c) The transpose of x is denoted x0 and is the corresponding row of x2y2 +:::+xnyn; which is a scalar. numbers: x0 =(x1;x2;:::;xn); (1n): (d) Let y = fyig; (n1); then xy denotes the inner or scalar 0P product of x and y and is deÖned as x0y= ni=1xiyi = x1y1 + (e) The vectors x and y; both (n 1); are orthogonal if x0y = 0: (f) For(n1)vectors,xandy;wewritex>yifandonlyifxi >yi for all i = 1;:::;n:
(g) The magnitude (length) or Euclidesan Norm of x 6= 0; (n 1) ; is deÖned as
p+ 0 + Pn 2 jjxjj= xx= xi >0:
i=1
If y is also (n1); the distasnce between x and y is
+Pn 2 jjxyjj= (xi yi) 0:
i=1
105

106 APPENDIX B. A SWIFT REVISION OF MATRIX ALGEBRA 2. Let A be a (n m) matrix with typical element faij g in row i; column
j: We write A = faij g :
(a) It is often convenient to think of A partitioned in terms of its
columns, thus
where the aj; j = 1;:::;m; are the m columns of A; each column
A=a1 a2 am; having n elements.
(b) The transpose of A is denoted A0. The rows which make up A0 are the columns of A; thus
2 a01 3
A 0 = 6 6 6 a 02 7 7 7 ; ( m n ) :
cij =
4 . 5 a0m
(c) A is square, of order n; if n = m; and the elements aii; i = 1; : : : ; n; form the leading diagonal.
(d) A square matrix, A; is symmetric if A = A0; i.e., aij = aji:
(e) Let k 2 R be any scalar (real number) and A any matrix, then
kA is a matrix with typical element kaij ; i.e., kA = fkaij g :
(f) Letz=fzjg;j=1;:::;m,a(m1)vector,andxa(n1)vec- tor, then xz0 denotes the outer product of x and z and is a (n m) matrixwithtypicalelementfxizjg;also,zx0 =fzjxig; (mn):
(g) The (n n) identity matrix is denoted In; having 1ís in its lead- ing diagonal and 0ís everywhere else.
3. If A is (nm), then A+B is only deÖned if B is also (nm); in which case A+B=faij +bijg; (nm); i.e., A+B has typical element aij + bij: By deÖnition, it follows that A + B = B + A:
If A + B is deÖned, then A and B are said to be conformable.
4. If A is (nm); then the matrix product AB is only deÖned if B is (m p) ; i.e., the number of rows in B must equal the number of columns in A: If AB is deÖned then A and B are said to be con- formable and,
(a) A is called the pre-multiplier and B the post-multiplier.
(b) AB 6= BA; in general. (Note that BA is only deÖned if p = n:)
(c) Formally, if C = AB then C is a (n p) matrix with typical element,
Xm
aisbsj; i = 1;:::;n; j = 1;:::;p; = ai1b1j +ai2b2j ++aimbmj:
s=1

107
(d) IfC=ABanda0i;(1m);i=1;:::;n;denotestheith rowof A; and bj; (m1); j = 1;:::;p; denotes the jth column of B; then cij = a0ibj:
(e) (AB)0 =B0A0;(pn):
5. If A is a (nPm) matrix and x is a (m1) vector, then the vector y = Ax can be expressed as a linear combination of the columns of A;thusy= mj=1ajxj;whereaj isthejth columnofAandxj isthe jth element of the vector x:
6. If A(B+C) is deÖned for matrices A;B and C; then A(B+C) =
AB + AC: Similarly, if (B + C) D is deÖned, then (B + C) D = BD + CD:
7. IfthematrixproductABCisdeÖned,thenABC=A(BC)=(AB)C and, from 4(e) above, (ABC)0 = C0B0A0:
8. If A is square, of order n; then tr (A) = Pni=1 aii; denotes the trace of A; it is only deÖned for square matrices.
(a) If the matrix products AB and BA are deÖned and square then tr (AB) = tr (BA) :
(b) If the matrix products ABC; BCA; CAB are all deÖned and square, then
tr (ABC) = tr (BCA) = tr (CAB) :
(c) For any matrix A; its Euclidean Norm is deÖned as
jjAjj = p+ tr(A0A) = q+ tr(AA0) = r+ PPa2ij: ij
Note that, even though tr(A0A) = tr(AA0); it is not true in general that A0A = AA0:
9. Only a square matrix can have an inverse and such an inverse can
only exist if the square matrix has full (column) rank; in which case the matrix is said to be non-singular.
(a) A matrix has full column rank if no column can be expressed as a linear combination of the remaining columns; i.e., the matrix A has full column rank i§ the only solPution to the homogeneous setofequationsAx=0isx=0;i.e., jajxj =0ifandonlyif (i§)xj =0forallj:
(b) If A is square, of order n; with r(A) = n (the rank of A equal to n), then there exists a unique (n n) matrix, called the inverse of A and denoted A1; which satisÖes AA1 = A1A = In:

108 APPENDIX B. A SWIFT REVISION OF MATRIX ALGEBRA (c) A10 = (A0)1 ; if A is non-singular.
(d) If both A and B are non-singular and of the same order, then (AB)1 = B1A1:
10. Let V=V11 0 ; (nn); where V11 and V22 are both square 0 V22
but not necessarily of the same order. V can be expressed as the following matrix product.
V=V11 0I 0=AB; say, 0 I 0 V22
where the two identity matrices may not be of the same order. By considering, for example, det(B) as an expansion by cofactors along its Örst row we must conclude that det(B) = det(V22): Similarly, det(A) = det(V11 ): Since we know that det (AB) = det (A) det (B); it must be that det (V) = det (V11) det (V22):
11. An idempotent matrix is a square matrix satisfying PP = P; with r(P) = tr(P):
12. A quadratic form is deÖned as the following scalar 0 XnXn
f (x) = x Ax = aijxixj i=1 j=1
where x=fxig; (n1) and A=faijg; (nn); and is symmetric (A0 = A).
(a) f (x) and A are said to be positive deÖnite (pd) i§ f (x) > 0 for all x 6= 0:
(b) f (x) and A are said to be positive semideÖnite (psd) i§ f (x) 0 for all x 6= 0:
(c) f (x) and A are said to be negative deÖnite (nd) i§ f (x) < 0 for all x 6= 0: (d) f (x) and A are said to be negative semideÖnite (nsd) i§ f (x) 0 for all x 6= 0: (e) f (x) and A are said to be indeÖnite i§ none of the above is satisÖed for all x 6= 0: (f) If A (resp. f (x)) is psd then A (resp. f (x)) is nsd. 13. If A is pd (or nd) then A1 exists (but not necessarily vice-versa). 14. Let A be (n n) : There exists a non-singular matrix, denoted A1=2; such that A = A1=2A1=2; A1 = A1=2A1=2 and A1=2AA1=2 = In if and only if A is pd. 109 15. For any (n m) matrix B; the (m m) matrix B0B and BB0 must both be psd. Furthermore, (a) B0Bispdifandonlyifr(B)=m: (b) IfAispd(nn),thenB0ABispdifandonlyifr(B)=m: 16. Vector derivatives (a) Letf(x)beascalar functionofa(m1)vectorx(e.g.,f(x)= x0x ), then denotes the (1 m) row vector of partial derivatives and @f = @f = @f ; @f ; : : : ; @f ; @x0 @x1 @x2 @xm @f 0 @x @x0 provides the same information but in a (m 1) vector. (b) Let f (x) be a column (n 1) vector function of the (m 1) vec- tor x where f (x) = ffi (x)g ; and the fi (x) are scalar functions of x (e.g., f(x) = Ax, so that fi(x) = a0ix; where a0i is the ith row of A). Then @f = @fi ; i = 1;:::;n; j = 1;:::;m; @x0 @xj is the appropriate (n m) matrix of partial derivatives, called the Jacobian matrix. For example, if f (x) = Ax; with A =faijg; then fi = a0ix =Pm aijxj, @fi = aij so that @f = A: j=1 @xj @x0 (c) If f (x) is a scalar function of the (m 1) vector x; then the (m m) matrix of second order partial derivatives is called the hessian and is of the form @2f = @2f ; i = 1;:::;m; j = 1;:::;m: @x@x0 @xj@xi Notethat @2f = @ @f: @x@x0 @x0 @x The hessian matrix can also be deÖned for vector functions of a vector, but do not need to dwell on that here. Some Results (a) Iff(x)=x0x;then @f =2x0: @x0 110 APPENDIX B. A SWIFT REVISION OF MATRIX ALGEBRA (b) Iff(x)=x0Ax;A=A0;then @f =2x0A and @x@x0 (c) ChainRule: iff(x)=y0Ay;wherey=y(x)withy(m1);x (n1)andA=A0 (mm);then @f =2y0A@y: @x0 @x0 (d) Iff(x)=Ax;then @f =A: @x0 @x0 @2f = 2A: Appendix C The classical linear regression model Here we review the algebra of the single equation multiple regression model Xk j=1 in which: yt is the tth observation on the dependent/endogenous variable xtj is the tth observation on the jth regressor/predetermined/exogenous regressor; j = 1; :::; k demand equation) and ut is the deviation from theory. yt = xtj j + ut t = 1; :::; n (C.1) ut is a random disturbance term, or error, which is unobserved. j is the jth unknown parameter/coe¢ cient, with j = @E [yt]; j = @xtj Generally, we think of Pkj=1 xtj j as being derived from theory (eg, 1; :::; k: In matrix vector notation we can write (C.1) as (putting all the obser- vations together) y=X +u (C.2) 111 112 APPENDIX C. THE CLASSICAL LINEAR REGRESSION MODEL where y = (y1;:::;yn)0 =fytg; t=1;:::;n (n1) u = (u1;:::;un) =futg; t=1;:::;n (n1) 0 =(1;:::;k)0= j;j=1;:::;k(k1) 26x11 x12 x1k37 X = 6 x21 x22 x2k 7=fx g; nrows, kcolumns. . . .. . tj (nk) 4....5 P xn1 xn2 xnk It is often the case that the ìregression functionî kj=1 xtj j contains an intercept termP(usually 1; with 2;:::; k being slopes) so that we can write yt = 1 + kj=2 xtj j + ut: I this case the Örst column in X contains just ones; i.e., xt1 = 1 8 t: Hereís some advice: When writing down matrix-vector expressions, its always a good idea to add the dimensions of the quantities underneath so that you are sure of the operation is deÖned. for example y=X+u (n 1) (n k) (k 1) (n 1) in particular, X is deÖned (number of columns in X = number of rows in ) and gives a (n 1) vector. Write X in terms of its columns: X=[x1;x2;:::;xk]; xj (n1): apartitionedversionofX Xk j=2 To this speciÖcation of the CLRM we will initially assume the following about the elements of X and u (but these assumptions will be relaxed in due course): The elements xtj are Öxed in repeated sampling. They are all NON- RANDOM (or NON-STOCHASTIC) and are Öxed prior to sampling/observing yt: X = 1x1+ 2x2++ kxk= which is a LINEAR COMBINATION of the columns of X: When 1 denotes an intercept term we have x1 = f1g ; so that X = 1k+ 2x2++ kxk = 1+ jxj jxj; 0k =(1;1;:::;1);(1k): Xk j=1 113 The elements of u have zero mean, constant variance and are uncor- related: E[ut] = 0 8t Eu2t = 2 8t E[utus] = 0 8t6=s: Further assumptions about the regressor matrix are made as follows: The columns of X form a linearly independent set of vectors. This means that no one vector in the set can be written as a linear combi- nation of the rest. For a given X; the only solution to the equations X =0is =0:Formally,thisistrueifandonlyifr(X)=k

Related Posts