Introduction to Numerical Analysis
Hector D. Ceniceros
⃝c Draft date December 7, 2018
Contents
Contents i
Preface 1
1 Introduction 3
1.1 WhatisNumericalAnalysis? ……………… 3
1.2 AnIllustrativeExample ………………… 3 1.2.1 AnApproximationPrinciple…………… 4 1.2.2 DivideandConquer ………………. 6 1.2.3 Convergence and Rate of Convergence . . . . . . . . . 7 1.2.4 ErrorCorrection ………………… 8 1.2.5 RichardsonExtrapolation ……………. 11
1.3 Super-algebraicConvergence………………. 13
2 Function Approximation 17
2.1 Norms…………………………. 17
2.2 UniformPolynomialApproximation. . . . . . . . . . . . . . . 19 2.2.1 Bernstein Polynomials and B ́ezier Curves . . . . . . . . 19 2.2.2 Weierstrass Approximation Theorem . . . . . . . . . . 23
2.3 BestApproximation ………………….. 25 2.3.1 Best Uniform Polynomial Approximation . . . . . . . . 27
2.4 ChebyshevPolynomials…………………. 31
3 Interpolation 37
3.1 PolynomialInterpolation………………… 37 3.1.1 EquispacedandChebyshevNodes . . . . . . . . . . . . 40
3.2 Connection to Best Uniform Approximation . . . . . . . . . . 41
3.3 BarycentricFormula ………………….. 43
i
ii
CONTENTS
3.3.1 Barycentric Weights for Chebyshev Nodes . . . . . . . 44
3.3.2 Barycentric Weights for Equispaced Nodes . . . . . . . 45
3.3.3 Barycentric Weights for General Sets of Nodes . . . . . 45
3.4 Newton’sFormandDividedDifferences. . . . . . . . . . . . . 46
3.5 Cauchy’sRemainder ………………….. 49
3.6 HermiteInterpolation………………….. 52
3.7 Convergence of Polynomial Interpolation . . . . . . . . . . . . 53
3.8 Piece-wiseLinearInterpolation …………….. 55
3.9 CubicSplines ……………………… 56
3.9.1 SolvingtheTridiagonalSystem . . . . . . . . . . . . . 60 3.9.2 CompleteSplines ………………… 62 3.9.3 ParametricCurves ……………….. 63
4 Trigonometric Approximation 65
4.1 ApproximatingaPeriodicFunction …………… 65 4.2 InterpolatingFourierPolynomial ……………. 70 4.3 TheFastFourierTransform ………………. 75
5 Least Squares Approximation 79
5.1 Continuous Least Squares Approximation . . . . . . . . . . . . 79
5.2 Linear Independence and Gram-Schmidt Orthogonalization . . 85
5.3 OrthogonalPolynomials ………………… 86
5.3.1 ChebyshevPolynomials……………… 89
5.4 DiscreteLeastSquaresApproximation . . . . . . . . . . . . . 90
5.5 High-dimensionalDataFitting……………… 95
6 Computer Arithmetic 99
6.1 FloatingPointNumbers ………………… 99
6.2 RoundingandMachinePrecision …………….100
6.3 CorrectlyRoundedArithmetic………………101
6.4 Propagation of Errors and Cancellation of Digits . . . . . . . . 102
7 Numerical Differentiation 105
7.1 FiniteDifferences…………………….105 7.2 TheEffectofRound-offErrors………………108 7.3 Richardson’sExtrapolation………………..109
CONTENTS iii
8 Numerical Integration 111
8.1 ElementarySimpsonQuadrature …………….111
8.2 InterpolatoryQuadratures ………………..114
8.3 GaussianQuadratures ………………….116
8.3.1 Convergence of Gaussian Quadratures . . . . . . . . . 119
8.4 ComputingtheGaussianNodesandWeights . . . . . . . . . . 121
8.5 Clenshaw-CurtisQuadrature……………….122
8.6 CompositeQuadratures …………………124
8.7 ModifiedTrapezoidalRule………………..125
8.8 TheEuler-MaclaurinFormula ………………127
8.9 RombergIntegration …………………..130
9 Linear Algebra 133
9.1 TheThreeMainProblems ………………..133
9.2 Notation…………………………135
9.3 SomeImportantTypesofMatrices . . . . . . . . . . . . . . .136
9.4 SchurTheorem ……………………..139
9.5 Norms………………………….140
9.6 ConditionNumberofaMatrix ……………..146
9.6.1 WhattoDoWhenAisIll-conditioned?. . . . . . . . . 148
10 Linear Systems of Equations I 151
10.1EasytoSolveSystems ………………….152 10.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 154 10.2.1 TheCostofGaussianElimination. . . . . . . . . . . .161 10.3LUandCholeskiFactorizations ……………..162 10.4TridiagonalLinearSystems ……………….166 10.5 A1DBVP:DeformationofanElasticBeam . . . . . . . . . . 168 10.6 A 2D BVP: Dirichlet Problem for the Poisson’s Equation . . . 170 10.7 LinearIterativeMethodsforAx=b. . . . . . . . . . . . . . . 173 10.8 Jacobi,Gauss-Seidel,andS.O.R. . . . . . . . . . . . . . . . .174 10.9 ConvergenceofLinearIterativeMethods . . . . . . . . . . . . 176
11 Linear Systems of Equations II 181
11.1 Positive Definite Linear Systems as an Optimization Problem . 181 11.2LineSearchMethods …………………..183 11.2.1 SteepestDescent …………………184 11.3TheConjugateGradientMethod …………….184
iv
11.4 11.5
CONTENTS
11.3.1 Generating the Conjugate Search Directions . . . . . . 187 Krylov Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 190 Convergence Rate of the Conjugate Gradient Method . . . . . 192
12 Non-Linear Equations 193
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
12.2 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 12.2.1 Convergence of the Bisection Method . . . . . . . . . . 194
12.3RateofConvergence …………………..195
12.4 Interpolation-Based Methods . . . . . . . . . . . . . . . . . . . 196
12.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 197
12.6 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 198
12.7 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . 200
12.8 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . 202 12.8.1 Newton’sMethod…………………203
List of Figures
2.1 The Bernstein weights bk,n(x) for x = 0.25 (◦)and x = 0.75 (•),n=50andk=1…n………………… 21
2.2 QuadraticB ́eziercurve………………….. 21
2.3 If the error function en does not equioscillate at least twice we couldlower∥en∥∞byanamountc>0………….. 28
4.1 S8(x)forf(x)=sinxecosxon[0,2π]. ………….. 74
5.1 The function f(x) = ex on [0,1] and its Least Squares Ap- proximationp1(x)=4e−10+(18−6e)x. . . . . . . . . . . . 81
5.2 Geometric interpretation of the solution Xa of the Least Squares problem as the orthogonal projection of f on the approximat- inglinearsubspaceW. …………………. 97
v
vi LIST OF FIGURES
List of Tables
1.1 Composite Trapezoidal Rule for f(x) = ex in [0,1]. . . . . . . 8 1.2 CompositeTrapezoidalRuleforf(x)=1/(2+sinx)in[0,2π]. 13
vii
viii LIST OF TABLES
Preface
These notes were prepared by the author for use in the upper division under- graduate course of Numerical Analysis (Math 104 ABC) at the University of California at Santa Barbara. They were written with the intent to emphasize the foundations of Numerical Analysis rather than to present a long list of numerical methods for different mathematical problems.
We begin with an introduction to Approximation Theory and then use the different ideas of function approximation in the derivation and analysis of many numerical methods.
These notes are intended for undergraduate students with a strong math- ematics background. The prerequisites are Advanced Calculus, Linear Alge- bra, and introductory courses in Analysis, Differential Equations, and Com- plex Variables. The ability to write computer code to implement the nu- merical methods is also a necessary and essential part of learning Numerical Analysis.
These notes are not in finalized form and may contain errors, misprints, and other inaccuracies. They cannot be used or distributed without written consent from the author.
1
2 LIST OF TABLES
Chapter 1 Introduction
1.1 What is Numerical Analysis?
This is an introductory course of Numerical Analysis, which comprises the design, analysis, and implementation of constructive methods and algorithms for the solution of mathematical problems.
Numerical Analysis has vast applications both in Mathematics and in modern Science and Technology. In the areas of the Physical and Life Sci- ences, Numerical Analysis plays the role of a virtual laboratory by providing accurate solutions to the mathematical models representing a given physical or biological system in which the system’s parameters can be varied at will, in a controlled way. The applications of Numerical Analysis also extend to more modern areas such as data analysis, web search engines, social networks, and basically anything where computation is involved.
1.2 An Illustrative Example: Approximating a Definite Integral
The main principles and objectives of Numerical Analysis are better illus- trated with concrete examples and this is the purpose of this chapter.
Consider the problem of calculating a definite integral
b a
3
(1.1) I[f] =
f(x)dx.
4 CHAPTER 1. INTRODUCTION
In most cases we cannot find an exact value of I[f] and very often we only know the integrand f at finite number of points in [a,b]. The problem is then to produce an approximation to I[f] as accurate as we need and at a reasonable computational cost.
1.2.1 An Approximation Principle
One of the central ideas in Numerical Analysis is to approximate a given function or data by simpler functions which we can analytically evaluate, integrate, differentiate, etc. For example, we can approximate the integrand f in [a, b] by the segment of the straight line, a linear polynomial p1(x), that passes through (a,f(a)) and (b,f(b)). That is
(1.2) and
(1.3)
That is
(1.4)
f(x) ≈ p1(x) = f(a) + f(b) − f(a)(x − a). b−a
bb1
f(x)dx ≈ aa
b a
p1(x)dx = f(a)(b − a) + 2[f(b) − f(a)](b − a) = 1[f(a) + f(b)](b − a).
f(x)dx ≈
2
b−a
2 [f(a) + f(b)].
The right hand side is known as the simple Trapezoidal Rule Quadrature. A
quadrature is a method to approximate an integral. How accurate is this approximation? Clearly, if f is a linear polynomial or a constant then the Trapezoidal Rule would give us the exact value of the integral, i.e. it would be exact. The underlying question is: how well does a linear polynomial p1, satisfying
(1.5) p1(a) = f(a), (1.6) p1(b) = f(b),
approximate f on the interval [a, b]? We can almost guess the answer. The approximation is exact at x = a and x = b because of (1.5)-(1.6) and it is
1.2. AN ILLUSTRATIVE EXAMPLE 5
exact for all polynomials of degree ≤ 1. This suggests that f(x) − p1(x) = Cf′′(ξ)(x − a)(x − b), where C is a constant. But where is f′′ evaluated at? it cannot be at x for if it did f would be the solution of a second order ODE and f is an arbitrary (but sufficiently smooth, C2[a,b] ) function so it has to be at some undetermined point ξ(x) in (a,b). Now, if we take the particular case f(x) = x2 on [0, 1] then p1(x) = x, f(x) − p1(x) = x(x − 1), and f′′(x) = 2, which implies that C would have to be 1/2. So our conjecture is
(1.7) f(x) − p1(x) = 1f′′(ξ(x))(x − a)(x − b). 2
There is a beautiful 19th Century proof of this result by A. Cauchy. It goes as follows. If x = a or x = b (1.7) holds trivially. So let us take x in (a,b) and define the following function of a new variable t as
(1.8) φ(t) = f(t) − p1(t) − [f(x) − p1(x)] (t − a)(t − b) . (x−a)(x−b)
Then φ, as a function of t, is C2[a,b] and φ(a) = φ(b) = φ(x) = 0. Since φ(a) = φ(x) = 0 by Rolle’s theorem there is ξ1 ∈ (a, x) such that φ′(ξ1) = 0 and similarly there is ξ2 ∈ (x, b) such that φ′(ξ2) = 0. Because φ is C2[a, b] we can apply Rolle’s theorem one more time, observing that φ′(ξ1) = φ′(ξ2) = 0, to get that there is a point ξ(x) between ξ1 and ξ2 such that φ′′(ξ(x)) = 0. Consequently,
(1.9) 0 = φ′′(ξ(x)) = f′′(ξ(x)) − [f(x) − p1(x)] 2 (x−a)(x−b)
and so
(1.10) f(x)−p1(x)= 1f′′(ξ(x))(x−a)(x−b), 2
ξ(x)∈(a,b).
We can use (1.10) to find the accuracy of the simple Trapezoidal Rule. As-
suming the integrand f is C2[a, b]
b b 1b
(1.11) f(x)dx = p1(x)dx + 2 f′′(ξ(x))(x − a)(x − b)dx. aaa
Now, (x − a)(x − b) does not change sign in [a, b] and f ′′ is continuous so by the Weighted Mean Value Theorem for Integrals we have that there is
6 CHAPTER 1. INTRODUCTION η ∈ (a, b) such that
b b
(1.12) f′′(ξ(x))(x − a)(x − b)dx = f′′(η) (x − a)(x − b)dx.
aa
The last integral can be easily evaluated if we shift to the midpoint, i.e., changing variables to x = y + 1 (a + b) then
2
b b−a
b−a2
a −b−a 2 6
(x−a)(x−b)dx= Collecting (1.11) and (1.13) we get
b b−a 1′′ 3
(1.13)
2
y2 −
1
dy=− (b−a)3.
2
f(x)dx = 2 [f(a) + f(b)] − 12f (η)(b − a) ,
(1.14)
where η is some point in (a, b). So in the approximation
a
we make the error
b
a
b−a
f(x)dx ≈ 2 [f(a) + f(b)].
(1.15) E[f] = − 1 f′′(η)(b − a)3. 12
1.2.2 Divide and Conquer
The error (1.15) of the simple Trapezoidal Rule grows cubically with the length of the interval of integration so it is natural to divide [a, b] into smaller subintervals, apply the Trapezoidal Rule on each of them, and sum up the result.
Let us divide [a, b] in N subintervals of equal length h = 1 (b − a), deter- N
minedbythepointsx0 =a,x1 =x0+h,x2 =x0+2h,…,xN =x0+Nh=b, then
b x1 f (x)dx =
x2 xN f (x)dx + . . . +
f (x)dx
f (x)dx +
a x0 x1 xN−1
(1.16) N−1 x
j+1
xj
=
f (x)dx.
j=0
1.2. AN ILLUSTRATIVE EXAMPLE 7 But we know
xj+1 1 1 ′′ 3
xj
f(x)dx = 2[f(xj) + f(xj+1)]h − 12f (ξj)h
(1.17)
for some ξj ∈ (xj,xj+1). Therefore, we get
b 1 1 1N−1 f(x)dx=h 2f(x0)+f(x1)+…+f(xN−1)+2f(xN) −12h3f′′(ξj).
a j=0
The first term on the right hand side is called the Composite Trapezoidal
Rule Quadrature (CTR):
1 1
1N−1 f′′(ξj) ,
(1.18) Th[f]:=h 2f(x0)+f(x1)+…+f(xN−1)+2f(xN) . The error term is
1 N−1 1
(1.19) Eh[f]=− h3 f′′(ξj)=− (b−a)h2
12 j=0 12
Nj=0
where we have used that h = (b − a)/N. The term in brackets is a mean value of f′′ (it is easy to prove that it lies between the maximum and the minimum of f′′). Since f′′ is assumed continuous (f ∈ C2[a,b]) then by the Intermediate Value Theorem, there is a point ξ ∈ (a,b) such that f′′(ξ) is equal to the quantity in the brackets so we obtain that
(1.20) Eh[f] = − 1 (b − a)h2f′′(ξ), 12
for some ξ ∈ (a, b).
1.2.3 Convergence and Rate of Convergence
We do not not know what the point ξ is in (1.20). If we knew, the error could be evaluated and we would know the integral exactly, at least in principle, because
(1.21) I[f] = Th[f] + Eh[f].
8 CHAPTER 1. INTRODUCTION
But (1.20) gives us two important properties of the approximation method in question. First, (1.20) tell us that Eh[f] → 0 as h → 0. That is, the quadrature rule Th[f] converges to the exact value of the integral as h → 0 1. Recall h = (b − a)/N, so as we increase N our approximation to the integral gets better and better. Second, (1.20) tells us how fast the approx- imation converges, namely quadratically in h. This is the approximation’s rate of convergence. If we double N (or equivalently halve h) then the error decreases by a factor of 4. We also say that the error is order h2 and write Eh[f] = O(h2). The Big ‘O’ notation is used frequently in Numerical Analysis.
Definition 1. We say that g(h) is order hα, and write g(h) = O(hα), if there is a constant C and h0 such that |g(h)|≤Chα for 0≤h≤h0, i.e. for sufficiently small h.
Example 1. Let’s check the Trapezoidal Rule approximation for an integral we can compute exactly. Take f(x) = ex in [0,1]. The exact value of the integral is e − 1. Observe how the error |I [ex ] − T1/N [ex ]| decreases by a
Table 1.1: Composite Trapezoidal Rule for f (x) = ex in [0, 1].
N T1/N [ex]
16 1.718841128579994
32 1.718421660316327 1.398318572816137 × 10−4 0.250012206406039 64 1.718316786850094 3.495839104861176 × 10−5 0.250003051723810
128 1.718290568083478 8.739624432374526 × 10−6 0.250000762913303 factor of (approximately) 1/4 as N is doubled, in accordance to (1.20).
1.2.4 Error Correction
We can get an upper bound for the error using (1.20) and that f′′ is bounded in [a,b], i.e. |f′′(x)| ≤ M2 for all x ∈ [a,b] for some constant M2. Then
(1.22) |Eh[f]|≤ 1(b−a)h2M2. 12
1Neglecting round-off errors introduced by finite precision number representation and computer arithmetic.
|I[ex] − T1/N [ex]| 5.593001209489579 × 10−4
Decrease factor
1.2. AN ILLUSTRATIVE EXAMPLE 9
However, this bound does not in general provide an accurate estimate of the error. It could grossly overestimate it. This can be seen from (1.19). As N → ∞ the term in brackets converges to a mean value of f′′, i.e.
1N−1
(1.23) f′′(ξj) −→
1 b b−a a
f′′(x)dx =
1
b−a
[f′(b) − f′(a)],
N j=0
as N → ∞, which could be significantly smaller than the maximum of |f′′|. Take for example f(x) = e100x on [0,1]. Then max|f′′| = 10000e100, whereas the mean value (1.23) is equal to 100(e100 − 1) so the error bound (1.22) overestimates the actual error by two orders of magnitude. Thus, (1.22) is of little practical use.
Equation (1.19) and (1.23) suggest that asymptotically, that is for suffi- ciently small h,
(1.24) Eh[f] = C2h2 + R(h), where
(1.25) C2 = − 1 [f′(b) − f′(a)] 12
and R(h) goes to zero faster than h2 as h → 0, i.e. (1.26) lim R(h) = 0.
h→0 h2 We say that R(h) = o(h2) (little ‘o’ h2).
Definition 2. A function g(h) is little ‘o’ hα if lim g(h) = 0
h→0 hα
and we write g(h) = o(hα). We then have
(1.27) I[f] = Th[f] + C2h2 + R(h).
and, for sufficiently small h, C2h2 is an approximation of the error. If it is possible and computationally efficient to evaluate the first derivative of
10 CHAPTER 1. INTRODUCTION
f at the end points of the interval then we can compute directly C2h2 and use this leading order approximation of the error to obtain the improved approximation
(1.28) Th[f] = Th[f] − 1 [f′(b) − f′(a)]h2. 12
This is called the (composite) Modified Trapezoidal Rule. It then follows from (1.27) that error of this “corrected approximation” is R(h), which goes to zero faster than h2. In fact, we will prove later that the error of the Modified Trapezoidal Rule is O(h4).
Often, we only have access to values of f and/or it is difficult to evaluate f′(a) and f′(b). Fortunately, we can compute a sufficiently good approxi- mation of the leading order term of the error, C2h2, so that we can use the same error correction idea that we did for the Modified Trapezoidal Rule. Roughly speaking, the error can be estimated by comparing two approxima- tions obtained with different h.
Consider (1.27). If we halve h we get
(1.29) I[f] = Th/2[f] + 1C2h2 + R(h/2).
4 Subtracting (1.29) from (1.27) we get
(1.30) C2h2 =4Th/2[f]−Th[f]+4(R(h/2)−R(h)). 33
The last term on the right hand side is o(h2). Hence, for h sufficiently small, we have
(1.31) C2h2 ≈ 4 Th/2[f] − Th[f] 3
and this could provide a good, computable estimate for the error, i.e.
(1.32) Eh[f]≈ 4Th/2[f]−Th[f]. 3
The key here is that h has to be sufficiently small to make the asymptotic approximation (1.31) valid. We can check this by working backwards. If h is sufficiently small, then evaluating (1.31) at h/2 we get
h2 4 (1.33) C2 2 ≈ 3 Th/4[f] − Th/2[f]
1.2. AN ILLUSTRATIVE EXAMPLE 11 and consequently the ratio
(1.34) q(h) = Th/2[f] − Th[f] Th/4[f] − Th/2[f]
should be approximately 4. Thus, q(h) offers a reliable, computable indicator of whether or not h is sufficiently small for (1.32) to be an accurate estimate of the error.
We can now use (1.31) and the idea of error correction to improve the accuracy of Th[f] with the following approximation 2
(1.35) Sh [f ] := Th [f ] + 4 Th/2 [f ] − Th [f ] . 3
1.2.5 Richardson Extrapolation
We can view the error correction procedure as a way to eliminate the leading order (in h) contribution to the error. Multiplying (1.29) by 4 and substracting (1.27) to the result we get
(1.36) I[f] = 4Th/2[f] − Th[f] + 4R(h/2) − R(h) 33
Note that Sh[f] is exactly the first term in the right hand side of (1.36) and that the last term converges to zero faster than h2. This very useful and general procedure in which the leading order component of the asymptotic form of error is eliminated by a combination of two computations performed with two different values of h is called Richardson’s Extrapolation.
Example 2. Consider again f (x) = ex in [0, 1]. With h = 1/16 we get 1 T1/32[ex] − T1/16[ex]
(1.37) q 16 = T1/64[ex] − T1/32[ex] ≈ 3.9998 and the improved approximation is
(1.38) S1/16[ex] = T1/16[ex] + 4 T1/32[ex] − T1/16[ex] = 1.718281837561771 3
which gives us nearly 8 digits of accuracy (error ≈ 9.1 × 10−9). S1/32 gives us an error ≈ 5.7 × 10−10. It decreased by approximately a factor of 1/16. This would correspond to fourth order rate of convergence. We will see in Chapter 8 that indeed this is the case.
2The symbol := means equal by definition.
12 CHAPTER 1. INTRODUCTION
It appears that Sh[f] gives us superior accuracy to that of Th[f] but at roughly twice the computational cost. If we group together the common terms in Th[f] and Th/2[f] we can compute Sh[f] at about the same compu- tational cost as that of Th/2[f]:
h1 2N−1 1 4Th/2[f] − Th[f] = 42 2f(a) + f(a + jh/2) + 2f(b)
j=1
1N−1 1
−h 2f(a)+f(a+jh)+2f(b) j=1
h N−1 N−1 = 2 f(a)+f(b)+2f(a+kh)+4f(a+kh/2) .
k=1 k=1
Therefore
(1.39) Sh[f] = 6 f(a)+2f(a+kh)+4f(a+kh/2)+f(b) .
hN−1 N−1 k=1 k=1
The resulting quadrature formula Sh[f] is known as the Composite Simpson’s Rule and, as we will see in Chapter 8, can be derived by approximating the integrand by quadratic polynomials. Thus, based on cost and accuracy, the Composite Simpson’s Rule would be preferable to the Composite Trapezoidal Rule, with one important exception: periodic smooth integrands integrated over their period.
Example 3. Consider the integral
(1.40) I [1/(2 + sin x)] =
Using Complex Variables techniques (Residues) the exact integral can be com-
puted and I[1/(2 + sin x)] = 2π/ 3. Note that the integrand is smooth (has an infinite number of continuous derivatives) and periodic in [0,2π]. If we use the Composite Trapezoidal Rule to find approximations to this integral we obtain the results show in Table 1.2.
The approximations converge amazingly fast. With N = 32, we already reached machine precision (with double precision we get about 16 digits).
√
2π dx
2 + sin x .
0
1.3. SUPER-ALGEBRAIC CONVERGENCE 13 Table1.2: CompositeTrapezoidalRuleforf(x)=1/(2+sinx)in[0,2π].
N
8 16 32
T2π/N [1/(2 + sin x)] 3.627791516645356 3.627598733591013 3.627598728468435
|I[1/(2 + sin x)] − T2π/N [1/(2 + sin x)]| 1.927881769203665 × 10−4 5.122577029226250 × 10−9 4.440892098500626 × 10−16
1.3
Super-Algebraic Convergence of the CTR for Smooth Periodic Integrands
Integrals of periodic integrands appear in many applications, most notably, in Fourier Analysis.
Consider the definite integral
2π
0
where the integrand f is periodic in [0, 2π] and has m > 1 continuous deriva- tives, i.e. f ∈ Cm[0, 2π] and f(x + 2π) = f(x) for all x. Due to periodicity we can work in any interval of length 2π and if the function has a different period, with a simple change of variables, we can reduce the problem to one in [0, 2π].
Consider the equally spaced points in [0,2π], xj = jh for j = 0,1,…,N and h = 2π/N. Because f is periodic f(x0 = 0) = f(xN = 2π) and the CTR becomes
f(x ) f(x ) N−1
(1.41) Th [f ] = h 0 + f (x1 ) + . . . + f (xN −1 ) + N = h f (xj ).
f(x)= 2 +
(akcoskx+bksinkx)
I[f] =
f(x)dx,
2 2j=0
Being f smooth and periodic in [0, 2π], it has a uniformly convergent Fourier
Series: (1.42)
where (1.43)
(1.44)
a0 ∞
1 2π
ak = π
0
1 2π
bk = π
f(x)coskxdx, f(x)sinkxdx,
k = 0,1,… k = 1,2,…
k=1
0
14
Using the Euler formula3. (1.45)
we can write
(1.46)
(1.47)
CHAPTER 1.
eix =cosx+isinx eix + e−ix
cos x = 2 , sin x = eix − e−ix
INTRODUCTION
2i
of functions eikx for k = 0, ±1, ±2, . . . so that (1.42) becomes
and the Fourier series can be conveniently expressed in complex form in terms
(1.48)
where
(1.49)
∞
f(x) = ckeikx, k=−∞
1 2π
ck = 2π
f(x)e−ikxdx.
0
We are assuming that f is real-valued so the complex Fourier coefficients
satisfy c ̄k = c−k, where c ̄k is the complex conjugate of ck. We have the relation 2c0 = a0 and 2ck = ak −ibk for k = ±1, ±2, . . ., between the complex and real Fourier coefficients.
.
Justified by the uniform convergence of the series we can exchange the finite and the infinite sums to get
Using (1.48) in (1.41) we get
(1.50) Th[f] = h ckeikxj
∞ N−1
(1.51) T [f ] = 2π c eik 2π j .
N−1 ∞ j=0 k=−∞
hNkN k=−∞ j=0
3i2 =−1andifc=a+ib,witha,b∈R,thenitscomplexconjugatec ̄=a−ib.
1.3. SUPER-ALGEBRAIC CONVERGENCE 15 But
(1.52)
Note that eik 2π
l ∈ Z and if so (1.53)
j=0 Otherwise, if k ̸= lN, then
N−1 N−1
ik2πj ik2πj
eN = eN . j=0 j=0
N
= 1 precisely when k is an integer multiple of N, i.e. k = lN,
N−1
ik 2π j
e N =N fork=lN.
ik 2π N N−1 1−eN
ik2π j
(1.54) e N = ik2π =0 fork̸=lN
j=0 1− e N Using (1.53) and (1.54) we thus get that
(1.55)
On the other hand (1.56)
∞
Th[f]=2π clN. l=−∞
1 2π
1 f(x)dx = 2πI[f].
c0 = 2π
Th[f]=I[f]+2π[cN +c−N +c2N +c−2N +…],
|Th [f ] − I [f ]| ≤ 2π [|cN | + |c−N | + |c2N | + |c−2N | + . . .] ,
Therefore (1.57) that is (1.58)
0
So now, the relevant question is how fast the Fourier coefficients clN of f decay with N. The answer is tied to the smoothness of f. Doing integration by parts in the formula (4.11) for the Fourier coefficients of f we have
1 1 2π 2π
(1.59) ck = 2π ik
0
f′(x)e−ikxdx − f(x)e−ikx0 k ̸= 0
16 CHAPTER 1. INTRODUCTION and the last term vanishes due to the periodicity of f(x)e−ikx. Hence,
1 1 2π
(1.60) ck = 2π ik Integrating by parts m times we obtain
1 1 m 2π
periodic
(1.62) |ck| ≤ Am ,
0
f′(x)e−ikxdx
k ̸= 0.
k ̸= 0,
f(m)(x)e−ikxdx
(1.61) ck = 2π ik
where f(m) is the m-th derivative of f. Therefore, for f ∈ Cm[0,2π] and
0
|k|m
where Am is a constant (depending only on m). Using this in (1.58) we get
222 |Th[f]−I[f]|≤2πAm Nm +(2N)m +(3N)m +…
(1.63) 4πAm 1 1 = Nm 1+2m +3m +… ,
and so for m > 1 we can conclude that
(1.64) |Th[f] − I[f]| ≤ Cm .
Nm
Thus, in this particular case, the rate of convergence of the CTR at equally spaced points is not fixed (to 2). It depends on the number of derivatives of f and we say that the accuracy and convergence of the approximation is spectral. Note that if f is smooth, i.e. f ∈ C∞[0, 2π] and periodic, the CTR converges to the exact integral at a rate faster than any power of 1/N (or h)! This is called super-algebraic convergence.
Chapter 2
Function Approximation
We saw in the introductory chapter that one key step in the construction of a numerical method to approximate a definite integral is the approximation of the integrand by a simpler function, which we can integrate exactly.
The problem of function approximation is central to many numerical methods: given a continuous function f in an interval [a, b], we would like to find a good approximation to it by simpler functions, such as polynomials, trigonometric polynomials, wavelets, rational functions, etc. We are going to measure the accuracy of an approximation using norms and ask whether or not there is a best approximation out of functions from a given family of simpler functions. These are the main topics of this introductory chapter to Approximation Theory.
2.1 Norms
AnormonavectorspaceV overafieldK=R(orC)isamapping ∥·∥:V →[0,∞),
which satisfy the following properties:
(i) ∥x∥≥0∀x∈V and∥x∥=0iff x=0.
(ii) ∥x+y∥≤∥x∥+∥y∥∀x,y∈V. (iii) ∥λx∥=|λ|∥x∥∀x∈V,λ∈K.
17
18 CHAPTER 2. FUNCTION APPROXIMATION
If we relax (i) to just ∥x∥ ≥ 0, we obtain a semi-norm.
We recall first some of the most important examples of norms in the finite
dimensional case V = Rn (or V = Cn):
∥x∥1 =|x1|+…+|xn|, ∥x∥2 = |x1|2 + . . . + |xn|2,
∥x∥∞ =max{|x1|,…,|xn|}. These are all special cases of the lp norm:
(2.4) ∥x∥p =(|x1|p +…+|xn|p)1/p, 1≤p≤∞.
If we have weights wi > 0 for i = 1,…,n we can also define a weighted p
norm by
(2.5) ∥x∥w,p =(w1|x1|p +…+wn|xn|p)1/p, 1≤p≤∞.
All norms in a finite dimensional space V are equivalent, in the sense that there are two constants c and C such that
(2.6) ∥x∥α ≤ C∥x∥β,
(2.7) ∥x∥β ≤ c∥x∥α,
forallx∈V andforanytwonorms∥·∥α and∥·∥β definedinV.
If V is a space of functions defined on a interval [a, b], for example C [a, b],
the corresponding norms to (2.1)-(2.4) are given by
(2.1)
(2.2) (2.3)
(2.8)
(2.9) (2.10)
(2.11)
|u(x)|dx,
b 1/2
(2.12) ∥u∥p =
, 1 ≤ p ≤ ∞,
∥u∥1 = ∥u∥2 =
b a
|u(x)|2dx ∥u∥∞ = sup |u(x)|,
,
, 1≤p≤∞
∥u∥p =
a x∈[a,b]
b a
|u(x)|pdx
1/p
and are called the L1, L2, L∞, and Lp norms, respectively. Similarly to (2.5) we can defined a weighted Lp norm by
b a
w(x)|u(x)|pdx
1/p
2.2. UNIFORM POLYNOMIAL APPROXIMATION 19 where w is a given positive weight function defined in [a, b]. If w(x) ≥ 0, we
get a semi-norm.
Lemma 1. Let ∥·∥ be a norm on a vector space V then (2.13) | ∥x∥−∥y∥ |≤ ∥x−y∥.
This lemma implies that a norm is a continuous function (on V to R). Proof. ∥x∥=∥x−y+y∥≤∥x−y∥+∥y∥whichgivesthat
(2.14) ∥x∥ − ∥y∥ ≤ ∥x − y∥.
By reversing the roles of x and y we also get
(2.15) ∥y∥ − ∥x∥ ≤ ∥x − y∥.
2.2 Uniform Polynomial Approximation
There is a fundamental result in approximation theory, which states that any continuous function can be approximated uniformly, i.e. using the norm ∥·∥∞, with arbitrary accuracy by a polynomial. This is the celebrated Weier- strass Approximation Theorem. We are going to present a constructive proof due to Sergei Bernstein, which uses a class of polynomials that have found widespread applications in computer graphics and animation. Historically, the use of these so-called Bernstein polynomials in computer assisted design (CAD) was introduced by two engineers working in the French car industry: Pierre B ́ezier at Renault and Paul de Casteljau at Citro ̈en.
2.2.1 Bernstein Polynomials and B ́ezier Curves
Given a function f on [0,1], the Bernstein polynomial of degree n ≥ 1 is defined by
(2.16) Bnf(x)=
n knk n−k
k=0
f n k x (1−x) ,
20 CHAPTER 2. FUNCTION APPROXIMATION where
(2.17) k =(n−k)!k!, k=0,…,n
n n!
are the binomial coefficients. Note that Bnf(0) = f(0) and Bnf(1) = f(1)
for all n. The terms
(2.18) bk,n(x)= k x (1−x) , k=0,…,n
n k n−k
which are all nonnegative, are called the Bernstein basis polynomials and can
be viewed as x-dependent weights that sum up to one:
n n n
(2.19) bk,n(x)= k=0 k=0
Thus, for each x ∈ [0,1], Bnf(x) represents a weighted average of the values of f at 0, 1/n, 2/n, . . . , 1. Moreover, as n increases the weights bk,n(x) con- centrate more and more around the points k/n close to x as Fig. 2.1 indicates for bk,n(0.25) and bk,n(0.75).
For n = 1, the Bernstein polynomial is just the straight line connecting f(0) and f(1), B1f(x) = (1−x)f(0)+xf(1). Given two points P0 = (x0,y0) and P1 = (x1,y1), the segment of the straight line connecting them can be written in parametric form as
(2.20) B1(t)=(1−t)P0 +tP1, t∈[0,1].
With three points, P0, P1, P2, we can employ the quadratic Bernstein basis
polynomials to get a more useful parametric curve
(2.21) B2(t)=(1−t)2P0 +2t(1−t)P1 +t2P2, t∈[0,1].
This curve connects again P0 and P2 but P1 can be used to control how the curve bends. More precisely, the tangents at the end points are B2′ (0) = 2(P1 − P0) and B2′ (1) = 2(P2 − P1), which intersect at P1, as Fig. 2.2 illus- trates. These parametric curves formed with the Bernstein basis polynomials are called B ́ezier curves and have been widely employed in computer graph- ics, specially in the design of vector fonts, and in computer animation. To allow the representation of complex shapes, quadratic or cubic B ́ezier curves
k
xk(1−x)n−k =[x+(1−x)]n =1.
2.2. UNIFORM POLYNOMIAL APPROXIMATION 21
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
bk,n(0.25)
bk,n(0.75)
0 10 20 30 40 50
k
Figure 2.1: The Bernstein weights bk,n(x) for x = 0.25 (◦)and x = 0.75 (•), n = 50 and k = 1 . . . n.
P1
P0
P2
Figure 2.2: Quadratic B ́ezier curve.
22 CHAPTER 2. FUNCTION APPROXIMATION
are pieced together to form composite B ́ezier curves. To have some degree of smoothness (C1), the common point for two pieces of a composite B ́ezier curve has to lie on the line connecting the two adjacent control points on ei- ther side. For example, the TrueType font used in most computers today is generated with composite, quadratic B ́ezier curves while the Metafont used in these pages, via LATEX, employs composite, cubic B ́ezier curves. For each character, many pieces of B ́ezier are stitched together.
Let us now do some algebra to prove some useful identities of the Bern- stein polynomials. First, for f(x) = x we have,
(2.22)
Now for f(x) = x2, we get
nknk n−kn kn!k n−k
n k x (1−x) k=0
= n(n−k)!k!x (1−x) k=1
n n − 1 k − 1
=x k−1x(1−x)
k=1
n−1 n − 1
n k2n k
(2.23) n k x(1−x) = n k−1 x(1−x)
= x k=0
xk (1 − x)n−1−k =x[x+(1−x)]n−1 =x.
n−k n kn−1 k n−k k=0 k=1
k
n − k
and writing
(2.24) k=k−1+1=n−1k−1+1, nnnnn−1n
2.2. UNIFORM POLYNOMIAL APPROXIMATION 23 we have
n k2n k n−k
n−1n k−1n−1 k n−k = n n − 1 k − 1 x (1 − x)
n − k
k=0
k=2
1 n n − 1 k
n
k
x (1 − x)
k−1 x(1−x) n−1nn−2k
+n = n
k=1
n−k x k−2x(1−x) +n
k=2
n−1 n−2n−2 x
= x2 xk(1−x)n−2−k+ . nkn
k=0
Thus, (2.25)
2.2.2 Weierstrass Approximation Theorem
Theorem 1. (Weierstrass Approximation Theorem) Let f be a continuous
function in [a, b]. Given ε > 0 there is a polynomial p such that max |f(x) − p(x)| < ε.
a≤x≤b
Proof. We are going to work on the interval [0,1]. For a general interval [a, b], we consider the simple change of variables x = a + (b − a)t for t ∈ [0, 1] so that F (t) = f (a + (b − a)t) is continuous in [0, 1].
nk2nk n−k n−12 x
n kx(1−x)=nx+n.
Now, expanding k − x2 and using (2.19), (2.22), and (2.25) it follows that
k=0
n
n k 2 n k n − k 1
(2.26) n − x k x (1 − x) = nx(1 − x). k=0
Using (2.19), we have (2.27) f(x)−Bnf(x)=
n k n k n − k
f(x)−f n k x (1−x) .
k=0
24 CHAPTER 2. FUNCTION APPROXIMATION Since f is continuous in [0, 1], it is also uniformly continuous. Thus, given
ε > 0 there is δ(ε) > 0, independent of x, such that (2.28) |f(x) − f(k/n)| < ε if |x − k/n| < δ.
2
Moreover,
(2.29) |f (x) − f (k/n)| ≤ 2∥f ∥∞
for all x ∈ [0, 1], k = 0, 1, . . . , n.
We now split the sum in (2.27) in two sums, one over the points such that
|k/n − x| < δ and the other over the points such that |k/n − x| ≥ δ:
knk n−k
f(x)−f n k x (1−x) k n
|k/n−x|≥δ
Using (2.28) and (2.19) it follows immediately that the first sum is bounded
by ε/2. For the second sum we have
knk n−k
f(x)−Bnf(x)=
+ f(x)−f n k xk(1−x)n−k.
(2.30)
|k/n−x|<δ
f(x)−f n k x(1−x)
|k/n−x|≥δ ≤2∥f∥∞
2∥f∥∞ ≤δ2
nk n−k
k x(1−x)
k 2n k n−k
|k/n−x|≥δ
(2.31)
Therefore, there is N such that for all n ≥ N the second sum in (2.30) is bounded by ε/2 and this completes the proof.
n−x kx(1−x) n k 2 n
|k/n−x|≥δ
≤ ∞ −x xk(1−x)n−k
2∥f∥
δ2 n k
k=0
= 2∥f∥∞x(1−x) ≤ ∥f∥∞.
nδ2 2nδ2
2.3. BEST APPROXIMATION 25 2.3 Best Approximation
We just saw that any continuous function f on a closed interval can be approximated uniformly with arbitrary accuracy by a polynomial. Ideally we would like to find the closest polynomial, say of degree at most n, to the function f when the distance is measured in the supremum (infinity) norm, or in any other norm we choose. There are three important elements in this general problem: the space of functions we want to approximate, the norm, and the family of approximating functions. The following definition makes this more precise.
Definition 3. Given a normed linear space V and a subspace W of V , p∗ ∈ W is called the best approximation of f ∈ V by elements in W if
(2.32) ∥f−p∗∥≤∥f−p∥, forallp∈W.
For example, the normed linear space V could be C[a,b] with the supre- mum norm (2.10) and W could be the set of all polynomials of degree at most n, which henceforth we will denote by Pn.
Theorem 2. Let W be a finite-dimensional subspace of a normed linear space V . Then, for every f ∈ V , there is at least one best approximation to f by elements in W.
Proof. Since W is a subspace 0 ∈ W and for any candidate p ∈ W for best approximation to f we must have
(2.33) ∥f − p∥ ≤ ∥f − 0∥ = ∥f∥. Therefore we can restrict our search to the set
(2.34) F ={p∈W :∥f−p∥≤∥f∥}.
F is closed and bounded and because W is finite-dimensional it follows that F is compact. Now, the function p → ∥f − p∥ is continuous on this compact set and hence it attains its minimum in F .
If we remove the finite-dimensionality of W then we cannot guarantee that there is a best approximation as the following example shows.
26 CHAPTER 2. FUNCTION APPROXIMATION
Example 4. Let V = C[0, 1/2] and W be the space of all polynomials (clearly ofsubspaceofV). Takef(x)=1/(1−x). Then,givenε>0thereisNsuch that
2 N −(1+x+x +…+x )<ε.
0, which implies
(2.36) p∗(x)= 1 ,
1−x
which is impossible.
Theorem 2 does not guarantee uniqueness of best approximation. Strict convexity of the norm gives us a sufficient condition.
Definition 4. A norm ∥ · ∥ on a vector space V is strictly convex if for all f ̸=g in V with ∥f∥=∥g∥=1 then
∥θf+(1−θ)g∥<1, forall0<θ<1.
In other words, a norm is strictly convex if its unit ball is strictly convex.
The p-norm is strictly convex for 1 < p < ∞ but not for p = 1 or p = ∞.
Theorem 3. Let V be a vector space with a strictly convex norm, W a subspace of V, and f ∈ V. If p∗ and q∗ are best approximations of f in W then p∗ = q∗.
Proof. Let M = ∥f −p∗∥ = ∥f −q∗∥, if p∗ ̸= q∗ by the strict convexity of the norm
(2.37) ∥θ(f−p∗)+(1−θ)(f−q∗)∥
(2.41) c= 1(m+∥en∥∞)>0, 2
28
CHAPTER 2. FUNCTION APPROXIMATION
en(x) en(x)−c
0
Figure 2.3: If the error function en does not equioscillate at least twice we could lower ∥en∥∞ by an amount c > 0.
and subtract it to en, as shown in Fig. 2.3, we have (2.42) −∥en∥∞ +c≤en(x)−c≤∥en∥∞ −c.
Therefore, ∥en −c∥∞ = ∥f −(p∗n +c)∥∞ = ∥en∥∞ −c < ∥en∥∞ but p∗n +c ∈ Pn so this is impossible since p∗n is a best approximation. A similar argument can used when en(x1) = −∥en∥∞.
Before proceeding to the general case, let us look at the n = 1 situation. Suppose there only two alternating extrema x1 and x2 for e1 as described in (2.40). We are going to construct a linear polynomial that has the same sign as e1 at x1 and x2 and which can be used to decrease ∥e1∥∞. Suppose e1(x1) = ∥e1∥∞ and e1(x2) = −∥e1∥∞. Since e1 is continuous, we can find small closed intervals I1 and I2, containing x1 and x2, respectively, and such that
(2.43) e1(x) > ∥e1∥∞ for all x ∈ I1, 2
(2.44) e1(x) < −∥e1∥∞ for all x ∈ I2. 2
Clearly I1 and I2 are disjoint sets so we can choose a point x0 between the two intervals. Then, it is possible to find a linear polynomial q that passes through x0 and that is positive in I1 and negative in I2. We are now going
2.3. BEST APPROXIMATION 29
to find a suitable constant α > 0 such that ∥f − p∗1 − αq∥∞ < ∥e1∥∞. Since p∗1 + αq ∈ P1 this would be a contradiction to the fact that p∗1 is a best approximation.
Let R = [a, b] \ (I1 ∪ I2) and d = maxx∈R |e1(x)|. Clearly d < ∥e1∥∞. Choose α such that
(2.45)
On I1, we have
(2.46) 0<αq(x)<
0<α< 1 (∥e1∥∞ −d). 2∥q∥∞
1 (∥e1∥∞ −d)q(x)≤ 1(∥e1∥∞ −d)
x∈[−1,1] 2n−1
The zeros and extrema of Tn are easy to find. Because Tn(x) = cosnθ
and 0 ≤ θ ≤ π, the zeros occur when θ is an odd multiple of π/2. Therefore, (2j + 1) π
for j = 0,1,…,n, that is
jπ
(2.75) xj = cos n , j = 0,1,…,n.
These points are called Chebyshev or Gauss-Lobatto points and are ex-
tremely useful in applications. Note that xj for j = 1, . . . , n − 1 are local extrema. Therefore
(2.76) Tn′(xj)=0, forj=1,…,n−1.
In other words, the Chebyshev points (2.75) are the n − 1 zeros of Tn′ plus theendpointsx0 =1andxn =−1.
Using the Chain Rule we can differentiate Tn with respect to x we get
(2.77) Tn′(x) = −nsinnθdθ = nsinnθ, (x = cosθ). dx sinθ
(2.74) x ̄j =cos n 2 j=0,…,n−1.
The extrema of Tn (the points x where Tn(x) = ±1) correspond to nθ = jπ
2.4. CHEBYSHEV POLYNOMIALS 35
Therefore
T′ (x) T′ (x) 1 n+1 n−1 sinθ
(2.78) n+1 − n−1 = [sin(n + 1)θ − sin(n − 1)θ]
and since sin(n + 1)θ − sin(n − 1)θ = 2 sin θ cos nθ, we get that
T′ (x) T′ (x) (2.79) n+1 − n−1
The polynomial
(2.80) Un(x)= n+1 = , (x=cosθ)
n+1 n−1
= 2Tn(x).
T ′ (x) sin(n + 1)θ n+1 sinθ
of degree n is called the Chebyshev polynomial of second kind. Thus, the Chebyshev nodes (2.75) are the zeros of the polynomial
(2.81) qn+1(x) = (1 − x2)Un−1(x).
36 CHAPTER 2. FUNCTION APPROXIMATION
Chapter 3 Interpolation
3.1 Polynomial Interpolation
One of the basic tools for approximating a function or a given data set is interpolation. In this chapter we focus on polynomial and piece-wise poly- nomial interpolation.
The polynomial interpolation problem can be stated as follows: Given n+1 datapoints,(x0,f0),(x1,f1)…,(xn,fn),wherex0,x1,…,xn aredistinct,find a polynomial pn ∈ Pn, which satisfies the interpolation property:
pn(x0) = f0, pn(x1) = f1, .
pn(xn) = fn.
The points x0, x1, . . . , xn are called interpolation nodes and the values f0, f1, . . . , fn are data supplied to us or can come from a function f we are trying to ap- proximate, in which case fj = f(xj) for j = 0,1,…,n.
Let us represent such polynomial as pn(x) = a0 + a1x + · · · + anxn. Then, the interpolation property implies
a0 +a1x0 +···+anxn0 =f0, a0 +a1x1 +···+anxn1 =f1,
. 37
38 CHAPTER 3. INTERPOLATION a0 +a1xn +···+anxn =fn.
This is a linear system of n + 1 equations in n + 1 unknowns (the polynomial coefficients a0, a1, . . . , an). In matrix form:
1x x2···xna f 00000
1 x1 x21···xn1a1 f1 (3.1) . . = .
. . . 1xnx2n···xn an fn.
Does this linear system have a solution? Is this solution unique? The answer is yes to both. Here is a simple proof. Take fj = 0 for j = 0,1,…,n. Then pn(xj) = 0, for j = 0,1,…,n but pn is a polynomial of degree at most n, it cannot have n+1 zeros unless pn ≡ 0, which implies a0 = a1 = ··· = an = 0. That is, the homogenous problem associated with (3.1) has only the trivial solution. Therefore, (3.1) has a unique solution.
Example 5. As an illustration let us consider interpolation by a linear poly- nomial, p1. Suppose we are given (x0,f0) and (x1,f1). We have written p1 explicitly in the Introduction, we write it now in a different form:
(3.2) p1(x)= x−x1 f0 + x−x0 f1. x0 − x1 x1 − x0
Clearly, this polynomial has degree at most 1 and satisfies the interpolation property:
(3.3) p1(x0) = f0,
(3.4) p1(x1) = f1.
Example 6. Given (x0, f0), (x1, f1), and (x2, f2) let us construct p2 ∈ P2 that interpolates these points. The way we have written p1 in (3.2) is suggestive of how to explicitly write p2:
p2(x) = (x−x1)(x−x2) f0 + (x−x0)(x−x2) f1 + (x−x0)(x−x1) f2. (x0 −x1)(x0 −x2) (x1 −x0)(x1 −x2) (x2 −x0)(x2 −x1)
3.1. POLYNOMIAL INTERPOLATION 39 If we define
(3.5) (3.6) (3.7)
then we simply have
(3.8)
l0(x) = l1(x) = l2(x) =
(x − x1)(x − x2) , (x0 − x1)(x0 − x2)
(x − x0)(x − x2) , (x1 − x0)(x1 − x2)
(x − x0)(x − x1) , (x2 − x0)(x2 − x1)
p2(x) = l0(x)f0 + l1(x)f1 + l2(x)f2.
Note that each of the polynomials (3.5), (3.6), and (3.7) are exactly of degree 2 and they satisfy lj(xk) = δjk 1. Therefore, it follows that p2 given by (3.8) satisfies the interpolation property
p2(x0) = f0, (3.9) p2(x1) = f1, p2(x2) = f2.
We can now write down the polynomial of degree at most n that interpo- lates n + 1 given values, (x0, f0), . . . , (xn, fn), where the interpolation nodes x0, . . . , xn are assumed distinct. Define
lj(x)= (x−x0)···(x−xj−1)(x−xj+1)···(x−xn) (xj −x0)···(xj −xj−1)(xj −xj+1)···(xj −xn)
(3.10) n (x − xk)
= (xj −xk), forj=0,1,…,n.
k=0 k̸=j
These are called the elementary Lagrange polynomials of degree n. For simplicity, we are omitting their dependence on n in the notation. Since lj(xk) = δjk, we have that
n
(3.11) pn(x)=l0(x)f0 +l1(x)f1 +···+ln(x)fn =lj(x)fj
j=0 1δjk is the Kronecker delta, i.e. δjk = 0 if k ̸= j and 1 if k = j.
40 CHAPTER 3. INTERPOLATION
interpolates the given data, i.e., it satisfies the interpolation property pn(xj) = fj for j = 0,1,2,…,n. Relation (3.11) is called the Lagrange form of the interpolating polynomial. The following result summarizes our discussion.
Theorem 7. Given the n + 1 values (x0,f0),…,(xn,fn), for x0,x1,…,xn distinct. There is a unique polynomial pn of degree at most n such that pn(xj) = fj for j = 0,1,…,n.
Proof. pn in (3.11) is of degree at most n and interpolates the data. Unique- ness follows from the Fundamental Theorem of Algebra, as noted earlier. Suppose there is another polynomial qn of degree at most n such that qn(xj) = fj for j = 0,1,…,n. Consider r = pn − qn. This is a polynomial of degree at most n and r(xj) = pn(xj)−qn(xj) = fj −fj = 0 for j = 0,1,2,…,n, which is impossible unless r ≡ 0. This implies qn = pn.
3.1.1 Equispaced and Chebyshev Nodes
There are special two sets of nodes that are particularly important in ap- plications. For convenience we are going to take the interval [−1, 1]. For a general interval [a, b], we can do the simple change of variables
(3.12) x=1(a+b)+1(b−a)t, t∈[−1,1]. 22
The uniform or equispaced nodes are given by
(3.13) xj =−1+jh, j=0,1,…,n andh=2/n.
These nodes yield very accurate and efficient trigonometric polynomial in- terpolation but are not good for (algebraic) polynomial interpolation as we will see later.
One of the preferred set of nodes for high order, accurate, and computa- tionally efficient polynomial interpolation is the Chebyshev or Gauss-Lobatto set
jπ
(3.14) xj =cos n , j=0,…,n,
which as discussed in Section 2.4 are the extrema of the Chebyshev poly- nomial (2.64) of degree n . Note that these nodes are obtained from the
3.2. CONNECTION TO BEST UNIFORM APPROXIMATION 41
equispaced points θj = j(π/n), j = 0, 1, . . . , n in [0, π] by the one-to-one re- lation x = cosθ, for θ ∈ [0,π]. As defined in (3.14), the nodes go from 1 to -1 so often the alternative definition xj = − cos(jπ/n) is used. The Chebyshev nodes are not equally spaced and tend to cluster toward the end points of the interval.
3.2 Connection to Best Uniform Approxima- tion
Given a continuous function f in [a, b], its best uniform approximation p∗n in Pn is characterized by an error, en = f − p∗n, which equioscillates, as defined in (2.40), at least n + 2 times. Therefore en has a minimum of n + 1 zeros andconsequently,thereexistsx0,…,xn suchthat
(3.15)
p∗n(x0) = f(x0), p∗n(x1) = f(x1),
.
p∗n(xn) = f(xn),
In other words, p∗n is the polynomial of degree at most n that interpolates the function f at n + 1 zeros of en. Of course, we do not construct p∗n by finding these particular n + 1 interpolation nodes. A more practical question is: given (x0,f(x0)),(x1,f(x1)),…,(xn,f(xn)), where x0,…,xn are distinct interpolation nodes in [a,b], how close is pn, the interpolating polynomial of degree at most n of f at the given nodes, to the best uniform approximation p∗n of f in Pn?
To obtain a bound for ∥pn −p∗n∥∞ we note that pn −p∗n is a polynomial of degree at most n which interpolates f − p∗n. Therefore, we can use Lagrange formula to represent it
n
(3.16) pn (x) − p∗n (x) = lj (x)(f (xj ) − p∗n (xj )).
j=0
It then follows that
(3.17) ∥pn − p∗n∥∞ ≤ Λn∥f − p∗n∥∞,
42 where (3.18)
CHAPTER 3.
n
Λn = max |lj(x)|
a≤x≤b
j=0
INTERPOLATION
is called the Lebesgue Constant and depends only on the interpolation nodes, not on f. On the other hand, we have that
(3.19) ∥f−pn∥∞ =∥f−p∗n−pn+p∗n∥∞ ≤∥f−p∗n∥∞+∥pn−p∗n∥∞. Using (3.17) we obtain
(3.20) ∥f − pn∥∞ ≤ (1 + Λn)∥f − p∗n∥∞.
This inequality connects the interpolation error ∥f − pn∥∞ with the best approximation error ∥f −p∗n∥∞. What happens to these errors as we increase n? To make it more concrete, suppose we have a triangular array of nodes as follows:
x(0) 0
x(1) x(1) 01
x(2) x(2) x(2) 012
(3.21) .
x(n) x(n) … x(n) 01n
.
wherea≤x(n)
3.3. BARYCENTRIC FORMULA 43
and hence the Lebesgue constant is not bounded in n. Therefore, we cannot conclude from (3.20) and (3.22) that ∥f − pn∥∞ as n → ∞, i.e. that the interpolating polynomial, as we add more and more nodes, converges uni- formly to f. That depends on the regularity of f and on the distribution of the nodes. In fact, if we are given the triangular array of interpolation nodes (3.21) in advance, it is possible to construct a continuous function f such that pn will not converge uniformly to f as n → ∞.
3.3 Barycentric Formula
The Lagrange form of the interpolating polynomial is not convenient for com- putations. If we want to increase the degree of the polynomial we cannot reuse the work done in getting and evaluating a lower degree one. How- ever, we can obtain a very efficient formula by rewriting the interpolating polynomial in the following way. Let
(3.24) ω(x) = (x − x0)(x − x1) · · · (x − xn).
Then, differentiating this polynomial of degree n+1 and evaluating at x = xj
we get
(3.25) ω′(xj)=(xj −xk), forj=0,1,…,n,
k=0 k̸=j
Therefore, each of the elementary Lagrange polynomials may be written as ω(x)
(3.26) lj(x)= x−xj = ω(x) , forj=0,1,…,n, ω′(xj ) (x − xj )ω′(xj )
for x ̸= xj and lj (xj ) = 1 follows from L’Hˆopital rule. Defining (3.27) λj = 1 , for j = 0,1,…,n,
n
ω′(xj)
we can write Lagrange formula for the interpolating polynomial of f at
x0,x1,…,xn as
(3.28) pn(x) = ω(x)n λj j=0 x−xj
fj.
44 CHAPTER 3. INTERPOLATION
Now, note that from (3.11) with f(x) ≡ 1 it follows that nnλ
(3.29) 1 = lj(x) = ω(x) j . j=0 j=0 x−xj
Dividing (3.28) by (3.29), we get the so-called Barycentric Formula for in- terpolation:
(3.30) pn(x) =
n λj fj j=0 x−xj
n λ , j
j=0 x−xj
for x ̸= xj,
j = 0,1,…,n.
For x = xj, j = 0,1,…,n, the interpolation property, pn(xj) = fj, should be used.
The numbers λj depend only on the nodes x0, x1, …, xn and not on given values f0,f1,…,fn. We can obtain them explicitly for both the Chebyshev nodes (3.14) and for the equally spaced nodes (3.13) and can be precomputed efficiently for a general set of nodes.
3.3.1 Barycentric Weights for Chebyshev Nodes
The Chebyshev nodes are the zeros of qn+1(x) = (1 − x2)Un−1(x), where Un−1(x) = sin nθ/ sin θ, x = cos θ is the Chebyshev polynomial of the second kind of degree n − 1, with leading order coefficient 2n−1 [see Section 2.4]. Since the λj’s can be defined up to a multiplicative constant (which would cancel out in the barycentric formula) we can take λj to be proportional to
1/q′ n+1
(3.31)
differentiating we get
(3.32) Thus,
(3.33)
qn+1(x) = sin θ sin nθ,
q′ (x) = −n cos nθ − sin nθ cot θ.
−2n,forj=0,
q′ (x)= −(−1)jn, forj=1,…,n−1,
−2n (−1)n for j = n.
(x ). Since j
n+1
n+1 j
3.3. BARYCENTRIC FORMULA 45 We can factor out −n in (3.33) to obtain the barycentric weights for the
Chebyshev points
(3.34) λj =
Note that for a general interval [a, b], the term (a + b)/2 in the change of
variables (3.12) cancels out in (3.25) but we gain an extra factor of [(b−a)/2]n. However, this factor can be omitted as it does not alter the barycentric formula. Therefore, the same barycentric weights (3.34) can also be used for the Chebyshev nodes in an interval [a, b].
3.3.2 Barycentric Weights for Equispaced Nodes
Forequispacedpoints,xj =x0+jh,j=0,1,…,nwehave
1
(xj −x0)···(xj −xj−1)(xj −xj+1)···(xj −xn)
1
(jh)[(j − 1)h] · · · (h)(−h)(−2h) · · · (j − n)h
1
(−1)n−j hn[j(j − 1) · · · 1][1 · 2 · · · (n − j)]
(−1)j −n n! hnn! j!(n − j)!
(−1)j−n n hnn! j
λj = = = = =
2,forj=0,
(−1)j, forj=1,…,n−1, 1 (−1)n forj=n.
2
1
(−1)n
hn! j
We can omit the factor (−1)n/(hnn!) because it cancels out in the barycentric formula. Thus, for equispaced nodes we can use
n = n (−1)j .
j n (3.35) λj = (−1) j
, j = 0,1,…n.
3.3.3 Barycentric Weights for General Sets of Nodes
For general arrays of nodes we can precompute the barycentric weights effi- ciently as follows.
46 CHAPTER 3. λ(0) = 1;
for m = 1 : n
for j = 0 : m − 1 (m) λ(m−1)
λ=j; j xj−xm
end
λ(m)= 1 ;
m m−1
(xm −xk)
k=0
INTERPOLATION
0
end
If we want to add one more point (xn+1, fn+1) we just extend the m-loop
to n + 1 to generate λ(n+1), λ(n+1), · · · , λ(n+1). 0 1 n+1
3.4 Newton’s Form and Divided Differences
There is another representation of the interpolating polynomial which is both very efficient computationally and very convenient in the derivation of nu- merical methods based on interpolation. The idea of this representation, due to Newton, is to use successively lower order polynomials for constructing pn.
Suppose we have gotten pn−1 ∈ Pn−1, the interpolating polynomial of (x0, f0), (x1, f1), . . . , (xn−1, fn−1) and we would like to obtain pn ∈ Pn, the in- terpolating polynomial of (x0, f0), (x1, f1), . . . , (xn, fn) by reusing pn−1. The difference between these polynomials, r = pn −pn−1, is a polynomial of degree atmostn. Moreover,forj=0,…,n−1
(3.36) r(xj) = pn(xj) − pn−1(xj) = fj − fj = 0.
Therefore, r can be factored as
(3.37) r(x) = cn(x − x0)(x − x1) · · · (x − xn−1).
The constant cn is called the n-th divided difference of f with respect to x0, x1, …, xn, and is usually denoted as f[x0, . . . , xn]. Thus, we have
(3.38) pn (x) = pn−1 (x) + f [x0 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ). By the same argument, we have
(3.39) pn−1(x)=pn−2(x)+f[x0,…,xn−1](x−x0)(x−x1)···(x−xn−2),
3.4. NEWTON’S FORM AND DIVIDED DIFFERENCES 47 etc. So we arrive at Newton’s Form of pn:
(3.40)
pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + . . . + f [x0 , . . . , xn ](x − x0 ) · · · (x − xn−1 ).
Note that for n = 1
Therefore (3.41) (3.42)
and (3.43)
p1 (x) = f [x0 ] + f [x0 , x1 ](x − x0 ), p1(x0) = f[x0] = f0, p1(x1)=f[x0]+f[x0,x1](x1 −x0)=f1.
f[x0] = f0, f[x0,x1]= f1 −f0,
p1(x) = f0 + f1 − f0 (x − x0). x1 − x0
x1 − x0
Define f [xj ] = fj for j = 0, 1, …n. The following identity will allow us to compute all the required divided differences.
Theorem 8.
(3.44) f[x0,x1,…,xk]= f[x1,x2,…,xk]−f[x0,x1,…,xk−1]. xk − x0
Proof. Let pk−1 be the interpolating polynomial of degree at most k − 1 of (x1, f1), . . . , (xk, fk) and qk−1 the interpolating polynomial of degree at most k−1 of (x0,f0),…,(xk−1,fk−1). Then
(3.45) p(x) = pk−1(x) + x − xk [pk−1(x) − qk−1(x)]. xk − x0
is a polynomial of degree at most k and for j = 1, 2, ….k − 1
p(xj)=fj + xj −xk[fj −fj]=fj. xk − x0
48 CHAPTER 3. INTERPOLATION
Moreover, p(x0) = qk−1(x0) = f0 and p(xk) = pk−1(xk) = fk. There- fore, p is the interpolation polynomial of degree at most k of the points (x0, f0), (x1, f1), . . . , (xk, fk). The leading order coefficient of pk is f[x0, …, xk] and equating this with the leading order coefficient of p
f[x1,…,xk]−f[x0,x1,…xk−1], xk − x0
gives (3.44).
To obtain the divided difference we construct a table using (3.44), column by column as illustrated below for n = 3.
xj 0th order x0 f0
x1 f1
x2 f2
1th order f[x0,x1] f[x1, x2] f[x2,x3]
2th order
f[x0,x1,x2] f[x1,x2,x3]
3th order
f[x0, x1, x2, x3]
x3 f3
Example 7. Let f(x) = 1+x2, xj = j, and fj = f(xj) for j = 0,…,3.
Then
xj 0th order 01
1 2
2 5
3 10
1th order
2−1 = 1 1−0
5−2 =3 2−1
10−5 =5 3−2
2th order
3−1 = 1 2−0
5−3 =1 3−1
3th order
0
so
p3(x) = 1 + 1(x − 0) + 1(x − 0)(x − 1) + 0(x − 0)(x − 1)(x − 2) = 1 + x2.
After computing the divided differences, we need to evaluate pn at a given point x. This can be done efficiently by suitably factoring it. For example,
3.5. CAUCHY’S REMAINDER 49 for n = 3 we have
p3(x)=c0 +c1(x−x0)+c2(x−x0)(x−x1)+c3(x−x0)(x−x1)(x−x2) =c0 +(x−x0){c1 +(x−x1)[c2 +(x−x2)c3]}
For general n we can use the following Horner-like scheme to get p = pn(x): p = cn;
for k = n − 1 : 0
p=ck +(x−xk)∗p; end
3.5 Cauchy’s Remainder
We now assume that the data fj = f(xj), j = 0,1,…,n come from a sufficiently smooth function f, which we are trying to approximate with an interpolating polynomial pn, and we focus on the error f − pn of such approximation.
In the Introduction we proved that if x0, x1, and x are in [a,b] and f ∈ C2[a,b] then
f(x) − p1(x) = 1f′′(ξ(x))(x − x0)(x − x1), 2
where p1 is the polynomial of degree at most 1 that interpolates (x0,f(x0)), (x1, f(x1)) and ξ(x) ∈ (a, b). The general result about the interpolation error is the following theorem:
Theorem 9. Let f ∈ Cn+1[a,b], x0,x1,…,xn,x be contained in [a,b], and pn(x) be the interpolation polynomial of degree at most n of f at x0,…,xn then
(3.46) f(x) − pn(x) = 1 f(n+1)(ξ(x))(x − x0)(x − x1) · · · (x − xn), (n+1)!
where min{x0,…,xn,x} < ξ(x) < max{x0,...,xn,x}.
Proof. The right hand side of (3.46) is known as the Cauchy Remainder and the following proof is due to Cauchy.
50 CHAPTER 3. INTERPOLATION For x equal to one of the nodes xj the result is trivially true. Take x fixed
not equal to any of the nodes and define
(3.47) φ(t)=f(t)−pn(t)−[f(x)−pn(x)] (t−x0)(t−x1)···(t−xn) .
Clearly, φ ∈ Cn+1[a, b] and vanishes at t = x0, x1, ..., xn, x. That is, φ has at least n + 2 zeros. Applying Rolle’s Theorem n + 1 times we conclude that there exists a point ξ(x) ∈ (a, b) such that φ(n+1)(ξ(x)) = 0. Therefore,
0 = φ(n+1)(ξ(x)) = f(n+1)(ξ(x)) − [f(x) − pn(x)] (n + 1)! (x−x0)(x−x1)···(x−xn)
from which (3.46) follows. Note that the repeated application of Rolle’s theo- rem implies that ξ(x) is between min{x0, x1, ..., xn, x} and max{x0, x1, ..., xn, x}.
We are now going to find a beautiful connection between Chebyshev polynomials and the interpolation error as given by the Cauchy remainder (3.46). Let us consider the interval [−1, 1]. We have no control on the term f(n+1)(ξ(x)). However, we can choose the interpolation nodes x0,...,xn so that the factor
(3.48) w(x) = (x − x0)(x − x1) · · · (x − xn)
is smallest as possible in the infinity norm. The function w is a monic poly-
(x−x0)(x−x1)···(x−xn)
nomial of degree n + 1 and we have proved in Section 2.4 that the Chebyshev
polynomial Tn+1, defined in (2.72), is the monic polynomial of degree n + 1 with smallest infinity norm. Hence, if the interpolation nodes are taken to
be the zeros of Tn+1, namely
(3.49)
(2j + 1) π
xj =cos n+1 2 , j=0,1,...n.
∥w∥∞ is minimized and ∥w∥∞ = 2−n. The following theorem summarizes this observation.
Theorem 10. Let pTn (x) be the interpolating polynomial of degree at most n of f ∈ Cn+1[−1, 1] with respect to the nodes (3.49) then
(3.50) ∥f−pTn∥∞ ≤ 1 ∥fn+1∥∞. 2n(n + 1)!
3.5. CAUCHY’S REMAINDER 51 The Gauss-Lobatto Chebyshev points,
jπ
(3.51) xj = cos n , j = 0,1,...,n,
which are the extrema and not the zeros of the corresponding Chebyshev polynomial, do not minimize maxx∈[−1,1] |w(x)|. However, they are nearly optimal. More precisely, since the Gauss-Lobatto nodes are the zeros of the (monic) polynomial [see (2.81) and (3.31) ]
(3.52) 1 (x2 −1)Un−1(x) = 1 sinθsinnθ, x = cosθ. 2n−1 2n−1
We have that
(3.53) ∥w∥∞ = max (1 − x )Un−1(x) ≤ .
12 1 x∈[−1,1] 2n−1 2n−1
So, the Gauss-Lobatto nodes yield a ∥w∥∞ of no more than a factor of two from the optimal value.
We now relate divided differences to the derivatives of f using the Cauchy remainder. Take an arbitrary point t distinct from x0,...,xn. Let pn+1 be the interpolating polynomial of f at x0,...,xn,t and pn that at x0,...,xn. Then, Newton’s formula (3.38) implies
(3.54) pn+1 (x) = pn (x) + f [x0 , . . . , xn , t](x − x0 )(x − x1 ) · · · (x − xn ). Noting that pn+1(t) = f(t) we get
(3.55) f (t) = pn (t) + f [x0 , . . . , xn , t](t − x0 )(t − x1 ) · · · (t − xn ). Since t was arbitrary we can set t = x and obtain
(3.56) f (x) = pn (x) + f [x0 , . . . , xn , x](x − x0 )(x − x1 ) · · · (x − xn ), and upon comparing with the Cauchy remainder we get
(3.57) f[x0, ..., xn, x] = f(n+1)(ξ(x)). (n+1)!
Ifwesetx=xn+1 andrelabeln+1bykwehave
(3.58) f[x0,...,xk]= 1f(k)(ξ), k!
52 CHAPTER 3. INTERPOLATION where min{x0,...,xk} < ξ < max{x0,...,xk}. Suppose that we now let
x1,...,xk → x0. Then ξ → x0 and
(3.59) lim f[x0,...,xk]= 1f(k)(x0).
We can use this relation to define a divided difference where there are coincident nodes. For example f[x0, x1] when x0 = x1 by f[x0, x0] = f′(x0), etc. This is going to be very useful for the following interpolation problem.
3.6 Hermite Interpolation
The Hermite interpolation problem is: given values of f and some of its derivatives at the nodes x0,x1,...,xn, find the interpolating polynomial of smallest degree interpolating those values. This polynomial is called the Hermite Interpolation Polynomial and can be obtained with a minor modifi- cation to the Newton’s form representation.
For example: Suppose we look for a polynomial p of lowest degree which satisfies the interpolation conditions:
p(x0) = f(x0), p′(x0) = f′(x0), p(x1) = f(x1), p′(x1) = f′(x1).
We can view this problem as a limiting case of polynomial interpolation of f at two pairs of coincident nodes, x0,x0,x1,x1 and we can use Newton’s Interpolation form to obtain p. The table of divided differences, in view of (3.59), is
x1,...,xk→x0 k!
(3.60)
and (3.61)
x0 f(x0)
x0 f(x0) f′(x0)
x1 f(x1) f[x0, x1]
x1 f(x1) f′(x1)
f[x0, x0, x1] f[x0, x1, x1]
f[x0, x0, x1, x1]
p(x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f [x0 , x0 , x1 ](x − x0 )2 +f[x0,x0,x1,x1](x−x0)2(x−x1).
3.7. CONVERGENCE OF POLYNOMIAL INTERPOLATION 53
Example 8. Let f(0) = 1, f′(0) = 0 and f(1) = √2. Find the Hermite Interpolation Polynomial.
We construct the table of divided differences as follows:
(3.62)
and therefore
01 010
√√√ 122−12−1
(3.63) p(x)=1+0(x−0)+(√2−1)(x−0)2 =1+(√2−1)x2.
3.7 Convergence of Polynomial Interpolation
From the Cauchy Remainder formula
(3.64) f(x) − pn(x) = 1 f(n+1)(ξ(x))(x − x0)(x − x1) · · · (x − xn)
(n+1)!
it is clear that the accuracy and convergence of the interpolation polynomial pn of f depends on both the smoothness of f and the distribution of nodes x0,x1,...,xn.
In the Runge example
f(x) = 1 x ∈ [−1,1],
1+25x2
is very smooth. It has an infinite number of continuous derivatives, i.e. f ∈ C∞[−1,1] (in fact f is real analytic in the whole real line, i.e. it has a convergent Taylor series to f(x) for every x ∈ R). Nevertheless, for the equispaced nodes (3.13) pn does not converge uniformly to f(x) as n → ∞. In fact it diverges quite dramatically toward the end points of the interval. On the other hand, there is fast and uniform convergence of pn to f when the Chebyshev nodes (3.14) are employed.
It is then natural to ask: Given any f ∈ C [a, b], can we guarantee that if we choose the Chebyshev nodes ∥f −pn∥∞ → 0? The answer is no. Bernstein and Faber proved in 1914 that given any distribution of points, organized in
54 CHAPTER 3. INTERPOLATION
a triangular array (3.21), it is possible to construct a continuous function f for which its interpolating polynomial pn (corresponding to the nodes on the n-th row of (3.21)) will not converge uniformly to f as n → ∞. However, if f is slightly smoother, for example f ∈ C1[a,b], then for the Chebyshev array of nodes ∥f − pn∥∞ → 0.
Iff(x)=e−x2 andthereisconvergenceofpnevenwiththeequidistributed nodes. What is so special about this function? The function
(3.65) f (z) = e−z2 ,
z = x + iy is analytic in the entire complex plane. Using complex variables analysis it can be shown that if f is analytic in a sufficiently large region in the complex plane containing [a, b] then ∥f − pn∥∞ → 0. Just how large the region of analyticity needs to be? it depends on the asymptotic distribution of the nodes as n → ∞.
In the limit as n → ∞, we can think of the nodes as a continuum with a density ρ so that for sufficiently large n,
paced nodes ρ(x) = 1/2 and for the Chebyshev nodes ρ(x) = 1/(π
It turns out that the relevant domain of analyticity is given in terms of
x a
(3.66) (n + 1)
is the total number of nodes in [a, x]. Take for example, [−1, 1]. For equis-
Let Γc be the level curve consisting of all the points z ∈ C such that φ(z) = c
for c constant. For very large and negative c, Γc approximates a large circle. As c is increased, Γc shrinks. We take the “smallest” level curve, Γc0 , which contains [a, b]. The relevant domain of analyticity is
(3.68) Rc0 ={z∈C:φ(z)≥c0}.
Then,iff isanalyticinRc0,∥f−pn∥∞ →0,notonlyin[a,b]butforevery
point in Rc0 . Moreover,
(3.69) |f(x) − pn(x)| ≤ Ce−N(φ(x)−c0),
ρ(t)dt
√
1 − x2).
the function
(3.67) φ(z) = −
b a
ρ(t) ln |z − t|dt.
3.8. PIECE-WISE LINEAR INTERPOLATION 55
for some constant C. That is pn converges exponentially fast to f. For the Chebyshev nodes Rc0 approximates [a, b], so if f is analytic in any region containing [a, b], however thin this region might be, pn will converge uniformly to f. For equidistributed nodes, Rc0 looks like a football, with [a,b] as its longest axis. In the Runge example, the function is singular at z = ±i/5 which happens to be inside this football-like domain and this explain the observed lack of convergence for this particular function.
The moral of the story is that polynomial interpolation using Chebyshev nodes converges very rapidly for smooth functions and thus yields very ac- curate approximations.
3.8 Piece-wise Linear Interpolation
One way to reduce the error in linear interpolation is to divide [a,b] into small subintervals [x0,x1],...,[xn−1,xn]. In each of the subintervals [xj,xj+1] we approximate f by
(3.70) p(x) = f (xj ) + f (xj +1 ) − f (xj ) (x − xj ), xj+1 − xj
We know that
(3.71) f(x)−p(x)= 1f′′(ξ)(x−xj)(x−xj+1), 2
where ξ is some point between xj and xj+1. Suppose that |f′′(x)| ≤ M2, ∀x ∈ [a, b] then
x ∈ [xj , xj +1 ].
x∈[xj,xj+1]
(3.72) |f(x) − p(x)| ≤ 1M2 max |(x − xj)(x − xj+1)|. 2 xj ≤x≤xj+1
Now the max at the right hand side is attained at the midpoint (xj +xj+1)/2 and
xj+1−xj2 12 (3.73) max |(x−xj)(x−xj+1)|= = hj,
xj≤x≤xj+1 2 4 where hj = xj+1 − xj. Therefore
(3.74) max |f(x)−p(x)|≤1M2h2j. xj ≤x≤xj+1 8
56 CHAPTER 3. INTERPOLATION If we want this error to be smaller than a prescribed tolerance δ we can take
sufficiently small subintervals. Namely, we can pick hj such that 1M2h2j ≤ δ 8
which implies that
(3.75) hj≤ M.
3.9 Cubic Splines
Several applications require a smoother curve than that provided by a piece- wise linear approximation. Continuity of the first and second derivatives provide that required smoothness.
One of the most frequently used such approximations is a cubic spline, which is is a piecewise cubic function, s(x), which interpolates a set points (x0, f0), (x1, f1), . . . (xn, fn), and has two continuous derivatives. In each subinterval [xj,xj+1], s(x) is a cubic polynomial, which we may represent as
(3.76) sj(x)=Aj(x−xj)3 +Bj(x−xj)2 +Cj(x−xj)+Dj. Let
(3.77) hj =xj+1−xj.
The spline s(x) interpolates the given data:
(3.78) sj(xj) = fj = Dj,
(3.79) sj(xj+1) = Ajh3j + Bjh2j + Cjhj + Dj = fj+1.
Nows′(x)=3A (x−x )2+2B (x−x )+C ands′′(x)=6A (x−x )+2B . jjjjjjjjjj
8δ 2
Therefore
(3.80) (3.81) (3.82) (3.83)
s′j(xj) = Cj,
s′j(xj+1) = 3Ajh2j + 2Bjhj + Cj,
s′′(x ) = 2B , jjj
s′′(x )=6Ah +2B. jj+1 jj j
3.9. CUBIC SPLINES 57
We are going to write the spline coefficients Aj, Bj, Cj, and Dj in terms of f andf andtheunknownvaluesz =s′′(x)andz =s′′(x ). We
j j+1 have
j j j j+1 j j+1
Dj =fj, Bj = 1zj,
2 6Ajhj+2Bj=zj+1⇒Aj= 1(zj+1−zj)
6hj and substituting these values in (3.79) we get
Cj = 1 (fj+1 − fj) − 1hj(zj+1 + 2zj). hj 6
Let us collect all our formulas for the spline coefficients: (3.84) Aj= 1(zj+1−zj),
6hj (3.85) Bj = 1zj,
2
(3.86) Cj = 1 (fj+1 − fj) − 1hj(zj+1 + 2zj),
hj 6 (3.87) Dj = fj.
Note that the second derivative of s is continuous, s′′(x ) = z = s′′
j j+1 j+1 j+1
),
(x and by construction s interpolates the given data. We are now going to use the condition of continuity of the first derivative of s to determine equations
j+1
= 3 1 (zj+1 −zj)h2j +21zjhj + 1 (fj+1 −fj)− 1hj(zj+1 +2zj) 6hj 2hj 6
= 1 (fj+1 − fj) + 1hj(2zj+1 + zj), hj 6
Decreasing the index by 1 we get
(3.88) s′j−1(xj) = 1 (fj − fj−1) + 1hj−1(2zj + zj−1) hj−1 6
for the unknown values zj , j = 1, 2, . . . , n − 1: s′j(xj+1) = 3Ajh2j + 2Bjhj + Cj
58 CHAPTER 3. INTERPOLATION Continuity of the first derivative at an interior node means s′j−1(xj) = s′j(xj)
forj=1,2,...,n−1. Therefore
1 (fj−fj−1)+1hj−1(2zj+zj−1)=Cj=1(fj+1−fj)−1hj(zj+1+2zj)
hj−1 6 hj 6 which can be written as
hj−1zj−1 + 2(hj−1 + hj )zj + hj zj+1 =
(3.89) − 6 (fj−fj−1)+6(fj+1−fj), j=1,...,n−1.
hj−1 hj
This is a linear system of n−1 equations for the n−1 unknowns z1, z2, . . . , zn−1.
In matrix form (3.90)
where
2(h0+h1) h1 ··· 0 z1 d1 h 2(h+h)h... 0 z d
1122 22
. = . , . ..hn−2..
. .. ..
0 ... hn−2 2(hn−2 + hn−1) zn−1
dn−1
.
d −6(f1−f0)+6(f2−f1)−h0z0 1 h0 h1
d −6(f2−f1)+6(f3−f2) 2h1h2
. . (3.91) .= .
dn−2 − 6 (fn−2 − fn−3) + 6 hn−3 hn−2
(fn−1 − fn−2) dn−1 − 6 (fn−1 −fn−2)+ 6 (fn −fn−1)−hn−1zn
hn−2 hn−1
Notethatz =f′′ andz =f′′ areunspecified. Thevaluesz =z =0de-
00nn 0n
fined what is called a Natural Spline. The matrix of the linear system (3.90) is strictly diagonally dominant, a concept we make precise in the definition below. A consequence of this property, as we will see shortly, is that the ma- trix is nonsingular and therefore there is a unique solution for z1, z2, . . . , zn−1 corresponding the second derivative values of the spline at the interior nodes.
Definition 6. An n × n matrix A with entries aij, i,j = 1,...,n is strictly diagonally dominant if
n
(3.92) |aii| > |aij|, for i = 1,…,n.
j=1 j ̸=i
3.9. CUBIC SPLINES 59 Theorem 11. Let A be a strictly diagonally dominant matrix. Then A is
nonsingular.
Proof. Suppose the contrary, that is there is x ̸= 0 such that Ax = 0. Let k
be an index such that |xk| = ∥x∥∞. Then, the k-th equation in Ax = 0 gives n
(3.93)
and consequently (3.94)
akkxk +akjxj =0 j=1
j ̸=k
n
|akk||xk| ≤ |akj||xj|.
j=1 j ̸=k
Dividing by |xk|, which by assumption in nonzero, and using that |xj|/|xk| ≤ 1 for all j = 1,…,n, we get
n
(3.95) |akk| ≤ |akj|,
j=1 j ̸=k
which contradicts the fact that A is strictly diagonally dominant.
If the nodes are equidistributed, xj = x0 +jh for j = 0,1,…,n then the
linear system (3.90), after dividing by h simplifies to
z 6(f−2f+f)−z
4 1 ··· 0 1 h2 0 1 2 0
. z 6(f−2f+f) . 2 h2 1 2 3
1 4 1 . 0 . .
.1 .1z6(f−2f+f)
(3.96) . .. .= . .
0 … 1 4 n−2 6h2 n−3 n−2 n−1 zn−1 h2 (fn−2 − 2fn−1 + fn) − zn
Once the z1,z2,…,zn−1 are found the spline coefficients can be computed from (3.84)-(3.87).
Example 9. Find the natural cubic spline that interpolates (0, 0), (1, 1), (2, 16). We know z0 = 0 and z2 = 0. We only need to find z1 (only 1 interior node). The system (3.89) degenerates to just one equation and h = 1, thus
z0 +4z1 +z2 =6[f0 −2f1 +f2]⇒z1 =21
60 CHAPTER 3. In [0, 1] we have
A0= 1(z1−z0)=1×21=7, 6h 6 2
B0 = 1z0 = 0 2
INTERPOLATION
C0 = 1(f1 − f0) − 1h(z1 + 2z0) = 1 − 121 = −5, h662
D0 =f0 =0.
Thus, s0(x) = A0(x−0)3 +B0(x−0)3 +C0(x−0)+D0 = 7×3 − 5x. Now
22
h66 D1 =f1 =1.
and s1(x) = −7(x−1)3 + 21(x−1)2 +8(x−1)+1. Therefore the spline is 22
in [1, 2]
A1= 1(z2−z1)=1(−21)=−7,
6h 6 2 B1 = 1z1 = 21,
22
C1 = 1(f2 − f1) − 1h(z2 + 2z1) = 16 − 1 − 1(2 · 21) = 8,
given by
7 x3 − 5 x x ∈ [0, 1], 2 2
s(x) =
3.9.1 Solving the Tridiagonal System
−7(x−1)3 +21(x−1)2 +8(x−1)+1 x∈[1,2]. 22
The matrix of coefficients of the linear system (3.90) has the tridiagonal form
a1 b1 c1 a2 b2
…. c2 . .
(3.97) A=
.. .. .. …
… … … ….
. . bN−1 cN−1 aN
3.9. CUBIC SPLINES 61
where for the natural splines (n − 1) × (n − 1) system (3.90), the non-zero tridiagonal entries are
(3.98) (3.99) (3.100)
aj =2(hj−1+hj), bj =hj,
cj =hj−1
j=1,2,…,n−1 j=1,2,…,n−2 j=1,2,…,n−2.
We can solve the corresponding linear system of equations Ax = d by factoring this matrix A into the product of a lower triangular matrix L and an upper triangular matrix U. To illustrate the idea let us take N = 5. A 5 × 5 tridiagonal linear system has the form
a1 b1 0 0 0x1 d1
c1 a2 b2 0 0x2 d2 (3.101) 0 c2 a3 b3 0 x3 = d3 0 0 c 3 a 4 b 4 x 4 d 4
000c4a5x5 d5 and we seek a factorization of the form
a1 b1 0 0 0 1 0 0 0 0m1 u1 0 0 0 c1 a2 b2 0 0 l1 1 0 0 00 m2 u2 0 0 0 c2 a3 b3 0=0 l2 1 0 00 0 m3 u3 0 0 0 c 3 a 4 b 4 0 0 l 3 1 0 0 0 0 m 4 u 4
0 0 0 c4 a5 0 0 0 l4 1 0 0 0 0 m5
Note the the first matrix on the right hand side is lower triangular and the second one is upper triangular. Performing the product of the matrices and comparing with the corresponding entries of the left hand side matrix we have
1st row: 2ndrow: 3rdrow: 4throw: 5throw:
a1 =m1,b1 =u1, c1=m1l1,a2=l1u1+m2,b2=u2, c2=m2l2,a3=l2u2+m3,b3=u3, c3=m3l3,a4=l3u3+m4,b4=u4, c4=m4l4,a5=l4u4+m5,b5=u5.
So we can determine the unknowns in the following order m1;u1,l1,m2;u2,l2,m3;…,u4,l4,m5.
62 CHAPTER 3. INTERPOLATION Since uj = bj for all j we can write down the algorithm for general N as
% Determine the factorization coefficients m1 = a1
for j = 1 : N − 1
lj =cj/mj
mj+1 = aj+1 − lj ∗ bj
end
% Forward substitution on Ly = d y1 = d1
for j = 2 : N
yj = dj − lj−1 ∗ yj−1 end
% Backward substitution to solve Ux = y xN =yN/mN
for j = N − 1 : 1
xj =(yj −bj ∗xj+1)/mj end
3.9.2 Complete Splines
Sometimes it is more appropriate to specify the first derivative at the end
points instead of the second derivative. This is called a complete or clamped
spline. In this case z = f′′ and z = f′′ become unknowns together with 00nn
z1,z2,…,zn−1.
We need to add two more equations to have a complete system for all the
n + 1 unknown values z0, z1, . . . , zn in a complete spline. Recall that sj(x)=Aj(x−xj)3 +Bj(x−xj)2 +Cj(x−xj)+Dj
and so
s′j(x) = 3Aj(x − xj)2 + 2Bj(x − xj) + Cj.
3.9. CUBIC SPLINES 63 Therefore
(3.102) s′0(x0) = C0 = f0′
(3.103) s′n−1(xn) = 3An−1h2n−1 + 2Bn−1hn−1 + Cn−1 = fn′ .
Substituting C0, An−1, Bn−1, and Cn−1 from (3.84)-(3.86) we get (3.104) 2h0z0 + h0z1 = 6 (f1 − f0) − 6f0′,
h0
(3.105) hn−1zn−1 + 2hn−1zn = − 6 (fn − fn−1) + 6fn′ .
hn−1
These two equations together with (3.89) uniquely determine the second derivative values at all the nodes. The resulting (n + 1) × (n + 1) is also tridiagonal and diagonally dominant (hence nonsingular). Once the values z0, z1, . . . , zn are found the splines coefficients are obtained from (3.84)-(3.87).
3.9.3 Parametric Curves
In computer graphics and animation it is often required to construct smooth curves that are not necessarily the graph of a function but that have a para- metric representation x = x(t) and y = y(t) for t ∈ [a, b]. Hence one needs to determine two splines interpolating (tj,xj) and (tj,yj) (j = 0,1,…n), respectively.
The arc length of the curve is a natural choice for the parameter t. How- ever, this is not known a priori and instead the nodes tj’s are usually chosen as the distances of consecutive, judiciously chosen points:
(3.106) t0=0, tj=tj−1+ (xj−xj−1)2+(yj−yj−1)2, j=1,2,…n.
64 CHAPTER 3. INTERPOLATION
Chapter 4
Trigonometric Approximation
We will study approximations employing truncated Fourier series. This type of approximation finds multiple applications in digital signal and image pro- cessing, and in the construction of highly accurate approximations to the solution of some partial differential equations. We will look at the problem of best approximation in the convenient L2 norm and then consider the more practical approximation using interpolation. The latter will put us in the discrete framework to introduce the leading star of this chapter, the Discrete Fourier Transform, and one of the top ten algorithms of all time, the Fast Fourier Transform, to compute it.
4.1 Approximating a Periodic Function
We begin with the problem of approximating a periodic function f at the
continuum level. Without loss of generality, we can assume that f is of period
2π (if it is of period p then the function F(y) = f( p y) has period 2π). 2π
If f is a smooth periodic function we can approximate it with the first n few terms of its Fourier series:
1 n
(4.1) f (x) ≈ 2 a0 +
k=1
(ak cos kx + bk sin kx), 65
66
where (4.2)
(4.3)
CHAPTER 4. TRIGONOMETRIC APPROXIMATION
Jn. We have
Jn = (4.6) =
2π 0
2π
0
n n 2π
dx
1 2π
ak = π
1 2π
bk = π
f(x)coskxdx, f(x)sinkxdx,
k = 0,1,…,n k = 1,2,…,n.
0
0
We will show that this is the best approximation to f, in the L2 norm, by a trigonometric polynomial of degree n (the right hand side of (4.1)). For convenience, we write a trigonometric polynomial Sn (of degree n) in complex form (see (1.45)-(1.48) ) as
n
(4.4) Sn(x) = ckeikx.
k=−n Consider the square of the error
n 2 f(x)− ckeikx
k=−n [f(x)]2dx − 2
n 2π
ck f(x)eikxdx
+ ckcl k=−n l=−n
0
2π 2π 1 2π (4.7) eikxeilxdx = ei(k+l)xdx = i(k + l)ei(k+l)x
0 0 0 and for k = −l
= 0
2π 0
(4.5) Jn = ∥f − Sn∥2 =
Let us try to find the coefficients ck (k = 0,±1…,±n) in (4.4) that minimize
[f(x) − Sn(x)]2 dx.
k=−n 0 eikxeilxdx.
This problem simplifies if we use the orthogonality of the set {1, eix, e−ix, . . . , einx, e−inx}, as for k ̸= −l
2π 2π
(4.8) eikxeilxdx = 00
dx = 2π.
4.1. APPROXIMATING A PERIODIC FUNCTION 67
Thus, we get
2π n2π n 0 k=−n 0 k=−n
(4.9) Jn = [f(x)]2dx − 2 ck f(x)eikxdx + 2π ckc−k.
Jn is a quadratic function of the coefficients ck and so to find the its minimum, we determine the critical point of Jn as a function of the ck’s
∂Jn 2π (4.10) ∂c =−2
f(x)eimxdx+2(2π)c−m =0, m=0,±1,…,±n. Therefore, relabeling the coefficients with k again, we get
1 2π
(4.11) ck = 2π
which are the complex Fourier coefficients of f. The real Fourier coefficients (4.2)-(4.3) follow from Euler’s formula eikx = cos kx+i sin kx, which produces the relation
(4.12) c0 = 1a0, ck = 1(ak −ibk), c−k = 1(ak +ibk), k=1,…,n. 222
This concludes the proof to the claim that the best L2 approximation to f by trigonometric polynomials of degree n is furnished by the Fourier series of f truncated up to wave number k = n.
Now, if we substitute the Fourier coefficients (4.11) in (11.3) we get
m0
f(x)e−ikxdx, k = 0,±1,…,±n,
0
0 ≤ Jn =
2 π 0
[f(x)]2dx − 2π
n k=−n
|ck|2
that is
(4.13) |ck|2 ≤ 2π [f(x)]2 dx.
n 12π k=−n 0
This is known as Bessel’s inequality. In terms of the real Fourier coefficients, Bessel’s inequality becomes
1 n 12π
(4.14) 2a20 + (a2k + b2k) ≤ π [f(x)]2 dx.
k=1 0
68
CHAPTER 4. TRIGONOMETRIC APPROXIMATION
[f(x)]2dx is finite, then then series
If
2π 0
1 ∞
2 a 20 +
( a 2k + b 2k )
k=1
converges and consequently limk→∞ ak = limk→∞ bk = 0.
The convergence of a Fourier series is a delicate question. For a continuous
function f, it does not follow that its Fourier series converges point-wise to it, only that it does so in the mean
2π
(4.15) lim [f(x) − Sn(x)]2 dx = 0.
n→∞ 0
This convergence in the mean for a continuous periodic function implies that
Bessel’s inequality becomes the equality
1∞ 12π
(4.16) 2a20 + (a2k + b2k) = π [f(x)]2 dx,
k=1 0
which is known as Parseval’s identity. We state now a convergence result
without proof.
Theorem 12. Suppose that f is piecewise continuous and periodic in [0, 2π]
and with a piecewise continuous first derivative. Then
1∞ 1+ −
(akcoskx+bksinkx)=2 f (x)+f (x)
for each x ∈ [0,2π], where ak and bk are the Fourier coefficients of f. Here
2a0+
(4.17) lim f(x + h) = f+(x),
k=1
h→0+
(4.18) lim f(x − h) = f−(x).
h→0+
In particular if x is a point of continuity of f
1 ∞
2 a0 +
(ak cos kx + bk sin kx) = f (x).
k=1
4.1. APPROXIMATING A PERIODIC FUNCTION 69
We have been working on the interval [0, 2π] but we can choose any other interval of length 2π. This is so because if g is 2π periodic then we have the following result
Lemma 2.
(4.19)
for any real t. Proof. Define
2π t+2π g(x)dx =
0t
t+2π t
g(x)dx
(4.20) Then (4.21)
G(t) =
g(x)dx.
G(t) =
g(x)dx =
g(x)dx −
g(x)dx.
t t000
0 t+2π g(x)dx +
t+2π
By the Fundamental theorem of calculus
G′(t) = g(t + 2π) − g(t) = 0
since g is 2π periodic. Thus, G is independent of t. Example 10. f (x) = |x| on [−π, π].
ak = π 2π
f(x)coskxdx = π
−π
1π 1π
|x|coskxdx
−π
= π xcoskxdx 0
u = x, dv = cos kxdx
du = dx, v = 1 sin kx k
=
2x1π 2π π
sinkx − sinkxdx = 2 coskx πk 0 k0 πk 0
= 2 [(−1)k−1], k=1,… πk2
70 CHAPTER 4. TRIGONOMETRIC APPROXIMATION and a0 = π, bk = 0 for all k as the function is even. Therefore
S(x) = lim Sn(x) n→∞
= 1π+∞ 2 [(−1)k −1]coskx 2 πk2
k=1
1 4 cosx cos3x cos5x
= 2π−π 12 + 32 + 52 +··· .
How do we find accurate numerical approximations to the Fourier coeffi- cients? We know that for periodic smooth integrands, integrated over one (or multiple) period(s), the Composite Trapezoidal Rule using equidistributed nodes gives spectral accuracy. Let xj = jh, j = 0, 1, . . . N, h = 2π/N, then we can approximate
11 N−1 1 ck ≈h2π 2f(x0)e−ikx0 +f(xj)e−ikxj +2f(xN)e−ikxN .
j=1
But due to periodicity f(x0)e−ikx0 = f(xN)e−ikxN so we have
1N 1N−1
(4.22) ck ≈ f(xj)e−ikxj = f(xj)e−ikxj.
N j=1 N j=0 and for the real Fourier coefficients we have
(4.23) (4.24)
4.2
ak ≈ bk ≈
2N 2N−1 f(xj)coskxj = f(xj)coskxj,
N j=1 N j=0 2N 2N−1
f(xj)sinkxj = f(xj)sinkxj. N j=1 N j=0
Interpolating Fourier Polynomial
Let f be a 2π-periodic function and xj = j2π, j = 0,1,…N, equidistributed N
nodes in [0,2π]. The interpolation problem is to find a trigonometric poly- nomial of lowest order Sn such that Sn(xj) = f(xj), for j = 0,1,…,N. Because of periodicity f(x0) = f(xN) so we only have N independent values.
4.2. INTERPOLATING FOURIER POLYNOMIAL 71
IfwetakeN =2nthenSn has2n+1=N+1coefficients. Soweneed
one more condition. At the equidistributed nodes, the sine term of highest
frequency vanishes as sin(N xj) = sin(jπ) = 0 so the coefficient bN/2 is ir- 2
relevant for interpolation. We thus look for an interpolating trigonometric polynomial of the form
1 N/2−1 1 N (4.25) PN(x)=2a0+ (akcoskx+bksinkx)+2aN/2cos 2x .
k=1
The convenience of the 1/2 factor in the last term will be seen in the formulas for the coefficients below. It is conceptually and computationally simpler to work with the corresponding polynomial in complex form
N/2
(4.26) PN(x)= ′′ ckeikx,
k=−N/2
where the double prime in the sum means that the first and last term (for k = −N/2 and k = N/2) have a factor of 1/2. It is also understood that c−N/2 = cN/2, which is equivalent to the bN/2 = 0 condition in (4.25).
Theorem 13.
with
N/2 PN(x)= ′′ ckeikx
k=−N/2
1N−1 NN ck = f(xj)e−ikxj, k=− ,…,
Nj=0 22
interpolates f at the equidistributed points xj = j(2π/N), j = 0,1,…,N.
Proof. We have
PN(x)= ′′ ckeikx =f(xj)
Defining (4.27)
k=−N/2
lj (x) =
j=0
N/2
N/2 N−1
1 N/2
′′ eik(x−xj).
Nk=−N/2 1 ′′ eik(x−xj )
N
k=−N/2
72 CHAPTER 4. TRIGONOMETRIC APPROXIMATION we have
N−1
(4.28) PN (x) = f (xj )lj (x).
j=0
Thus, we only need to prove that for j and m in the range 0,…,N −1
(4.29)
1 form=j, lj(xm) = 0 for m ̸= j.
N/2
lj(xm)= 1 ′′ eik(m−j)2π/N.
N
k=−N/2
But ei(±N/2)(m−j)2π/N = e±i(m−j)π = (−1)(m−j) so we can combine the first
and the last term and remove the prime from the sum:
1 N/2−1
N
k=−N/2 1 N/2−1
k=−N/2
1 N−1
1 N−1 0 ifj−misnotdivisiblebyN (4.30) e−ik(j−m)2π/N =
N k=0 1 otherwise. Then (4.29) follows and
PN(xm)=f(xm), m=0,1,…N−1.
lj(xm) = =
N
eik(m−j)2π/N
ei(k+N/2)(m−j)2π/N e−i(N/2)(m−j)2π/N
= e−i(m−j)π
and, as we proved in the introduction
eik(m−j)2π/N k=0
N
4.2. INTERPOLATING FOURIER POLYNOMIAL 73 Using the relations (4.12) between the ck and the ak and bk coefficients
we find that
PN(x)=2a0+ (akcoskx+bksinkx)+2aN/2cos 2x
1 N/2−1 1 N k=1
interpolates f at the equidistributed nodes xj = j(2π/N), j = 0, 1, . . . , N if and only if
2 N
(4.31) (4.32)
Fourier interpolant PN . Note that derivatives of PN can be easily computed N/2
(4.33) P(p)(x) = ′′ (ik)pckeikx N
k=−N/2
The discrete Fourier coefficients of the p-th derivative of PN are (ik)pck.
Thus, once these Fourier coefficients have been computed a very accurate
approximation of the derivatives of f is obtained, f (p) (x) ≈ P (p) (x). Fig- N
ure 5.1 shows the approximation of f(x) = sin xecos x on [0, 2π] by P8. The graph of f and of the Fourier interpolant are almost indistinguishable.
Let us go back to the complex Fourier interpolant (4.26). Its coefficients ck are periodic of period N,
ak = N
2 N
f(xj)coskxj, f(xj)sinkxj.
j=1
bk = N
A smooth periodic function f can be approximated very accurately by its
1 N−1
(4.34) ck+N = fje−i(k+N)xj =
1 N−1
fje−ikxj e−ij2π = ck
j=1
N j=0
and in particular c−N/2 = cN/2. Using the interpolation property and setting
fj = f(xj), we have
N j=0 N/2−1
(4.35) fj = ′′ ck eikxj = k=−N/2
ck eikxj k=−N/2
N/2
74
CHAPTER 4.
TRIGONOMETRIC APPROXIMATION
and
(4.36)
−1.5 0123456
Figure 4.1: S8(x) for f(x) = sin xecos x on [0, 2π].
N/2−1 −1 N/2−1
ckeikxj = ckeikxj + ckeikxj
k=−N/2 k=−N/2 k=0
N−1 N/2−1 N−1
= ckeikxj + ckeikxj = ckeikxj , k=N/2 k=0 k=0
1.5
1
0.5
0
−0.5
−1
where we have used that ck+N = ck. Combining this with the formula for the ck’s we get Discrete Fourier Transform (DFT) pair
(4.37) ck =
(4.38) fj =
1 N−1 fje−ikxj,
N j=0 N−1
ckeikxj, k=0
k=0,…,N−1, j=0,…,N−1.
The set of discrete coefficients (4.39) is known as DFT of the periodic array f0, f1, . . . , fN−1 and (4.40) is referred to as the Inverse DFT.
The direct evaluation of the DFT is computationally expensive, it re- quires order N2 operations. However, there is a remarkable algorithm which
4.3. THE FAST FOURIER TRANSFORM 75 achieves this in merely order N log N operations. This is known as the Fast
Fourier Transform.
4.3 The Fast Fourier Transform
The DFT was defined as
(4.39) ck =
(4.40) fj =
1 N−1 fje−ikxj,
N j=0 N−1
ckeikxj, k=0
k=0,…,N−1, j=0,…,N−1.
The direct computation of either (4.39) or (4.40) is requires order N2 op-
erations. As N increasing the cost quickly becomes prohibitive. In many applications N could easily be on the order of thousands, millions, etc.
One of the top algorithms of all times is the Fast Fourier Transform (FFT). It is usually attributed to Cooley and Tukey (1965) but its origin can be tracked back to C. F. Gauss (1777-1855). We now look at the main ideas of this famous and widely used algorithm.
Let us define dk = Nck for k = 0,1,…,N − 1. Then we can rewrite (4.39) as
N−1 (4.41) dk =fjωkj,
N j=0
where ωN = e−i2π/N . In matrix form
d 0 ω N0 ω N0 · · · ω N0 f 0
1NN N1 d ω0 ω1 ··· ωN−1 f
(4.42) . = . . .. . . .…..
dN−1 ω0 ωN−1 ··· ω(N−1)2 fN−1 NNN
Let us call the matrix on the right FN . Then FN is N times the matrix of the DFT and the matrix of the Inverse DFT is simply the complex conjugate
76 CHAPTER 4. TRIGONOMETRIC APPROXIMATION of FN . This follows from the identities:
1+ωN +ωN2 +···+ωN−1 =0, N
1+ω2 +ω4 +···+ω2(N−1) =0, NNN
. 1+ωN−1 +ω2(N−1) +···+ω(N−1)2 =0,
NNN
1+ωN +ω2N +···+ωN(N−1) =N. NNN
We already proved the first of these identities. This geometric sum is equal to (1 − ωN)/(1 − ωN), which in turn is equal to zero because ωN = 1. The other are proved similarly. We can summarize these identities as
N−1 0 j̸=0(modN) (4.43) ωjk =
N N j=0(modN),
where j = k (mod N) means j − k is an integer multiple of N. Then
(4.44)
1
1N−1 ωjkω−lk =
1N−1
0 j̸=l(modN) 1 j=l(modN).
(FNF ̄N)jl =
which shows that F ̄N is the inverse of 1 FN .
odd-numbered points we have
(4.45)
But
(4.46) (4.47)
k=0
ω(j−l)k = N
N N N k=0
N
Let N = 2n. Going back to (4.41), if we split the even-numbered and the
N
k=0 N
n−1 n−1
dk =f2jω2jk +f2j+1ω(2j+1)k
NN j=0 j=0
2π −ijk2π 2π
ω2jk = e−i2jk = e N = e−ijk = ωkj,
NN2nn
ω(2j+1)k = e−i(2j+1)k 2π = e−ik 2π e−i2jk 2π = ωk ωkj . NNN
NNn So denoting fje = f2j and fj0 = f2j+1, we get
n−1 n−1 (4.48) d =feωjk+ωk f0ωjk
kjnNjn j=0 j=0
4.3. THE FAST FOURIER TRANSFORM 77 We have reduced the problem to two DFT of size n = N plus N multiplica-
2
tions (and N sums). The numbers ωNk , k = 0,1,…,N − 1 only depend on N so they can be precomputed once and stored for other DFT of the same size N.
If N = 2p, for p positive integer, we can repeat the process to reduce each of the DFT’s of size n to a pair of DFT’s of size n/2 plus n multiplications (and n additions), etc. We can do this p times so that we end up with 1-point DFT’s, which require no multiplications!
Let us count the number of operations in the FFT algorithm. For simplic- ity, let is count only the number of multiplications (the numbers of additions is of the same order). Let mN be the number of multiplications to compute the DFT for a periodic array of size N and assume that N = 2p. Then
mN =2mN +N 2
=2m2p−1 +2p
= 2(2m2p−2 + 2p−1) + 2p =22m2p−2 +2·2p
=···
=2pm20 +p·2p =p·2p = N log2 N,
where we have used that m20 = m1 = 0 ( no multiplication is needed for
DFT of 1 point). To illustrate the savings, if N = 220, with the FFT we can
obtain the DFT (or the Inverse DFT) in order 20 × 220 operations, whereas
the direct methods requires order 240, i.e. a factor of 1 220 ≈ 52429 more 20
operations.
The FFT can also be implemented efficiently when N is the product of
small primes. A very efficient implementation of the FFT is the FFTW (“the Fastest Fourier Transform in the West”), which employs a variety of code generation and runtime optimization techniques and is a free software.
78 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
Chapter 5
Least Squares Approximation
5.1 Continuous Least Squares Approximation
Let f be a continuous function on [a,b]. We would like to find the best approximation to f by a polynomial of degree at most n in the L2 norm. We have already studied this problem for the approximation of periodic functions by trigonometric (complex exponential) polynomials. The problem is to find a polynomial pn of degree ≤ n such that
(5.1)
b
a
[f(x) − pn(x)]2dx = min
Such polynomial, is also called the Least Squares approximation to f. As an
illustration let us consider n = 1. We look for p1(x) = a0 + a1x, for x ∈ [a, b],
which minimizes
f2(x)dx − 2 f(x)(a0 + a1x)dx + (a0 + a1x)2dx. aaa
(5.2)
J (a0 , a1 ) = =
[f (x) − p1 (x)]2 dx bbb
b a
J(a0,a1) is a quadratic function of a0 and a1 and thus a necessary condition for the minimum is that it is a critical point:
∂J(a0,a1) b b
∂a =−2 f(x)dx+2 (a0 +a1x)dx=0,
0aa ∂J(a0,a1) b b
∂a =−2 xf(x)dx+2 (a0 +a1x)xdx=0, 1aa
79
80 CHAPTER 5. LEAST SQUARES APPROXIMATION
which yields the following linear 2 × 2 system for a0 and a1: b b b
(5.3) (5.4)
f(x)dx,
Example 11. Let f(x) = ex for x ∈ [0,1]. Then 1
(5.5)
(5.6)
0 1
0
and the normal equations are
1dx a0 + xdx a1 = aaa
b b b xdx a0 + x2dx a1 =
xf(x)dx.
These two equations are known as the Normal Equations for n = 1.
aaa
exdx = e − 1, xexdx = 1,
(5.7) a0 + 1a1 = e − 1,
2
(5.8) 1a0 + 1a1 = 1, 23
whose solution is a0 = 4e − 10, a1 = −6e + 18. Therefore the least squares approximation to f(x) = ex by a linear polynomial is
p1(x) = 4e − 10 + (18 − 6e)x. p1 and f are plotted in Fig. 5.1
The Least Squares Approximation to a function f on an interval [a, b] by a polynomial of degree at most n is the best approximation of f in the L2 norm by that class of polynomials. It is the polynomial
such that (5.9)
pn(x)=a0 +a1x+···+anxn b
[f (x) − pn (x)]2 dx = min .
a
5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 81
2.8 2.6 2.4 2.2
2 1.8 1.6 1.4 1.2 1
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P (x) 1
f(x)=ex
Figure 5.1: The function f (x) = ex on [0, 1] and its Least Squares Approxi- mation p1(x) = 4e − 10 + (18 − 6e)x.
Defining this squared L2 error as J(a0, a1, …, an) we have b
[f(x)−(a0 +a1x+···+anxn)]2dx
b nb nnb
J(a0,a1,…,an)=
= f2(x)dx−2ak xkf(x)dx+akal xk+ldx.
a
a k=0 a k=0 l=0 a
J(a0, a1, …, an) is a quadratic function of the parameters a0, …, an. A neces- sary condition is that the set of a0,…,an which minimizes J is the critical point. That is
0=
= −2
= −2
∂am
b nb nb
∂J(a0,a1,…,an)
xmf(x)dx + ak xk+mdx +
xl+mdx m = 0, 1, . . . , n
a
b nb
xmf(x)dx + 2 ak xk+mdx, a k=0a
al l=0 a
k=0 a
82 CHAPTER 5. LEAST SQUARES APPROXIMATION and we get the Normal equations
(5.10)
nb b
ak xk+mdx = xmf(x)dx, m = 0, 1, …, n.
b
xdx
b a b x2dx ··· xn+1dx 1 xf(x)dx
b (5.11) a a
= a . . . .
k=0
aa
We can write the Normal Equations in matrix as
b b b a b 1dx xdx ··· xndx 0
a a a a
a
. …..
.
b b b b xndx xn+1dx ··· x2ndx xnf(x)dx
a a a an a
The matrix in this system is clearly symmetric. Denoting this matrix by H
anda=[a0a1···an]T anyn+1rowvector,wehave nn
aiajHij i=0 j=0
aTHa =
= aiaj
nnb
xi+jdx aixiajxjdx
i=0 j=0 a bnn
aixiajxjdx a i=0 j=0
a nnb
i=0 j=0
=
=
=
Moreover, aT Ha = 0 if and only if a = 0, i.e. H is positive definite. This
implies that H is nonsigular and hence there is a unique solution to (5.11).
For if there is a ̸= 0 such that Ha = 0 then aTHa = 0 contradicting the
fact that H is positive definite. Furthermore, the Hessian, ∂2J is equal to ∂ai∂aj
2H so it is also positive definite and therefore the critical point is indeed a minimum.
b n (
ajxj)2dx ≥ 0.
a
j=0
f(x)dx
5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 83 In the interval [0, 1],
11···1 12n+1
1 1 ··· 1
(5.12) H= 2 3 n+2, . . .. .
…. 11···1
n+1 n+2 2n+1
which is known as the [(n + 1) × (n + 1)] Hilbert matrix.
In principle, the direct process to obtain the Least Squares Approximation
pn to f is to solve the normal equations (5.10) for the coefficients a0, a1, . . . , an and set pn(x) = a0 + a1x + . . . anxn. There are however two problems with this approach:
1. It is difficult to solve this linear system numerically for even moderate n because the matrix H is very sensitive to small perturbations and this sensitivity increases rapidly with n. For example, numerical solutions in double precision (about 16 digits of accuracy) of a linear system with the Hilbert matrix (5.12) will lose all accuracy for n ≥ 11.
2. If we want to increase the degree of the approximating polynomial we need to start over again and solve a larger set of normal equations. That is, we cannot use the a0, a1, . . . , an we already found.
It is more efficient and easier to solve the Least Squares Approximation problem using orthogonality, as we did with approximation by trigonometric polynomials. Suppose that we have a set of polynomials defined on an interval [a, b],
{φ0, φ1, …, φn},
such that φk is a polynomial of degree k. Then, we can write any polyno- mial of degree at most n as a linear combination of these polynomials. In particular, the Least Square Approximating polynomial pn can be written as
n
pn(x) = a0φ0(x) + a1φ1(x) + … + anφn(x) = akφk(x),
k=0
84 CHAPTER 5. LEAST SQUARES APPROXIMATION for some coefficients a0, . . . , an to be determined. Then
(5.13)
and
k=0
f2(x)dx − 2 aj φk(x)f(x)dx
a k=0a nnb
J (a0 , …, an ) = =
b n [f (x) −
ak φj (x)]2 dx
∂J b 0 = ∂a = −2
a
b nb
+ akal k=0 l=0
a
φk(x)φl(x)dx
n b
ak φk(x)φm(x)dx.
φm(x)f(x)dx + 2 ma k=0a
for m = 0, 1, . . . , n, which gives the normal equations
nb b
ak φk(x)φm(x)dx = φm(x)f(x)dx, m = 0,1,…,n.
k=0
(5.14)
Now, if the set of approximating functions {φ0,….φn} is orthogonal, i.e.
Bessel inequality
(5.17)
n b αka2k ≤
f2(x)dx.
aa
b a
φk(x)φm(x)dx = 0 if k ̸= m
then the coefficients of the least squares approximation are explicitly given
(5.15)
by
(5.16) am = α φm(x)f(x)dx, αm = φ2m(x)dx, m = 0, 1, …, n.
1b b maa
and
Note that if the set {φ0, ….φn} is orthogonal, (5.16) and (5.13) imply the
pn(x) = a0φ0(x) + a1φ1(x) + … + anφn(x).
k=0
a
5.2. LINEARINDEPENDENCEANDGRAM-SCHMIDTORTHOGONALIZATION85 This inequality shows that if f is square integrable, i.e. if
∞
b
f2(x)dx < ∞,
a
then the series αka2k converges. k=0
We can consider the Least Squares approximation for a class of linear combinations of orthogonal functions {φ0, ..., φn} not necessarily polynomials. We saw an example of this with Fourier approximations 1. It is convenient to define a weighted L2 norm associated with the Least Squares problem
1 2
b a
the weighted inner product (5.19), if ⟨φk, φl⟩ = 0 for k ̸= l.
5.2 Linear Independence and Gram-Schmidt
Orthogonalization
Definition 8. A set of functions {φ0(x), ..., φn(x)} defined on an interval
[a, b] is said to be linearly independent if
(5.20) a0φ0(x)+a1φ1(x)+...anφn(x)=0, for all x∈[a,b]
then a0 = a1 = . . . = an = 0. Otherwise, it is said to be linearly dependent.
(5.18) ∥f∥w,2 = by
b a
f2(x)w(x)dx
,
where w(x) ≥ 0 for all x ∈ (a, b) 2 . A corresponding inner product is defined
(5.19) ⟨f,g⟩ =
f(x)g(x)w(x)dx.
Definition 7. A set of functions {φ0,...,φn} is orthogonal, with respect to
1For complex-valued functions orthogonality means where the bar denotes the complex conjugate
2More precisely, we will assume w ≥ 0, w(x)dx > 0, and xkw(x)dx < +∞ for
aa
k = 0, 1, . . .. We call such a w an admissible weight function.
b a
φk(x)φ ̄l(x)dx = 0 if k ̸= l, b b
86 CHAPTER 5. LEAST SQUARES APPROXIMATION
Example 12. The set of functions {φ0(x),...,φn(x)}, where φk(x) is a poly- nomial of degree k for k = 0, 1, . . . , n is linearly independent on any interval [a, b]. For a0φ0(x) + a1φ1(x) + . . . anφn(x) is a polynomial of degree at most n and hence a0φ0(x) + a1φ1(x) + . . . anφn(x) = 0 for all x in a given interval [a,b] implies a0 =a1 =...=an =0.
Given a set of linearly independent functions {φ0(x),...,φn(x)} we can produce an orthogonal set {ψ0(x),...,ψn(x)} by doing the Gram-Schmidt procedure:
ψ0(x) = ψ1(x) =
ψ2(x) =
.
We can write this procedure recursively as
ψ0(x) = φ0(x),
j=0 ⟨ψj,ψj⟩ 5.3 Orthogonal Polynomials
Let us take the set {1,x,...,xn} on a interval [a,b]. We can use the Gram- Schmidt process to obtain an orthogonal set {ψ0(x), ..., ψn(x)} of polynomials with respect to the inner product (5.19). Each of the ψk is a polynomial of degree k, determined up to a multiplicative constant (orthogonality is not changed). Suppose we select the ψk (x), k = 0, 1, . . . , n to be monic, i.e. the coefficient of xk is 1. Then ψk+1(x) − xψk(x) = rk(x), where rk(x) is a
φ0 (x)
φ1 (x) − c0 ψ0 (x),
⟨ψ1,ψ0⟩ = 0 ⇒ c0 = ⟨φ1,ψ0⟩ ⟨ψ0, ψ0⟩
φ2(x) − c0ψ0(x) − c1ψ1(x), ⟨ψ2,ψ0⟩ = 0 ⇒ c0 = ⟨φ2,ψ0⟩
⟨ψ0, ψ0⟩ ⟨ψ2,ψ1⟩ = 0 ⇒ c1 = ⟨φ2,ψ1⟩
(5.21) k−1 ⟨φ ,ψ ⟩ ψk(x)=φk(x)−cjψj(x), cj = k j .
⟨ψ1, ψ1⟩
5.3. ORTHOGONAL POLYNOMIALS
polynomial of degree at most k. So we can write
(5.22) ψk+1(x) − xψk(x) = −αkψk(x) − βkψk−1(x) + cjψj(x).
Then taking the inner product of this expression with ψk and using orthog- onality we get
and
−⟨xψk, ψk⟩ = −αk⟨ψk, ψk⟩
αk = ⟨xψk,ψk⟩. ⟨ψk, ψk⟩
k−2 j=0
87
Similarly, taking the inner product with ψk−1 we obtain −⟨xψk, ψk−1⟩ = −βk⟨ψk−1, ψk−1⟩
but ⟨xψk, ψk−1⟩ = ⟨ψk, xψk−1⟩ and xψk−1(x) = ψk(x)+qk−1(x), where qk−1(x) is a polynomial of degree at most k − 1 then
⟨ψk, xψk−1⟩ = ⟨ψk, ψk⟩ + ⟨ψk, qk−1⟩ = ⟨ψk, ψk⟩, where we have used orthogonality in the last equation. Therefore
βk= ⟨ψk,ψk⟩ . ⟨ψk−1, ψk−1⟩
Finally, taking the inner product of (5.22) with ψm for m = 0,...,k − 2 we get
−⟨ψk,xψm⟩=cm⟨ψm,ψm⟩ m=0,...,k−2
but the left hand side is zero because xψm(x) is a polynomial of degree at most k − 1 and hence it is orthogonal to ψk(x). Collecting the results we obtain a three-term recursion formula
(5.23) (5.24)
(5.25) (5.26)
(5.27)
ψ0 (x)
ψ1 (x) for k = 1,...n
ψk+1(x) αk
βk
=1,
=x−α0, α0=⟨xψ0,ψ0⟩
= (x − αk)ψk(x) − βkψk−1(x), = ⟨xψk,ψk⟩,
⟨ψk, ψk⟩
= ⟨ψk,ψk⟩ .
⟨ψk−1, ψk−1⟩
⟨ψ0, ψ0⟩
88 CHAPTER 5. LEAST SQUARES APPROXIMATION
Example 13. Let [a, b] = [−1, 1] and w(x) ≡ 1. The corresponding orthog- onal polynomials are known as the Legendre Polynomials and are widely used in a variety of numerical methods. Because xψk2(x)w(x) is an odd func- tion it follows that αk =0 for all k. We have ψ0(x)=1 and ψ1(x)=x. We can now use the three-term recursion (5.25) to obtain
1
x2 dx
−1 β1= 1
−1
and ψ2(x) = x2 − 1. Now for k = 2 we get
3
11
(x2 − 3)2dx −1
ψ0(x) = 1, ψ1(x) = x,
ψ2(x) = x2 − 1, 3
ψ3(x) = x3 − 3x, 5
=1/3
dx
β2 =
1 −1
= 4/15
x2 dx
andψ3(x)=x(x2−1)− 4 x=x3−3x. WenowcollectLegendrepolynomials
we found:
315 5
.
Theorem 14. The zeros of orthogonal polynomials are real, simple, and they
all lie in (a, b).
Proof. Indeed, ψk(x) is orthogonal to ψ0(x) = 1 for each k ≥ 1, thus
(5.28)
b a
ψk(x)w(x)dx = 0
i.e. ψk has to change sign in [a, b] so it has a zero, say x1 ∈ (a, b). Suppose x1 is not a simple root, then q(x) = ψk(x)/(x − x1)2 is a polynomial of degree
5.3. ORTHOGONAL POLYNOMIALS 89 k−2 and so
b ψk2(x)
0=⟨ψk,q⟩= (x−x )2w(x)dx>0,
a1
which is of course impossible. Assume that ψk(x) has only l zeros in (a,b), x1,…,xl. Then ψk(x)(x − x1)···(x − xl) = qk−l(x)(x − x1)2 ···(x − xl)2, where qk−l(x) is a polynomial of degree k − l which does not change sign in [a,b]. Then
5.3.1 Chebyshev Polynomials
We introduced in Section 2.4 the Chebyshev polynomials, which as we have seen possess remarkable properties. We now add one more important prop- erty of this outstanding class of polynomials, namely orthogonality.
The Chebyshev polynomials are orthogonal with respect to the weight function
(5.29) w(x)=√ 1 . 1−x2
Indeed. Recall that Tn(x) = cos nθ, (x = cos θ). Then,
⟨ψk,(x−x1)···(x−xl)⟩ = but⟨ψk,(x−x1)···(x−xl)⟩=0forl
90 CHAPTER 5. LEAST SQUARES APPROXIMATION 5.4 Discrete Least Squares Approximation
Suppose that we are given the data (x1, f1), (x2, f2), · · · , (xN , fN ) obtained from an experiment. Can we find a simple function that appropriately fits these data? Suppose that empirically we determine that there is an approx- imate linear behavior between fj and xj, j = 1,…,N. What is the straight line y = a0 + a1x that best fits these data? The answer depends on how we measure the error, the deviations fj − (a0 + a1xj), i.e. which norm we use for the error. The most convenient measure is the squared of the 2-norm (Euclidean norm) because we will end up with a linear system of equations to find a0 and a1. Other norms will yield a nonlinear system for the unknown parameters. So the problem is: Find a0, a1 which minimize
N
(5.32) J(a0,a1)=[fj −(a0 +a1xj)]2.
j=1
We can repeat all the Least Squares Approximation theory that we have seen at the continuum level except that integrals are replaced by sums. The conditions for the minimum
(5.33)
∂ J ( a 0 , a 1 )
= 2
N
[fj − (a0 + a1xj)](−1) = 0,
j=1 N
∂a
∂ J ( a 0 , a 1 )
(5.34)
produce the Normal Equations:
[fj − (a0 + a1xj)](−xj) = 0,
0
(5.35)
(5.36)
NNN
a 1+a x =f, 01jj
j=1 j=1 j=1 NNN
a x + a x2 = x f . 0j1jjj
∂a
= 2
1
j=1
j=1 j=1 j=1 Approximation by a higher order polynomial
pn(x)=a0+a1x+···+anxn, n
92 CHAPTER 5. LEAST SQUARES APPROXIMATION discrete Least Squares Approximation problem by a polynomial of degree
≤ n can be written as
(5.40) pn(x) = a0φ0(x) + a1φ1(x) + · · · + anφn(x)
and the square of the error is given by
N
(5.41) J(a0,···,an)=[fj −(a0φ0(xj)+···+anφn(xj))]2ωj.
j=1
Consequently, the normal equations are
n
(5.42) ak⟨φk,φl⟩N = ⟨φl,f⟩N, l = 0,1,…,n.
k=0
If {φ0, · · · , φn} are orthogonal with respect to the inner product ⟨·, ·⟩N , i.e. if ⟨φk,φl⟩N = 0 for k ̸= l, then the coefficients of the Least Squares Approximation are given by
(5.43) ak = ⟨φk,f⟩N , k = 0,1,…,n ⟨φk,φk⟩N
and pn(x) = a0φ0(x) + a1φ1(x) + · · · + anφn(x).
If the {φ0, · · · , φn} are not orthogonal we can produce an orthogonal set
{ψ0, · · · , ψn} using the 3-term recursion formula adapted to the discrete inner product. We have
(5.44) (5.45)
(5.46) (5.47)
Then, (5.48)
ψ0(x) ≡ 1,
ψ1(x)=x−α0, α0=⟨xψ0,ψ0⟩N,
⟨ψ0, ψ0⟩N ψk+1(x) = (x − αk)ψk(x) − βkψk−1(x),
for k = 1,…,n
αk = ⟨xψk,ψk⟩N , βk = ⟨ψk,ψk⟩N .
⟨ψk, ψk⟩N ⟨ψk−1, ψk−1⟩N aj = ⟨ψj,f⟩N , j=0,…,n.
⟨ψj,ψj⟩N
and the Least Squares Approximation is pn(x) = a0ψ0(x) + a1ψ1(x) + · · · + an ψn (x).
5.4. DISCRETE LEAST SQUARES APPROXIMATION 93 Example 14. Suppose we are given the data: xj : 0, 1, 2, 3, fj = 1.1, 3.2, 5.1, 6.9
and we would like to fit to a line. The normal equations are
444
a 1+a x =f 01jj
j=1 j=1 j=1
and performing the sums we have
(5.51) 4a0 + 6a1 = 16.3, (5.52) 6a0 + 14a1 = 34.1.
Solving this 2 × 2 linear system we get a0 = 1.18 and a1 = 1.93. Thus, the Least Squares Approximation is
p1(x) = 1.18 + 1.93x and the square of the error is
4
J (1.18, 193) = [fj − (1.18 + 1.93xj )]2 = 0.023.
j=1
Example 15. Fitting to an exponential y = beaxk . Defining
N
J(a, b) = [fj − beaxj ]2
j=1 we get the conditions for a and b
(5.53) (5.54)
(5.49) (5.50)
444
a x + a x2 = x f
∂a = 2
j=1
0j1jjj j=1 j=1 j=1
∂J N
[fj − beaxj ](−bxjeaxj ) = 0, [fj − beaxj ](−eaxj ) = 0.
∂J N
∂b = 2
j=1
94 CHAPTER 5. LEAST SQUARES APPROXIMATION
which is a nonlinear system of equations. However, if we take the natural log of y = beaxk we have lny = lnb+ax. Defining B = lnb the problem becomes linear in B and a. Tabulating (xj,lnfj) we can obtain the normal equations
(5.55) (5.56)
NNN B1+axj =lnfj,
j=1 j=1 j=1 NNN
Bxj+ax2j =xjlnfj, j=1 j=1 j=1
andsolvethislinearsystemforBanda.Thenb=eB anda=a.
If a is given and we only need to determine b then the problem is linear. From (5.54) we have
NN
N
fj eaxj
be2axj = fjeaxj ⇒ b = j=1 N
j=1 j=1
e2axj j=1
Example 16. Discrete orthogonal polynomials. Let us construct the first few
orthogonal polynomials with respect to the discrete inner product with ω ≡ 1
and xj = j , j = 1, …, N. Here N = 10 (the points are equidistributed in N
[0, 1]). We have ψ0(x) = 1 and ψ1(x) = x − α0, where
⟨xψ0, ψ0⟩N Nj=1 xj α0=⟨ψ,ψ⟩ =N =0.55.
00N j=11 and hence ψ1(x) = x − 0.55. Now
ψ2(x) = (x − α1)ψ1(x) − β1ψ0(x),
(5.57) (5.58)
(5.59)
Therefore ψ2(x) = (x − 0.55)2 − 0.0825. We can now use these orthogonal polynomials to find the Least Squares Approximation by polynomial of degree
⟨xψ1,ψ1⟩N Nj=1 xj(xj −0.55)2 α1=⟨ψ,ψ⟩ =N 2 =0.55,
1 1 N j=1(xj −0.55) β1=⟨ψ1,ψ1⟩N =0.0825.
⟨ψ0, ψ0⟩N
5.5. HIGH-DIMENSIONAL DATA FITTING 95
at most two of a given set of data. Let us take fj = x2j +2xj +3. Clearly, the Least Squares Approximation should be p2(x) = x2 + 2x + 3. Let us confirm this by using the orthogonal polynomials ψ0, ψ1 and ψ2. The Least Squares Approximation coefficients are given by
(5.60) a0 = ⟨f,ψ0⟩N =4.485, ⟨ψ0, ψ0⟩N
(5.61) a1=⟨f,ψ1⟩N =3.1, ⟨ψ1, ψ1⟩N
(5.62) a2=⟨f,ψ2⟩N =1, ⟨ψ2, ψ2⟩N
which gives, p2(x) = (x−0.55)2 −0.0825+(3.1)(x−0.55)+4.485 = x2 +2x+3. 5.5 High-dimensional Data Fitting
In many applications each data point contains many variables. For example, a value for each pixel in an image, or clinical measurements of a patient, etc. We can put all these variables in a vector x ∈ Rd for d ≥ 1. Associated with x there is a scalar quantity f that can be measured or computed so that our data set consists of the points (xj,fj), where xj ∈ Rd and fj ∈ R, for j = 1,…,N.
A central problem is that of predicting f from a given large, high-dimensional data set. The simplest and most commonly used approach is to postulate a linear relation
(5.63) f(x) = a0 + aT x
and determine the so-called bias coefficient a0 and the vector a = [a1, . . . , ad]T as a least squares solution, i.e. such that it minimizes the deviations of the
data:
N
[fj −(a0 +aTxj)]2.
j=1
We have studied in detail the case d = 1. Here we are interested in the case
d >> 1.
If we add an extra component, equal to 1, to each data vector Xj so that
nowxj =[1,xj1,…,xjd]T,forj=1,…,N,thenwecanwrite(5.63)as (5.64) f (x) = aT x
96 CHAPTER 5. LEAST SQUARES APPROXIMATION and the dimension d is increased by one. Then we are seeking a vector a ∈ Rd
that minimizes
(5.65) J(a) = [fj − aT xj]2.
j=1
Puttingthedataxj asrowsofanN×d(N ≥d)matrixX andfj asthe
components of a (column) vector f, i.e.
x1 f1
x2 f2 (5.66) X= . and f=. ..
xN fN
we can write (5.65) as
(5.67) J(a) = (f − Xa)T (f − Xa) = ∥f − Xa∥2.
The normal equations are given by the condition ∇aJ(a) = 0. Since ∇aJ(a) = −2XT f + 2XT Xa, we get the linear system of equations
(5.68) XT Xa = XT f.
Every solution of the least square problem is necessarily a solution of the normal equations. We will prove that the converse is also true and that the solutions have a geometric characterization.
Let W be the linear space spanned by the columns of X. Clearly, W ⊆ RN . Then, the least square problem is equivalent to minimizing ∥f − w∥2 among all vectors w in W. There is always at least one solution, which can be obtained by projecting f onto W, as Fig. 5.2 illustrates. First, note that if a ∈ Rd is a solution of the normal equations (5.68) then the residual f − Xa is orthogonal to W because
(5.69) XT (f − Xa) = XT f − XT Xa = 0
and a vector r ∈ RN is orthogonal to W if it is orthogonal to each column of X, i.e. XTr = 0. Let a∗ be a solution of the normal equations, let r=f−Xa∗,andforarbitrarya∈Rd,lets=Xa−Xa∗. Then,wehave
(5.70) ∥f − Xa∥2 = ∥f − Xa∗ − (Xa − Xa∗)∥2 = ∥r − s∥2.
N
5.5. HIGH-DIMENSIONAL DATA FITTING 97
f
f − Xa
Figure 5.2: Geometric interpretation of the solution Xa of the Least Squares problem as the orthogonal projection of f on the approximating linear sub- space W.
But r and s are orthogonal. Therefore,
(5.71) ∥r − s∥2 = ∥r∥2 + ∥s∥2 ≥ ∥r∥2 and so we have proved that
(5.72) ∥f − Xa∥2 ≥ ∥f − Xa∗∥2
for arbitrary a ∈ Rd, i.e. a∗ minimizes ∥f − Xa∥2.
If the columns of X are linearly independent, i.e. if for every a ̸= 0 we
have that Xa ̸= 0, then the d × d matrix XT X is positive definite and hence nonsingular. Therefore, in this case, there is a unique solution to the least squares problem mina ∥f − Xa∥2 given by
(5.73) a∗ = (XT X)−1XT f. The d × N matrix
(5.74) X† = (XT X)−1XT
is called the pseudoinverse of the N × d matrix X. Note that if X were square and nonsingular X† would coincide with the inverse, X−1.
As we have done in the other least squares problems we seen so far, rather than working with the normal equations, whose matrix XT X may be very sensitive to perturbations in the data, we use an orthogonal basis for the ap- proximating subspace (W in this case) to find a solution. While in principle this can be done by applying the Gram-Schmidt process to the columns of
Xa
W
98 CHAPTER 5. LEAST SQUARES APPROXIMATION
X, this is a numerically unstable procedure; when two columns are nearly linearly dependent, errors introduced by the finite precision representation of computer numbers can be largely amplified during the the Gram-Schmidt process and render vectors which are not orthogonal. A re-orthogonalization step can be introduced in the Gram-Schmidt algorithm to remedy this prob- lem at the price of doubling the computational cost. A more efficient method using a sequence of orthogonal transformations, known as Householder reflec- tions, is usually preferred. Once this orthonormalization process is completed we get a QR factorization of X
(5.75) X = QR,
whereQisanN×N orthogonalmatrix,i.e. QTQ=I,andRisanN×d
upper triangular matrix
(5.76) R= ∗= R1 . 0
Here R1 is a d×d upper triangular matrix and the zero stands for a (N −d)×d zero matrix.
Using X = QR we have
(5.77) ∥f−Xa∥2 =∥f−QRa∥2 =∥QTf−Ra∥2.
Therefore the least square solution is obtained by solving the system Ra = QT f . Writing R in blocks we have
R1a (QT f)1 (5.78) 0 = (QTf) .
2
so that the solution is found by solving upper triangular system R1a = (QT f)1 (R1 is nonsingular if the columns of X are linearly independent). The last N − d equations, (QT f )2 = 0, may be satisfied or not depending on f but we have no control on them.
∗ ··· ∗
.. .
..
Chapter 6
Computer Arithmetic
6.1 Floating Point Numbers
Floating point numbers are based on scientific notation in binary (base 2). For example
(1.0101)2 ×22 = (1·20 +0·2−1 +1·2−2 +0·2−3 +1·2−4)×22 =(1+1+ 1)×4=5.2510.
4 16
We can write any non-zero real number x in normalized, binary, scientific notation as
(6.1) x=±S×2E, 1≤S<2,
where S is called the significant or mantissa and E is the exponent. In
general S is an infinite expansion of the form (6.2) S = (1.b1b2 · · · )2.
In a computer a real number is represented in scientific notation but us- ing a finite number of digits (bits). We call these floating point numbers. In single precision (SP), floating point numbers are stored in 32-bit words whereas in double precision (DP), used in most scientific computing appli- cations, a 64-bit word is employed: 1 bit is used for the sign, 52 bits for S, and 11 bits for E. This memory limits produce a large but finite set of floating point numbers which can be represented in a computer. Moreover, the floating points numbers are not uniformly distributed!
99
100 CHAPTER 6. COMPUTER ARITHMETIC
The maximum exponent possible in DP would be 211 = 2048 but this is shifted to allow representation of small and large numbers so that we actually have Emin = −1022, Emax = 1023. Consequently the min and max DP floating point number which can be represented in DP are
(6.3) Nmin = min |x| = 2−1022 ≈ 2.2 × 10−308, x∈DP
(6.4) Nmax = max |x| = (1.1.....1)2 · 21023 = (2 − 2−52) · 21023 ≈ 1.8 × 10308. x∈DP
If in the course of a computation a number is produced which is bigger than Nmax we get an overflow error and the computation would halt. If the number is less than Nmin (in absolute value) then an underflow error occurs.
6.2 Rounding and Machine Precision
To represent a real number x as a floating point number, rounding has to be performed to retain only the numbers of binary bits allowed in the significant. Letx∈Randitsbinaryexpansionbex=±(1.b1b2···)2 ×2E.
One way to approximate x to a floating number with d bits in the signif- icant is to truncate or chop discarding all the bits after bd, i.e.
(6.5) x∗ =chop(x)=±(1.b1b2···bd)2×2E.
In double precision d = 52.
A better way to approximate to a floating point number is to do rounding
up or down (to the nearest floating point number), just as we do when we round in base 10. In binary, rounding is simpler because bd+1 can only be 0 (we round down) or 1 (we round up). We can write this type of rounding in terms of the chopping described above as
(6.6) x∗ = round(x) = chop(x + 2−(d+1) × 2E ).
Definition 9. Given an approximation x∗ to x the absolute error is defined
by |x − x∗| and the relative error by |x−x∗ |, x ̸= 0. x
The relative error is generally more meaningful than the absolute error to measure a given approximation.
6.3. CORRECTLY ROUNDED ARITHMETIC 101 The relative error in chopping and in rounding (called a round-off error)
is (6.7)
(6.8)
x − chop(x)
≤
x x−round(x)
2−d2E (1.b1b2 ···)2E
−d ≤2 ,
1 −d x ≤22 .
x+y ̸= 0, write Then
δ+= 1 [δx+δy]. x+y
The number 2−d is called machine precision or epsilon (eps). In double precision eps=2−52 ≈ 2.22 × 10−16. The smallest double precision number greater than 1 is 1+eps. As we will see below, it is more convenient to write (6.8) as
(6.9) round(x) = x(1 + δ), |δ| ≤ eps. 6.3 Correctly Rounded Arithmetic
Computers today follow the IEEE standard for floating point representation and arithmetic. This standard requires a consistent floating point represen- tation of numbers across computers and correctly rounded arithmetic.
In correctly rounded arithmetic, the computer operations of addition, sub- traction, multiplication, and division are the correctly rounded value of the exact result.
If x and y are floating point numbers and ⊕ is the machine addition, then
(6.10) x⊕y=round(x+y)=(x+y)(1+δ+), |δ+|≤eps,
and similarly for ⊖, ⊗, ⊘.
One important interpretation of (6.10) is the the following. Assuming
1
(6.11) x⊕y=(x+y) 1+x+y(δx +δy) =(x+δx)+(y+δy).
The computer ⊕ is giving the exact result but for a sightly perturbed data. This interpretation is the basis for Backward Error Analysis, which is used to study how round-off errors propagate in a numerical algorithm.
102 CHAPTER 6. COMPUTER ARITHMETIC 6.4 Propagation of Errors and Cancellation
of Digits
Let fl(x) and fl(y) denote the floating point approximation of x and y, respectively, and assume that their product is computed exactly, i.e
fl(x)·fl(y) = x(1+δx)·y(1+δy) = x·y(1+δx +δy +δxδy) ≈ x·y(1+δx +δy), where |δx|,|δy| ≤ eps. Therefore, for the relative error we get
x · y − fl(x) · fl(y)
(6.12) x · y ≈ |δx + δy|,
which is acceptable.
Let us now consider addition (or subtraction):
fl(x)+fl(y)=x(1+δx)+y(1+δy)=x+y+xδx +yδy xy
=(x+y) 1+x+yδx+x+yδy . x+y−(fl(x)+fl(y)) x y
The relative error is
(6.13) x + y = x + y δx + x + y δy
If x and y have the same sign then x , y are both positive and bounded x+y x+y
by 1. Therefore the relative error is less than |δx + δy|, which is fine. But
if x and y have different sign and are close in magnitude the error could be
largely amplified because | x |, | y | can be very large. x+y x+y
Example 17. Suppose we have 10 bits of precision and x = (1.01011100 ∗ ∗)2 × 2E ,
y = (1.01011000 ∗ ∗)2 × 2E ,
where the ∗ stands for inaccurate bits (i.e. garbage) that say were generated in
previous floating point computations. Then, in this 10 bit precision arithmetic
(6.14) z=x−y=(1.00∗∗∗∗∗∗∗∗)2 ×2E−6.
We end up with only 2 bits of accuracy in z. Any further computations using z will result in an accuracy of 2 bits or lower!
6.4. PROPAGATIONOFERRORSANDCANCELLATIONOFDIGITS103
Example 18. Sometimes we can rewrite the difference of two very close numbers to avoid digit cancellation. For example, suppose we would like to compute
√
y= 1+x−1
for x > 0 and very small. Clearly, we will have loss of digits if we proceed directly. However, if we rewrite y as
√ √1+x−1 x y=( 1+x+1)√ =√
1+x+1 1+x+1
then the computation can be performed at nearly machine precision level.
104 CHAPTER 6. COMPUTER ARITHMETIC
Chapter 7
Numerical Differentiation
7.1 Finite Differences
Suppose f is a differentiable function, and we’d like to approximate f′(x0) given value of f at x0 and at neighboring points x1, x2, …, xn. We could approximate f by its interpolating polynomial Pn(x) at those points and use f′(x0) ≈ Pn′ (x0). There are several other possibilities. For example, we can approximate f′(x0) by the derivative of the cubic spline of f evaluated at x0, or by the derivative of the Least Squares Chebyshev expansion of f:
n
f′(x0) = ajTj′(x0), j=1
etc.
If we use polynomial interpolation then
′ ′
′ (ξ(x)) ωn(x) .
x=x0
f(x) = Pn(x) + 1 f(n+1)(ξ(x)) ωn(x), (n+1)!
(7.1)
where (7.2) Thus,
1 d(n+1) (n+1) f (x0) = Pn(x0) + (n + 1)! dxf (ξ(x))ωn(x) + f
105
ωn(x) = (x − x0)(x − x1) · · · (x − xn).
106 CHAPTER 7. NUMERICAL DIFFERENTIATION But ωn(x0) = 0 and ωn′ (x0) = (x0 − x1) · · · (x0 − xn), thus
(7.3) f′(x0)=Pn′(x0)+ 1 f(n+1)(ξ(x))(x0 −x1)···(x0 −xn) (n+1)!
Example 19. Take n=1 and x1 =x0 +h (h>0). In Newton’s form (7.4) P1(x) = f(x0) + f(x0 + h) − f(x0)(x − x0),
h
and P′(x ) = 1[f(x +h)−f(x )]. We obtain the so-called Forward Difference
10h00 Formula for approximating f′(x0)
(7.5) Dh+f(x0) := f(x0 + h) − f(x0). h
From (7.3) the error in this approximation is
(7.6) f′(x0)−Dh+f(x0)= 1f′′(ξ(x))(x0 −x1)=−1f′′(ξ)h. 2! 2
Example 20. Take again n = 1 but now x1 = x0 −h. Then P1′(x0) = 1 [f(x0) − f(x0 − h)] and we get the so-called Backward Difference Formula
h
for approximating f′(x0)
(7.7) Dh−f(x0) := f(x0) − f(x0 − h).
h
Its error is
(7.8) f′(x0) − Dh−f(x0) = 1f′′(ξ)h. 2
Example 21. Let n=2 and x1 = x0 −h, x2 = x0 +h. P2 in Newton’s form is
P2 (x) = f [x1 ] + f [x1 , x0 ](x − x1 ) + f [x1 , x0 , x2 ](x − x1 )(x − x0 ). Let us obtain the finite difference table:
x0−h f(x0−h)
f (x0 )−f (x0 −h) h
f (x0 +h)−f (x0 ) h
f(x0+h)−2f(x0)+f(x0+h) 00 2h2
x
x0+h f(x0+h)
f(x )
7.1. FINITE DIFFERENCES 107 Therefore,
P 2′ ( x 0 ) = f ( x 0 ) − f ( x 0 − h ) + f ( x 0 + h ) − 2 f ( x 0 ) + f ( x 0 − h ) h h 2h2
and thus
(7.9) P2′(x0) = f(x0 + h) − f(x0 − h). 2h
This defines the Centered Difference Formula to approximate f′(x0) (7.10) Dh0f(x0) := f(x0 + h) − f(x0 − h).
2h
Its error is
(7.11) f′(x0) − Dh0f(x0) = 1 f′′′(ξ)(x0 − x1)(x0 − x2) = −1f′′′(ξ)h2. 3! 6
Example 22. Let n=2 and x1 =x0 +h, x2 =x0 +2h. The table of finite differences is
x0 f(x0) + h f(x + h)
f (x0 +h)−f (x0 ) h
f(x0+2h)−2f(x0+h)+f(x0) 00 2h2
x
x0+2h f(x0+2h)
f (x0 +2h)−f (x0 +h) h
and
thus
(7.12) P2′(x0) = −f(x0 + 2h) + 4f(x0 + h) − 3f(x0) 2h
If we use this sided difference to approximate f′(x0) the error is (7.13) f′(x0) − P2′(x0) = 1 f′′′(ξ)(x0 − x1)(x0 − x2) = 1h2f′′′(ξ),
which is twice as large as that in Centered Finite Difference Formula.
P 2′ ( x 0 ) = f ( x 0 + h ) − f ( x 0 ) + f ( x 0 + 2 h ) − 2 f ( x 0 + h ) + f ( x 0 ) h h 2h2
3! 3
108 CHAPTER 7. NUMERICAL DIFFERENTIATION 7.2 The Effect of Round-off Errors
In numerical differentiation we take differences of values which could be very close to each other, for small h. As we know, this leads to loss of accuracy because of the finite precision floating point arithmetic. Consider for example the centered difference formula. For simplicity let us suppose that h has an exact floating point representation and that we make no rounding error when doing the division by h. That is, suppose that the the only source of round- off error is in the computation of the difference f (x0 + h) − f (x0 − h). Then f(x0+h) and f(x0−h) are replaced by f(x0+h)(1+δ+) and f(x0−h)(1+δ−), respectively with |δ+| ≤ ε and |δ−| ≤ ε. Then
f(x0 +h)(1+δ+)−f(x0 −h)(1+δ−) = f(x0 +h)−f(x0 −h) +rh, 2h 2h
where
rh = f(x0 +h)δ+ −f(x0 −h)δ−. 2h
Clearly,|rh|≤(|f(x0+h)|+|f(x0−h)|)ε ≈|f(x0)|ε.Theapproximation 2h h
error or truncation error for the centered finite difference approximation is −1f′′′(ξ)h2. Thus, the total error E(h) can be approximately bounded by
h0 =
6
1h2M3 + |f(x0)|ε . The minimum error occurs at h0 such that E′(h0) = 0,
6h i.e.
1 3ε|f(x0)| 3 1
(7.14)
M ≈cε3 3
2
and E(h0) = O(ε3 ). We do not get machine precision!
Higher order finite differences exacerbate the problem of digit cancella- tion. When f can be extended to an analytic function in the complex plane, Cauchy Integral Theorem can be used to evaluate the derivative:
(7.15) f′(z0) = 1 f(z) 2πi C(z−z0)2
dz,
where C is a simple closed contour around z0 and f is analytic on and inside
C. Parametrizing C as a circle of radius r we get
1 2π
(7.16) f′(z0) = 2πr
f(x0 + reit)e−itdt
0
7.3. RICHARDSON’S EXTRAPOLATION 109
The integrand is periodic and smooth so it can be approximated with spectral accuracy with the composite trapezoidal rule.
Another approach to obtain finite difference formulas to approximate derivatives is through Taylor expansions. For example,
(7.17) f(x0 + h) = f(x0) + f′(x0)h + 1f′′(x0)h2 + 1 f(3) + 1 f(4)(ξ+)h4, 2 3! 4!
(7.18) f(x0 − h) = f(x0) − f′(x0)h + 1f′′(x0)h2 − 1 f(3) + 1 f(4)(ξ−)h4, 2 3! 4!
where x0 < ξ+ < x0 +h and x0 −h < ξ− < x0. Then subtracting (7.17) from (7.18) we have f(x0 +h)−f(x0 −h) = 2f′(x0)h+ 2 f′′′(x0)h3 +··· and
3!
therefore
(7.19) f(x0 +h)−f(x0 −h) =f′(x0)+c2h2 +c4h4 +···
2h
Similarly if we add (7.17) and (7.18) we obtain f(x0 + h) + f(x0 − h) =
2f(x0)+f′′(x0)h2 +ch4 +··· and consequently
(7.20) f ′′ (x0 ) = f (x0 + h) − 2f (x0 ) + f (x0 − h) + c ̃h2 + · · ·
h2
(7.21) Dh2f(x0) = f(x0 + h) − 2f(x0) + f(x0 − h)
h2
is thus a second order approximation to f′′(x0), i.e., f′′(x0) − Dh2f(x0) =
O(h2).
7.3 Richardson’s Extrapolation
From (7.19) we know that, asymptotically
(7.22) Dh0f(x0)=f′(x0)+c2h2 +c4h4 +···
We can apply Richardson extrapolation once to obtain a fourth order ap- proximation. Evaluating (7.22) at h/2 we get
(7.23) D0 f(x)=f′(x)+1ch2+ 1ch4+··· h/2 0 0 42 164
The finite difference
110 CHAPTER 7. NUMERICAL DIFFERENTIATION and multiplying this equation by 4, subtracting (7.22) to the result and
dividing by 3 we get
(7.24) Dextf(x ):= h/2 h =f′(x )+c ̃h4 +···
4D0 f(x0) − D0f(x0) h0304
The method Dextf(x0) has order of convergence 4 for about twice the amount h
of work of that Dh0f(x0). Round-off errors are still O(ε/h) and the minimum
total error will be when O(h4) is O(ε/h), i.e. when h = ε1/5. The minimum
error is thus O(ε4/5) for Dextf(x0), about 10−14 in double precision with
h = O(10−3).
h
Chapter 8
Numerical Integration
We now revisit the problem of numerical integration that we used as an introductory example to some of the principle of Numerical Analysis. The problem in question is to find accurate and efficient approximations to
b
f (x)dx
a
Numerical formulas to approximate a definite integral are called Quadra- tures and, as we saw in Chapter 1, they can be elementary (simple) or com- posite.
We shall assume, unless otherwise noted, that the integrand is sufficiently smooth.
8.1 Elementary Simpson Quadrature
The elementary Trapezoidal Rule quadrature was derived by replacing the integrand f by its linear interpolating polynomial P1 at a and b, that is
(8.1) f(x) = P1(x) + 1f′′(ξ)(x − a)(x − b), 2
for some ξ between a and b and thus
b b 1b
(8.2)
f(x)dx = P1(x)dx + 2 f′′(ξ)(x − a)(x − b)dx aaa
= 1(b − a)[f(a) + f(b)] − 1f′′(η)(b − a)3 22
111
112 CHAPTER 8. NUMERICAL INTEGRATION Thus, the approximation
b1
(8.3)
has an error given by −1f′′(η)(b − a)3.
a
f(x)dx ≈ 2(b − a)[f(a) + f(b)] 2
We can add an intermediate point, say xm = (a + b)/2, and replace f by its quadratic interpolating polynomial P2 with respect to the nodes a, xm and b. For simplicity let’s take [a, b] = [−1, 1]. With the simple change of variables
(8.4) x = 1 (a + b) + 1 (b − a)t, t ∈ [−1, 1] 22
we can obtain a quadrature formula for a general interval [a, b].
Let P2 be the interpolating polynomial of f at −1, 0, 1. The corresponding
divided difference table is:
−1 0 1
f (−1) f (0) f(1)
f(0) − f(−1) f(1) − f(0)
f (1)−2f (0)+f (−1) . 2
Thus (8.5)
P2(x) = f(−1) + [f(0) − f(1)](x + 1) + f(1) − 2f(0) + f(−1)(x + 1)x. 2
Now using the interpolation formula with remainder expressed in terms of a divided difference (3.56) we have
(8.6)
Therefore
f (x) = P2 (x) + f [−1, 0, 1, x](x + 1)x(x − 1) =P2(x)+f[−1,0,1,x]x(x2 −1).
111
f(x)dx= P2(x)dx+ f[−1,0,1,x]x(x2 −1)dx
−1 −1 −1
= 2f(−1) + 2[f(0) − f(−1)] + 1[f(1) − 2f(0) + f(−1)] + E[f]
3
= 1[f(−1) + 4f(0) + f(1)] + E[f], 3
8.1. ELEMENTARY SIMPSON QUADRATURE 113 where E[f] is the error
1 −1
Note that x(x2 − 1) changes sign in [−1, 1] so we cannot use the Mean Value Theorem for integrals. However, if we add another node, x4, we can relate f[−1,0,1,x] to the fourth order divided difference f[−1,0,1,x4,x], which will make the integral in (8.7) easier to evaluate:
(8.8) f [−1, 0, 1, x] = f [−1, 0, 1, x4 ] + f [−1, 0, 1, x4 , x](x − x4 ). This identity is just an application of Theorem 8. Using (8.8)
1 1
E[f]=f[−1,0,1,x4] x(x2 −1)dx+ f[−1,0,1,x4,x]x(x2 −1)(x−x4)dx.
−1 −1
The first integral is zero, because the integrand is odd. Now we choose x4 symmetrically, x4 = 0, so that x(x2 − 1)(x − x4) does not change sign in [−1, 1] and
(8.7) E[f]=
f[−1,0,1,x]x(x2 −1)dx.
(8.9)
E[f]= f[−1,0,1,0,x]x2(x2 −1)dx= f[−1,0,0,1,x]x2(x2 −1)dx.
1 1 −1 −1
Now, using (3.58), there is ξ(x) ∈ (−1, 1) such that (8.10) f[−1, 0, 0, 1, x] = f(4)(ξ(x)),
4!
and assuming f ∈ C4[−1, 1], by the Mean Value Theorem for integrals, there
is η ∈ (−1, 1) such that
f(4)(η) 1 4 f(4)(η) 1
x2(x2 − 1)dx = −15 4! = −90f(4)(η). Summarizing, Simpson’s elementary quadrature for the interval [−1, 1] is
(8.11) E[f] = 4!
−1
(8.12)
f(x)dx = 3[f(−1) + 4f(0) + f(1)] − 90f (η).
1 1 1 (4)
−1
114 CHAPTER 8. NUMERICAL INTEGRATION Note that Simpson’s elementary quadrature, 1[f(−1) + 4f(0) + f(1)],
3
gives the exact value of the integral when f is polynomial of degree 3 or less (the error is proportional to the fourth derivative), even though we used a second order polynomial to approximate the integrand. We gain extra precision because of the symmetry of the quadrature around 0. In fact, we could have derived Simpson’s quadrature by using the Hermite (third order) interpolating polynomial of f at −1, 0, 0, 1.
To obtain the corresponding formula for a general interval [a,b] we use the change of variables (8.4)
b11
f(x)dx = 2(b − a) F(t)dt,
a −1
where
(8.13) F(t)=f 2(a+b)+2(b−a)t ,
1 1
and noting that F(k)(t) = (b−a)kf(k)(x) we obtain Simpson’s elementary rule
on the interval [a, b]:
2
(8.14)
8.2
b 1 a+b
f(x)dx= 6(b−a) f(a)+4f 2 +f(b) a
1 (4) b − a5 −90f(η) 2 .
Interpolatory Quadratures
The elementary Trapezoidal and Simpson rules are examples of interpolatory quadratures. This class of quadratures is obtained by selecting a set of nodes x0, x1, . . . , xn in the interval of integration and by approximating the integral by that of the interpolating polynomial Pn of the integrand at this nodes. By construction, such interpolatory quadrature is exact for polynomials of degree up to n, at least. We just saw that Simpson rule is exact for polynomial up to degree 3 and we use P2 in its construction. The “degree gain” was due to the symmetric choice of the interpolation nodes. This lead us to two important questions:
8.2. INTERPOLATORY QUADRATURES 115
1. For a given n, how do we choose the nodes x0,x1,...,xn so that the corresponding interpolation quadrature is exact for polynomials of the highest degree k possible?
2. What is that k?
Because orthogonal polynomials (Section 5.3) play a central role in the answer to these questions, we will consider the problem of approximating
b a
b xkw(x)dx < +∞ for k = 0,1,...), w ≡ 1 being a particular case. The a
interval of integration [a,b] can be either finite or infinite (e.g. [0,+∞], [−∞, +∞]).
Definition 10. We say that a quadrature Q[f] to approximate I[f] has degree of precision k if it is exact, i.e. I[P] = Q[P], for all polynomials P of degree up to k but not exact for polynomials of degree k + 1. Equivalently, a quadrature Q[f] has degree of precision k if I[xm] = Q[xm], for m = 0,1,...,k but I[xk+1] ̸= Q[xk+1].
Example 23. Trapezoidal rule quadrature has degree of precision 2 while the Simpson quadrature has degree of precision 3.
For a given set of nodes x0, x1, . . . , xn in [a, b], let Pn be the interpolating polynomial of f at these nodes. In Lagrange form we can write Pn as (see Section 3.1)
(8.15) I[f] =
where w is an admissible weight function (w ≥ 0, b w(x)dx > 0, and
a
(8.16)
where (8.17)
lj (x) =
n
Pn(x) = f(xj)lj(x),
j=0
n (x − xk ) , for j = 0, 1, …, n. k=0,k̸=j (xj − xk)
f(x)w(x)dx,
are the elementary Lagrange polynomials. The corresponding interpolatory quadrature Qn[f] to approximate I[f] is then given by
(8.18) Qn[f] =
n j=0
Ajf(xj), Aj =
b a
ljw(x)dx, for j = 0,1,…,n.
116 CHAPTER 8. NUMERICAL INTEGRATION Theorem 15. Degree of precision of the interpolatory quadrature (8.18) is
less than 2n + 2
Proof. Suppose the degree of precision k of (8.18) is greater or equal than 2n+2. Takef(x)=(x−x0)2(x−x1)2···(x−xn)2. Thisispolynomialof degree exactly 2n + 2. Then
(8.19)
and on the other hand
Ajf(xj) = 0.
b n f(x)w(x)dx =
a
j=0
b b
(8.20) f(x)w(x)dx = (x − x0)2 · · · (x − xn)2w(x)dx > 0
aa
which is a contradiction. Therefore k < 2n + 2.
8.3 Gaussian Quadratures
We will now show that there is a choice of nodes x0,x1,...,xn which yields the optimal degree of precision 2n + 1 for an interpolatory quadrature. The corresponding quadratures are called Gaussian quadratures. To define them we recall that ψk is the k-th orthogonal polynomial with respect to the inner product
b a
if < ψk,Q >= 0 for all polynomials Q of degree less than k. Recall also that the zeros of the orthogonal polynomials are real, simple, and contained in [a, b] (see Theorem 14).
Definition 11. Let ψn+1(x) be the (n + 1)st orthogonal polynomial and let x0,x1,…,xn be its n+1 zeros. Then the interpolatory quadrature (8.18) with the nodes so chosen is called a Gaussian-Quadrature.
Theorem 16. The interpolatory quadrature (8.18) has degree of precision k = 2n + 1 if and only if it is a Gaussian Quadrature.
(8.21) < f,g >=
f(x)g(x)w(x)dx,
8.3. GAUSSIAN QUADRATURES 117
Proof. Let f is a polynomial of degree ≤ 2n + 1. Then, we can write
(8.22) f(x) = Q(x)ψn+1(x) + R(x),
where Q and R are polynomials of degree ≤ n. Now bbb
(8.23) f(x)w(x)dx = Q(x)ψn+1(x)w(x)dx + R(x)w(x)dx aaa
The first integral on the right hand side is zero because of orthogonality. For the second integral the quadrature is exact (it is interpolatory). Therefore
(8.24)
Thus, (8.25)
b n f(x)w(x)dx =
AjR(xj).
Moreover, R(xj) = f(xj) − Q(xj)ψn+1(xj) = f(xj) for all j = 0,1,…,n.
aa
a
b n f(x)w(x)dx =
a
j=0
This proves that the Gaussian quadrature has degree of precision k = 2n + 1. Now suppose that the interpolatory quadrature (8.18) has maximal degree of precision 2n+1. Take f(x) = P(x)(x−x0)(x−x1)···(x−xn) where P is a polynomial of degree ≤ n. Then, f is a polynomial of degree ≤ 2n + 1 and
bb n f(x)w(x)dx = P(x)(x−x0)···(x−xn)w(x)dx =
j=0
Ajf(xj) = 0. Therefore, the polynomial (x − x0)(x − x1)···(x − xn) of degree n + 1 is
orthogonal to all polynomials of degree ≤ n. Thus, it is a multiple of ψn+1.
Example 24. Consider the interval [−1, 1] and the weight function w ≡ 1. The corresponding orthogonal the Legendre Polynomials 1,x,x2 − 1,x3 −
3
3x,···. Taken=1. Therootsofψ2 arex0 =−1 andx1 =1. There- 533
fore, the corresponding Gaussian quadrature is
j=0
Ajf(xj).
(8.26)
1 1 1
f(x)dx ≈ A0f −3 + A1f 3
−1
118
where
(8.27) (8.28)
CHAPTER 8. NUMERICAL INTEGRATION
A0 = A1 =
1 −1
1
l0(x)dx, l1(x)dx.
−1
We can evaluate evaluate these integrals directly or use the method of un-
determined coefficients to find A0 and A1. The latter is generally easier and we illustrate it now. Using that the quadrature has to be exact for 1 and x we have
1
1dx=A0 +A1,
1 11
(8.29) 2=
(8.30) 0 =
−1
xdx = −A0 3 + A1 3.
Solving this 2 × 2 linear system we get A0 = A1 = 1. So the Gaussian
−1
quadrature for n = 1 in [−1, 1] is
1 1
(8.31) Q1[f]=f − 3 +f 3
Let us compare this quadrature to the elementary Trapezoidal Rule. Take
f(x) = x2. The Trapezoidal Rule, T[f], gives
(8.32) T[x2] = 2[f(−1) + f(1)] = 2
2
whereas the Gaussian quadrature Q1[f] yields the exact result:
12 12 2 (8.33) Q1[x2] = − 3 + 3 = 3.
Example 25. Let us take again the interval [−1, 1] but now w(x) = √ 1 . 1−x2
As we know (see 2.4 ), ψn+1(x) = Tn+1(x), i.e. the (n + 1)st Chebyshev Polynomial. Its zeros are xj = cos[ 2j+1 π] for j = 0,…,n. For n = 1 we
have
(8.34) x0=cos 4 = 2,
5π 1 (8.35) x1=cos 4 =− 2.
2(n+1) π 1
8.3. GAUSSIAN QUADRATURES 119 We can use again the method of undetermined coefficients to find A0 and A1:
11
1√1 − x2 dx = A0 + A1, 11 11
(8.36) π =
(8.37) 0 =
−1
x√1 − x2 dx = −A0 2 + A1 2,
which give A0 = A1 = π . Thus, the corresponding Gaussian quadrature to
−1 2
approximate 1 f(x)√ 1 dx is −1 1−x2
π 1 1 (8.38) Q1[f]=2 f − 2 +f 2 .
8.3.1 Convergence of Gaussian Quadratures
Let f ∈ C[a,b] and consider the interpolation quadrature (8.18). Can we guarantee that the error converges to zero as n → ∞, i.e.,
n b Aj =
j=0
a
(8.39) 0 <
b n lk2(x)w(x)dx =
Ajlk2(xj) = Ak
b n f (x)w(x)dx −
Aj f (xj ) → 0, as n → ∞ ?
a
j=0
The answer is no. As we know, convergence of the interpolating polynomial to f depends on the smoothness of f and the distribution of the nodes. However, if the interpolatory quadrature is a Gaussian one the answer is yes. This follows from the following special properties of the quadrature weights A0, A1, . . . , An in the Gaussian quadrature.
Theorem 17. For a Gaussian quadrature all the quadrature weights are positive and sum up to ∥w∥1, i.e.,
(a) Aj > 0 for all j = 0,1,…,n.
(b)
Proof. (a) Let Pk(x) = lk2(x) for k = 0,1,…,n. These are polynomials of
degree exactly equal to 2n and Pk(xj) = δkj. Thus,
w(x)dx.
a
j=0
120
for k = 0,1,…,n.
(b) Take f(x) ≡ 1 then (8.40)
CHAPTER 8. NUMERICAL INTEGRATION
n n 2n+1
(x)].
b a
b n w(x)dx =
Aj.
as the quadrature is exact for polynomials of degree zero.
a
j=0
We can now use these special properties of the Gaussian quadrature to prove its convergence for all f ∈ C [a, b]:
Theorem 18. Let
(8.41) Qn[f] = Ajf(xj)
j=0 be the Gaussian quadrature. Then
(8.43) E [f−P∗ ]=E [f]−E [P∗ ]=E [f] n 2n+1 n n2n+1 n
f(x)w(x)dx−Qn[f]→0,as n→∞ .
norm, ∥f∥∞ = maxx∈[a,b] |f(x)|) by polynomials of degree ≤ 2n + 1. Then,
n
(8.42) En[f]:=
Proof. Let P∗ (x) be the best uniform approximation to f (in the max
and therefore
E [f]=E [f−P∗ ]=
b a
[f(x)−P∗ 2n+1
(x)]w(x)dx−
n j=0
A[f(x)−P∗
j j 2n+1 j
2n+1
Taking the absolute value, using the triangle inequality, and the fact that the quadrature weights are positive we obtain
|E [f]| ≤ n
b a
|f(x)−P∗ 2n+1
n j=0
A |f(x )−P∗ (x )|
≤ ∥f − P ∗ ∥ 2n+1 ∞
w(x)dx + ∥f − P ∗
2n+1 ∞
n j=0
a
1 2n+1 ∞
= 2∥w∥ ∥f − P ∗
∥
(x)|w(x)dx+ b
j
j
2n+1 j
From the Weierstrass approximation theorem it follows that En[f] → 0 as n → ∞.
∥
A
j
8.4. COMPUTING THE GAUSSIAN NODES AND WEIGHTS 121 Moreover, one can prove (using one of the Jackson Theorems) that if
f ∈ Cm[a,b]
(8.44) |En[f]| ≤ C(2n)−m∥f(m)∥∞. That is, the rate of convergence is super algebraic.
8.4 Computing the Gaussian Nodes and Weights
Orthogonal polynomials satisfy a three-term relation:
(8.45) ψk+1(x) = (x − αk)ψk(x) − βkψk−1(x), for k = 0, 1, . . . , n,
where β0 is defined by b w(x)dx, ψ0(x) = 1 and ψ−1(x) = 0. Equivalently a
(8.46) xψk(x) = βkψk−1(x) + αkψk(x) + ψk+1(x), If we use the normalized orthogonal polynomials
(8.47) ψ ̃k(x) = √ ψk(x) <ψk,ψk >
and recalling that
βk= <ψk,ψk> <ψk−1,ψk−1 >
then (8.46) can be written as (8.48)
xψ ̃k(x) = βkψ ̃k−1(x) + αkψ ̃k(x) + βk+1ψ ̃k+1(x),
Now evaluating this at a root xj of ψn+1 we get the eigenvalue problem (8.49) xjvj = Jnvj,
where
α √ β 0 · · · √01√
β 1 α 0 β 2 · · · . .. .. ..
0 0
ψ ̃ 0 ( x j ) ψ ̃ 1 ( x j )
for k = 0, 1, . . . , n.
for k = 0, 1, . . . , n.
(8.50) Jn=. . . . , vj= . .
. √
√ β n ψ ̃ n − 1 ( x j ) 0 0 0 β n α n ψ ̃ n ( x j )
122 CHAPTER 8. NUMERICAL INTEGRATION
That is, the Gaussian nodes xj, j = 0,1,…,n are the eigenvalues of the Jacobi Matrix Jn. One can show that the Gaussian weights Aj, are given in terms of the first component vj,0 of the (normalized) eigenvector vj (vjT vj = 1):
(8.51) A =βv2 . j 0j,0
There are efficient numerical methods (e.g. the QR method) to solve the eigenvalue problem for a symmetric triadiagonal matrix and this is one of most popular approached to compute the Gaussian nodes.
8.5 Clenshaw-Curtis Quadrature
Gaussian quadratures are optimal in terms of the degree of precision and offer superalgebraic convergence for smooth integrands. However, the com- putation of Gaussian weights and nodes carries a significant cost, for large n. There is an ingenious interpolatory quadrature which can be a competitor to the Gaussian quadrature due to its efficient and fast rate of convergence. This is the Clenshaw-Curtis Quadrature.
Suppose f is a smooth function in [−1,1] and we are interested in an accurate approximation of the integral
1
f (x)dx.
−1
The idea is to use the extrema of the n Chebyshev polynomial Tn, xj =
cos(jπ ), j = 0, 1, …, n as the nodes of the corresponding interpolatory quadra- n
ture. The degree of precision is only n (not 2n + 1!). However, as we know, for smooth functions the approximation by polynomial interpolation using the Chebyshev nodes converges very rapidly. Hence, for smooth integrands this particular interpolatory quadrature can be expected to converge fast to the exact value of the integral.
LetPnbetheinterpolatingpolynomialoffatxj =cos(jπ), j=0,1,…,n. n
We can write Pn as
(8.52) Pn(x)= 0 +akTk(x)+ nTn(x)
k=1
an−1 a
22
8.5. CLENSHAW-CURTIS QUADRATURE 123 Under the change of variable x = cos θ, for θ ∈ [0, π] we get
an−1 1
(8.53) Pn(cosθ)= 0 +akcoskθ+ ancosnθ.
k=1
Let Πn(θ) = Pn(cosθ) and F(θ) = f(cosθ). By extending F evenly over
[−π,0] (or over [π,2π]) and using Theorem 13, we conclude that Πn(θ) in-
terpolates F(θ) = f(cosθ) at the equally spaced points θj = jπ, j = 0,1,…n n
if and only if
(8.54) ak = 2 ′′ F(θj)coskθj, k = 0,1,..,n.
n j=0
These are the (Type I) Discrete Cosine Transform (DCT) coefficients of F and we can compute them efficiently in O(n log2 n) operations with the FFT.
Now, using the change of variable x = cos θ we have 1ππ
(8.55) f(x)dx = f(cosθ)sinθdθ = F(θ)sinθdθ, −1 0 0
and approximating F(θ) by its interpolant Πn(θ) = Pn(cosθ), we obtain the corresponding quadradure
1 π
(8.56) f(x)dx ≈ Πn(θ) sin θdθ.
−1 0 Substituting (8.53) we have
(8.57)
π aπ n−1 π aπ Πn(θ)sinθdθ= 0 sinθdθ+ak coskθsinθdθ+ n cosnθsinθdθ.
22
n
0 20 k=10 20 Assuming n even and using cos kθ sin θ = 1 [sin(1 + k)θ + sin(1 − k)θ] we
obtain the Clenshaw-Curtis Quadrature
(8.58)
1 −1
n−22a a f(x)dx ≈ a0 + k + n
.
2
k=2 1−k2 1−n2 k even
124 CHAPTER 8. NUMERICAL INTEGRATION
For a general interval [a, b] we simply use the change of variables
(8.59) x = a + b + b − a cos θ 22
for θ ∈ [0, π] and thus
a b−aπ
(8.60) f (x)dx = 2 F (θ) sin θdθ, b0
where F (θ) = f ( a+b + b−a cos θ) and so the formula (8.58) gets an extra factor
of (b − a)/2.
8.6 Composite Quadratures
We saw in Section 1.2.2 that one strategy to improve the accuracy of a quadrature formula is to divide the interval of integration [a,b] into small subintervals, use the elementary quadrature in each of them, and sum up all the contributions.
For simplicity, let us divide uniformly [a, b] into N subintervals of equal length h = (b − a)/N, [xj,xj+1], where xj = a + jh for j = 0,1,…,N − 1. If we use the elementary Trapezoidal Rule in each subinterval (as done in Section 1.2.2) we arrive at the Composite Trapezoidal Rule:
(8.61)
f(x)dx = h 2f(a) + f(xj) + 2f(b) − 12(b − a)h2f′′(η), j=1
22
b 1N−1 11
(8.63)
xj+2 xj
h1
f(x)dx = 3 [f(xj) + 4f(xj+1) + f(xj+2)] − 90f(4)(ηj)h5
a
where η is some point in (a, b).
To derive a corresponding composite Simpson quadrature we take N even
and apply the elementary Simpson quadrature in each of the N/2 intervals [xj,xj+2],j=0,…,N−2. Thatis:
b x2 (8.62) f (x)dx =
x4 xN
f (x)dx + · · · + f (x)dx
f (x)dx +
a x0 x2 xN−2
and since the elementary Simpson quadrature applied to [xj,xj+2] reads:
8.7. MODIFIED TRAPEZOIDAL RULE 125 for some ηj ∈ (xj,xj+2), summing up all the N/2 contributions we get the
composite Simpson quadrature:
b h N/2−1 N/2
f(x)dx= 3f(a)+2 f(x2j)+4f(x2j−1)+f(b) a j=1 j=1
− 1 (b − a)h4f(4)(η), 180
for some η ∈ (a, b).
8.7 Modified Trapezoidal Rule
We are going to consider here a modification to the trapezoidal rule that will yield a quadrature with an error of the same order as Simpson’s rule. More- over, this modified quadrature will give us some insight to the the asymptotic form of the trapezoidal rule error.
To simplify the derivation let us consider the interval [0, 1] and let P3 be the polynomial interpolating f(0), f′(0), f(1), f′(1). Newton’s divided differ- ences representation of P3 is
(8.64) P3(x)=f(0)+f[0,0]x+f[0,0,1]x2 +f[0,0,1,1]x2(x−1), and thus
11′11
P3 (x)dx = f (0) + 2 f (0) + 3 f [[0, 0, 1] − 12 f [0, 0, 1, 1]. f′(0)
(8.65)
Th divided differences are obtained in the tableau:
0
f(1) − f(0) f′(1)
1 f(1)
Thus,
0 f(0) 0 f(0) 1 f(1)
f(1) − f(0) − f′(0) f′(1) − f(1) + f(0)
f′(1) + f′(0) + 2(f(0) − f(1))
11′1′1′′
P3(x)dx = f(0)+2f (0)+3[f(1)−f(0)−f (0)]−12[f (0)+f (1)+2(f(0)−f(1))] 0
126 CHAPTER 8. NUMERICAL INTEGRATION and simplifying the right hand side we get
(8.66)
111′′
0
P3(x)dx = 2[f(0) + f(1)] + 12[f (0) − f (1)],
which is the simple Trapezoidal rule plus a correction involving the derivative of the integrand at the end points.
We can obtain an expression for the error of this quadrature formula by recalling that the Cauchy remainder in the interpolation is
(8.67) f(x) − P3(x) = 1 f(4)(ξ(x))x2(x − 1)2 4!
and since x2(x − 1)2 does not change sign in [0, 1] we can use the mean value Theorem for integrals to get
(8.68)
E[f] = [f(x) − P3(x)]dx = 4!f(4)(η) x2(x − 1)2dx = 720f(4)(η)
1111 00
for some η ∈ (0, 1).
To obtain the quadrature in a general finite interval [a,b] we use the
changeofvariablesx=a+(b−a)t, t∈[0,1] b 1
(8.69) f (x)dx = (b − a) F (t)dt, a0
where F(t) = f(a + (b − a)t). Thus, (8.70)
a
for some η ∈ (a, b).
We can get a Composite Modified Trapezoidal rule by subdividing [a, b]
in N subintervals of equal length h = b−a , applying the simple rule in each N
subinterval and adding up all the contributions:
b 1N−11h2
f(x)dx=h 2f(x0)+f(xj)+2f(xN) −12[f′(b)−f′(a)]
(8.71) a j=1
+ 1 f(4)(η)h4. 720
b b−a (b−a)2 ′ ′
1 (4)) 5 f(x)dx= 2 [f(a)+f(b)]+ 12 [f(a)−f(b)]+720f (η)(b−a),
8.8. THE EULER-MACLAURIN FORMULA 127 8.8 The Euler-Maclaurin Formula
We are now going to obtain a more general formula for the asymptotic form of the error in the trapezoidal rule quadrature. The idea is to use integration by parts with the aid of suitable polynomials. Let us consider again the interval [0, 1] and define B0(x) = 1, B1(x) = x − 1 , then
(8.72)
2 111
f(x)dx = f(x)B0(x)dx = f(x)B1′ (x)dx 000
= f(x)B1(x)|10 − 11
f′(x)B1(x)dx
We can continue the integration by parts using the Bernoulli Polynomials
= 2[f(0) + f(1)] −
(8.73) B′ (x)=(k+1)B (x), k=1,2,…
1 0
f′(x)B1(x)dx
0
2
degree exactly k with leading order coefficient 1, i.e. monic. These polyno- mials are determined by the recurrence relation (8.73) up to a constant. The constant is fixed by requiring that
(8.74) Bk(0) = Bk(1) = 0, k = 3,5,7,…
Indeed,
(8.75) B′′ (x)=(k+1)B′(x)=(k+1)kB (x) k+1 k k−1
and Bk−1(x) has the form
(8.76) Bk−1(x) = xk−1 + ak−2xk−2 + . . . a1x + a0. Integrating (8.75) twice we get
which satisfy
k+1 k
Since we start with B1(x) = x − 1 it is clear that Bk(x) is a polynomial of
(8.77)
Bk+1(x)=k(k+1) k(k+1)x +(k−1)kx +…+2a0x +bx+c
1k+1ak−2k 12
128 CHAPTER 8. NUMERICAL INTEGRATION
For k + 1 odd, the two constants of integration b and c are determined
by the condition (8.74). The Bk(x) for k even are then given by Bk(x) =
B′ (x)/(k + 1). k+1
We are going to need a few properties of the Bernoulli polynomials. Be- cause of construction, Bk(x) is an even (odd) polynomial in x− 1 is k is even
(odd). Equivalently, they satisfy the identity
(8.78) (−1)kBk(1 − x) = Bk(x).
This follows because the polynomials Ak(x) = (−1)kBk(1 − x) satisfy the same conditions that define the Bernoulli polynomials, i.e. A′k+1(x) = (k + 1)Ak(x) and Ak(0) = Ak(1) = 0, for k = 3,5,7,… and since A1(x) = B1(x) they have are the same. From (8.78) and (8.74) we get that
(8.79) Bk(0) = Bk(1), k = 2,3,,…
We define Bernoulli numbers as Bk = Bk(0) = Bk(1). This together with
the recurrence relation (8.73) implies that (8.80)
for k = 1,2,….
Lemma 3. The polynomials C2m(x) = B2m(x) − B2m, m = 1, 2, . . . do not
change sign in [0, 1].
Proof. We will prove it by contradiction. Let us suppose that C2M (x) changes
sign. Then it has at least 3 zeros and, by Rolle’s theorem, C′ (x) = B′ (x) 2m 2m
1111
B (x)dx= B′ (x)dx= [B (1)−B
(0)]=0
k k+1 k+1 k+1 k+1 k+1 00
2
has at least 2 zeros in (0,1). This implies that B2m−1(x) has 2 zeros in (0, 1). Since B (0) = B (1) = 0, again by Rolle’s theorem, B′ (x)
2m−1 2m−1 2m−1 has 3 zeros in (0,1), which implies that B2m−2(x) has 3 zeros, …,etc. We then conclude that B2l−1(x) has 2 zeros in (0,1) plus the two at the end points, B2l−1(0) = B2l−1(1) for all l = 1, 2, . . ., which is a contradiction (for l = 1,2).
8.8. THE EULER-MACLAURIN FORMULA 129
Here are the first few Bernoulli polynomials
(8.81) B0(x) = 1
(8.82) B1(x)=x−1 2
12 1 2 1
(8.83) B2(x)= x−2 −12=x −x+6
13 1 1 3 32 1
(8.84) B3(x)= x−2 −4 x−2 =x −2x +2x
14 1 12 7 4 3 2 1
(8.85) B4(x)= x−2 −2 x−2 +5·48=x −2x +x −30.
Let us retake the idea of integration by parts that we started in (8.72) 1 11
− f′(x)B1(x)dx = −2 f′(x)B2′ (x)dx 00
(8.86) 1 11
= 2B2[f′(0) − f′(1)] + 2
f′′(x)B2(x)dx
and (8.87)
11 11
f′′(x)B2(x)dx = 2 · 3 00
0
2
f′′(x)B3′ (x)dx 111
= f′′(x)B3(x) − f′′′(x)B3(x)dx 2·300
11 11
= −2 · 3 f′′′(x)B3(x)dx = −2 · 3 · 4 f′′′(x)B4′ (x)dx
00
B4 11
= 4![f′′′(0)−f′′′(1)]+4!
Continuing this way we arrive at the Euler-Maclaurin formula for the simple
f(4)(x)B4(x)dx.
Trapezoidal rule in [0, 1]:
Theorem 19.
(8.88)
1 1 m
f(x)dx = 2[f(0) + f(1)] +
0 k=1
0
B2k (2k−1) (2k)![f
(0) − f
(2k−1)
(1)] + Rm
130 CHAPTER 8. NUMERICAL INTEGRATION where
11
f(2m+2)(x)[B2m+2(x) − B2m+2]dx and using (8.80), the Mean Value theorem for integrals, and Lemma 3
(8.89) Rm = (2m + 2)! (8.90)
0
1 1 B2m+2
f(2m+2)(η)[B2m+2(x) − B2m+2]dx = −(2m + 2)!f(2m+2)(η) It is now straight forward to obtain the Euler Maclaurin formula for the
composite Trapezoidal rule with equally spaced points:
Theorem 20. (The Euler-Maclaurin Summation Formula) Let m be a positive integer and f ∈ C(2m+2)[a, b], h = b−a then
Rm = (2m + 2)! for some η ∈ (0, 1).
0
N b 11N−1
(8.91)
mB + 2k
(2k)! k=1
h2k[f2k−1(a) − f2k−1(b)]
a
f(x)dx=h 2f(a)+2f(b)+f(a+jh) j=1
− B2m+2 (2m + 2)!
(b − a)h2m+2f(2m+2)(η).
η ∈ (0, 1)
Remarks: The error is in even powers of h. The formula gives m corrections to the composite Trapezoidal rule. For a smooth periodic function and if b−a is a multiple of its period, then the error of the composite Trapezoidal rule, with equally spaced points, decreases faster than any power of h as h → 0.
8.9 Romberg Integration
We are now going to apply successively Richardson’s Extrapolation to the Trapezoidal rule. Again, we consider equally spaced nodes, xj = a + jh, j=0,1,…,N,h=(b−a)/N,andassumeN iseven
1 1 n−1 N′′
(8.92) Th[f]=h 2f(a)+2f(b)+f(a+jh) :=h f(a+jh)
j=1 j=0
8.9. ROMBERG INTEGRATION 131 where ′′ means that first and last terms have a 1 factor.
2
We know from the Euler-Maclaurin formula that for a smooth integrand
(8.93)
b a
f(x)dx=Th[f]+c2h2 +c4h4 +···
for some constants c2, c4, etc. We can do Richardson extrapolation to obtain a quadrature with a leading order error O(h4). If we have computed T2h[f] we can combine it with Th[f] to achieve this by noting that
(8.94) we have (8.95)
b a
f(x)dx=T2h[f]+c2(2h)2 +c4(2h)4 +···
1 N ′′
Th[f] − 2T2h[f] = h j=0
NN
b a
4Th[f]−T2h[f] 3
4 6
f(x)dx=
+c ̃h +c ̃h +··· 46
We can continue the Richardson extrapolation process but we can do this more efficiently if we reuse the work we have done to compute T2h[f] to evaluate Th[f]. To this end, we note that
2 ′′ f(a + jh) − h
j=0
f(a + 2jh) = h
2 j=1
f(a + (2j − 1)h)
Ifweleth =b−a then l 2l
(8.96) T =
T
2l−1
[f]+h f(a+(2j−1)h).
1
hl2hl−1l l
j=1
Beginning with the simple Trapezoidal rule (two points) we can successively double the number of points in the quadrature by using (8.96) and immedi- ately do extrapolation.
Let
(8.97) R(0,0)=Th0[f]= b−a[f(a)+f(b)] 2
and for l = 1,2,…,M define
(8.98) R(l,0)=
1 2l−1
R(l−1,0)+h f(a+2j−1)h).
2ll j=1
132 CHAPTER 8. NUMERICAL INTEGRATION From R(0, 0) and R(1, 0) we can extrapolate to obtain
(8.99) R(1, 1) = R(1, 0) + 1 [R(1, 0) − R(0, 0)] 4−1
We can generate a tableau of approximations like the following, for M = 4
R(0, 0) R(1, 0) R(2, 0) R(3, 0) R(4, 0)
R(1, 1) R(2, 1) R(3, 1) R(4, 1)
R(2, 2)
R(3, 2) R(3, 3)
R(4, 2) R(4, 3) R(4, 4)
Each of the R(l, m) is obtained by extrapolation
(8.100) R(l,m)=R(l,m−1)+ 1 [R(l,m−1)−R(l−1,m−1)].
4m − 1
and R(4,4) would be the most accurate approximation (neglecting round off
errors). This is the Romberg algorithm and can be written as: h = b − a;
R(0,0)= 1(b−a)[f(a)+f(b)]; 2
for l = 1 : M h = h/2;
R(1,0)= 1R(l−1,0)+h2l−1 f(a+(2j−1)h); 2 j=1
for m = 1 : M
R(l,m)=R(l,m−1)+ 1 [R(l,m−l)−R(l−1,m−1)]; 4m −1
end end
Chapter 9 Linear Algebra
9.1 The Three Main Problems
There are three main problems in Numerical Linear Algebra:
1. Solving large linear systems of equations.
2. Finding eigenvalues and eigenvectors.
3. Computing the Singular Value Decomposition (SVD) of a large matrix.
The first problem appears in a wide variety of applications and is an indispensable tool in Scientific Computing.
Given a nonsingular n×n matrix A and a vector b ∈ Rn, where n could be on the order of millions or billions, we would like to find the unique solution x, satisfying
(9.1) Ax = b
or an accurate approximation x ̃ to x. Henceforth we will assume, unless otherwise stated, that the matrix A is real.
We will study Direct Methods (for example Gaussian Elimination), which compute the solution (up to roundoff errors) in a finite number of steps and Iterative Methods, which starting from an initial approximation of the solution x(0) produce subsequent approximations x(1),x(2),… from a given recipe
(9.2) x(k+1) = G(x(k),A,b), k = 0,1,… 133
134 CHAPTER 9. LINEAR ALGEBRA
where G is a continuous function of the first variable. Consequently, if the iterations converge, x(k) → x as k → ∞, to the solution x of the linear system Ax = b, then
(9.3) x = G(x, A, b).
That is, x is a fixed point of G.
One of the main strategies in the design of efficient numerical methods
for linear systems is to transform the problem to one which is much easier to solve. Both direct and iterative methods use this strategy.
The eigenvalue problem for an n × n matrix A consists of finding each or some of the scalars (the eigenvalues) λ and the corresponding eigenvectors v ̸= 0 such that
(9.4) Av = λv.
Equivalently, (A − λI)v = 0 and so the eigenvalues are the roots of the
characteristic polynomial of A
(9.5) p(λ) = det(A − λI).
Clearly, we cannot solve this problem with a finite number of elementary operations (for n ≥ 5 it would be a contradiction to Abel’s theorem) so iterative methods have to be employed. Also, λ and v could be complex even if A is real. The maximum of the absolute value of the eigenvalues of a matrix is useful concept in numerical linear algebra.
Definition 12. Let A be an n × n matrix. The spectral radius ρ of A is defined as
(9.6) ρ(A) = max{|λ1|, . . . , |λn|},
where λi, i = 1, . . . , n are the eigenvalues (not necessarily distinct) of A.
Large eigenvalue (or more appropriately eigenvector) problems arise in the study of the steady state behavior of time-discrete Markov processes which are often used in a wide range of applications, such as finance, popu- lation dynamics, and data mining. The original Google’s PageRank search algorithm is a prominent example of the latter. The problem is to find an eigenvector v associated with the eigenvalue 1, i.e. v = Av. Such v is a
9.2. NOTATION 135
probability vector so all its entries are positive, add up to 1, and represent the probabilities of the system described by the Markov process to be in a given state in the limit as time goes to infinity. This eigenvector v is in effect a fixed point of the linear transformation represented by the Markov matrix A.
The third problem is related to the second one and finds applications in image compression, model reduction techniques, data analysis, and many other fields. Given an m × n matrix A, the idea is to consider the eigenvalues and eigenvectors of the square, n × n matrix AT A (or A∗A, where A∗ is the conjugate transpose of A as defined below, if A is complex). As we will see, the eigenvalues are all real and nonnegative and AT A has a complete set of orthogonal eigenvectors. The singular values of a matrix A are the positive square roots of the eigenvalues of AT A. Using this, it follows that any real m × n matrix A has the singular value decomposition (SVD)
(9.7) UTAV =Σ,
where U is an orthogonal m × m matrix, V is an orthogonal n × n matrix,
and Σ is a “diagonal” matrix of the form D 0
(9.8) Σ = 0 0 , D = diag(σ1,σ2,…,σr), where σ1 ≥ σ2 ≥ …σr > 0 are the nonzero singular values of A.
9.2 Notation
A matrix A with elements aij will be denoted A = (aij), this could be a squaren×nmatrixoranm×nmatrix.AT denotesthetransposeofA,i.e. AT =(aji).
A vector in x ∈ Rn will be represented as the n-tuple x1
x2 (9.9) x = . .
. xn
The canonical vectors, corresponding to the standard basis in Rn, will be denoted by e1,e2,…,en, where ek is the n-vector with all entries equal to zero except the j-th one, which is equal to one.
136 CHAPTER 9. LINEAR ALGEBRA
The inner product of two real vectors x and y in Rn is n
(9.10) ⟨x,y⟩ = xiyi = xTy. i=1
If the vectors are complex, i.e. x and y in Cn we define their inner product as
n (9.11) ⟨x, y⟩ = x ̄iyi,
i=1
where x ̄i denotes the complex conjugate of xi.
With the inner product (9.10) in the real case or (9.11) in the complex
case, we can define the Euclidean norm
(9.12) ∥x∥2 = ⟨x, x⟩.
NotethatifAisann×nrealmatrixandx,y∈Rn then
nnnn ⟨x,Ay⟩=x a y =a xy
i
(9.13) nn nn
i=1 = k=1
ik k ik i k i=1 k=1
k=1
aikxi yk =
aTkixi yk, i=1 k=1 i=1
that is
(9.14)
Similarly in the complex case we have
(9.15) ⟨x, Ay⟩ = ⟨A∗x, y⟩,
where A∗ is the conjugate transpose of A, i.e. A∗ = (aji).
9.3 Some Important Types of Matrices
One useful type of linear transformations consists of those that preserve the Euclidean norm. That is, if y = Ax, then ∥y∥2 = ∥x∥2 but this implies
(9.16) ⟨Ax, Ax⟩ = ⟨AT Ax, x⟩ = ⟨x, x⟩ and consequently AT A = I.
⟨x, Ay⟩ = ⟨AT x, y⟩.
9.3. SOME IMPORTANT TYPES OF MATRICES 137 Definition 13. An n × n real (complex) matrix A is called orthogonal (uni-
tary) if AT A = I (A∗A = I).
Two of the most important types of matrices in applications are symmet-
ric (Hermitian) and positive definite matrices.
Definition 14. An n × n real matrix A is called symmetric if AT = A. If
the matrix A is complex it is called Hermitian if A∗ = A.
Symmetric (Hermitian) matrices have real eigenvalues, for if v is an eigen- vector associated to an eigenvalue λ of A, we can assumed it has been nor- malized so that ⟨v, v⟩ = 1, and
(9.17) ⟨v,Av⟩ = ⟨v,λv⟩ = λ⟨v,v⟩ = λ.
ButifAT =Athen
(9.18) λ = ⟨v,Av⟩ = ⟨Av,v⟩ = ⟨λv,v⟩ = λ⟨v,v⟩ = λ,
and λ = λ if and only if λ ∈ R.
Definition 15. An n×n matrix A is called positive definite if it is symmetric (Hermitian) and ⟨x,Ax⟩ > 0 for all x ∈ Rn,x ̸= 0.
By the preceding argument the eigenvalues of a positive definite matrix A are real because AT = A. Moreover, if Av = λv with ∥v∥2 = 1 then 0 < ⟨v,Av⟩ = λ. Therefore, positive definite matrices have real, positive eigenvalues. Conversely, if all the eigenvalues of a symmetric matrix A are positive then A is positive definite. This follows from the fact that symmetric matrices are diagonalizable by an orthogonal matrix S, i.e. A = SDST, where D is a diagonal matrix with the eigenvalues λ1, . . . , λn (not necessarily distinct) of A. Then
n
(9.19) ⟨x, Ax⟩ = λiyi2,
i=1
where y = ST x. Thus a symmetric (Hermitian) matrix A is positive definite if and only if all its eigenvalues are positive. Moreover, since the determinant is the product of the eigenvalues, positive definite matrices have a positive determinant.
We now review another useful consequence of positive definiteness.
138 CHAPTER 9. LINEAR ALGEBRA Definition 16. Let A = (aij) be an n × n matrix. Its leading principal
submatrices are the square matrices
a11 ··· a1k .
(9.20) Ak = . , k=1,...,n. ak1 ··· akk
Theorem 21. All the leading principal submatrices of a positive definite matrix are positive definite.
Proof. Suppose A is an n × n positive definite matrix. Then, all its leading principal submatrices are symmetric (Hermitian). Moreover, if we take a vector x ∈ Rn of the form
y1 .
. yk
(9.21) x= , 0.
. . 0
wherey=[y1,...,yk]T ∈Rk isanarbitrarynonzerovectorthen 0 < ⟨x,Ax⟩ = ⟨y,Aky⟩
which shows that Ak for k = 1, . . . , n is positive definite.
The converse of Theorem 21 is also true but the proof is much more technical: A is positive definite if and only if det(Ak) > 0 for k = 1,…,n.
Note also that if A is positive definite then all its diagonal elements are positive because 0 < ⟨ej,Aej⟩ = ajj, for j = 1,...,n.
9.4. SCHUR THEOREM 139 9.4 Schur Theorem
Theorem 22. (Schur) Let A be an n × n matrix, then there exists a unitary matrix T ( T∗T = I ) such that
(9.22)
TAT= .
are the eigenvalues of
. , bn−1,n
λn
and all the elements below the
where
diagonal are zero.
A
λ1, . . . , λn
λ1 b12 b13 ··· b1n λ2 b23 ··· b2n
∗ .. .
Proof. We will do a proof by induction. Let A be a 2 × 2 matrix with eigenvalues λ1 and λ2. Let u be a normalized, eigenvector u (u∗u = 1) corresponding to λ1. Then we can take T as the matrix whose first column is u and its second column is a unit vector v orthogonal to u (u∗v = 0). We have
∗ u∗ λ1 u∗Av (9.23) TAT= v∗ λ1u Av = 0 v∗Av .
The scalar v∗Av has to be equal to λ2, as similar matrices have the same eigenvalues. We now assume the result is true for all k × k (k ≥ 2) matrices and will show that it is also true for all (k + 1) × (k + 1) matrices. Let A be a (k + 1) × (k + 1) matrix and let u1 be a normalized eigenvector associated with eigenvalue λ1. Choose k unit vectors t1,...,tk so that the matrix T1 = [u1 t1 . . . tk] is unitary. Then,
λ1 c12 c13 ··· c1n 0
∗ . (9.24) T1 AT1 = . Ak ,
0
where Ak is a k × k matrix. Now, the eigenvalues of the matrix on the right hand side of (9.24) are the roots of (λ1 − λ) det(Ak − λI) and since this matrix is similar to A, it follows that the eigenvalues of Ak are the
140 CHAPTER 9. LINEAR ALGEBRA
remaining eigenvalues of A, λ2, . . . , λk+1. By the induction hypothesis there is a unitary matrix Tk such that Tk∗AkTk is upper triangular with the eigenvalues λ2, . . . , λk+1 sitting on the diagonal. We can now use Tk to construct the (k+1)×(k+1) unitary matrix as
1 0 0 ··· 0 0
. (9.25) Tk+1 =. Tk
0 and define T = T1Tk+1. Then
(9.26) T∗AT = T∗ T∗AT T = T∗ (T∗AT )T k+1 1 1 k+1 k+1 1 1 k+1
and using (9.24) and (9.25) we get
10 0 ··· 0λ1 c12 c13 ··· c1n10 0 ··· 0
0 0 0
. ∗ . . T AT=. Tk . Ak . Tk
000
λ1 b12 b13 ··· b1n λ2 b23 ··· b2n
.. . =...
bn−1,n λn
9.5 Norms
AnormonavectorspaceV (forexampleRn orCn)overK=R(orC)isa mapping ∥ · ∥ : V → [0, ∞), which satisfy the following properties:
(i) ∥x∥≥0∀x∈V and∥x∥=0iff x=0. (ii) ∥x+y∥≤∥x∥+∥y∥∀x,y∈V.
∗
9.5. NORMS 141
(iii) ∥λx∥=|λ|∥x∥∀x∈V,λ∈K. Example 26.
∥x∥1 =|x1|+...+|xn|,
∥x∥2 =⟨x,x⟩>=|x1|2 +…+|xn|2,
∥x∥∞ = max{|x1|, . . . , |xn|}.
Lemma 4. Let ∥·∥ be a norm on a vector space V then
(9.30) | ∥x∥−∥y∥ |≤ ∥x−y∥.
This lemma implies that a norm is a continuous function (on V to R).
Proof. ∥x∥=∥x−y+y∥≤∥x−y∥+∥y∥whichgivesthat (9.31) ∥x∥ − ∥y∥ ≤ ∥x − y∥.
By reversing the roles of x and y we also get
(9.32) ∥y∥ − ∥x∥ ≤ ∥x − y∥.
We will also need norms defined on matrices. Let A be an n × n matrix. We can view A as a vector in Rn×n and define its corresponding Euclidean norm
n n
(9.33) ∥A∥ =
This is called the Frobenius norm for matrices. A different matrix norm can be obtained by using a given vector norm and matrix-vector multiplication. Given a vector norm ∥·∥ in Rn (or in Cn), it is easy to show that
(9.34) ∥A∥ = max ∥Ax∥, x̸=0 ∥x∥
satisfies the properties (i), (ii), (iii) of a norm for all n × n matrices A . That is, the vector norm induces a matrix norm.
(9.27)
(9.28) (9.29)
i=1 j=1
|aij|2.
142 CHAPTER 9. LINEAR ALGEBRA Definition 17. The matrix norm defined by (11.1) is called the subordinate
or natural norm induced by the vector norm ∥ · ∥. Example 27.
(9.35) (9.36) (9.37)
∥A∥1 = max ∥Ax∥1 , x̸=0 ∥x∥1
∥A∥∞ = max ∥Ax∥∞ , x̸=0 ∥x∥∞
∥A∥2 = max ∥Ax∥2 . x̸=0 ∥x∥2
Theorem 23. Let ∥ · ∥ be an induced matrix norm then (a) ∥Ax∥ ≤ ∥A∥∥x∥,
(b) ∥AB∥ ≤ ∥A∥∥B∥.
Proof. (a) if x = 0 the result holds trivially. Take x ̸= 0, then the definition (11.1) implies
(9.38) ∥Ax∥ ≤ ∥A∥ ∥x∥
that is ∥Ax∥ ≤ ∥A∥∥x∥.
(b) Take x ̸= 0. By (a) ∥ABx∥ ≤ ∥A∥∥Bx∥ ≤ ∥A∥∥B∥∥x∥ and thus
(9.39) ∥ABx∥ ≤ ∥A∥∥B∥. ∥x∥
Taking the max it we get that ∥AB∥ ≤ ∥A∥∥B∥.
The following theorem offers a more concrete way to compute the matrix
norms (9.35)-(9.37).
Theorem 24. Let A = (aij) be an n × n matrix then
n
(a) ∥A∥1 =max|aij|.
j
i=1
n
(b) ∥A∥∞ =max|aij|.
i
j=1
9.5. NORMS 143 (c) ∥A∥2 = ρ(AT A),
where ρ(AT A) is the spectral radius of AT A, as defined in (9.6). Proof. (a)
∥Ax∥1 =
aijxj ≤ |xj|
|aij| ≤ max |aij| j i=1
∥x∥1.
nn n n n
i=1 j=1
j=1
i=1
i=1 j=1 (b) Analogously to (a) we have
i=1 j i=1
n
Thus, ∥A∥1 ≤ max|aij|. We just need to show there is a vector x for
j
i=1
which the equality holds. Let j∗ be the index such that nn
(9.40) |aij∗| = max|aij|
j
i=1
andtakextobegivenbyxi =0fori̸=j∗ andxj∗ =1. Then,∥x∥1 =1and
nnn n
(9.41) ∥Ax∥1 = aijxj = |aij∗| = max |aij|.
(9.42) ∥Ax∥∞ = max aijxj ≤ max
|aij|
∥x∥∞.
i j=1 Let i∗ be the index such that
i j=1
(9.43)
and take x given by (9.44)
nn |ai∗j| = max|aij|
j=1
xj =
j=1
n n
ai∗j |ai∗j|
1
i=1
i
if ai∗j ̸= 0, ifai∗j =0.
144 CHAPTER 9. LINEAR ALGEBRA Then, |xj| = 1 for all j and ∥x∥∞ = 1. Hence
nn n
(9.45) ∥Ax∥∞ = max aijxj = |ai∗j| = max |aij|.
i i=1
Note that the matrix AT A is symmetric and all its eigenvalues are nonnega- tive. Let us label them in increasing order, 0 ≤ λ1 ≤ λ1 ≤ ··· ≤ λn. Then, λn = ρ(AT A). Now, since AT A is symmetric, there is an orthogonal matrix Q such that QT AT AQ = D = diag(λ1, . . . , λn). Therefore, changing variables, x = Qy, we have
xTATAx yTDy λ1y12 +···+λnyn2
(9.47) xTx = yTy = y12+···+yn2 ≤λn.
Nowtakethevectorysuchthatyj =0forj̸=nandyn =1andtheequality holds. Thus,
∥Ax∥2 T (9.48) ∥A∥2 = max 2 = λn = ρ(A A).
NotethatifAT =Athen
(9.49) ∥A∥2 = ρ(AT A) = ρ(A2) = ρ(A).
Let λ be an eigenvalue of the matrix A with eigenvector x, normalized so that ∥x∥ = 1. Then,
(9.50) |λ| = |λ|∥x∥ = ∥λx∥ = ∥Ax∥ ≤ ∥A∥∥x∥ = ∥A∥ for any matrix norm with the property ∥Ax∥ ≤ ∥A∥∥x∥. Thus, (9.51) ρ(A) ≤ ∥A∥
for any induced norm. However, given an n × n matrix A and ε > 0 there is at least one induced matrix norm such that ∥A∥ is within ε of the spectral radius of A.
i j=1 j=1
(9.46) ∥A∥2 = max ∥Ax∥2 = max xT AT Ax
(c) By definition
2 x̸=0 ∥x∥2 x̸=0 xT x
x̸=0 ∥x∥2
9.5. NORMS 145 Theorem 25. Let A be an n×n matrix. Given ε > 0 there is at least one
induced matrix norm ∥ · ∥ such that
(9.52) ρ(A) ≤ ∥A∥ ≤ ρ(A) + ε.
Proof. By Schur’s Theorem, there is a unitary matrix T such that
λ1 b12 b13 ··· b1n λ2 b23 ··· b2n
∗ .. . (9.53) TAT= . . =U,
bn−1,n λn
where λj, j = 1,…,n are the eigenvalues of A. Take 0 < δ < 1 and define the diagonal matrix Dδ = diag(δ, δ2, . . . , δn). Then
λ1
−1 (9.54) Dδ UDδ=
δb12 λ2
δ2b13 δb23
· · · · · ·
δn−1b1n δn−2b2n
. . .
δbn−1,n λn
Given ε > 0, we can find δ sufficiently small so that D−1UDδ is “within ε” δ
of a diagonal matrix, in the sense that the sum of the absolute values of the off diagonal entries is less than ε for each row:
n
δj−ibij ≤ ε for i = 1,…,n.
j =i+1
Now,
(9.56)
Given a nonsingular matrix S and a matrix norm ∥ · ∥ then (9.57) ∥A∥′ = ∥S−1AS∥
(9.55)
D−1UDδ = D−1T∗ATDδ = (TDδ)−1A(TDδ) δδ
..
.
146 CHAPTER 9. LINEAR ALGEBRA is also a norm. Taking S = T Dδ and using the infinity norm we get
∥A∥′ = ∥(TDδ)−1A(TDδ)∥∞
9.6
Condition Number of a Matrix
λ1 0 δb12 δ2b13 · · · δn−1b1n
λ2 0 δb23 · · · δn−2b2n
.. .. . ≤ .+ . .
≤ ρ(A) + ε.
λn ∞∞
Consider the 5 × 5 Hilbert matrix 11111
12345
1 1 1 1 1 2 3 4 5 6
1 1 1 1 1 (9.58) H5= 3 4 5 6 7
1 1 1 1 1
4 5 6 7 8
1 1 1 1 1 56789
and the linear system H5x = b where
137/60
87/60 (9.59) b = 153/140 .
743/840 1879/2520
δbn−1,n
0
9.6. CONDITION NUMBER OF A MATRIX 147 The exact solution of this linear system is x = [1, 1, 1, 1, 1]T . Note that
b ≈ [2.28, 1.45, 1.09, 0.88, 0.74]T . Let us perturb b slightly (about % 1)
2.28
1.46 (9.60) b + δb = 1.10 0.89
0.75
The solution of the perturbed system (up to rounding at 12 digits of accuracy)
is
0.5
7.2 (9.61) x + δx = −21.0 .
30.8 −12.6
A relative perturbation of ∥δb∥2/∥b∥2 = 0.0046 in the data produces a change in the solution equal to ∥δx∥2 ≈ 40. The perturbations gets amplified nearly four orders of magnitude!
This high sensitivity of the solution to small perturbations is inherent to the matrix of the linear system, H5 in this example.
Consider the linear system Ax = b and the perturbed one A(x + δx) = b+δb. Then,Ax+Aδx=b+δbimpliesδx=A−1δbandso
(9.62) ∥δx∥ ≤ ∥A−1∥∥δb∥
for any induced norm. But also ∥b∥ = ∥Ax∥ ≤ ∥A∥∥x∥ or
(9.63) 1 ≤∥A∥1. ∥x∥ ∥b∥
Combining (9.62) and (9.63) we obtain
(9.64) ∥δx∥ ≤ ∥A∥∥A−1∥∥δb∥.
The right hand side of this inequality is actually a least upper bound, there are b and δb for which the equality holds.
∥x∥ ∥b∥
148 CHAPTER 9. LINEAR ALGEBRA Definition 18. Given a matrix norm ∥ · ∥, the condition number of a
matrix A, denoted by κ(A) is defined by
(9.65) κ(A) = ∥A∥∥A−1∥.
Example 28. The condition number of the 5 × 5 Hilbert matrix H5, (9.58), in the 2 norm is approximately 4.7661 × 105. For the particular b and δb we chose we actually got a variation in the solution of O(104) times the relative perturbation but now we know that the amplification factor could be as bad as κ(A).
Similarly, if we perturbed the entries of a matrix A for a linear system Ax = b so that we have (A + δA)(x + δx) = b we get
(9.66) Ax + Aδx + δA(x + δx) = b that is, Aδx = −δA(x + δx), which implies that (9.67) ∥δx∥ ≤ ∥A−1∥∥δA∥∥x + δx∥ for any induced matrix norm and consequently
(9.68) ∥δx∥ ≤ ∥A−1∥∥A∥∥δA∥ = κ(A)∥δA∥. ∥x + δx∥ ∥A∥ ∥A∥
Because, for any induced norm, 1 = ∥I∥ = ∥A−1∥∥A∥ ≤ ∥A−1∥∥A∥, we get that κ(A) ≥ 1. We say that A is ill-conditioned if κ(A) is very large.
Example 29. The Hilbert matrix is ill-conditioned. We already saw that in the 2 norm κ(H5) = 4.7661 × 105. The condition number increases very rapidly as the size of the Hilbert matrix increases, for example κ(H6) = 1.4951 × 107, κ(H10) = 1.6025 × 1013.
9.6.1 What to Do When A is Ill-conditioned?
There are two ways to deal with a linear system with an ill-conditioned matrix A. One approach is to work with extended precision (using as many digits as required to to obtain the solution up to a given accuracy). Unfortunately, computations using extended precision can be computationally expensive, several times the cost of regular double precision operations.
9.6. CONDITION NUMBER OF A MATRIX 149
A more practical approach is often to replace the ill-conditioned linear system Ax = b by an equivalent linear system with a much smaller condition number. This can be done by for example by premultiplying by a matrix P−1 such that we have P−1Ax = P−1b. Obviously, taking P = A gives us the smallest possible condition number but this choice is not practical so a compromise is made between P approximating A and the cost of solving linear systems with the matrix P to be low. This very useful technique, also employed to accelerate the convergence of some iterative methods, is called preconditioning.
150 CHAPTER 9. LINEAR ALGEBRA
Chapter 10
Linear Systems of Equations I
In this chapter we focus on a problem which is central to many applications: find the solution to a large linear system of n linear equations in n unknowns x1,x2,…,xn
(10.1)
a11x1 +a12x2 +…+a1nxn =b1, a21x1 +a22x2 +…+a2nxn =b2,
.
an1x1 +an2x2 +…+annxn =bn,
or written in matrix form
(10.2) Ax = b where A is the n × n matrix of coefficients
a11 a12 ··· a1n
a21 a22 ··· a2n (10.3) A= . . .. . ,
…. an1 an2 ··· ann
x is a column vector whose components are the unknowns, and b is the given right hand side of the linear system
x1
x2 (10.4) x= . ,
b1
b2 b=..
. .
xn
bn
151
152 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
We will assume, unless stated otherwise, that A is a nonsingular, real matrix. That is, the linear system (10.2) has a unique solution for each b. Equiva- lently, the determinant of A, det(A), is non-zero and A has an inverse.
While mathematically we can write the solution as x = A−1b, this is not computationally efficient. Finding A−1 is several (about four) times more costly than solving Ax = b for a given b.
In many applications n can be on the order of millions or much larger. 10.1 Easy to Solve Systems
When A is diagonal, i.e.
a11
a22 (10.5) A = .. .
ann
(all the entries outside the diagonal are zero and since A is assumed non- singular aii ̸= 0 for all i), then each equation can be solved with just one division:
(10.6) xi = bi/aii, for i = 1,2,…,n. If A is lower triangular and nonsingular,
a11
a21 a22 (10.7) A= . .
.. …
an1 an2 ··· ann
the solution can also be obtained easily by the process of forward substitution:
(10.8)
x1 = b1 a11
x2 = b2 − a21x1 a22
x3 = b3 − [a31x1 + a32x2]. a33
xn = bn −[an1x1 +an2x2 +…+an,n−1xn−1], ann
10.1. EASY TO SOLVE SYSTEMS 153 or in pseudo-code:
Algorithm 1 Forward Subsitution 1: for i = 1,…,n do
i−1
2: xi ← bi −aijxj /aii
j=1
3: end for
Note that the assumption that A is nonsingular implies that aii ̸= 0 for all i = 1,2,…,n since det(A) = a11a22 ···ann. Also observe that (10.8) shows that xi is a linear combination of bi, bi−1, . . . , b1 and since x = A−1b it follows that A−1 is also lower triangular.
To compute xi we perform i−1 multiplications, i−1 additions/subtractions, and one division, so the total amount of computational work W (n) to do for- ward substitution is
n
(10.9) W(n)=2(i−1)+n=n2 +2n, i=1
where we have used that (10.10)
n n(n − 1)
i= 2 .
That is, W(n) = O(n2) to solve a lower triangular linear system.
i=1
If A is nonsingular and upper triangular
a11 a12 ··· a1n
0 a22 ··· a2n (10.11) A= . . .. . ….
0 0 ··· ann
we solve the linear system Ax = b starting from xn, then we solve for xn−1,
154 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I etc. This is called backward substitution
(10.12)
xn = bn , ann
xn−1 = bn−1 − an−1,nxn , an−1,n−1
xn−2 = bn−2 − [an−2,n−1xn−1 + an−2,nxn], an−2,n−2
.
x1 = b1 −[a12x2 +a13x3 +···a1nxn].
a11
From this we deduce that xi is a linear combination of bi..,bi+1,…bn and so
A−1 is an upper triangular matrix. In pseudo-code, we have Algorithm 2 Backward Subsitution
1: fori=n,n−1,…,1do n
2: xi ← bi − aijxj /aii j =i+1
3: end for
The operation count is the same as for forward substitution, W(n) =
O(n2).
10.2 Gaussian Elimination
The central idea of Gaussian elimination is to reduce the linear system Ax = b to an equivalent upper triangular system, which has the same solution and can readily be solved with backward substitution. Such reduction is done with an elimination process employing linear combinations of rows. We illustrate first the method with a concrete example:
(10.13)
x1 +2×2 −x3 +x4 = 0, 2×1 +4×2 −x4 =−3, 3×1 +x2 −x3 +x4 = 3,
x1 −x2 +2×3 +x4 = 3.
10.2. GAUSSIAN ELIMINATION 155
To do the elimination we form an augmented matrix Ab by appending one more column to the matrix of coefficients A, consisting of the right hand side b:
1 2 −1 1 0 (10.14) Ab = 2 4 0−1−3.
3 1 −1 1 3 1 −1 2 1 3
The first step is to eliminate the first unknown in the second to last equations,
i.e. to produce a zero in 1 2 −1
(10.15) 2 4 0 3 1 − 1
1 −1 2
the first column of Ab for rows 2, 3, and 4:
1 0 1 2 −1 1 0 −1 −3 −−−−−−→ 0 0 2 −3 −3,
1 3 R ← R − 2 R 0 − 5 2 − 2 3 221
R3 ←R3 −3R1
1 3 R4←R4−1R1 0 −3 3 0 3
where R2 ← R2 − 2R1 means that the second row has been replaced by the second row minus two times the first row, etc. Since the coefficient of x1 in the first equation is 1 it is easy to figure out the number we need to multiply rows 2, 3, and 4 to achieve the elimination of the first variable for each row, namely 2, 3, and 1. These numbers are called multipliers. In general, to obtain the multipliers we divide the coefficient of x1 in the rows below the first one by the nonzero coefficient a11 (2/1=2, 3/1=3, 1/1=1). The coefficient we need to divide by to obtain the multipliers is called a pivot (1 in this case).
Note that the (2, 2) element of the last matrix in (10.15) is 0 so we cannot use it as a pivot for the second round of elimination. Instead, we proceed by exchanging the second and the third rows
1 0 1 2 −1 1 0
1 2 −1 (10.16) 0 0 2
−3 −2 0
We can now use -5 as a pivot
−3 −−−−−−→ 0 −5 2 −2 3 . 3 R ↔ R 0 0 2 − 3 − 3
3 0 −3 3 0 3 and do the second round of elimination:
0 −5 2 0 −3 3
23
1 2 −1 (10.17) 0 −5 2
1 0 1 2 −1 1 0
0 0 2 0−3 3
−2 3 −−−−−−→ 0 −5 2 −2 − 3 − 3 R ← R − 0 R 0 0 2 − 3
5
3. − 3
332
0 3R4←R4−3R2 0 0 9 6
6 555
156 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
Clearly, the elimination step R3 ← R3 − 0R2 is unnecessary as the coefficient to be eliminated is already zero but we include it to illustrate the general procedure. The last round of the elimination is
1 2 −1 1 0 1 2 −1 1 0 (10.18) 0−5 2−2 3−−−−−−→0−5 2−2 3,
0 0 2 −3 −3 0 0 9 6 6
5 5 5
9 0 0 2 −3 −3
R4←R4−10R3
The last matrix, let us call it Ub, corresponds to the upper triangular system
1 2 −1 1x1 0 (10.19) 0 −5 2 −2x2 = 3,
0 0 2 −3 x3 −3 00039×4 39
which we can solve with backward substitution to obtain the solution x1 1
(10.20) x = x2 = −1 . x3 0
x4 1
Each of the steps in the Gaussian elimination process are linear trans- formations and hence we can represent these transformations with matrices. Note, however, that these matrices are not constructed in practice, we only implement their effect (row exchange or elimination). The first round of elim- ination (10.15) is equivalent to multiplying (form the left) Ab by the lower triangular matrix
0 0 0 39 39 10 10
(10.21)
that is
(10.22)
1 E1 =−2
0 0 0 1 0 0,
0 1 0 0 0 1
−1 1 0 2 −3 −3. 2 −2 3 3 0 3
1 2 E1Ab=0 0
−3 −1
0 −5 0 −3
10 10
10.2. GAUSSIAN ELIMINATION 157
The matrix E1 is formed by taking the 4 × 4 identity matrix and replac- ing the elements in the first column below 1 by negative the multiplier, i.e. −2, −3, −1. We can exchange rows 2 and 3 with a permutation matrix
1 0 0 0 (10.23) P =0 0 1 0,
(10.25)
and we get
(10.26)
00 Finally, for the last elimination we have
0 1 0 0 0001
second and third rows in the 4 × 4
−1 1 0 2−2 3. 2 −3 −3 3 0 3
To construct the matrix associated with the second round of elimination we have take 4×4 identity matrix and replace the elements in the second column below the diagonal by negative the multipliers we got with the pivot equal to -5:
which is obtained by exchanging the identity matrix,
1 2 (10.24) PE1Ab = 0 −5
0 0 0 −3
1 000 E2=0 1 0 0,
0 010 0 −3 0 1
5
1 2 −1 1 0 E2PE1Ab = 0 −5 2 −2 3.
0 0 2 −3 −3
966 555
10 00 (10.27) E3=0 1 0 0,
00 10
00−91 10
158 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I and E3E2P E1Ab = Ub.
Observe that P E1Ab = E1′ P Ab, where
1 0 0 0
(10.28) E′ =−3 1 0 0, 1 −2010
−1 0 0 1
i.e., we exchange rows in advance and then reorder the multipliers accord- ingly. If we focus on the matrix A, the first four columns of Ab, we have the matrix factorization
(10.29) E3E2E1′ PA = U, where U is the upper triangular matrix
1 2 −1 1 (10.30) U=0 −5 2 −2.
0 0 2 −3
39 10
Moreover, the product of upper lower triangular matrices is also a lower triangular matrix and so is the inverse. Hence, we obtain the so-called LU factorization
000
(10.31) P A = LU,
where L = (E E E′ )−1 = E′−1E−1E−1 is a lower triangular matrix. Now
321123
recall that the matrices E1′ , E2, E3 perform the transformation of subtracting
the row of the pivot times the multiplier to the rows below. Therefore, the inverse operation is to add the subtracted row back, i.e. we simply remove the negative sign in front of the multipliers,
1000 1000 10 00 E′−1=3 1 0 0, E−1=0 1 0 0, E−1=0 1 0 0.
1 2 0 1 0 2 0 0 1 0 3 0 0 1 0 1001 0301 0091
It then follows that
(10.32) L=3 1 0 0.
5 10
1 0 0 0 2 0 1 0
1391 5 10
10.2. GAUSSIAN ELIMINATION 159
Note that L has all the multipliers below the diagonal and U has all the pivots on the diagonal. We will see that a factorization P A = LU is always possible for any nonsingular n × n matrix A and can be very useful.
We now consider the general linear system (10.1). The matrix of coeffi- cients and the right hand size are
a11 a12 ··· a1n b1 a21 a22 ··· a2n b2
(10.33) A= . . .. . , …..
b=., an1 an2 ··· ann bn
respectively. We form the augmented matrix Ab by appending b to A as the last column:
a11 a12 ··· a1n b1
a21 a22 ··· a2n b2 (10.34) Ab = . . .. . .
…. an1 an2 ··· ann bn
In principle if a11 ̸= 0 we can start the elimination. However, if |a11| is too small, dividing by it to compute the multipliers might lead to inaccurate results in the computer, i.e. using finite precision arithmetic. It is generally better to look for the coefficient of largest absolute value in the first column, to exchange rows, and then do the elimination. This is called partial pivoting. It is possible to then search for the element of largest absolute value in the first row and switch columns accordingly. This is called complete pivoting and works well provided the matrix is properly scaled. Henceforth, we will consider Gaussian elimination only with partial pivoting, which is less costly to apply.
To perform the first round of Gaussian elimination we do three steps:
1. Find the max|ai1|, let us say this corresponds to the m-th row, i.e.
i
|am1| = max |ai1|. If |am1| = 0, the matrix is singular. Stop. i
2. Exchange rows 1 and m.
3. Compute the multipliers and perform the elimination.
160 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I After these three steps, we have transformed Ab into
a(1) a(1) ··· a(1) b′ 11 12 1n 1
0 a(1) ··· a(1) b(1)
(1)22 2n2 (10.35) Ab = . . … . .
0 a(1) ··· a(1) b(1) n2 nnn
This corresponds to A(1) = E1P1Ab, where P1 is the permutation matrix b
that exchanges rows 1 and m (P1 = I if no exchange is made) and E1 is the matrix to obtain the elimination of the entries below the first element in the first column. The same three steps above can now be applied to the smaller (n−1)×n matrix
a(1) · · · a(1) b(1) 22 2n2
̃(1) . . .. . (10.36) Ab =. . . .,
a(1) · · · a(1) b(1) n2 nnn
and so on. Doing this process (n − 1) times, we obtain the reduced, upper triangular system, which can be solved with backward substitution.
In matrix terms, the linear transformations in the Gaussian elimination process correspond to A(k) = EkPkA(k−1), for k = 1,2,…,n−1 (A(0) = Ab),
bbb
where the Pk and Ek are permutation and elimination matrices, respectively.
Pk = I if no row exchange is made prior to the k-th elimination round (but recall that we do not construct the matrices Ek and Pk in practice). Hence, the Gaussian elimination process for a nonsingular linear system produces the matrix factorization
(10.37) Ub ≡ A(n−1) = En−1Pn−1En−2Pn−2 · · · E1P1Ab. b
Arguing as in the introductory example we can rearrange the rows of Ab, with the permutation matrix P = Pn−1 · · · P1 and the corresponding multipliers, as if we knew in advance the row exchanges that would be needed to get
(10.38) U ≡A(n−1) =E′ E′ ···E′PA. bb n−1n−21b
10.2. GAUSSIAN ELIMINATION 161 Since the inverse of En′ −1En′ −2 · · · E1′ is the lower triangular matrix
1 0 ··· ··· 0
l21 1 0 ··· 0
l l 1 ··· 0 (10.39) L = 31 32
. . .. .. . . . . . .
ln1 ln2 ··· ln,n−1 1
,
where the lij, j = 1,…,n − 1, i = j + 1,…,n are the multipliers (com- puted after all the rows have been rearranged), we arrive at the anticipated factorization PA = LU. Incidentally, up to sign, Gaussian elimination also produces the determinant of A because
(10.40) det(PA) = ±det(A) = det(LU) = det(U) = a(1)a(2) ···a(n) 1122 nn
and so det(A) is plus or minus the product of all the pivots in the elimination process.
In the implementation of Gaussian elimination the array storing the aug- mented matrix Ab is overwritten to save memory. The pseudo code with partial pivoting (assuming ai,n+1 = bi, i = 1, . . . , n) is presented in Algo- rithm 3.
10.2.1 The Cost of Gaussian Elimination
We now do an operation count of Gaussian elimination to solve an n × n linear system Ax = b.
We focus on the elimination as we already know that the work for the step of backward substitution is O(n2). For each round of elimination, j = 1, . . . , n−1, we need one division to compute each of the n−j multipliers and (n − j)(n − j + 1) multiplications and to (n − j)(n − j + 1) sums (subtracts) perform the eliminations. Thus, the total number number of operations is
(10.41)
W(n)=[2(n−j)(n−j+1)+(n−j)]=2(n−j)2 +3(n−j)
n−1 n−1
j=1
and using (10.10) and
(10.42)
m i=1
i2 =
j=1
m(m + 1)(2m + 1) 6 ,
162 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I Algorithm 3 Gaussian Elimination with Partial Pivoting
1: 2:
3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13:
14: 15:
forj=1,…,n−1do
Find m such that |amj | = max |aij |
if |amj| = 0 then stop
end if
◃ Matrix is singular ◃Exchangerows
◃ Compute multiplier ◃Elimination ◃ Store multiplier
ajk ↔amk, k=j,…,n+1 fori=j+1,…,ndo
m ← aij/ajj
aik ←aik −m∗ajk, aij ← m
end for end for
fori=n,n−1,…,1do n
xi ← end for
ai,n+1 − aijxj /aii j =i+1
W(n)= 2n3 +O(n2). 3
j ≤i≤n
k=j+1,…,n+1
◃BackwardSubstitution
we get (10.43)
Thus, Gaussian elimination is computationally rather expensive for large systems of equations.
10.3 LU and Choleski Factorizations
If Gaussian elimination can be performed without row interchanges, then we obtain an LU factorization of A, i.e. A = LU. This factorization can be advantageous when solving many linear systems with the same n × n matrix A but different right hand sides because we can turn the problem Ax = b into two triangular linear systems, which can be solved much more economically in O(n2) operations. Indeed, from LUx = b and setting y = Ux we have
(10.44) Ly = b, (10.45) Ux = y.
10.3. LU AND CHOLESKI FACTORIZATIONS 163
Given b, we can solve the first system for y with forward substitution and then we solve the second system for x with backward substitution. Thus, while the LU factorization of A has an O(n3) cost, subsequent solutions to the linear system with the same matrix A but different right hand sides can be done in O(n2) operations.
When can we obtain the factorization A = LU? the following result provides a useful sufficient condition.
Theorem 26. Let A be an n × n matrix whose leading principal submatrices A1, . . . , An are all nonsingular. Then, there exists an n × n lower triangular matrix L, with ones on its diagonal, and an n × n upper triangular matrix U such that A = LU and this factorization is unique.
Proof. Since A1 is nonsingular then a11 ̸= 0 and P1 = I. Suppose now that we do not need to exchange rows in steps 2,…,k − 1 so that A(k−1) = Ek−1 · · · E2E1A, that is
a11 ··· a1k ··· a1n ..
1 a11 ··· a1k ··· a1n ..…
. ··· ···
m21 . . .. ··· .
.
. ··· (k−1) = .
(k−1)
a···a aa···a
kn mk1 1 k1 kk
a a(2) · · · a(k−1) and this is equal to the determinant of the product of boxed 11 22 kk
blocks on the right hand side. Since the determinant of the first such block is one (it is a lower triangular matrix with ones on the diagonal), it follows that
(10.46) a a(2) ···a(k−1) = det(A ) ̸= 0, 11 22 kk k
which implies that a(k−1) ̸= 0 and so Pk = I and we conclude that U = kk
En−1 ···E1A and therefore A = LU.
Let us now show that this decomposition is unique. Suppose A = L1U1 =
L2U2 then
(10.47) L−1L = U U−1. 2121
kk
kn
...…
. . . . .
. .. . . . a(k−1) ··· a(k−1) m ··· 1 an1 ··· ank ··· ann
nk nn n1
The determinant of the boxed k×k leading principal submatrix on the left is
164 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
But the matrix on the left hand side is lower triangular (with ones in its diag-
onal) whereas the one on the right hand side is upper triangular. Therefore
L−1L =I=UU−1,whichimpliesthatL =L andU =U. 2121 2121
An immediate consequence of this result is that Gaussian elimination can be performed without row interchange for a SDD matrix, as each of its leading principal submatrices is itself SDD and hence nonsingular, and for a positive definite matrix.
Corollary 1. Let A be an n×n matrix. Then A = LU, where L is an n×n lower triangular matrix , with ones on its diagonal, and U is an n × n upper triangular matrix if either
(a) A is SDD or
(b) A is symmetric positive definite.
In the case of a positive definite matrix the number number of operations can be cut down in approximately half by exploiting symmetry to obtain a symmetric factorization A = BBT , where B is a lower triangular matrix with positive entries in its diagonal. This representation is called Choleski factorization of the the symmetric positive definite matrix A.
Theorem 27. Let A be a symmetric positive definite matrix. Then, there is a unique lower triangular matrix B with positive entries in its diagonal such that A = BBT .
Proof. By Corollary 1 A has an LU factorization. Moreover, from (10.46) it follows that all the pivots are positive and thus uii > 0 for all i = 1,…,n. We can split the pivots evenly in L and U by letting D = diag(√u11, . . . , √unn) and writing A = LDD−1U = (LD)(D−1U). Let B = LD and C = D−1U. Both matrices have diagonal elements √u11, . . . , √unn but B is lower trian- gular while C is upper triangular. Moreover, A = BC and because AT = A we have that CT BT = BC, which implies
(10.48) B−1CT =C(BT)−1.
The matrix on the right hand side is lower triangular with ones in its diagonal while the matrix on the left hand side is upper triangular also with ones in its diagonal. Therefore, B−1CT = I = C(BT)−1 and thus, C = BT and A = BBT . To prove that this Choleski factorization is unique we go back
10.3. LU AND CHOLESKI FACTORIZATIONS 165
to the LU factorization, which we now is unique (if we choose L to have
ones in its diagonal). Given A = BBT, where B is lower triangular with
positive diagonal elements b11, . . . , bnn, we can write A = BD−1DBBT , where B
DB = diag(b11,…,bnn). Then L = BD−1 and U = DBBT yield the unique B
LU factorization of A. Now suppose there is another Choleski factorization A = CCT . Then by the uniqueness of the LU factorization, we have
(10.49) L = BD−1 = CD−1, BC
(10.50) U=DBBT =DCCT,
where DC = diag(c11, . . . , cnn). Equation (10.50) implies that b2ii = c2ii for i=1,…,nandsincebii >0andcii >0foralli,thenDC =DB and consequently C = B.
TheCholeskifactorizationisusuallywrittenasA=LLT andisobtained by exploiting the lower triangular structure of L and symmetry as follows. First,L=(lij)islowertriangularthenlij =0for1≤i
10.6. A2DBVP:DIRICHLETPROBLEMFORTHEPOISSON’SEQUATION171 f. Denoting by Ω, and ∂Ω, the unit square [0,1] × [0,1] and its boundary,
respectively, the BVP is to find u such that
(10.70) −∆u(x, y) = f (x, y), for (x, y) ∈ Ω
and
(10.71) u(x, y) = 0. for (x, y) ∈ ∂Ω
In (10.70), ∆u is the Laplacian of u, also denoted as ∇2u, and is given by
(10.72) ∆u=∇2u=uxx+uyy =∂2u+∂2u. ∂x2 ∂y2
Equation (10.70) is Poisson’s equation (in 2D) and together with (10.71) specify a (homogeneous) Dirichlet problem because the value of u is given at the boundary.
To construct a numerical approximation to (10.70)-(10.71), we proceed as in the previous 1D BVP example by discretizing the domain. For simplicity, we will use uniformly spaced grid points. We choose a positive integer N and define the grid points of our domain Ω = [0, 1] × [0, 1] as
(10.73) (xi,xj)=(ih,jh), fori,j=0,…,N+1,
where h = 1/(N + 1). The interior nodes correspond to 1 ≤ i, j ≤ N and the boundary nodes are those corresponding to the remaining values of indices i and j (i or j equal 0 and i or j equal N + 1).
At each of the interior nodes we replace the Laplacian by its second order order finite difference approximation, called the five-point discrete Laplacian
(10.74)
∇2u(xi,yj)= u(xi−1,xj)+u(xi+1,yj)+u(xi,yj−1)+u(xi,yj+1)−4u(xi,yj)
h2
Neglecting the O(h2) discretization error and denoting by vij the approxima-
+ O(h2). tion to u(xi,yj) we get:
(10.75) −vi−1,j+vi+1,j+vi,j−1+vi,j+1−4vij =fij,for1≤i,j≤N. h2
172 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
This is a linear system of N2 equations for the N2 unknowns, vij, 1 ≤ i,j ≤ N. We have freedom to order or label the unknowns any way we wish and that will affect the structure of the matrix of coefficients of the linear system but remarkably the matrix will be symmetric positive definite regardless of ordering of the unknowns!.
The most common labeling is the so-called lexicographical order, which proceeds from the bottom row to top one, left to right, v11, v12, . . . , v1N , v21, . . ., etc. Denoting by v1 = [v11,v12,…,v1N]T, v2 = [v21,v22,…,v2N]T, etc., and similarly for the right hand side f, the linear system (10.75) can be written in matrix form as
T−I0 −IT−I …
.. . .
0
v1 v2
. .
f1 f2
0…. .. ..
. .
(10.76)
Here, I is the N × N identity matrix and T is the N × N tridiagonal matrix
..
.. .. .. .. .. . .
2. ..
…
0 0−ITvN fN
4 −1 0
−1 4 −1 …
…. 0 .. .. .. ..
.. .. .. .. .. (10.77) T= …..
0
0
… … …
.. .. ..
.
. . . . .. .. .. −I. .
…
. 0
Thus, the matrix of coefficients in (10.76), is sparse, i.e. the vast majority of
. . . −1 0 −1 4
.
. = h . .
.
… … … 0. .
10.7. LINEAR ITERATIVE METHODS FOR AX = B 173 its entries are zeros. For example, for N = 3 this matrix is
4−10−10000 0 −14−10−10 0 0 0
0−1 −1 0 0 −1 0 0 00 00
0 0
Gaussian elimination is hugely inefficient for a large system (n > 100) with a sparse matrix, as in this example. This is because the intermediate matrices in the elimination would be generally dense due to fill-in introduced by the elimination process. To illustrate the high cost of Gaussian elimination, if we merely use N = 100 (this corresponds to a modest discretization error of O(10−4)), we end up with n = N2 = 104 unknowns and the cost of Gaussian elimination would be O(1012) operations.
10.7 Linear Iterative Methods for Ax = b
As we have seen, Gaussian elimination is an expensive procedure for large linear systems of equations. An alternative is to seek not an exact (up to roundoff error) solution in a finite number of steps but an approximation to the solution that can be obtained from an iterative procedure applied to an initial guess x(0).
We are going to consider first a class of iterative methods where the central idea is to write the matrix A as the sum of a non-singular matrix M, whose corresponding system is easy to solve, and a remainder −N = A − M , so that the system Ax = b is transformed into the equivalent system
(10.78) Mx = Nx + b.
Starting with an initial guess x(0), (10.78) defines a sequence of approxima-
tions generated by
(10.79) Mx(k+1) =Nx(k) +b, k=0,1,… The main questions are
400−100 0 0 4−10−10 0 0 −1 4 −1 0 −1 0.
4 0 0 −1 04−10 0−14−1
−1 0 −1
0 −1 0
00−1
0 0 0 −1 0 −1 4
174 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
1. When does this iteration converge?
2. What determines its rate of convergence? 3. What is the computational cost?
But first we look at three concrete iterative methods of the form (10.79). Unless otherwise stated A is assumed to be a non-singular n × n matrix and b a given n-column vector.
10.8 Jacobi, Gauss-Seidel, and S.O.R.
If the all the diagonal elements of A are nonzero we can take M = diag(A) and then at each iteration (i.e. for each k) the linear system (10.79) can be easily solved to obtain the next iterate x(k+1). Note that we do not need to compute M−1 nor do we need to do the matrix product M−1N (and due to its cost it should be avoided). We just need to solve the linear system with the matrix M, which in this case is trivial to do. We just solve the first equation for the first unknown, the second equation for the second unknown, etc., and we obtain the so-called Jacobi iterative method:
(10.80) x(k+1) = i
n
−aijx(k+1) +b j
j=1
j̸=i , i = 1,2,…,n, and k = 0,1,…
aii The iteration could be stop when
(10.81) ∥x(k+1) − x(k)∥∞ ≤ Tolerance. ∥x(k+1)∥∞
Example 30. Consider the 4 × 4 linear system
(10.82)
10×1 −x2 +2×3 =6, −x1 +11×2 −x3 +3×4 = 25,
2×1 −x2 +10×3 −x4 =−11, 3×2 −x3 +8×4 =15.
10.8. JACOBI, GAUSS-SEIDEL, AND S.O.R. 175 It has the unique solution (1,2,-1,1). Jacobi’s iteration for this system is
(10.83)
x(k+1) = 1 x(k) − 1x(k) + 3, 1 102 53 5
x(k+1) = 1x(k)+ 1x(k)− 3x(k)+25, 2 111 113 114 11 x(k+1) =−1x(k)+ 1x(k)+ 1x(k)−11,
3 5110210410 x(k+1) = −3x(k) + 1x(k) + 15.
482838 Starting with x(0) = [0, 0, 0, 0]T we obtain
(10.84)
0.60000000
1.04727273
0.93263636
x(1) = 2.27272727 ,
x(2) = 1.71590909 , −0.80522727
In the Jacobi iteration, when we evaluate x(k+1) we have already x(k+1) 21
available. When we evaluate x(k+1) we have already x(k+1) and x(k+1) available 312
and so on. If we update the Jacobi iteration with the already computed components of x(k+1) we obtained the Gauss-Seidel iteration:
(10.85) x(k+1) =
i−1 n −aijx(k+1) − aijx(k) +b
−1.10000000
1.87500000 0.88522727 1.13088068
jj
j=1 j=i+1 ,
i = 1,2,…,n,
The Gauss-Seidel iteration is equivalent to the iteration obtained by taking
x(3) = 2.05330579 . −1.04934091
k = 0,1,…
i
Example 31. For the system (10.82), starting again with the initial guess [0, 0, 0, 0]T , Gauss-Seidel produces the following approximations
aii
M as the lower triangular part of the matrix A, including its diagonal.
(10.86)
0.60000000
x(1) = 2.32727273 , −0.98727273
0.87886364
1.03018182 x(2) = 2.03693802 ,
−1.0144562 0.98434122
1.00658504 x(3) = 2.00355502 .
−1.00252738 0.99835095
176 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
In an attempt to accelerate convergence of the Gauss-Seidel iteration, one could also put some weight in diagonal part of A, and split this into the matrices M and N of the iterative method (10.79). Specifically, we can write
(10.87) diag(A) = 1 diag(A) − 1 − ω diag(A), ωω
where the first term of the right hand side goes into M and the last into N. The Gauss-Seidel method then becomes
(10.88)
aiix(k) −ω aijx(k+1) +aijx(k) −b
x(k+1) = i
i−1 n ijj
j=1 j=i , i = 1,2,…,n, k = 0,1,… aii
Note that ω = 1 corresponds to Gauss-Seidel. This iteration is generically S.O.R. (successive over-relaxation), even though we refer to over-relaxation only when ω > 1 and under-relaxation when ω < 1. It can be proved that a necessary condition for convergence is that 0 < ω < 2.
10.9 Convergence of Linear Iterative Meth- ods
To study the convergence of iterative methods of the form M x(k+1) = N x(k) + b, for k = 0, 1, . . . we use the equivalent iteration
(10.89) x(k+1) =Tx(k) +c, k=0,1,... where
(10.90) T = M−1N = I − M−1A
is called the iteration matrix and c = M−1b.
The issue of convergence is that of existence of a fixed point for the map
F(x) = Tx+c defined for all x ∈ Rn. That is, whether or not there is an x ∈ Rn such that F (x) = x. For if the sequence defined in (10.89) converges to a vector x then, by continuity of F, we would have x = Tx+c = F(x). For any x, y ∈ Rn and for any inducedinduced matrix norm we have
(10.91) ∥F(x) − F(y)∥ = ∥Tx − Ty∥ ≤ ∥T∥ ∥x − y∥.
10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 177
If for some induced norm ∥T∥ < 1, F is a contracting map or contraction and we will show that this guarantees the existence of a unique fixed point. We will also show that the rate of convergence of the sequence generated by iterative methods of the form (10.89) is given by the spectral radius ρ(T) of the iteration matrix T. These conclusions will follow from the following result.
Theorem 29. Let T be an n × n matrix. Then the following statements are equivalent:
(a) limTk=0. k→∞
(b) limTkx=0forallx∈Rn. k→∞
(c) ρ(T)<1.
(d) ∥T ∥ < 1 for at least one induced norm.
Proof. (a) ⇒ (b): For any induced norm we have that
(10.92) ∥Tkx∥ ≤ ∥Tk∥ ∥x∥
andsoifTk →0ask→∞then∥Tkx∥→0,thatisTkx→0forallx∈Rn.
(b)⇒(c): Letussupposethat limTkx=0forallx∈Rn butthat k→∞
ρ(T) ≥ 1. Then, there is a eigenvector v such that Tv = λv with |λ| ≥ 1 and the sequence Tkv = λkv does not converge, which is a contradiction.
(c) ⇒ (d): By Theorem 25, for each ε > 0, there is at least one induced norm ∥ · ∥ such that ∥T ∥ ≤ ρ(T ) + ε from which the statement follows.
(d) ⇒ (a): This follows immediately from ∥Tk∥ ≤ ∥T∥k.
Theorem 30. The iterative method (10.89) is convergent for any initial guess x(0) if and only if ρ(T) < 1 or equivalently if and only if ∥T∥ < 1 for at least one induced norm.
Proof. Let x be the exact solution of Ax = b. Then
(10.93) x−x(1) =Tx+c−(Tx(0) +c)=T(x−x(0)),
178 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
from which it follows that the error of the k iterate, ek = x(k) − x, satisfies (10.94) ek = Tke0,
fork=1,2,...andwheree0 =x−x(0) istheerroroftheinitialguess. The conclusion now follows immediately from Theorem 29.
The spectral radius ρ(T) of the iteration matrix T measures the rate of convergence of the method. For if T is normal, then ∥T∥2 = ρ(T) and from (10.94) we get
(10.95) ∥ek∥2 ≤ ρ(T)k∥e0∥2.
But each k we can find a vector e0 for which the equality holds so ρ(T)k∥e0∥2 is a least upper bound for the error ∥ek∥2. If T is not normal, the following results shows that, asymptotically ∥Tk∥ ≈ ρ(T)k, for any matrix norm.
Theorem 31. Let T be any n × n matrix. Then, for any matrix norm ∥ · ∥ (10.96) lim ∥Tk∥1/k = ρ(T).
k→∞
Proof. We know that ρ(T k ) = ρ(T )k and that ρ(T ) ≤ ∥T ∥. Therefore
(10.97) ρ(T) ≤ ∥Tk∥1/k
Now, for any given ε > 0 construct the matrix Tε = T /(ρ(T ) + ε). Then
limk→∞ Tεk = 0 as ρ(Tε) < 1. Therefore, there is an integer Kε such that (10.98) ∥Tεk∥ = ∥Tk∥ ≤ 1, for all k ≥ Kε.
(ρ(T)+ε)k Thus, for all k ≥ Kε we have
(10.99) ρ(T) ≤ ∥Tk∥1/k ≤ ρ(T) + ε from which the results follows.
Theorem 32. Let A an n × n strictly diagonally dominant matrix. Then, for any initial guess x(0) ∈ Rn
(a) The Jacobi iteration converges to the exact solution of Ax = b.
10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 179 (b) The Gauss-Seidel iteration converges to the exact solution of Ax = b.
Proof. (a) The Jacobi iteration matrix T has entries Tii = 0 and Tij = −aij/aii for i ̸= j. Therefore,
|aij| < 1.
(b) We will proof that ρ(T) < 1 for the Gauss-Seidel iteration. Let x be an eigenvector of T with eigenvalue λ, normalized to have ∥x∥∞ = 1. Recall that T = I −M−1A. Then, Tx = λx implies Mx−Ax = λMx from which we get
n i i−1 (10.101) − aijxj =λaijxj =λaiixi +λaijxj.
n aij 1 n (10.100) ∥T∥∞ = max = max
j=i+1 j=1
Now choose i such that ∥x∥∞ = |xi| = 1 then
i−1 n |λ||aii|≤|λ||aij|+ |aij|
j=1
= 1.
|λ| ≤
|aij| j=i+1
i−1 |aii|−|aij|
j=1
|aij| < j=i+1
n
|aij|
j =i+1
1≤i≤n j=1 aii 1≤i≤n |aii| j=1,j̸=i j ̸=i
j =1 j =i+1 nn
where the last inequality was obtained by using that A is SDD. Thus, |λ| < 1 and so ρ(T) < 1.
Theorem 33. A necessary condition for the S.O.R. iteration is 0 < ω < 2.
Proof. We will show that det(T ) = (1−ω)n and because det(T ) is equal, up to a sign, to the product of the eigenvalues of T we have that | det(T )| ≤ ρn(T ) and this implies that
(10.102) ρ(T ) ≥ |1 − ω|.
180 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
Since ρ(T) < 1 is required for convergence, the conclusion follows. Now, T = M−1[M − A] and det(T) = det(M−1)det(M − A). From the definition of the S.O.R. iteration (10.88) we get that
i−1 n
(10.103) aiix(k+1) +ωaijx(k+1) =aiix(k) −ωaijx(k) +ωb.
ijij j=1 j=i
Therefore, M is lower triangular with a diagonal equal to that of A. Con- sequently, det(M−1) = det(diag(A)−1). Similarly, det(M − A) = det((1 − ω)diag(A)). Thus,
(10.104)
det(T) = det(M−1)det(M −A) = det(diag(A)−1)det((1−ω)diag(A))
= det(diag(A)−1(1 − ω)diag(A)) = det((1 − ω)I) = (1 − ω)n.
If A is positive definite S.O.R. converges for any initial guess. However, as we will see, there are more efficient iterative methods for positive definite linear systems.
Chapter 11
Linear Systems of Equations II
In this chapter we focus on some numerical methods for the solution of large linear systems Ax = b where A is a sparse, symmetric positive definite matrix. We also look briefly at the non-symmetric case.
11.1 Positive Definite Linear Systems as an Optimization Problem
Suppose that A is an n × n symmetric, positive definite matrix and we are interested in solving Ax = b. Let x ̄ be the unique, exact solution of Ax = b. Since A is positive definite, we can define the norm
√
(11.1) ∥x∥A =
Henceforth we are going to denote the inner product of two vector x, y in Rn
by ⟨x, y⟩, i.e
(11.2) ⟨x,y⟩ = xTy = xiyi.
i=1
Consider now the quadratic function of x ∈ Rn defined by
(11.3) J(x) = 1∥x − x ̄∥2A. 2
Note that J(x) ≥ 0 and J(x) = 0 if and only if x = x ̄ because A is positive definite. Therefore, x minimizes J if and only if x = x ̄. In optimization, the
181
xTAx.
n
182 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
function to be minimized (maximized), J in our case, is called the objective function.
For several optimization methods it is useful to consider the one-dimensional problem of minimizing J along a fixed direction. For given x, v ∈ Rn we con- sider the so called line minimization problem consisting in minimizing J along the line that passes through x and is in the direction of v, i.e
(11.4) minJ(x+tv). t∈R
Denoting g(t) = J(x + tv) and using the definition (11.3) of J we get g(t)= 1⟨x−x ̄+tv,A(x−x ̄+tv)⟩
(11.5)
2
= J(x) + ⟨x − x ̄, Av⟩ t + 1⟨v, Av⟩ t2 2
= J(x) + ⟨Ax − b, Av⟩ t + 1⟨v, Av⟩ t2. 2
This is a parabola opening upward because ⟨v, Av⟩ > 0 for all v ̸= 0. Thus, its minimum is given by the critical point
(11.6) 0 = g′(t∗) = −⟨v, b − Ax⟩ + t∗⟨v, Av⟩,
that is
(11.7) t∗ = ⟨v, b − Ax⟩ ⟨v, Av⟩
andtheminimumofJ alongthelinex+tv,t∈Ris ∗ 1⟨v,b−Ax⟩2
(11.8) g(t)=J(x)−2 ⟨v,Av⟩ .
Finally, using the definition of ∥ · ∥A and Ax ̄ = b, we have
(11.9) 1∥x − x ̄∥2A = 1∥x∥2A − ⟨b, x⟩ + 1∥x ̄∥2A 222
and so it follows that
(11.10) ∇J(x) = Ax − b.
11.2. LINE SEARCH METHODS 183 11.2 Line Search Methods
We just saw in the previous section that the problem of solving Ax = b, when A is a symmetric positive definite matrix is equivalent to a convex, minimization problem of a quadratic objective function J(x) = ∥x − x ̄∥2A. An important class of methods for this type of optimization problems is called line search methods.
Line search methods produce a sequence of approximations to the mini- mizer, in the form
(11.11) x(k+1) =x(k) +tkv(k), k=0,1,…,
where the vector v(k) and the scalar tk are called the search direction and the step length at the k-th iteration, respectively. The question then is how to select the search directions and the step lengths to converge to the minimizer. Most line search methods are of descent type because they required that the value of J is decreased with each iteration. Going back to (11.5) this means that descent line search methods have the condition ⟨v(k),∇J(x(k))⟩ < 0.
Starting with an initial guess x(0), line search methods generate
(11.12) x(1) = x(0) + t0v(0)
(11.13) x(2) = x(1) + t1v(0) = x(0) + t0v(0) + t1v(1),
etc., so that the k-th element of the sequence is x(0) plus a linear combination
of v(0),v(1),...,v(k−1) :
(11.14) x(k) = x(0) + t0v(0) + t1v(0) + · · · + tk−1v(k−1).
That is,
(11.15) x(k) − x(0) ∈ span{v(0), v(1), . . . , v(k−1)}.
Unless otherwise noted, we will take the step length tk to be given by the one-dimensional minimizer (11.7) evaluated at the k-step, i.e.
(11.16) tk = ⟨v(k), r(k)⟩ , ⟨v(k), Av(k)⟩
where
(11.17) r(k) = b − Ax(k)
is the residual of the linear equation Ax = b associated with the approxima- tion x(k).
184 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II 11.2.1 Steepest Descent
One way to guarantee a decrease of J(x) = ∥x − x ̄∥2A at every step of a line search method is to choose v(k) = −∇J(x(k)), which is locally the fastest rate of decrease of J. Recalling that ∇J(x(k)) = −r(k), we take v(k) = r(k). The optimal step length is selected according to (11.16) so that we choose the line minimizer (in the direction of −∇J(x(k))) of J. The resulting method is called steepest descent and, starting from an initial guess x(0), is given by
(11.18)
(11.19) (11.20)
tk = ⟨r(k), r(k)⟩ , ⟨r(k), Ar(k)⟩
x(k+1) = x(k) + tkr(k), r(k+1) = r(k) − tkAr(k),
for k =
(11.19) to b, is preferable to using the definition of the residual, i.e. r(k+1) = b − Ax(k), due to round-off errors.
If A is an n × n diagonal, positive definite matrix, the steepest descent method find the minimum in at most n steps. This is easy to visualize for n = 2 as the level sets of J are ellipses with their principal axes aligned with the coordinate axes. For a general, non-diagonal positive definite matrix A the minimum, convergence of the steepest descent sequence to the minimizer of J and hence to the solution of Ax = b is guaranteed but it not be reached in a finite number of steps.
11.3 The Conjugate Gradient Method
The steepest descent method uses an optimal search direction locally but not globally and as a results it converges in general very slowly to the minimizer. A key strategy to accelerate convergence in line search methods is to widen our search space by considering the previous search directions, not just the
current one. Obviously, we would like the v(k)’s to be linear independent. Recall that x(k) − x(0) ∈ span{v(0), v(1), . . . , v(k−1)}. We are going to
denote
(11.21) Vk =span{v(0),v(1),...,v(k−1)} andwritex∈x(0)+Vk tomeanthatx=x(0)+vwithv∈Vk.
0, 1, . . ..
Formula
(11.20), which comes from subtracting A times
11.3. THE CONJUGATE GRADIENT METHOD 185 Theideaistoselectv(0),v(1),...,v(k−1) suchthat
(11.22) x(k) = min ∥x − x ̄∥2A. x∈x(0)+Vk
If the search directions are linearly independent, as k increases our search space grows so the minimizer would be found in at most n steps, when Vn = Rn.
Let us derive a condition for the minimizer of J(x) = ∥x−x ̄∥2A in x(0)+Vk. Suppose x ∈ x(0) + Vk. Then, there are scalars c0, c1, . . . , ck−1 such that
(11.23) x(k) = x0 + c0v(0) + c1v(1) + · · · + ck−1v(k−1).
For fixed v(0), v(1), . . . , v(k−1), define the following function of c0, c1, . . . , ck−1
(11.24) G(c0, c1, ..., ck−1) := J x0 + c0v(0) + c1v(1) + · · · + ck−1v(k−1) Because J is a quadratic function, the minimizer of G is the critical point
c∗0, c∗1, ..., c∗k−1
(11.25) ∂G(c∗0,c∗1,...,c∗k−1)=0, j=0,...,k−1.
∂cj But by the Chain Rule
(11.26) 0=∂G=∇J·v(j) =−⟨r(k),v(j)⟩, j=0,1,...,k−1. ∂cj
We have proved the following theorem.
Theorem 34. The vector x(k) ∈ x(0) +V k minimizes ∥x−x ̄∥2A over x(0) +V k,
for k = 0,1,... if and only if
(11.27) ⟨r(k),v(j)⟩=0, j=0,1,...,k−1.
That is, the residual r(k) = b−Ax(k) is orthogonal to all the search directions v(0), . . . , v(k−1).
Let us go back to one step of a line search method, x(k+1) = x(k) + tkv(k), where tk is given by the one-dimensional minimizer (11.16). As we have done in the Steepest Descent method, we find that the corresponding residual
186 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II satisfies r(k+1) = r(k)−tkAv(k). Starting with an initial guess x(0), we compute
r(0) = b − Ax(0) and take v(0) = r(0). Then,
x(1) = x(0) + t0v(0), r(1) = r(0) − t0Av(0)
and
(11.30)
where the last equality follows from the definition (11.16) of t0. Now, (11.31) r(2) = r(1) − t1Av(1)
and consequently
(11.32) ⟨r(2), v(0)⟩ = ⟨r(1), v(0)⟩ − t1⟨v(0), Av(1)⟩ = −t1⟨v(0), Av(1)⟩.
Thus if
(11.33) ⟨v(0), Av(1)⟩ = 0
then⟨r(2),v(0)⟩=0. Moreover,r(2) =r(1)−t1Av(1) fromwhichitfollowsthat (11.34) ⟨r(2), v(0)⟩ = ⟨r(1), v(0)⟩ − t1⟨v(0), Av(0)⟩ = 0,
where in the last equality we have used the definition of t1, (11.16). Thus, if condition (11.33) holds we can guarantee that ⟨r(1), v(0)⟩ = 0 and ⟨r(2), v(j)⟩ = 0, j = 0, 1, i.e. we satisfy the conditions of Theorem 34 for k = 1, 2.
Definition 19. Let A be an n × n matrix. We say that two vectors x, y ∈ Rn are conjugate with respect to A if
(11.35) ⟨x, Ay⟩ = 0.
We can now proceed by induction to prove the following theorem.
Theorem 35. Suppose v(0),...,v(k−1) are conjugate with respect to A, then for k = 1,2,...
⟨r(k),v(j)⟩=0, j=0,1,...,k−1.
(11.28) (11.29)
⟨r(1), v(0)⟩ = ⟨r(0), v(0)⟩ − t0⟨v(0), Av(0)⟩ = 0
11.3. THE CONJUGATE GRADIENT METHOD 187 Proof. Let us do induction. We know the statement is true for k = 1.
Suppose
(11.36) ⟨r(k−1), v(j)⟩ = 0, j = 0, 1, ...., k − 2.
Recall that r(k) = r(k−1) − tk−1Av(k−1) and so
(11.37) ⟨r(k), v(k−1)⟩ = ⟨r(k−1), v(k−1)⟩ − tk−1⟨v(k−1), Av(k−1)⟩ = 0
becauseofthechoice(11.16)oftk−1. Now,forj=0,1,...,k−2
(11.38) ⟨r(k), v(j)⟩ = ⟨r(k−1), v(j)⟩ − tk−1⟨v(j), Av(k−1)⟩ = 0,
where the first term is zero because of the induction hypothesis and the second term is zero because the search directions are conjugate.
Combining Theorems 34 and 35 we get the following important conclu- sion.
Theorem 36. If the search directions, v(0), v(1), . . . , v(k−1) are conjugate (with respect to A) then x(k) = x(k−1) + tk−1v(k−1) is the minimizer of ∥x − x ̄∥2A over x(0) + Vk.
11.3.1 Generating the Conjugate Search Directions
The conjugate gradient method, due to Hestenes and Stiefel, is an ingenious approach to generating efficiently the set of conjugate search directions. The idea is to modify the negative gradient direction, r(k), by adding information about the previous search direction, v(k−1). Specifically, we start with
(11.39) v(k) = r(k) + skv(k−1),
where the scalar sk is chosen so that v(k) is conjugate to v(k−1) with respect
to A, i.e.
(11.40) 0 = ⟨v(k), Av(k−1)⟩ = ⟨r(k), Av(k−1)⟩ + sk⟨v(k−1), Avk−1)⟩
which gives
(11.41) sk =− ⟨r(k),Av(k−1)⟩ . ⟨v(k−1), Av(k−1)⟩
Magically this simple construction renders all the search directions conjugate and the residuals orthogonal!
188 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II Theorem 37.
(a) ⟨r(i), r(j)⟩ = 0, i ̸= j. (b) ⟨v(i), Av(j)⟩ = 0, i ̸= j.
Proof. By the choice of tk and sk it follows that
(11.42) ⟨r(k+1), r(k)⟩ = 0,
(11.43) ⟨v(k+1), v(k)⟩ = 0,
for k = 0, 1, . . . Let us now proceed by induction. We know ⟨r(1), r(0)⟩ = 0 and ⟨v(1),v(0)⟩ = 0. Suppose ⟨r(i),r(j)⟩ = 0 and ⟨v(i),Av(j)⟩ = 0 holds for 0≤j T OL do
3:
4: 5: 6:
7:
8: k←k+1
9: end while
tk ← ⟨r(k), r(k)⟩ ⟨v(k), Av(k)⟩
x(k+1) ← x(k) + tkv(k)
r(k+1) ← r(k) − tkAv(k)
sk+1 ← ⟨r(k+1),r(k+1)⟩ ⟨r(k), r(k)⟩
v(k+1) ← r(k+1) + sk+1v(k)
Theorem 38. Let A be an n × n symmetric positive definite matrix, then the conjugate gradient method converges to the exact solution (assuming no round-off errors) of Ax = b in at most n steps .
Proof. By Theorem 37, the residuals are orthogonal hence linearly indepen- dent. After n steps, r(n) is orthogonal to r(0),r(1),…,r(n−1). Since the di- mension of the space is n, r(n) has to be the zero vector.
11.4 Krylov Subspaces
In the conjugate gradient method we start with an initial guess x(0), compute theresidualr(0) =b−Ax(0) andsetv(0) =r(0). Wethengetx(1) =x(0)+t0r(0) and evaluate the residual r(1), etc. If we use the definition of the residual we have
(11.52) r(1) = b − Ax(1) = b − Ax(0) + t0Ar(0) = r(0) + t0Ar(0)
11.4. KRYLOV SUBSPACES 191 so that r(1) is a linear combination of r(0) and Ar(0). Similarly,
x(2) = x(1) + t1v(1)
(11.53) = x(0) + t0r(0) + t1r(1) + t1s0r(0)
= x(0) + (t0 + t1s0)r(0) + t1r(1)
so that r(2) = b − Ax(2) is a linear combination of r(0), Ar(0), and A2r(0) and
so on.
Definition 20. The set Kk(r(0), A) = span{r(0), Ar(0), …, Ak−1r(0)} is called
the Krylov subspace of degree k for r(0).
Krylov subspaces are central to an important class of numerical methods that rely on getting approximations through matrix-vector multiplication like the conjugate gradient method.
The following theorem provides a reinterpretation of the conjugate gra- dient method. The approximation x(k) is the minimizer of ∥x − x ̄∥2A over Kk(r(0),A).
Theorem 39. Kk(r(0), A) = span{r(0), …, r(k−1)} = span{v(0), …, v(k−1)}.
Proof. We will proof it by induction. The case k = 1 by construction. Let us now assume that it holds for k and we will prove that it also holds for k + 1.
By the induction hypothesis r(k), v(k−1) ∈ Kk(r(0); A) then Av(k−1) ∈span{Ar(0),…,Akr(0)}
but r(k) = r(k−1) − tk−1Av(k−1) and so
r(k) ∈ Kk+1(r(0), A).
Consequently,
span{r(0), …, r(k)} ⊆ Kk+1(r(0), A). We now prove the reverse inclusion,
span{r(0), …, r(k)} ⊇ Kk+1(r(0), A).
Note that Akr(0) = A(Ak−1r(0)). But by the induction hypothesis
span{r(0), Ar(0), …, Ak−1r(0)} = span{v(0), …, v(k−1)}.
192
Given that
and since
it follows that Thus,
CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
Akr(0) = A(Ak−1r(0)) ∈ span{Av(0), …, Av(k−1)} Av(j) = 1 (r(j) − r(j+1))
tj
Akr(0) ∈span{r(0),r(1),…,r(k)}.
span{r(0), …, r(k)} = Kk+1(r(0), A).
For the last equality we observe that span{v(0), …, v(k)} = span({v(0), …, v(k), r(k)}
because v(k) = r(k) + skv(k−1) and by the induction hypothesis
span{v(0), …, v(k), r(k)} = span{r(0), Ar(0), …, Akr(0), r(k)} (11.54) = span{r(0), r(1), …, r(k), r(k)}
= Kk+1(r(0), A).
11.5 Convergence Rate of the Conjugate Gra- dient Method
Let us define the initial error as e(0) = x(0) − x ̄. Then Ae(0) = Ax(0) − Ax ̄ implies that
(11.55) r(0) = −Ae(0).
For the conjugate gradient method x(k) ∈ x(0) + Kk(r(0), A) and in view of
(11.55) we have that
(11.56) x(k) − x ̄ = e(0) + c1Ae(0) + c2A2e(0) + · · · + ckAke(0),
for some real constants c1, . . . , ck. In fact,
∥x(k) − x ̄∥A = min{∥p(A)e(0)∥A : p polynomial of degree ≤ k and p(0) = 1}.
Chapter 12 Non-Linear Equations
12.1 Introduction
In this chapter we consider the problem of finding zeros of a continuous function f, i.e. solving f(x) = 0 for example ex − x = 0 or a system of nonlinear equations:
(12.1)
f1(x1,x2,··· ,xn) = 0, f2(x1,x2,··· ,xn) = 0,
. fn(x1,x2,··· ,xn) = 0.
We are going to write this generic system in vector form as (12.2) f(x) = 0,
where f : U ⊆ Rn → Rn. Unless otherwise noted the function f is assumed to be smooth in its domain U.
We are going to start with the scalar case, n = 1 and look a very simple but robust method that relies only on the continuity of the function and the existence of a zero.
12.2 Bisection
Suppose we are interested in solving a nonlinear equation in one unknown (12.3) f(x) = 0,
193
194 CHAPTER 12. NON-LINEAR EQUATIONS
where f is a continuous function on an interval [a,b] and has at least one zero there.
Suppose that f has values of different sign at the end points of the interval, i.e.
(12.4) f(a)f(b) < 0.
By the Intermediate Value Theorem, f has at least one zero in (a,b). To locate a zero we bisect the interval and check on which subinterval f changes sign. We repeat the process until we bracket a zero within a desired accuracy. The Bisection algorithm to find a zero x∗ is shown below.
Algorithm 7 The Bisection Method
Givenf,aandb(aTOLandk≤Nmax do
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14: 15:
With the bisection method we generate a sequence
(12.5) ck=ak+bk, k=1,2,… 2
where each ak and bk are the endpoints of the subinterval we select at each bisection step (because f changes sign there). Since
(12.6) bk−ak=b−a, k=1,2,… 2k−1
c=(a+b)/2
if f(c) == 0 then
x∗ = c
stop
end if
if sign(f (c)) == sign(f (a)) then
a←c else
b←c end if
k←k+1 end while
◃ This is the solution
x∗ ←(a+b)/2
12.2.1 Convergence of the Bisection Method
12.3. RATE OF CONVERGENCE 195 and ck = ak+bk is the midpoint of the interval then
2
|ck −x∗| ≤ 1(bk −ak) = b−a 2 2k
(12.7)
and consequently ck → x∗, a zero of f in [a, b].
12.3 Rate of Convergence
We now define in precise terms the rate of convergence of a sequence of approximations to a value x∗.
Definition 21. Suppose a sequence {xn}∞n=1 converges to x∗ as n → ∞. We saythatxn →x∗ oforderp(p≥1)ifthereisapositiveintegerN anda constant C such that
(12.8) |xn+1−x∗|≤C|xn−x∗|p, foralln≥N. or equivalently
(12.9) lim |xn+1 − x∗| = C. n→∞ |xn−x∗|p
Example 32. The sequence generated by the bisection method converges lin- early to x∗ because
∗ b−a |cn+1−x|≤2n+1 =1.
p = 1, linear convergence. Suppose
(12.10) |xn+1 − x∗| ≈ C|xn − x∗|, n ≥ N. Then
|xN+1 − x∗| ≈ C|xN − x∗|,
|xN+2 − x∗| ≈ C|xN+1 − x∗| ≈ C(C|xN − x∗|) = C2|xN − x∗|.
Continuing this way we get
(12.11) |xN+k −x∗|≈Ck|xN −x∗|, k=0,1,…
|c −x∗| b−a n 2n
2
Let’s examine the significance of the rate of convergence. Consider first,
196 CHAPTER 12. NON-LINEAR EQUATIONS
and this is the reason of the requirement C < 1 for p = 1. If the error at the N step, |xN − x∗|, is small enough it will be reduced by a factor of Ck after k more steps. Setting Ck = 10−dk , then the error |xN − x∗| will be reduced approximately
(12.12) d = ⌊log 1 ⌋k k 10 C
digits.around
Let us now do a similar analysis for p = 2, quadratic convergence. We
have
It is easy to prove by induction that
(12.13) |xN+k −x∗|≈C2k−1|xN −x∗|2k, k=0,1,...
To see how many digits of accuracy we gain in k steps beginning from xN , we write C2k−1|xN − x∗|2k = 10−dk |xN − x∗|, and solving for dk we get
(12.14) d =⌊log 1+log 1 ⌋(2k−1). k 10 C 10 |xN −x∗|
|xN+1 − x∗| ≈ C|xN − x∗|2,
|xN+2 − x∗| ≈ C|xN+1 − x∗|2 ≈ C(C|xN − x∗|2)2 = C3|xN − x∗|4, |xN+3 − x∗| ≈ C|xN+2 − x∗|2 ≈ C(C3|xN − x∗|4)2 = C7|xN − x∗|8.
It is not difficult to prove that for the general p > 1 and as k → ∞ we get d=αpk,whereα=1 log 1+log 1 .
k p p p−1 10 C 10 |xN−x∗|
12.4 Interpolation-Based Methods
Assuming again that f is a continuous function in [a,b] and f(a)f(b) < 0
we can proceed as in the bisection method but instead of using the midpoint
c = a+b to subdivide the interval in question we could use the root of linear 2
polynomial interpolating (a,f(a)) and (b,f(b)). This is called the method of false position. Unfortunately, this method only converges linearly and under stronger assumptions than the Bisection Method.
An alternative approach to use interpolation to obtain numerical methods for f(x) = 0 is to proceed as follows: Given m+1 approximations to the zero
12.5. NEWTON’S METHOD 197
of f, x0,...,xm, construct the interpolating polynomial of f, pm, at those points, and set the root of pm closest to xm as the new approximation to the zero of f. In practice, only m = 1,2 are used. The method for m = 1 is called the Secant method and we will look at it in some detail later. The method for m = 2 is called Muller’s Method.
12.5 Newton’s Method
If the function f is smooth, say at least C2[a,b], and we have already a good approximation x0 to a zero x∗ of f then the tangent line of f at x0, y = f(x0) + f′(x0)(x − x0) provides a good approximation to f in a small neighborhood of x0, i.e.
(12.15) f(x) ≈ f(x0) + f′(x0)(x − x0).
Then we can define the next approximation as the zero of that tangent line,
i.e.
(12.16) x1 =x0 − f(x0),
f′(x0)
etc. At the k step or iteration we get the new approximation xk+1 according
to:
(12.17) xk+1=xk−f(xk), k=0,1,...
f′(xk)
This iteration is called Newton’s Method or Newton-Raphson’s Method. There are some conditions for this method to work and converge. But, as we will show, when it does converge it does it at least quadratically.
Theorem 40. Let x∗ be a simple zero of f (i.e. f(x∗) = 0 and f′(x∗) ̸= 0) and suppose f ∈ c2. Then there’s a neighborthood Iε of x∗ such that Newton’s method converges to x∗ for any initial guess in Iε.
Proof. Since f′ is continuous and f′(x∗) ̸= 0 we can choose ε > 0, sufficiently small so that f′(x) ̸= 0 for all x such that |x−x∗| ≤ ε (this is Iε) and that εM(ε) < 1 where
1 maxk∈Iε |f′′(x)|
M(ε)= 2
min |f ′ (x)|
198 CHAPTER 12. NON-LINEAR EQUATIONS 1|f′′(x)|
(this is possible because M(ε)(ε>0) → 2 ′ < +∞. Taylor around x∗ |f (x)|
f(x) = f(x∗) + f′(x∗)(x − x∗) + 1f′′(ξ)(x − x∗) 2
1f′′(ξ) f(x) = f′(x∗)(x−x∗)(1+(x−x∗)2 )
1f′′(ξ) 1|f′′(ξ)| (x−x∗)2 = |x−x∗|2
f′(x∗) |f′(x∗)|
Thus x∗ is the only zero of f in Iε. Weneed to show that xk ∈ Iε. Assume x0 ∈ Iε, |x0 − x∗| ≤ ε then
f′(x∗) ≤εM(ε)<1
1f′′(ξ0)
|x1−x∗| = |x0−x|2 2 ≤ |x0−x∗||x0−x∗|M(ε) ≤ |x0−x∗|εM(ε) < |x0−x∗| ≤ ε
f′(x0) Now assume that xk ∈ Iε then
1f′′(ξk)
|xk+1 −x∗|=|xk −x|22 ≤ε2M(ε)<ε⇒xk+1 ∈Iε
So now
|xk+1 − x∗|
f′(xk)
≤ |xk − x∗|2M(ε)
= ≤|xk−x∗|εM(ε)
.
≤ |x−0∗|(εM(ε)k+1 →0,k→∞,xk →x∗ ask→∞
The need for a good initial guess x0 for Newton’s method should be emphasized. In practice, this is obtained with another method, like bisection.
12.6 The Secant Method
Sometimes it could be computationally expensive or not possible to evaluate the derivative of f. The following method, known as the secant method,
12.6. THE SECANT METHOD
replaces the derivative by the secant: (12.18) xk+1 = xk − f(xk) ,
Note that since f(x∗) = 0
199
f (xk )−f (xk−1 ) xk −xk−1
k = 1,2,...
xk+1 − x∗
= xk−x∗−f(xk)−f(x∗), f (xk )−f (xk−1 )
xk −xk−1
= xk−x∗−f(xk)−f(x∗)
f[xk,xk−1]
f(xk)−f(x∗)
= (xk−x∗) 1− xk−x∗ f[xk,xk−1]
∗ f[xk,x∗] = (xk−x) 1−f[xk,xk−1]
∗ f[xk,x∗] = (xk−x) f[xk,xk−1]−f[xk,xk−1]
f[xk,xk−1]−f[xk,x∗] = (xk − x∗)(xk−1 − x∗) xk−1−x∗
f[xk,xk−1] = (xk −x∗)(xk−1 −x∗)f[xk−1,xk,x∗]
1f′′(x∗) → 2
f[x ,x ,x∗] Ifx →x∗,then k−1 k
(12.19) ek+1 ≈ cekek−1
Let’s try to determine the rate of convergence of the secant method. Starting
with the ansatz ek ≈ Aep or equivalently ek−1 = 1 ek 1/p we have k−1 A
1 1 p
k
ek+1 ≈ cekek−1 ≈ ce Aek ,
k f[xk,xk−1]
sequence generated by the secant method would converge faster than linear.
f[xk,xk−1]
andlim
x −x∗
k+1 =0,i.e. the
k→∞ xk−x∗
f′(x∗)
Defining ek = |xk − x∗|, the calculation above suggests
200
which implies
CHAPTER 12.
NON-LINEAR EQUATIONS
1+1
A p 1−p+1
≈e p. ck
(12.20)
Since the left hand side is a constant we must have 1 − p + 1 = 0 which gives
√p p=1± 5,thus
2
√
(12.21) p = 1 + 5 =≈ 1.61803
2
gives the rate of convergence of the secant method. It is better than linear,
but worse than quadratic. Sufficient conditions for local convergence are as in Newton’s method.
12.7 Fixed Point Iteration
Newton’s method is a particular example of a functional iteration of the form
xk+1 = g(xk), k = 0,1,...
with the particular choice of g(x) = x− f(x) . Clearly, if x∗ is a zero of f then f′(x)
x∗ is a fixed point of g, i.e. g(x∗) = x∗. We will look at fixed point iterations as a tool for solving f(x) = 0.
Example 33. Suppose we want to solve x − e−1 = 0 in [0, 1]. Then if we take g(x) = e−x, a fixed point of g corresponds to a zero of f.
Definition 22. Let g is defined in an interval [a,b]. We say that g is a contraction or a contractive map if there is a constant L with 0 ≤ L < 1 such that
(12.22) |g(x)−g(y)|≤L|x−y|, forallx,y∈[a,b]. Ifx∗ isafixedpointofgin[a,b]then
|xk −x∗|
= |g(xk−1) − g(x∗)|
≤ L|xk−1−x∗|
≤ L2|xk−2−x∗
≤ ···
≤ Lk|x0−x∗|→0,ask→∞
12.7. FIXED POINT ITERATION 201
Theorem 41. If g is contraction on [a,b] and maps [a,b] into [a,b] then g has a unique fixed point x∗ in [a, b] and the fixed point iteration converges to it for any [a, b]. Moreover
(a)
∗ L∗
|xk − x | ≤ 1 − L|x1 − x0|
(b)
Proof. With proof (b) already, since g : [a, b] → [a, b] the fixed point iteration
|xk − x∗| ≤ Lk|x0 − x∗| xk+1 = g(xk), k = 0, 1, ... is well-defined
|xk+1 −xk|
= |g(xk) − g(xk−1)|
≤ L|xk−xk−1|
≤ ···
≤ Lk|x1−x0|
(xk+n − xk+n−1) + (xk+n−1 − xk+n−2) + · · · + (xn+1 − xn)
xk+n − xn = |xk+n −xn| ≤
n
= Lk+j−1|x1 − x0|
j=1
= L∗|x−1−x0|Lj−1
j=1 ∞
≤ Lk|x1−x0| j=1
L∗
= 1−L|x1−x0|
|xk+n−xn|≤ L∗ |x1−x0|therefore|xm−fn|→0asm,n→0asm,n→∞ 1−L
i.e. {xk} is a Cauchy sequence in [a,b] and hence it converges to a point x∗ in [a,b]. But |xk − g(x∗)| = |g(xk−1 − g(x∗)| ≤ L|xk−1 − x∗| → 0 as k→∞⇒x∗ →g(x∗)i.e. x∗ isafixedpointofg.
|xk+n −xk+n−1|+|xk+n−1 −xk+n−2|+···+|xn+1 −xn| ≤ Lk+n−1|x1 −x0|+Lk+n−2|x1 −x0|+···+Ln|x1 −x0|
n
202 CHAPTER 12. NON-LINEAR EQUATIONS
Suppose that there are two fixed points, x1, x2 ∈ [a, b]. |x1 − x2| = |g(x1) − g(x2)| ≤ L|x1 − x2| ⇒ (1 − L)|x1 − x2) ≤ 0 but 0 ≤ L < 1 ⇒ |x1 − x2| = 0 ⇒ x1 = x2 i.e the fixed point is unique.
If g is differentiable in (a, b), then by the mean value theorem g(x)−g(y)=g′(ξ)(x−y), forsomeξ∈[a,b]
and if the derivative is bounded by a constant L less than 1, i.e. |g′(x)| ≤ L for all x ∈ (a,b), then |g(x)−g(y)| ≤ L|x−y| with 0 ≤ L < 1, i.e. g is contractive in [a, b].
Example34.Letg(x)=1(x2+3)forx∈[0,1]. Then0≤g(x)≤1and 4
|g′(x)| ≤ 1 for all x ∈ [0,1]. So g is contractive in [0,1] and the fixed point 2
iteration will converge to the unique fixed point of g in [0,1]. Note that
Thus, (12.23)
xk+1 − x∗
= =
g(xk) − g(x∗)
g′(ξk)(xk − x∗), for some ξk ∈ [xk, x∗].
xk+1 − x∗ ′
xk − x∗ = g (ξk)
and unless g′(x∗) ̸= 0, the fixed point iteration converges linearly, when it does converge.
12.8 Systems of Nonlinear Equations
We now look at the problem of finding numerical approximation to the solu- tion(s) of a nonlinear system of equations f(x) = 0, where f : U ⊆ Rn → Rn.
The main approach to solve a nonlinear system is fixed point iteration
(12.24) xk+1 = G(xk), k = 0,1,...
whereweassumethatGisdefinedonaclosedsetB⊆Rn andG:B→B. The map G is a contraction (with respect to some norm,∥ · ∥) if there is
a constant L with 0 ≤ L < 1 and
(12.25) ∥G(x)−G(y)∥≤L∥x−y∥, forallx,y∈B.
12.8. SYSTEMS OF NONLINEAR EQUATIONS 203
Then, as we know, by the contraction map principle, G has a unique fixed point and the sequence generated by the fixed point iteration (12.24) con- verges to it.
Suppose that G is C1 on some convex set B ⊆ Rn, for example a ball. Consider the linear segment x + t(y − x) for t ∈ [0,1] with x,y fixed in B. Define the one-variable function
(12.26) h(t) = G(x + t(y − x)).
Then, by the Chain Rule, h′(t) = DG(x + t(y − x))(y − x), where DG stands for the derivative matrix of G. Then, using the definition of h and the Fundamental Theorem of Calculus we have
Thus if there is 0 ≤ L < 1 such that
(12.28) ∥DG(x)∥ ≤ L, for all x ∈ B, for some subordinate norm ∥ · ∥. Then
(12.29) ∥G(y) − G(x)∥ ≤ L∥y − x∥
and G is a contraction (in that norm). The spectral radius of DG, ρ(DG) willdetermine the rate of convergence of the corresponding fixed point itera- tion.
12.8.1 Newton’s Method
By Taylor theorem
(12.30) f(x) ≈ f(x0) + Df(x0)(x − x0)
so if we take x1 as the zero of the right ahnd side of (12.30) we get (12.31) x1 = x0 − [Df(x0)]−1f(x0).
G(y) − G(x) = h(1) − h(0) = (12.27) 1
=
0
1 0
h′(t)dt DG(x+t(y−x))dt (y−x).
204 CHAPTER 12. NON-LINEAR EQUATIONS Continuing this way, Newton’s method for the system of equations f(x) = 0
can be written as
(12.32) xk+1 = xk − [Df(xk)]−1f(xk).
In the implementation of Newton’s method for a system of equations we solve the linear system Df(xk)w = −f(xk) at each iteration and update xk+1 = xk + w.