Formulas from Algebra
1+r +r2 +···+rn−1 = rn −1 r−1
1 + 2 + 3 + · · · + n = 1 n(n + 1) 2
12 +22 +32 +···+n2 = 1n(n+1)(2n+1) 6
Cauchy-Schwarz Inequality
n 2 n n xiyi xi2 yi2
i=1 i=1 i=1 Formulas from Geometry
Area of circle: A = πr2 (r = radius)
Area of trapezoid: A = 1 h(a + b) (h = height; a and b are parallel bases)
2
Area of triangle: A = 1 bh (b = base, h = height) 2
Formulas from Trigonometry
loga x =(loga b)(logb x)
|x| − |y| |x ± y| |x| + |y|
Circumference of circle:
C = 2πr
sin2x+cos2x=1 1+tan2 x =sec2 x sin x = 1/ csc x cos x = 1/ sec x tan x = 1/ cot x
tan x = sin x / cos x sinx =−sin(−x) cos x = cos(−x)
sinπ −x=cosx 2
cos π −x =sinx 2
sin(x + y) = sin x cos y + cos x sin y
cos(x + y) = cos x cos y − sin x sin y
sin x + sin y = 2 sin 1 (x + y) cos 1 (x − y) 22
cos x + cos y = 2 cos 1 (x + y) cos 1 (x − y) 22
sinhx = 1(ex −e−x) 2
coshx = 1(ex +e−x) 2
Graphs
y
1
tan x
y
arccos x
–2
0
sin x
cos x
–2
3 ––
2
1
arcsin x
2 x
arctan x
x
1 1
2–
Formulas from Analytic Geometry
Slope of line: m = y2 − y1 (two points (x1, y1) and (x2, y2)) x2 − x1
Equation of line: y − y1 = m(x − x1)
Distance formula: d = (x2 − x1)2 + (y2 − y1)2
Circle: (x − x0)2 + (y − y0)2 = r2 (r = radius, (x0, y0) center)
(x − x0)2 (y − y0)2
Ellipse: a2 + b2 = 1 (a and b semiaxes)
Definitions from Calculus
The limit statement lim f (x) = L means that for any ε > 0, there is a δ > 0 such that | f (x) − L| < ε x→a
whenever 0 < |x − a| < δ.
Afunction f iscontinuousatxiflim f(x+h)= f(x).
h→0
Iflim 1[f(x+h)− f(x)]exists,itisdenotedby f′(x)or d f(x)andistermedthederivativeof f atx.
h→0 h dx Formulas from Differential Calculus
( f ± g)′ = f ′ ± g′ d dx
(fg)′ = fg′ + f′g d dx
f ′ g f ′ − f g′ d g = g2 dx
( f ◦ g)′ = ( f ′ ◦ g)g′ d dx
d xa =axa−1 d dx dx
d ex = ex d dx dx
d eax =aeax d dx dx
d ax =ax lna d dx dx
d xx =xx(1−lnx) d dx dx
logax=x−1logae
sinx=cosx
cosx =−sinx
tanx=sec2x cotx=−csc2x secx=tanxsecx cscx =−cotx cscx
darccotx= −1 dx 1 + x2
darcsecx= √1
dx
x x2−1 √−1
d arccsc x = dx
x x2−1 d sinhx=coshx
dx
d coshx=sinhx dx
d tanhx=sech2x dx
d coth x = −csch2 x dx
arcsinx=√1 arccosx=√−1
d sechx=−sechxtanhx 1 − x2 dx
d lnx=x−1 dx
d arctan x = dx
d cschx=−cschxcothx 1 − x2 dx
1
1 + x2
SIXTH EDITION
NUMERICAL MATHEMATICS AND COMPUTING
Ward Cheney
The University of Texas at Austin
David Kincaid
The University of Texas at Austin
Australia • Brazil • Canada • Mexico • Singapore • Spain nUited Kingdom • nUited States
Publisher: Bob Pirtle
Development Editor: Stacy Green
Editorial Assistant: Elizabeth Rodio
Technology Project Manager: Sam Subity
Marketing Manager: Amanda Jellerichs
Marketing Assistant: Ashley Pickering
Marketing Communications Manager:
Darlene Amidon-Brent
Project Manager, Editorial Production:
Cheryll Linthicum Creative Director: Rob Hugel Art Director: Vernon T. Boes
© 2008, 2004 Thomson Brooks/Cole, a part of The Thomson Corporation. Thomson, the Star logo, and Brooks/Cole are trademarks used herein under license.
ALL RIGHTS RESERVED. No part of this work covered by the copyright hereon may be reproduced or used in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, web distribution, information storage and retrieval systems, or in any other manner—without the written permission of the publisher.
Printed in the United States of America 1 2 3 4 5 6 7 11 10 09 08 07
Print Buyer: Doreen Suruki
Permissions Editor: Bob Kauser
Production Service: Matrix Productions
Text Designer: Roy Neuhaus
Photo Researcher: Terri Wright
Copy Editor: Barbara Willette
Illustrator: ICC Macmillan Inc.
Cover Designer: Denise Davidson
Cover Image: Glowimages/Getty Images Cover Printer: R.R. Donnelley/Crawfordsville Compositor: ICC Macmillan Inc.
Printer: R.R. Donnelley/Crawfordsville
Numerical Mathematics and Computing, Sixth edition Ward Cheney, David Kincaid
Dedicated to David M. Young
For more information about our products, contact us at:
Thomson Learning Academic Resource Center 1-800-423-0563
For permission to use material from this text or product, submit a request online at http://www.thomsonrights.com. Any additional questions about permissions can be submitted by e-mail to thomsonrights@thomson.com.
Thomson Higher Education 10 Davis Drive
Belmont, CA 94002-3098 USA
Library of Congress Control Number: 2007922553
Student Edition:
ISBN-13: 978-0-495-11475-8 ISBN-10: 495-11475-8
Preface
In preparing the sixth edition of this book, we have adhered to the basic objective of the previous editions—namely, to acquaint students of science and engineering with the po- tentialities of the modern computer for solving numerical problems that may arise in their professions. A secondary objective is to give students an opportunity to hone their skills in programming and problem solving. A final objective is to help students arrive at an under- standing of the important subject of errors that inevitably accompany scientific computing, and to arm them with methods for detecting, predicting, and controlling these errors.
Much of science today involves complex computations built upon mathematical soft- ware systems. The users may have little knowledge of the underlying numerical algorithms used in these problem-solving environments. By studying numerical methods one can be- come a more informed user and be better prepared to evaluate and judge the accuracy of the results. What this implies is that students should study algorithms to learn not only how they work but also how they can fail. Critical thinking and constant skepticism are attitudes we want students to acquire. Any extensive numerical calculation, even when carried out by state-of-the-art software, should be subjected to independent verification, if possible.
Since this book is to be accessible to students who are not necessarily advanced in their formal study of mathematics and computer sciences, we have tried to achieve an elementary style of presentation. Toward this end, we have provided numerous examples and figures for illustrative purposes and fragments of pseudocode, which are informal descriptions of computer algorithms.
Believing that most students at this level need a survey of the subject of numerical mathematics and computing, we have presented a wide diversity of topics, including some rather advanced ones that play an important role in current scientific computing. We rec- ommend that the reader have at least a one-year study of calculus as a prerequisite for our text. Some knowledge of matrices, vectors, and differential equations is helpful.
Features in the Sixth Edition
Following suggestions and comments by a dozen reviewers, we have revised all sections of the book to some degree, and a number of major new features have been added as follows:
• We have moved some items (especially computer codes) from the text to the website so that they are easily accessible without tedious typing. This endeavor includes all of the Matlab, Mathematica, and Maple computer codes as well as the Appendix on Overview of Mathematical Software available on the World Wide Web.
• Wehaveaddedmorefiguresandnumericalexamplesthroughout,believingthatconcrete codes and visual aids are helpful to every reader.
iii
iv Preface
• New sections and material have been added to many topics, such as the modified false position method, the conjugate gradient method, Simpson’s method, and some others.
• More exercises involving applications are presented throughout.
• There are additional citations to recent references and some older references have been replaced.
• We have reorganized the appendices, adding some new ones and omitting some older ones.
Suggestions for Use
Numerical Mathematics and Computing, Sixth Edition, can be used in a variety of ways, depending on the emphasis the instructor prefers and the inevitable time constraints. Prob- lems have been supplied in abundance to enhance the book’s versatility. They are divided into two categories: Problems and Computer Problems. In the first category, there are more than 800 exercises in analysis that require pencil, paper, and possibly a calculator. In the second category, there are approximately 500 problems that involve writing a program and testing it on a computer. Students can be asked to solve some problems using advanced software systems such as Matlab, Mathematica, or Maple. Alternatively, students can be asked to write their own code. Readers can often follow a model or example in the text to assist them in working out exercises, but in other cases they must proceed on their own from a mathematical description given in the text or in the problems.
In some of the computer problems, there is something to be learned beyond simply writing code—a moral, if you like. This can happen if the problem being solved and the code provided to do so are somehow mismatched. Some computing problems are designed to give experience in using either mathematical software systems, precoded programs, or black-box library codes.
A Student’s Solution Manual is sold as a separate publication. Also, teachers who adopt the book can obtain from the publisher the Instructor’s Solution Manual. Sample programs based on the pseudocode displayed in this text have been coded in several programming languages. These codes and additional material are available on the textbook websites:
www.thomsonedu.com/math/cheney www.ma.utexas.edu/CNA/NMC6/
The arrangement of chapters reflects our own view of how the material might best unfold for a student new to the subject. However, there is very little mutual dependence among the chapters, and the instructor can order the sequence of presentation in various ways. Most courses will certainly have to omit some sections and chapters for want of time.
Our own recommendations for courses based on this text are as follows:
• Aone-termcoursecarefullycoveringChapters1through11(possiblyomittingChapters5 and 8 and Sections 4.2, 9.3, 10.3, and 11.3, for example), followed by a selection of material from the remaining chapters as time permits.
• A one-term survey rapidly skimming over most of the chapters in the text and omitting some of the more difficult sections.
• Atwo-termcoursecarefullycoveringallchapters.
Student Research Projects
Throughout the book there are some computer problems designated as Student Research Projects. These suggest opportunities for students to explore topics beyond the scope of the textbook. Many of these involve application areas for numerical methods. The projects should include programming and numerical experiments. A favorable aspect of these as- signments is to allow students to choose a topic of interest to them, possibly something that may arise in their future profession or their major study area. For example, any topic suggested by the chapters and sections in the book may be delved into more deeply by consulting other texts and references on that topic. In preparing such a project, the students have to learn about the topic, locate the significant references (books and research papers), do the computing, and write a report that explains all this in a coherent way. Students can avail themselves of mathematical software systems such as Matlab, Maple, or Mathematica, or do their own programming in whatever language they prefer.
Acknowledgments
In preparing the sixth edition, we have been able to profit from advice and suggestions kindly offered by a large number of colleagues, students, and users of the previous edition. We wish to acknowledge the reviewers who have provided detailed critiques for this new edition: Krishan Agrawal, Thomas Boger, Charles Collins, Gentil A. Este ́vez, Terry Feagin, Mahadevan Ganesh, William Gearhart, Juan Gil, Xiaofan Li, Vania Mascioni, Bernard Maxum, Amar Raheja, Daniel Reynolds, Asok Sen, Ching-Kuang Shene, William Slough, Thiab Taha, Jin Wang, Quiang Ye, Tjalling Ypma, and Shangyou Zhan. In particular, Jose
Flores was most helpful in checking over the manuscript.
Reviewers from previous editions were Neil Berger, Jose E. Castillo, Charles Cullen,
Elias Y. Deeba, F. Emad, Terry Feagin, Leslie Foster, Bob Funderlic, John Gregory, Bruce P. Hillam, Patrick Lang, Ren Chi Li, Wu Li, Edward Neuman, Roy Nicolaides. J. N. Reddy, Ralph Smart, Stephen Wirkus, and Marcus Wright.
We thank those who have helped in various capacities. Many individuals took the trou- ble to write us with suggestions and criticisms of previous editions of this book: A. Aawwal, Nabeel S.Abo-Ghander, Krishan Agrawal, Roger Alexander, Husain Ali Al-Mohssen, Kistone Anand, Keven Anderson, Vladimir Andrijevik, Jon Ashland, Hassan Basir, Steve Batterson, Neil Berger, Adarsh Beohar, Bernard Bialecki, Jason Brazile, Keith M. Briggs, Carl de Boor, Jose E. Castillo, Ellen Chen, Edmond Chow, John Cook, Roger Crawfis, Charles Cullen, Antonella Cupillari, Jonathan Dautrich, James Arthur Davis, Tim Davis, Elias Y. Deeba, Suhrit Dey, Alan Donoho, Jason Durheim, Wayne Dymacek, Fawzi P. Emad, Paul Enigenbury, Terry Feagin, Leslie Foster, Peter Fraser, Richard Gardner, John Gregory, Katherine Hua Guo, Scott Hagerup, Kent Harris, Bruce P. Hillam, Tom Hogan, Jackie Hohnson, Christopher M. Hoss, Kwang-il In, Victoria Interrante, Sadegh Jokar, Erni Jusuf, Jason Karns, Grant Keady, Jacek Kierzenka, S. A. (Seppo) Korpela, Andrew Knyazev, Gary Krenz, Jihoon Kwak, Kim Kyungjin, Minghorng Lai, Patrick Lang, Wu Li, Grace Liu, Wenguo Liu, Mark C. Malburg, P. W. Manual, Juan Meza, F. Milianazzo, Milan Miklavcic, Sue Minkoff, George Minty, Baharen Momken, Justin Montgomery, Ramon E. Moore, Aaron Naiman, Asha Nallana, Edward Neuman, Durene Ngo, Roy Nicolaides, Jeff Nunemacher, Valia Guerra Ones, Tony Praseuth, Rolfe G. Petschek, Mihaela Quirk, Helia Niroomand Rad, Jeremy Rahe, Frank Roberts, Frank Rogers, Simen Rokaas, Robert
Preface v
vi Preface
S. Raposo, Chris C. Seib, Granville Sewell, Keh-Ming Shyue, Daniel Somerville, Nathan Smith, Mandayam Srinivas, Alexander Stromberger, Xingping Sun, Thiab Taha, Hidajaty Thajeb, Joseph Traub, Phuoc Truong, Vincent Tsao, Bi Roubolo Vona, David Wallace, Charles Walters, Kegnag Wang, Layne T. Watson, Andre Weideman, Perry Wong, Yuan Xu, and Rick Zaccone.
Valuable comments and suggestions were made by our colleagues and friends. In particular, David Young was very generous with suggestions for improving the accuracy and clarity of the exposition in previous editions. Some parts of previous editions were typed with great care and attention to detail by Katy Burrell, Kata Carbone, and Belinda Trevino. Aaron Naiman at Jerusalem College of Technology was particularly helpful in preparing view-graphs for a course based on this book.
It is our pleasure to thank those who helped with the task of preparing the new edition. The staff of Brooks/Cole and associated individuals have been most understanding and patient in bringing this book to fruition. In particular, we thank Bob Pirtle, Stacy Green, Elizabeth Rodio, and Cheryll Linthicum for their efforts on behalf of this project. Some of those who were involved with previous editions were Seema Atwal, Craig Barth, Carol Benedict, Gary Ostedt, Jeremy Hayhurst, Janet Hill, Ragu Raghavan, Anne Seitz, Marlene Thom, and Elizabeth Rammel. We also thank Merrill Peterson and Sara Planck at Matrix Productions Inc. for providing the LATEX macros and for help in putting the book into final form.
We would appreciate any comments, questions, criticisms, or corrections that readers may communicate to us. For this, e-mail is especially efficient.
Ward Cheney
Department of Mathematics cheney@math.utexas.edu
David Kincaid
Department of Computer Sciences kincaid@cs.utexas.edu
Contents
1 Introduction
1.1 Preliminary Remarks 1
Significant Digits of Precision: Examples 3
1
Errors: Absolute and Relative Accuracy and Precision 5 Rounding and Chopping 6 Nested Multiplication 7
Pairs of Easy/Hard Problems First Programming Experiment Mathematical Software 10 Summary 11
Additional References 11 Problems 1.1 12 Computer Problems 1.1 14
1.2 Review of Taylor Series
5
9
9
20
Taylor Series 20
Complete Horner’s Algorithm
Taylor’s Theorem in Terms of (x − c) 24 Mean-Value Theorem 26
Taylor’s Theorem in Terms of h 26 Alternating Series 28
Summary 30
Additional References 31
Problems 1.2 31
Computer Problems 1.2 36
23
2 Floating-Point Representation and Errors 43
2.1 Floating-Point Representation 43
Normalized Floating-Point Representation 44 Floating-Point Representation 46 Single-Precision Floating-Point Form 46
vii
viii
Contents
3
Double-Precision Floating-Point Form 48 Computer Errors in Representing Numbers 50 Notation fl(x) and Backward Error Analysis 51 Historical Notes 54
Summary 54
Problems 2.1 55
Computer Problems 2.1 59
2.2 Loss of Significance 61
Significant Digits 61
Computer-Caused Loss of Significance 62 Theorem on Loss of Precision 63
Avoiding Loss of Significance in Subtraction 64 Range Reduction 67
Summary 68
Additional References 68
Problems 2.2 68
Computer Problems 2.2 71
Locating Roots of Equations
3.1 Bisection Method 76
Introduction 76
Bisection Algorithm and Pseudocode 78
Examples 79
Convergence Analysis 81
False Position (Regula Falsi) Method and Modifications Summary 85
Problems 3.1 85
Computer Problems 3.1 87
3.2 Newton’s Method 89
Interpretations of Newton’s Method 90 Pseudocode 92
Illustration 92
Convergence Analysis 93
Systems of Nonlinear Equations 96 Fractal Basins of Attraction 99 Summary 100
Additional References 100 Problems 3.2 101
Computer Problems 3.2 105
3.3 Secant Method 111
Secant Algorithm 112 Convergence Analysis 114 Comparison of Methods 117
76
83
Hybrid Schemes 117 Fixed-Point Iteration 117 Summary 118
Additional References Problems 3.3 119 Computer Problems 3.3 121
4 Interpolation and Numerical Differentiation
4.1 Polynomial Interpolation 124
Preliminary Remarks 124
Polynomial Interpolation 125
Interpolating Polynomial: Lagrange Form 126
Existence of Interpolating Polynomial 128
Interpolating Polynomial: Newton Form 128
Nested Form 130
Calculating Coefficients ai Using Divided Differences 131 Algorithms and Pseudocode 136
Vandermonde Matrix 139
Inverse Interpolation 141
Polynomial Interpolation by Neville’s Algorithm 142 Interpolation of Bivariate Functions 144
Summary 145
Problems 4.1 146
Computer Problems 4.1 152
4.2 Errors in Polynomial Interpolation 153
Dirichlet Function 154
Runge Function 154
Theorems on Interpolation Errors 156 Summary 160
Problems 4.2 161
Computer Problems 4.2 163
4.3 Estimating Derivatives and Richardson Extrapolation
First-Derivative Formulas via Taylor Series 164
Richardson Extrapolation 166
First-Derivative Formulas via Interpolation Polynomials 170 Second-Derivative Formulas via Taylor Series 173
Noise in Computation 174
Summary 174
Additional References for Chapter 4 175 Problems 4.3 175
Computer Problems 4.3 178
119
Contents ix
164
124
x
Contents
5
Numerical Integration
5.1 Lower and Upper Sums 180
Definite and Indefinite Integrals 180 Lower and Upper Sums 181 Riemann-Integrable Functions 183 Examples and Pseudocode 184 Summary 187
Problems 5.1 187
Computer Problems 5.1 188
5.2 Trapezoid Rule 190
Uniform Spacing 191
Error Analysis 192
Applying the Error Formula 195
Recursive Trapezoid Formula for Equal Subintervals Multidimensional Integration 198
Summary 199
Problems 5.2 200
Computer Problems 5.2 203
5.3 Romberg Algorithm 204
Description 204
Pseudocode 205 Euler-Maclaurin Formula 206 General Extrapolation 209 Summary 211
Additional References 211 Problems 5.3 212
Computer Problems 5.3 214
Additional Topics on Numerical Integration
180
196
6
6.1 Simpson’s Rule and Adaptive Simpson’s Rule
Basic Simpson’s Rule 216
Simpson’s Rule 219
Composite Simpson’s Rule 220
An Adaptive Simpson’s Scheme 221
Example Using Adaptive Simpson Procedure 224 Newton-Cotes Rules 225
Summary 226
Problems 6.1 227
Computer Problems 6.1 229
216
216
6.2 Gaussian Quadrature Formulas 230
Description 230
Change of Intervals 231 Gaussian Nodes and Weights 232 Legendre Polynomials 234 Integrals with Singularities 237 Summary 237
Additional References 239 Problems 6.2 239
Computer Problems 6.2 241
7 Systems of Linear Equations
7.1 Naive Gaussian Elimination 245
A Larger Numerical Example 247 Algorithm 248
Pseudocode 250
Testing the Pseudocode 253 Residual and Error Vectors 254 Summary 255
Problems 7.1 255 Computer Problems 7.1 257
7.2 Gaussian Elimination with Scaled Partial Pivoting
Naive Gaussian Elimination Can Fail 259
Partial Pivoting and Complete Partial Pivoting 261 Gaussian Elimination with Scaled Partial Pivoting 262 A Larger Numerical Example 265
Pseudocode 266
Long Operation Count 269
Numerical Stability 271
Scaling 271
Summary 271
Problems 7.2 272
Computer Problems 7.2 276
7.3 Tridiagonal and Banded Systems 280
Tridiagonal Systems 281
Strictly Diagonal Dominance 282 Pentadiagonal Systems 283 Block Pentadiagonal Systems 285 Summary 286
Additional References 287 Problems 7.3 287
Computer Problems 7.3 288
245
Contents xi
259
xii
Contents
8
Additional Topics Concerning Systems of Linear Equations
8.1 Matrix Factorizations 293
Numerical Example 294
Formal Derivation 296
Pseudocode 300
Solving Linear Systems Using LU Factorization 300 L D L T Factorization 302
Cholesky Factorization 305
Multiple Right-Hand Sides 306 Computing A−1 307
Example Using Software Packages 307 Summary 309
Problems 8.1 311 Computer Problems 8.1 316
8.2 Iterative Solutions of Linear Systems 319
Vector and Matrix Norms 319
Condition Number and Ill-Conditioning 321 Basic Iterative Methods 322
Pseudocode 327
Convergence Theorems 328
Matrix Formulation 331
Another View of Overrelaxation 332 Conjugate Gradient Method 332
Summary 335
Problems 8.2 337
Computer Problems 8.2 339
8.3 Eigenvalues and Eigenvectors 342
Calculating Eigenvalues and Eigenvectors 343 Mathematical Software 344
Properties of Eigenvalues 345
Gershgorin’s Theorem 347
Singular Value Decomposition 348
Numerical Examples of Singular Value Decomposition Application: Linear Differential Equations 353 Application: A Vibration Problem 354
Summary 355
Problems 8.3 356
Computer Problems 8.3 358
8.4 Power Method 360
Power Method Algorithms 361
351
293
Aitken Acceleration 363
Inverse Power Method 364
Software Examples: Inverse Power Method 365 Shifted (Inverse) Power Method 365
Example: Shifted Inverse Power Method 366 Summary 366
Additional References 367
Problems 8.4 367
Computer Problems 8.4 368
9 Approximation by Spline Functions
9.1 First-Degree and Second-Degree Splines 371
First-Degree Spline 372
Modulus of Continuity 374 Second-Degree Splines 376 Interpolating Quadratic Spline Q(x) 376 Subbotin Quadratic Spline 378 Summary 380
Problems 9.1 381 Computer Problems 9.1 384
9.2 Natural Cubic Splines 385
Introduction 385
Natural Cubic Spline 386
Algorithm for Natural Cubic Spline 388
Pseudocode for Natural Cubic Splines 392
Using Pseudocode for Interpolating and Curve Fitting 393 Space Curves 394
Smoothness Property 396
Summary 398
Problems 9.2 399
Computer Problems 9.2 403
9.3 B Splines: Interpolation and Approximation 404
Interpolation and Approximation by B Splines 410 Pseudocode and a Curve-Fitting Example 412 Schoenberg’s Process 414
Pseudocode 414
Be ́zier Curves 416 Summary 418
Additional References 419 Problems 9.3 420 Computer Problems 9.3 423
371
Contents xiii
xiv Contents
10 Ordinary Differential Equations
10.1 Taylor Series Methods 426
Initial-Value Problem: Analytical versus Numerical Solution 426 An Example of a Practical Problem 428
Solving Differential Equations and Integration 428
Vector Fields 429
Taylor Series Methods 431
Euler’s Method Pseudocode 432
Taylor Series Method of Higher Order 433
Types of Errors 435
Taylor Series Method Using Symbolic Computations 435 Summary 435
Problems 10.1 436
Computer Problems 10.1 438
10.2 Runge-Kutta Methods 439
Taylor Series for f (x, y) 440 Runge-Kutta Method of Order 2 441 Runge-Kutta Method of Order 4 442 Pseudocode 443
Summary 444
Problems 10.2 445
Computer Problems 10.2 447
426
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 450
An Adaptive Runge-Kutta-Fehlberg Method An Industrial Example 454
450
Adams-Bashforth-Moulton Formulas Stability Analysis 456
Summary 459
Additional References 460 Problems 10.3 460
Computer Problems 10.3 461
11 Systems of Ordinary Differential Equations
455
11.1 Methods for First-Order Systems
Uncoupled and Coupled Systems 465 Taylor Series Method 466
Vector Notation 467
Systems of ODEs 468
Taylor Series Method: Vector Notation 468
465
465
Runge-Kutta Method 469 Autonomous ODE 471 Summary 473
Problems 11.1 474 Computer Problems 11.1 475
11.2 Higher-Order Equations and Systems 477
Higher-Order Differential Equations 477
Systems of Higher-Order Differential Equations 479 Autonomous ODE Systems 479
Summary 480
Problems 11.2 480
Computer Problems 11.2 482
11.3 Adams-Bashforth-Moulton Methods 483
A Predictor-Corrector Scheme 483 Pseudocode 484
An Adaptive Scheme 488
An Engineering Example 488
Some Remarks about Stiff Equations 489 Summary 491
Additional References 492 Problems 11.3 492 Computer Problems 11.3 492
12 Smoothing of Data and
the Method of Least Squares
12.1 Method of Least Squares 495
Linear Least Squares 495
Linear Example 498 Nonpolynomial Example 499 BasisFunctions{g0,g1,...,gn} 500 Summary 501
Problems 12.1 502 Computer Problems 12.1 505
12.2 Orthogonal Systems and Chebyshev Polynomials
OrthonormalBasisFunctions{g0,g1,...,gn} 505 Outline of Algorithm 508
Smoothing Data: Polynomial Regression 510 Summary 515
Problems 12.2 516 Computer Problems 12.2 517
12.3 Other Examples of the Least-Squares Principle
495
Contents xv
Use of a Weight Function w (x ) 519
505
518
xvi Contents
Nonlinear Example 520
Linear and Nonlinear Example 521 Additional Details on SVD 522
Using the Singular Value Decomposition 524 Summary 527
Additional References 527
Problems 12.3 527
Computer Problems 12.3 530
13 Monte Carlo Methods and Simulation 532
13.1 Random Numbers 532
Random-Number Algorithms and Generators 533 Examples 535
Uses of Pseudocode Random 537
Summary 541
Problems 13.1 541 Computer Problems 13.1 542
13.2 Estimation of Areas and Volumes
by Monte Carlo Techniques 544
Numerical Integration 544 Example and Pseudocode 545 Computing Volumes 547
Ice Cream Cone Example 548 Summary 549
Problems 13.2 549 Computer Problems 13.2 549
13.3 Simulation 552
Loaded Die Problem 552 Birthday Problem 553 Buffon’s Needle Problem 555 Two Dice Problem 556 Neutron Shielding 557 Summary 558
Additional References 558 Computer Problems 13.3 559
14 Boundary-Value Problems for
Ordinary Differential Equations 563
14.1 Shooting Method 563
Shooting Method Algorithm 565 Modifications and Refinements 567
Summary 567
Problems 14.1 568 Computer Problems 14.1 570
14.2 A Discretization Method 570
Finite-Difference Approximations 570 The Linear Case 571
Pseudocode and Numerical Example 572 Shooting Method in the Linear Case 574 Pseudocode and Numerical Example 575 Summary 577
Additional References 578 Problems 14.2 578 Computer Problems 14.2 580
15 Partial Differential Equations
15.1 Parabolic Problems 582
Some Partial Differential Equations from Applied Problems Heat Equation Model Problem 585
Finite-Difference Method 585
Pseudocode for Explicit Method 587
Crank-Nicolson Method 588
Pseudocode for the Crank-Nicolson Method 589 Alternative Version of the Crank-Nicolson Method 590 Stability 591
Summary 593
Problems 15.1 594
Computer Problems 15.1 596
15.2 Hyperbolic Problems 596
Wave Equation Model Problem 596 Analytic Solution 597
Numerical Solution 598 Pseudocode 600
Advection Equation 601 Lax Method 602
Upwind Method 602 Lax-Wendroff Method 602 Summary 603
Problems 15.2 604 Computer Problems 15.2 604
15.3 Elliptic Problems 605
Helmholtz Equation Model Problem 605 Finite-Difference Method 606 Gauss-Seidel Iterative Method 610
582
Contents xvii
582
xviii
Contents
Numerical Example and Pseudocode 610 Finite-Element Methods 613
More on Finite Elements 617
Summary 619
Additional References 620 Problems 15.3 620 Computer Problems 15.3 622
16 Minimization of Functions
16.1 One-Variable Case 625
Unconstrained and Constrained Minimization Problems One-Variable Case 626
Unimodal Functions F 627
Fibonacci Search Algorithm 628
Golden Section Search Algorithm 631 Quadratic Interpolation Algorithm 633 Summary 635
Problems 16.1 635
Computer Problems 16.1 637
16.2 Multivariate Case 639
Taylor Series for F: Gradient Vector and Hessian Matrix Alternative Form of Taylor Series 641
Steepest Descent Procedure 643
Contour Diagrams 644
625
More Advanced Algorithms 644
Minimum, Maximum, and Saddle Points 646 Positive Definite Matrix 647
Quasi-Newton Methods 647
Nelder-Mead Algorithm 647
Method of Simulated Annealing
Summary 650
Additional References 651
Problems 16.2 651
Computer Problems 16.2 654
17 Linear Programming
17.1 Standard Forms and Duality 657
First Primal Form 657
Numerical Example 658
Transforming Problems into First Primal Form
648
625
640
660
657
17.2
17.3
Appendix A
A.1
Appendix B
B.1
Dual Problem 661
Second Primal Form 663 Summary 664
Problems 17.1 665 Computer Problems 17.1 669
Simplex Method 670
Vertices in K and Linearly Independent Columns of A 671 Simplex Method 672
Summary 674
Problems 17.2 674
Computer Problems 17.2 675
Approximate Solution of Inconsistent Linear Systems 675
l1 Problem 676
l∞ Problem 678
Summary 680
Additional References 682 Problems 17.3 682 Computer Problems 17.3 682
Advice on Good Programming Practices 684
Programming Suggestions 684
Case Studies 687
On Developing Mathematical Software 691
Representation of Numbers in Different Bases 692
Representation of Numbers in Different Bases 692
Base β Numbers 693
Conversion of Integer Parts 693 Conversion of Fractional Parts 695 Base Conversion 10 ↔ 8 ↔ 2 696 Base 16 698
More Examples 698
Summary 699
Problems B.1 699
Computer Problems B.1 701
Additional Details on IEEE Floating-Point Arithmetic 703
More on IEEE Standard Floating-Point Arithmetic 703
Linear Algebra Concepts and Notation 706
Elementary Concepts 706
Vectors 706 Matrices 708
Appendix C
C.1
Appendix D
D.1
Contents xix
xx Contents
Matrix-Vector Product 711 Matrix Product 711
Other Concepts 713 Cramer’s Rule 715
D.2 Abstract Vector Spaces 716
Subspaces 717
Linear Independence 717
Bases 718
Linear Transformations 718
Eigenvalues and Eigenvectors 719
Change of Basis and Similarity 719
Orthogonal Matrices and Spectral Theorem 720 Norms 721
Gram-Schmidt Process 722
Answers for Selected Problems 724 Bibliography 745
Index 754
1
Introduction
The Taylor series for the natural logarithm ln(1 + x) is
ln 2 = 1 − 1 + 1 − 1 + 1 − 1 + 1 − 1 + · · ·
2345678
Adding together the eight terms shown, we obtain ln 2 ≈ 0.63452∗, which
is a poor approximation to ln 2 = 0.69315. . . . On the other hand, the Taylor
series for ln[(1 + x)/(1 − x)] gives us with x = 1 3
3−3 3−5 3−7 ln2=2 3−1+ 3 + 5 + 7 +···
By adding the four terms shown between the parentheses and multiplying
by 2, we obtain ln 2 ≈ 0.69313. This illustrates the fact that rapid conver-
gence of a Taylor series can be expected near the point of expansion but
not at remote points. Evaluating the series ln[(1 + x)/(1 − x)] at x = 1 is a 3
mechanism for evaluating ln 2 near the point of expansion. It also gives an example in which the properties of a function can be exploited to obtain a more rapidly convergent series. Examples like this will become clearer after the reader has studied Section 1.2. Taylor series and Taylor’s Theorem are two of the principal topics we discuss in this chapter. They are ubiquitous features in much of numerical analysis.
1.1 Preliminary Remarks
The objective of this text is to help the reader in understanding some of the many methods for solving scientific problems on a modern computer. We intentionally limit ourselves to the typical problems that arise in science, engineering, and technology. Thus, we do not touch on problems of accounting, modeling in the social sciences, information retrieval, artificial intelligence, and so on.
∗The symbol ≈ means “approximately equal to.”
1
2 Chapter 1
Introduction
Usually, our treatment of problems will not begin at the source, for that would take us far afield into such areas as physics, engineering, and chemistry. Instead, we consider problems after they have been cast into certain standard mathematical forms. The reader is therefore asked to accept on faith the assertion that the chosen topics are indeed important ones in scientific computing.
To survey many topics, we must treat some in a superficial way. But it is hoped that the reader will acquire a good bird’s-eye view of the subject and therefore will be better prepared for a further, deeper study of numerical analysis.
For each principal topic, we list good current sources for more information. In any realistic computing situation, considerable thought should be given to the choice of method to be employed. Although most procedures presented here are useful and important, they may not be the optimum ones for a particular problem. In choosing among available methods for solving a problem, the analyst or programmer should consult recent references.
Becoming familiar with basic numerical methods without realizing their limitations would be foolhardy. Numerical computations are almost invariably contaminated by errors, and it is important to understand the source, propagation, magnitude, and rate of growth of these errors. Numerical methods that provide approximations and error estimates are more valuable than those that provide only approximate answers. While we cannot help but be impressed by the speed and accuracy of the modern computer, we should temper our admiration with generous measures of skepticism. As the eminent numerical analyst Carl-Erik Fro ̈berg once remarked:
Never in the history of mankind has it been possible to produce so many wrong answers so quickly!
Thus, one of our goals is to help the reader arrive at this state of skepticism, armed with methods for detecting, estimating, and controlling errors.
The reader is expected to be familiar with the rudiments of programming. Algorithms are presented as pseudocode, and no particular programming language is adopted.
Some of the primary issues related to numerical methods are the nature of numerical errors, the propagation of errors, and the efficiency of the computations involved, as well as the number of operations and their possible reduction.
Many students have graphing calculators and access to mathematical software systems that can produce solutions to complicated numerical problems with minimal difficulty. The purpose of a numerical mathematics course is to examine the underlying algorithmic techniques so that students learn how the software or calculator found the answer. In this way, they would have a better understanding of the inherent limits on the accuracy that must be anticipated in working with such systems.
One of the fundamental strategies behind many numerical methods is the replacement of a difficult problem with a string of simpler ones. By carrying out an iterative process, the solutions of the simpler problems can be put together to obtain the solution of the original, difficult problem. This strategy succeeds in finding zeros of functions (Chapter 3), interpolation (Chapter 4), numerical integration (Chapters 5–6), and solving linear systems (Chapters 7–8).
Students majoring in computer science and mathematics as well as those majoring in engineering and other sciences are usually well aware that numerical methods are needed to solve problems that they frequently encounter. It may not be as well recognized that
EXAMPLE 1
Solution
scientific computing is quite important for solving problems that come from fields other than engineering and science, such as economics. For example, finding zeros of functions may arise in problems using the formulas for loans, interest, and payment schedules. Also, problems in areas such as those involving the stock market may require least-squares solu- tions (Chapter 12). In fact, the field of computational finance requires solving quite complex mathematical problems utilizing a great deal of computing power. Economic models rou- tinely require the analysis of linear systems of equations with thousands of unknowns.
Significant Digits of Precision: Examples
Significant digits are digits beginning with the leftmost nonzero digit and ending with the rightmost correct digit, including final zeros that are exact.
In a machine shop, a technician cuts a 2-meter by 3-meter rectangular sheet of metal into two equal triangular pieces. What is the diagonal measurement of each triangle? Can these pieces be slightly modified so the diagonals are exactly 3.6 meters?
Since the piece is rectangular, the Pythagorean Theorem can be invoked. Thus, to compute
the diagonal, we write 22 + 32 = d2, where d is the diagonal. It follows that √√
d = 4+9 = 13 = 3.605551275
This last number is obtained by using a hand-held calculator. The accuracy of d as given can be verified by computing (3.60555 1275) ∗ (3.60555 1275) = 13. Is this value for the diagonal, d, to be taken seriously? Certainly not. To begin with, the given dimensions of the rectangle cannot be expected to be precisely 2 and 3. If the dimensions are accurate to one millimeter, the dimensions may be as large as 2.001 and 3.001. Using the Pythagorean
Theorem again, one finds that the diagonal may be as large as
√√
d = 2.0012 + 3.0012 = 4.00400 1 + 9.00600 1 = 13.01002 ≈ 3.6069
Similar reasoning indicates that d may be as small as 3.6042. These are both worst cases. We can conclude that
3.6042 d 3.6069 No greater accuracy can be claimed for the diagonal, d.
If we want the diagonal to be exactly 3.6, we require (3−c)2 +(2−c)2 = 3.62
For simplicity, we reduce each side by the same amount. This leads to c2 −5c+0.02=0
Using the quadratic formula, we obtain the smaller root
√
c = 2.5 − 6.23 ≈ 0.00400
By cutting off 4 millimeters from the two perpendicular sides, we have triangular pieces of sizes 1.996 by 2.996 meters. Checking, we obtain (1.996)2 + (2.996)2 ≈ 3.62. ■
To show the effect of the number of significant digits used in a calculation, we consider the problem of solving a linear system of equations.
1.1 Preliminary Remarks 3
4
Chapter 1
Introduction
EXAMPLE 2
Solution
Let us concentrate on solving for the variable y in this linear system of equations in two
variables
0.1036 x + 0.2122 y = 0.7381 0.2081 x + 0.4247 y = 0.9327
(1)
First, carry only three significant digits of precision in the calculations. Second, repeat with four significant digits throughout. Finally, use ten significant digits.
In the first task, we round all numbers in the original problem to three digits and round all the calculations, keeping only three significant digits. We take a multiple α of the first equation and subtract it from the second equation to eliminate the x-term in the second equation. The multiplier is α = 0.208/0.104 ≈ 2.00. Thus, in the second equation, the new coefficient of the x-term is 0.208 − (2.00)(0.104) ≈ 0.208 − 0.208 = 0 and the new y-term coefficient is 0.425 − (2.00)(0.212) ≈ 0.425 − 0.424 = 0.001. The right- hand side is 0.933 − (2.00)(0.738) = 0.933 − 1.48 = −0.547. Hence, we find that y = −0.547/(0.001) ≈ −547.
We decide to keep four significant digits throughout and repeat the calculations. Now the multiplier is α = 0.2081/0.1036 ≈ 2.009. In the second equation, the new coefficient of the x -term is 0.2081 − (2.009)(0.1036) ≈ 0.2081 − 0.2081 = 0, the new coefficient of the y-term is 0.4247 − (2.009)(0.2122) ≈ 0.4247 − 0.4263 = −0.00160 0, and the new right-hand side is 0.9327 − (2.009)(0.7381) ≈ 0.9327 − 1.483 ≈ −0.5503. Hence, we find y = −0.5503/(−.00160 0) ≈ 343.9. We are shocked to find that the answer has changed from −547 to 343.9, which is a huge difference!
In fact, if we repeat this process and carry ten significant decimal digits, we find that even 343.9 is not accurate, since we obtain 356.29071 99. The lesson learned in this example is that data thought to be accurate should be carried with full precision and not be rounded off prior to each of the calculations. ■
In most computers, the arithmetic operations are carried out in a double-length ac- cumulator that has twice the precision of the stored quantities. However, even this may not avoid a loss of accuracy! Loss of accuracy can happen in various ways such as from roundoff errors and subtracting nearly equal numbers. We shall discuss loss of precision in Chapter 2, and the solving of linear systems in Chapter 7.
Figure 1.1 shows a geometric illustration of what can happen in solving two equations in two unknowns. The point of intersection of the two lines is the exact solution. As is shown by the dotted lines, there may be a degree of uncertainty from errors in the measurements or roundoff errors. So instead of a sharply defined point, there may be a small trapezoidal area containing many possible solutions. However, if the two lines are nearly parallel, then
FIGURE 1.1
In 2D, well- conditioned and ill-conditioned linear systems
EXAMPLE 3
Solution
1.1 Preliminary Remarks 5 this area of possible solutions can increase dramatically! This is related to well-conditioned
and ill-conditioned systems of linear equations, which are discussed more in Chapter 8.
Errors: Absolute and Relative
Suppose that α and β are two numbers, of which one is regarded as an approximation to
the other. The error of β as an approximation to α is α − β; that is, the error equals the
exact value minus the approximate value. The absolute error of β as an approximation to
α is |α − β|. The relative error of β as an approximation to α is |α − β|/|α|. Notice that in
computing the absolute error, the roles of α and β are the same, whereas in computing the
relative error, it is essential to distinguish one of the two numbers as correct. (Observe that
the relative error is undefined in the case α = 0.) For practical reasons, the relative error is
usually more meaningful than the absolute error. For example, if α1 = 1.333, β1 = 1.334,
and α2 = 0.001, β2 = 0.002, then the absolute error of βi as an approximation to αi is
the same in both cases—namely, 10−3. However, the relative errors are 3 × 10−3 and 1, 4
respectively. The relative error clearly indicates that β1 is a good approximation to α1 but that β2 is a poor approximation to α2. In summary, we have
absolute error = |exact value − approximate value| relative error = |exact value − approximate value|
|exact value|
Here the exact value is the true value. A useful way to express the absolute error and relative
error is to drop the absolute values and write
(relative error)(exact value) = exact value − approximate value
approximate value = (exact value)[1 + (relative error)]
So the relative error is related to the approximate value rather than to the exact value because
the true value may not be known.
Consider x = 0.00347 rounded to x = 0.0035 and y = 30.158 rounded to y = 30.16. In each case, what are the number of significant digits, absolute errors, and relative errors. Interpret the results.
Case 1. x = 0.35 × 10−2 has two significant digits, absolute error 0.3 × 10−4, and relative error 0.865 × 10−2. Case 2. y = 0.3016 × 102 has four significant digits, absolute error 0.2 × 10−2, and relative error 0.66 × 10−4. Clearly, the relative error is a better indication of the number of significant digits than the absolute error. ■
Accuracy and Precision
Accurate to n decimal places means that you can trust n digits to the right of the decimal place. Accurate to n significant digits means that you can trust a total of n digits as being meaningful beginning with the leftmost nonzero digit.
Suppose you use a ruler graduated in millimeters to measure lengths. The measurements will be accurate to one millimeter, or 0.001 m, which is three decimal places written in meters. A measurement such as 12.345 m would be accurate to three decimal places. A measurement such as 12.34567 89 m would be meaningless, since the ruler produces only
6 Chapter 1
Introduction
three decimal places, and it should be 12.345 m or 12.346 m. If the measurement 12.345 m has five dependable digits, then it is accurate to five significant figures. On the other hand, a measurement such as 0.076 m has only two significant figures.
When using a calculator or computer in a laboratory experiment, one may get a false sense of having higher precision than is warranted by the data. For example, the result
(1.2) + (3.45) = 4.65
actually has only two significant digits of accuracy because the second digit in 1.2 may be the effect of rounding 1.24 down or rounding 1.16 up to two significant figures. Then the left-hand side could be as large as
or as small as
(1.249) + (3.454) = (4.703) (1.16) + (3.449) = (4.609)
There are really only two significant decimal places in the answer! In adding and subtracting numbers, the result is accurate only to the smallest number of significant digits used in any step of the calculation. In the above example, the term 1.2 has two significant digits; therefore, the final calculation has an uncertainty in the third digit.
In multiplication and division of numbers, the results may be even more mislead- ing. For instance, perform these computations on a calculator: (1.23)(4.5) = 5.535 and (1.23)/(4.5) = 0.27333 3333. You think that there are four and nine significant digits in the results, but there are really only two! As a rule of thumb, one should keep as many significant digits in a sequence of calculations as there are in the least accurate number involved in the computations.
Rounding and Chopping
Rounding reduces the number of significant digits in a number. The result of rounding is a number similar in magnitude that is a shorter number having fewer nonzero digits. There are several slightly different rules for rounding. The round-to-even method is also known as statistician’s rounding or bankers’ rounding. It will be discussed below. Over a large set of data, the round-to-even rule tends to reduce the total rounding error with (on average) an equal portion of numbers rounding up as well as rounding down.
We say that a number x is chopped to n digits or figures when all digits that follow the nth digit are discarded and none of the remaining n digits are changed. Conversely, x is rounded to n digits or figures when x is replaced by an n-digit number that approximates x with minimum error. The question of whether to round up or down an (n + 1)-digit decimal number that ends with a 5 is best handled by always selecting the rounded n-digit number with an even nth digit. This may seem strange at first, but remarkably, this is essentially what computers do in rounding decimal calculations when using the standard floating-point arithmetic! (This is a topic discussed in Chapter 2.)
For example, the results of rounding some three-decimal numbers to two digits are 0.217 ≈ 0.22, 0.365 ≈ 0.36, 0.475 ≈ 0.48, and 0.592 ≈ 0.59, while chopping them gives 0.217 ≈ 0.21, 0.365 ≈ 0.36, 0.475 ≈ 0.47, and 0.592 ≈ 0.59. On the computer, the user sometimes has the option to have all arithmetic operations done with either chopping or rounding. The latter is usually preferable, of course.
Nested Multiplication
We will begin with some remarks on evaluating a polynomial efficiently and on rounding and chopping real numbers. To evaluate the polynomial
p(x)=a0 +a1x+a2x2 +···+an−1xn−1 +anxn (2) we group the terms in a nested multiplication:
p(x)=a0 +x(a1 +x(a2 +···+x(an−1 +x(an))···))
The pseudocode‡ that evaluates p(x) starts with the innermost parentheses and works out-
ward. It can be written as
Here we assume that numerical values have been assigned to the integer variable n, the real variable x, as well as the coefficients a0,a1,...,an, which are stored in a real linear array. (Throughout, we use semicolons between these declarative statements to save space.) The left-pointing arrow (←) means that the value on the right is stored in the location named on the left (i.e., “overwrites” from right to left). The for-loop index i runs backward, taking values n − 1,n − 2,...,0. The final value of p is the value of the polynomial at x. This nested multiplication procedure is also known as Horner’s algorithm or synthetic division.
In the pseudocode above, there is exactly one addition and one multiplication each time the loop is traversed. Consequently, Horner’s algorithm can evaluate a polynomial with only n additions and n multiplications. This is the minimum number of operations possible. A naive method of evaluating a polynomial would require many more operations. For example,
p(x)=5+3x−7x2+2x3 shouldbecomputedasp(x)=5+x(3+x(−7+x(2))) for a given value of x. We have avoided all the exponentiation operations by using nested multiplication!
The polynomial in Equation (1) can be written in an alternative form by utilizing the
mathematical symbols for sum and product , namely,
n ni
p(x)= aixi = i=0
i=0
ai x j=1
1.1 Preliminary Remarks 7
integer i, n; real p, x; real array (ai )0:n p ← an
fori =n−1to0do p←ai +xp
end for
‡ A pseudocode is a compact and informal description of an algorithm that uses the conventions of a programming language but omits the detailed syntax. When convenient, it may be augmented with natural language.
8
Chapter 1
Introduction
Recall that if n m, we write m
and
xk = xn + xn+1 + · · · + xm k=n
m
xk = xnxn+1 ···xm
k=n
By convention, whenever m < n, we define
m k=n
xk=0 and
m k=n
xk=1
integer i, n; real p, r ; real array (ai )0:n , (bi )0:n−1
bn−1 ← an
fori =n−1to0do
bi−1 ←ai +rbi end for
EXAMPLE 4
Solution
The pseudocode for Horner’s algorithm can be written as follows:
Notice that b−1 = p(r ) in this pseudocode. If f is an exact root, then b−1 = p(r ) = 0. If the calculation in Horner’s algorithm is to be carried out with pencil and paper, the following arrangement is often used:
an an−1 an−2 ... a1 a0
r) rbn−1 rbn−2 ... rb1 rb0 −−−−−− bn−1 bn−2 bn−3 ... b0 b−1
Use Horner’s algorithm to evaluate p(3), where p is the polynomial p(x)=x4 −4x3 +7x2 −5x−2
We arrange the calculation as suggested above:
1 −4 7 −5 −2
Horner’s algorithm can be used in the deflation of a polynomial. This is the process of removing a linear factor from a polynomial. If r is a root of the polynomial p, then x − r is a factor of p. The remaining roots of p are the n − 1 roots of a polynomial q of degree 1 less than the degree of p such that
where
p(x) = (x − r)q(x) + p(r) (3) q(x)=b0 +b1x+b2x2 +···+bn−1xn−1 (4)
3) 3 −3 12 21 −
1 −1 4 7 19
EXAMPLE 5
Solution
Thus, we obtain p(3) = 19, and we can write
p(x)=(x−3)(x3 −x2 +4x+7)+19 ■
In the deflation process, if r is a zero of the polynomial p, then x − r is a factor of p, and conversely. The remaining zeros of p are the n − 1 zeros of q(x).
Deflate the polynomial p of the preceding example, using the fact that 2 is one of its zeros. We use the same arrangement of computations as explained previously:
1 −4 7 −5 −2
2) 2−462 −
1 −2 3 1 0
Thus, we have p(2) = 0, and
x4 −4x3 +7x2 −5x−2=(x−2)(x3 −2x2 +3x+1) ■
Pairs of Easy/Hard Problems
In scientific computing, we often encounter a pair of problems, one of which is easy and the other hard and they are inverses of each other. This is the main idea in cryptology, in which multiplying two numbers together is trivial but the reverse problem (factoring a huge number) verges on the impossible.
The same phenomenon arises with polynomials. Given the roots, we can easily find the power form of the polynomial as in Equation (2). Given the power form, it may be a difficult problem to compute the roots (and it may be an ill-conditioned problem). Computer Problem 1.1.24 calls for the writing of code to compute the coefficients in the power form of a polynomial from its roots. It is a do-loop with simple formulas. One adjoins one factor (x −r) at a time. This theme arises again in linear algebra, in which computing b = Ax is trivial but finding x from A and b (the inverse problem) is hard. (See Section 7.1.)
Easy/hard problems come up again in two-point boundary value problems. Finding D f and f (0) and f (1) when f is given and D is a differential operator is easy, but finding f from knowledge of D f , f (0) and f (1) is hard. (See Section 14.1.)
Likewise, computing the eigenvalues of a matrix is a hard problem. Given the eigen- values λ1,λ2,...,λn of an n × n matrix and corresponding eigenvectors v1,v2,...,vn of an n × n matrix, we can get A by putting the eigenvalues on the diagonal of a diagonal matrix D and the eigenvectors as columns in a matrix V . Then A V = V D, and we can get
A from this by solving the equation for A. But finding λi and vi from A itself is difficult. (See Section 8.3.)
The reader may think of other examples.
First Programming Experiment
We conclude this section with a short programming experiment involving numerical com- putations. Here we consider, from the computational point of view, a familiar operation in calculus—namely, taking the derivative of a function. Recall that the derivative of a function
1.1 Preliminary Remarks 9
10 Chapter 1
Introduction
f at a point x is defined by the equation
f′(x)=lim f(x+h)−f(x)
h→0 h
A computer has the capacity of imitating the limit operation by using a sequence of numbers
h such as
h = 4−1,4−2,4−3,...,4−n, ...
for they certainly approach zero rapidly. Of course, many other simple sequences are pos- sible, such as 1/n, 1/n2, and 1/10n. The sequence 1/4n consists of machine numbers in a binary computer and, for this experiment on a 32-bit computer, will be sufficiently close to zero when n is 10.
The following is pseudocode to compute f ′(x) at the point x = 0.5, with f (x) = sin x:
program First
integer i, imax, n ← 30
real error, y, x ← 0.5, h ← 1, emax ← 0 for i = 1 to n do
h ← 0.25h
y ← [sin(x + h) − sin(x)]/h
error ← |cos(x) − y|; output i, h, y, error
if error > emax then emax ← error; imax ← i end if
end for
output imax, emax
end program First
We have neither explained the purpose of the experiment nor shown the output from this pseudocode. We invite the reader to discover this by coding and running it (or one like it) on a computer. (See Computer Problems 1.1.1 through 1.1.3.)
Mathematical Software
The algorithms and programming problems in this book have been coded and tested in a variety of ways, and they are available on the website for this book as given in the Pref- ace. Some are best done by using a scientific programming language such as C, C++, Fortran, or any other that allows for calculations with adequate precision. Sometimes it is instructive to utilize mathematical software systems such as Matlab, Maple, Mathemat- ica, or Octave, since they contain built-in problem-solving procedures. Alternatively, one could use a mathematical program library such as IMSL, NAG, or others when locally available. Some numerical libraries have been specifically optimized for the processor such as Intel and AMD. Software systems are particularly useful for obtaining graphical results as well as for experimenting with various numerical methods for solving a difficult prob- lem. Mathematical software packages containing symbolic-manipulation capabilities, such as in Maple, Mathematica, and Macsyma, are particularly useful for obtaining exact as well as numerical solutions. In solving the computer problems, students should focus on gaining insights and better understandings of the numerical methods involved. Appendix A
offers advice on computer programming for scientific computations. The suggestions are independent of the particular language being used.
With the development of the World Wide Web and the Internet, good mathematical software has become easy to locate and to transfer from one computer to another. Browsers, search engines, and URL addresses may be used to find software that is applicable to a particular area of interest. Collections of mathematical software exist, ranging from large comprehensive libraries to smaller versions of these libraries for PCs; some of these are interactive. Also, references to computer programs and collections of routines can be found in books and technical reports. The URL of the website for this book, as given in the Preface, contains an overview of available mathematical software as well as other supporting material.
Summary
(1) Use nested multiplication to evaluate a polynomial efficiently: p(x)=a0 +a1x+a2x2 +···+an−1xn−1 +anxn
=a0 +x(a1 +x(a2 +···+x(an−1 +x(an))···)) A segment of pseudocode for doing this is
p ← an
for k = 1 to n do
p ← xp + an−k end for
(2) Deflation of the polynomial p(x) is removing a linear factor: p(x) = (x − r)q(x) + p(r)
where
q(x)=b0 +b1x+b2x2 +···+bn−1xn−1
The pseudocode for Horner’s algorithm for deflation of a polynomial is
bn−1 ← an
fori =n−1to0do
bi−1 ←ai +rbi end for
Here b−1 = p(r). Additional References
Two interesting papers containing numerous examples of why numerical methods are criti- cally important are Forsythe [1970] and McCartin [1998]. See Briggs [2004] and Friedman and Littman [1994] for many industrial and real-world problems.
1.1 Preliminary Remarks 11
12 Chapter 1
Introduction
In high school, some students have been misled to believe that 22/7 is either the actual value of π or an acceptable approximation to π. Show that 355/113 is a better approximation in terms of both absolute and relative errors. Find some other simple rational fractions n/m that approximate π. For example, ones for which |π − n/m| < 10−9. Hint: See Problem 1.1.4.
A real number x is represented approximately by 0.6032, and we are told that the relative error is at most 0.1%. What is x?
Whatistherelativeerrorinvolvedinrounding4.9997to5.000?
The value of π can be generated by the computer to nearly full machine precision by
the assignment statement
pi ← 4.0 arctan(1.0)
Suggest at least four other ways to compute π using basic functions on your computer
system.
A given doubly subscripted array (ai j )n×n can be added in any order. Write the pseu- docode segments for each of the following parts. Which is best?
1.
a2. a3.
a 4.
5.
a6. 7. 8.
9. a10.
where (a1,a2,...,an) and (b1,b2,...,bn) are linear arrays containing given values. ∗Problems marked with a have answers in the back of the book.
aa. n n aij
b. n n aij j=1 i=1
i=1 j=1 c. n i
i=1 ad.n−1
aij +i−1 aji j=1
aij
j=1 k=0 |i−j|=k
e.2n n
k=2 i+j=k
aij
Count the number of operations involved in evaluating a polynomial using nested multiplication. Do not count subscript calculations.
For small x, show that (1 + x)2 can sometimes be more accurately computed from (x + 2)x + 1. Explain. What other expressions can be used to compute it?
Showhowthesepolynomialscanbeefficientlyevaluated:
aa. p(x)=x32 b. p(x)=3(x−1)5 +7(x−1)9 ac. p(x)=6(x+2)3 +9(x+2)7 +3(x+2)15 −(x+2)31
d. p(x)=x127 −5x37 +10x17 −3x7
Using the exponential function exp(x), write an efficient pseudocode segment for the
statement y = 5e3x + 7e2x + 9ex + 11. Writeapseudocodesegmenttoevaluatetheexpression
n i z= b−1 a
ij i=1 j=1
Problems 1.1*
1.1 Preliminary Remarks 13 11. Writesegmentsofpseudocodetoevaluatethefollowingexpressionsefficiently:
a. p(x)=n−1kxk k=0
c. z = n i x i=1 j=1 j
xn−j+1
d. p(t) = n a i−1 (t − x )
i=1 i j=1 j
12. Usingsummationandproductnotation,writemathematicalexpressionsforthefollow-
ing pseudocode segments:
a. integer i, n; real v, x; real array (ai )0:n v ← a0
for i = 1 to n do v ← v + xai
end for
ab. integeri,n; realv,x; realarray(ai)0:n v ← an
for i = 1 to n do
v ← vx + an−i
end for
c. integer i, n; real v, x; v ← a0
for i = 1 to n do v ← vx + ai
end for
d. integer i, n; real v, x, z; v ← a0
z←x
for i = 1 to n do
v ← v + zai
z ← xz end for
real array (ai )0:n
real array (ai )0:n
ae. integer i, n; real v; real array (ai )0:n v ← an
for i = 1 to n do
v ← (v + an−i )x
end for
a13. Express in mathematical notation without parentheses the final value of z in the fol-
lowing pseudocode segment:
integer k, n; real z; real array (bi )0:n z ← bn + 1
for k = 1 to n − 2 do
z ← zbn−k + 1 end for
ab. z=n i i=1 j=1
14 Chapter 1
Introduction
a14.
15.
Howmanymultiplicationsoccurinexecutingthefollowingpseudocodesegment? integer i, j,n; real x; real array (aij)0:n×0:n,(bij)0:n×0:n
x ← 0.0
for j = 1 to n do
for i = 1 to j do
x ← x + ai j bi j
end for end for
Criticize the following pseudocode segments and write improved versions:
16.
17.
1. 2.
The augmented matrix 3.5713 2.1426 | 7.2158 10.714 6.4280 | 1.3379
is for a system of two equations
a.
a b.
c.
integer i, n; real x, z; real array (ai )0:n for i = 1 to n do
x ← z2 + 5.7
ai ←x/i end for
integer i, j, n; real array (ai j )0:n×0:n for i = 1 to n do
for j = 1 to n do
aij ←1/(i+j−1)
end for end for
integer i, j, n; real array (ai j )0:n×0:n for j = 1 to n do
for i = 1 to n do
aij ←1/(i+j−1)
end for end for
and two unknowns x and y. Repeat Example 2 for this system. Can small changes in the data lead to massive change in the solution?
Abase60approximationcirca1750B.C.is √2≈1+24+ 51 + 10
60 602 603
Determine how accurate it is. See Sauer [2006] for additional details.
Write and run a computer program that corresponds to the pseudocode program First described in the text (p. 10) and interpret the results.
(Continuation) Select a function f and a point x and carry out a computer experiment like the one given in the text. Interpret the results. Do not select too simple a function. For example, you might consider 1/x, log x, ex , tan x, cosh x, or x3 − 23x.
Computer Problems 1.1
3. As we saw in the first computer experiment, the accuracy of a formula for numerical differentiation may deteriorate as the step-size h decreases. Study the following central difference formula:
f′(x)≈ f(x+h)− f(x−h) 2h
6
the experiment First so that approximate values for the rounding error and truncation
error are computed. On the same graph, plot the rounding error, the truncation error, and the total error (sum of these two errors) using a log-scale; that is, the axes in the plot should be − log10 |error| versus log10 h. Analyze these results.
a4. The limit e = limn→∞(1+1/n)n defines the number e in calculus. Estimate e by taking the value of this expression for n = 8, 82, 83, . . . , 810. Compare with e obtained from e ← exp(1.0). Interpret the results.
5. It is not difficult to see that the numbers pn = 1 xnex dx satisfy the inequalities 0
p1 > p2 > p3 > · · · > 0. Establish this fact. Next, use integration by parts to show that pn+1 = e − (n + 1) pn and that p1 = 1. In the computer, use the recurrence relation to generate the first 20 values of pn and explain why the inequalities above are violated. Do not use subscripted variables. (See Dorn and McCracken [1972], pp. 120–129.)
6. (Continuation) Let p20 = 1 and use the formula in the preceding computer problem 8
to compute p19, p18,…, p2, and p1. Do the numbers generated obey the inequalities 1 = p1 > p2 > p3 > · · · > 0? Explain the difference in the two procedures. Repeat with p20 = 20 or p20 = 100. Explain what happens.
7. Writeanefficientroutinethatacceptsasinputalistofrealnumbersa1,a2,…,an and then computes the following:
Arithmetic mean
Variance
Standard deviation
ak 1 n
1.1 Preliminary Remarks 15
as h → 0. We will learn in Chapter 4 that the truncation error for this formula is −1h2 f′′′(ξ) for some ξ in the interval (x − h,x + h). Modify and run the code for
m = n
v = n − 1
1 n
σ = √v Test the routine on a set of data of your choice.
k=1
k=1
(ak − m)2
8. (Continuation)Showthatanotherformulais
Variance v = 1 ak2 − nm2
n n−1 k=1
Of the two given formulas for v, which is more accurate in the computer? Verify on the computer with a data set. Hint: Use a large set of real numbers that vary in magnitude from very small to very large.
16 Chapter 1
Introduction
a 9.
a 10.
11.
a12.
13.
14.
Let a1 be given. Write a program to compute for 1 n 1000 the numbers bn = nan−1 andan = bn/n.Printthenumbersa100,a200,…,a1000.Donotusesubscriptedvariables. What should an be? Account for the deviation of fact from theory. Determine four values for a1 so that the computation does deviate from theory on your computer. Hint: Consider extremely small and large numbers and print to full machine precision.
In a computer, it can happen that a + x = a when x ̸= 0. Explain why. Describe the set of n for which 1 + 2−n = 1 in your computer. Write and run appropriate programs to illustrate the phenomenon.
Write a program to test the programming suggestion concerning the roundoff error in
the computation of t ← t +h versus t ← t0 +ih. For example, use h = 1 and 10
compute t ← t + h in double precision for the correct single-precision value of t ; print the absolute values of the differences between this calculation and the values of the two procedures. What is the result of the test when h is a machine number, such as h = 1 , on a binary computer (with more than seven bits per word)?
128
TheRussianmathematicianP.L.Chebyshev(1821–1894)spelledhisnameQebywev. Many transliterations from the Cyrillic to the Latin alphabet are possible. Cheb can alternatively be rendered as Ceb, Tscheb, or Tcheb. The y can be rendered as i. Shev can also be rendered as schef, cev, cheff, or scheff. Taking all combinations of these variants, program a computer to print all possible spellings.
Compute n! using logarithms, integer arithmetic, and double-precision floating-point arithmetic. For each part, print a table of values for 0 n 30, and determine the largest correct value.
Given two arrays, a real array v = (v1,v2,…,vn) and an integer permutation array p = (p1, p2,…, pn) of integers 1,2,…,n, can we form a new permuted array v = (vp1 , vp2 , . . . , vpn ) by overwriting v and not involving another array in memory?
If so, write and test the code for doing it. If not, use an additional array and test. Case 1. v = (6.3,4.2,9.3,6.7,7.8,2.4,3.8,9.7), p = (2,3,8,7,1,4,6,5)
Case 2. v = (0.7,0.6,0.1,0.3,0.2,0.5,0.4), p = (3,5,4,7,6,2,1) Usingacomputeralgebrasystem(e.g.,Maple,Derive,Mathematica),print200decimal
15.
16. a. Repeat the example (1) on loss of significant digits of accuracy but perform the
√
digits of 10.
calculations with twice the precision before rounding them. Does this help?
b. Use Maple or some other mathematical software system in which you can set the number of digits of precision. Hint: In Maple, use Digits.
17. In1706,Machinusedtheformula
π = 16arctan 5 −4arctan 239
1 1
to compute 100 digits of π. Derive this formula. Reproduce Machin’s calculations
by using suitable software. Hint: Let tanθ = 1, and use standard trigonometric 5
identities.
18. Usingasymbol-manipulatingprogramsuchasMaple,MathematicaorMacsyma,carry out the following tasks. Record your work in some manner, for example, by using a diary or script command.
a. Find the Taylor series, up to and including the term x10, for the function (tan x)2, using 0 as the point x0.
b. Find the indefinite integral of (cos x )−4 .
c. Findthedefiniteintegral1log|logx|dx.
0
d. Find the first prime number greater than 27448.
1 3
e. Obtain the numerical value of 0 1 + sin x dx.
1.1 Preliminary Remarks 17
f. Find the solution of the differential equation y′ + y = (1 + ex )−1.
g. Definethefunction f(x,y)=9×4 −y4 +2y2 −1.Youwanttoknowthevalueof f (40545, 70226). Compute this in the straightforward way by direct substitution of x = 40545 and y = 70226 in the definition of f (x, y), using first six-decimal accuracy, then seven, eight, and so on up to 24-decimal digits of accuracy. Next,
prove by means of elementary algebra that
f(x,y)=(3×2 −y2 +1)(3×2 +y2 −1)
Use this formula to compute the same value of f (x, y), again using different pre- cisions, from six-decimal to 24-decimal. Describe what you have learned. To force the program to do floating-point operations instead of integer arithmetic, write your numbers in the form 9.0, 40545.0, and so forth.
19. Consider the following pseudocode segments:
a. integeri; realx,y,z fori =1to20do
x ←2+1.0/8i
y ← arctan(x) − arctan(2) z ← 8i y
output x, y, z end for
b. realepsi←1
while 1 < 1 + epsi do
epsi ← epsi/2
output epsi end while
What is the purpose of each program? Is it achieved? Explain. Code and run each one to verify your conclusions.
20. Consider some oversights involving assignment statements.
aa. What is the difference between the following two assignment statements? Write a code that contains them and illustrate with specific examples to show that sometimes x = y and sometimes x ̸= y.
18 Chapter 1
Introduction
integerm,n; realx,y x ← real(m/n)
y ← real(m)/real(n) output x, y
b. What value will n receive?
integer n; real x, y x ← 7.4
y ← 3.8
n←x+y
output n
What happens when the last statement is replaced with the following?
n ← integer(x) + integer(y)
21. Write a computer code that contains the following assignment statements exactly as
shown. Analyze the results.
a. Print these values first using the default format and then with an extremely large format field:
real p,q,u,v,w,x,y,z x ← 0.1
y ← 0.01
z←x−y
p ← 1.0/3.0
q ← 3.0p
u ← 7.6
v ← 2.9
w←u−v
output x, y, z, p, q, u, v, w
b. What values would be computed for x, y, and z if this code is used?
integern; realx,y,z for n = 1 to 10 do
x ←(n−1)/2 y ← n2/3.0
z ← 1.0 + 1/n output x, y, z
end for
c. Whatvalueswouldthefollowingassignmentstatementsproduce?
integeri,j; realc,f,x,half x ← 10/3
i ← integer(x + 1/2)
half ← 1/2
j ← integer(half)
c←(5/9)(f −32) f ←9/5c+32
output x,i,half, j,c, f
d. Discusswhatiswrongwiththefollowingpseudocodesegment:
real area, circum, radius radius ← 1
area ← (22/7)(radius)2 circum ← 2(3.1416)radius output area, circum
22. Criticizethefollowingpseudocodeforevaluatinglimx→0arctan(|x|)/x.Codeandrun it to see what happens.
integer i; real x, y x←1
fori =1to24do
x ← x/2.0
y ← arctan(|x|)/x output x, y
end for
23. Carryoutsomecomputerexperimentstoillustrateortesttheprogrammingsuggestions in Appendix A. Specific topics to include are these: (a) when to avoid arrays, (b) when to limit iterations, (c) checking for floating-point equality, (d) ways for taking equal floating-point steps, and (e) various ways to evaluate functions. Hint: Comparing single and double precision results may be helpful.
24. (Easy/Hard Problem Pairs) Write a computer program to obtain the power form of a polynomial from its roots. Let the roots be r1,r2,...,rn. Then (except for a scalar factor) the polynomial is the product
p(x) = (x −r1)(x −r2)···(x −rn).
Find the coefficients in the expression p(x ) = nj =0 a j x j . Test your code on the Wilkinson polynomials in Computer Problems 3.1.10 and 3.3.9. Explain why this task of getting the power form of the polynomial is trivial, whereas the inverse problem of finding the roots from the power form is quite difficult.
25. Aprimenumberisapositiveintegerthathasnointegerfactorsotherthanitselfand1. How many prime numbers are there in each of these open intervals: (1, 40), (1, 80), (1, 160), and (1, 2000)? Make a guess as to the percentage of prime numbers among all numbers.
26. MathematicalsoftwaresystemssuchasMapleandMathematicadobothnumericalcal- culations and symbolic manipulations. Verify symbolically that a nested multiplication is correct for a general polynomial of degree ten.
1.1 Preliminary Remarks 19
20
Chapter 1 Introduction
1.2
Review of Taylor Series
Most students will have encountered infinite series (particularly Taylor series) in their study of calculus without necessarily having acquired a good understanding of this topic. Consequently, this section is particularly important for numerical analysis, and deserves careful study.
Once students are well grounded with a basic understanding of Taylor series, the Mean- Value Theorem, and alternating series (all topics in this section) as well as computer number representation (Section 2.2), they can proceed to study the fundamentals of numerical methods with better comprehension.
Taylor Series
Familiar (and useful) examples of Taylor series are the following:
x x2x3 ∞xk
(|x|<∞) (1)
(|x| < ∞) (2)
(|x|<∞) (3)
(|x|<1) (4)
(−1 < x 1) (5)
e =1+x+2!+3!+···=
k!
x 3
cosx =1− + 2!
x 5 5!
∞
k=0 (−1)k
x 2 k + 1 (2k + 1)!
sin x = x −
3!
+
x2x4 ∞ x2k
− · · · =
−···= (−1)k
k=0
4!
1 ∞
1−x =1+x+x2+x3+···=
ln(1 + x) = x − + − · · · = (−1)k−1
xk x2x3 ∞ xk
k=0 (2k)!
23 k=1 k
For each case, the series represents the given function and converges in the interval specified. Series (1)–(5) are Taylor series expanded about c = 0. A Taylor series expanded about c = 1 is
(x−1)2 (x−1)3 ∞ (x−1)k ln(x) = (x − 1) − + − · · · = (−1)k−1
2 3 k=1 k where 0 < x 2. The reader should recall the factorial notation
n! = 1·2·3·4· ··· ·n
for n 1 and the special definition of 0! = 1.
Series of this type are often used to compute good approximate values of complicated
functions at specific points.
k=0
EXAMPLE 1
Solution
EXAMPLE 2
Solution
Use five terms in Series (5) to approximate ln(1.1).
Taking x = 0.1 in the first five terms of the series for ln(1 + x) gives us
ln(1.1) ≈ 0.1 − 0.01 + 0.001 − 0.0001 + 0.00001 = 0.09531 03333 . . . 2345
where ≈ means “approximately equal.” This value is correct to six decimal places of accuracy. ■
On the other hand, such good results are not always obtained in using series.
Try to compute e8 by using Series (1). The result is
e8 =1+8+ 64 + 512 + 4096 + 32768 +··· 2 6 24 120
It is apparent that many terms will be needed to compute e8 with reasonable precision. By repeatedsquaring,wefinde2 =7.389056,e4 =54.5981500,ande8 =2980.957987.The first six terms given above yield 570.06666 5. ■
These examples illustrate a general rule:
A Taylor series converges rapidly near the point of expansion and slowly (or not at all) at more remote points.
A graphical depiction of the phenomenon can be obtained by graphing a few partial sums of a Taylor series. In Figure 1.2, we show the function
1.2 Review of Taylor Series 21
3 2 1
y = sin x y
2
1
0
1
2
S1
1 2 3
S5 sin x
x
FIGURE 1.2
Approximations to sin x
S3
22 Chapter 1
Introduction
and the partial-sum functions
which come from Series (2). While S1 may be an acceptable approximation to sin x when x ≈ 0, the graphs for S3 and S5 match that of sin x on larger intervals about the origin.
All of the series illustrated above are examples of the following general series:
S1 = x
S3 = x − x3
S5 =x− 6 +120
6
x3 x5
FORMAL TAYLOR SERIES FOR f ABOUT c
f(x) ∼ f(c)+ f′(c)(x −c)+ f′′(c)(x −c)2 + f′′′(c)(x −c)3 +··· 2! 3!
f(x)∼∞ f(k)(c)(x−c)k k=0 k!
(6)
■ THEOREM1
EXAMPLE 3
Solution
f(x)=3x5 −2x4 +15x3 +13x2 −12x−5
To compute the coefficients in the series, we need the numerical values of f (k)(2) for
Here, rather than using =, we have written ∼ to indicate that we are not allowed to assume that f (x) equals the series on the right. All we have at the moment is a formal series that can be written down provided that the successive derivatives f ′ , f ′′ , f ′′′ , . . . exist at the point c. Series (6) is called the “Taylor series of f at the point c.”
In the special case c = 0, Series (6) is also called a Maclaurin series: f(x)∼ f(0)+ f′(0)x+ f′′(0)x2 + f′′′(0)x3 +···
f(x)∼∞ f(k)(0)xk k=0 k!
Thefirsttermis f(0)whenk =0. What is the Taylor series of the function
(7)
2! 3!
at the point c = 2?
k 0. Here are the details of the computation:
f(x) = 3x5 −2x4 +15x3 +13x2 −12x −5
f′(x) = 15x4 −8x3 +45x2 +26x −12 f′′(x) = 60x3 −24x2 +90x +26
f′′′(x) =180x2−48x+90
f(4)(x) =360x−48
f(5)(x) = 360 f(k)(x) =0
f (2) f′(2) f′′(2) f ′′′(2) f (4)(2) f (5)(2) f (k)(2)
= 207 = 396 = 590 = 714 = 672 = 360 =0
for k 6. Therefore, we have
f (x) ∼ 207 + 396(x − 2) + 295(x − 2)2
+ 119(x −2)3 +28(x −2)4 +3(x −2)5
In this example, it is not difficult to see that ∼ may be replaced by = . Simply expand all the terms in the Taylor series and collect them to get the original form for f . Taylor’s Theorem, discussed soon, will allow us to draw this conclusion without doing any work! ■
Complete Horner’s Algorithm
An application of Horner’s algorithm is that of finding the Taylor expansion of a polynomial about any point. Let p(x) be a given polynomial of degree n with coefficients ak as in Equation (2) in Section 1.1, and suppose that we desire the coefficients ck in the equation
p(x)=anxn +an−1xn−1 +···+a0
= cn(x −r)n +cn−1(x −r)n−1 +···+c1(x −r)+c0
Of course, Taylor’s Theorem asserts that ck = p(k)(r)/k!, but we seek a more efficient algorithm. Notice that p(r) = c0, so this coefficient is obtained by applying Horner’s algorithm to the polynomial p with the point r . The algorithm also yields the polynomial
q(x)= p(x)−p(r)=cn(x−r)n−1+cn−1(x−r)n−2+···+c1 x−r
This shows that the second coefficient, c1, can be obtained by applying Horner’s algorithm to the polynomial q with point r, because c1 = q(r). (Notice that the first application of Horner’s algorithm does not yield q in the form shown but rather as a sum of powers of x. (See Equations (3)–(4) in Section 1.1.) This process is repeated until all coefficients ck are found.
We call the algorithm just described the complete Horner’s algorithm. The pseu- docode for executing it is arranged so that the coefficients ck overwrite the input coeffi- cients ak .
1.2 Review of Taylor Series 23
integer n, k, j; real r; real array (ai )0:n for k = 0 to n − 1 do
for j = n − 1 to k do aj ←aj +raj+1
end for end for
EXAMPLE 4
This procedure can be used in carrying out Newton’s method for finding roots of a poly- nomial, which we discuss in Chapter 3. Moreover, it can be done in complex arithmetic to handle polynomials with complex roots or coefficients.
Using the complete Horner’s algorithm, find the Taylor expansion of the polynomial p(x)=x4 −4x3 +7x2 −5x+2
about the point r = 3.
24 Chapter 1 Solution
Introduction
The work can be arranged as follows:
1 −4 7 −5 2
3) 3 −3 12 21 −
1 −1 4 7 23
−
18
p(x)=(x−3)4 +8(x−3)3 +25(x−3)2 +37(x−3)+23
Taylor’s Theorem in Terms of (x − c)
The calculation shows that
3 6 30
−
1 2 10 37
−
3 15
1 5 25
3
In practical computations with Taylor series, it is usually necessary to truncate the series because it is not possible to carry out an infinite number of additions. A series is said to be truncated if we ignore all terms after a certain point. Thus, if we truncate the exponential Series (1) after seven terms, the result is
x x2 x3 x4 x5 x6
e ≈1+x+2!+3!+4!+5!+6!
This no longer represents ex except when x = 0. But the truncated series should approximate ex . Here is where we need Taylor’s Theorem. With its help, we can assess the difference
between a function f and its truncated Taylor series.
The explicit assumption in this theorem is that f (x), f ′(x), f ′′(x), . . . , f (n+1)(x) are
all continuous functions in the interval I = [a, b]. The final term En+1 in Equation (8) is the remainder or error term. The given formula for En+1 is valid when we assume only that f (n+1) exists at each point of the open interval (a, b). The error term is similar to the terms preceding it, but notice that f (n+1) must be evaluated at a point other than c. This point ξ depends on x and is in the open interval (c,x) or (x,c). Other forms of the remainder
■
TAYLOR’S THEOREM FOR f (x)
If the function f possesses continuous derivatives of orders 0, 1, 2, . . . , (n + 1) in a closed interval I = [a,b], then for any c and x in I,
f(x)=n f(k)(c)(x−c)k +En+1 (8) k=0 k!
where the error term En+1 can be given in the form
En+1 = f (n+1)(ξ)(x − c)n+1
(n + 1)!
Here ξ is a point that lies between c and x and depends on both.
■ THEOREM2
EXAMPLE 5
Solution
1.2 Review of Taylor Series 25 are possible; the one given here is Lagrange’s form. (We do not prove Taylor’s Theorem
here.)
Derive the Taylor series for ex at c = 0, and prove that it converges to ex by using Taylor’s Theorem.
If f(x)=ex,then f(k)(x)=ex fork0.Therefore, f(k)(c)= f(k)(0)=e0 =1forallk. From Equation (8), we have
nxk eξ
k!+(n+1)!xn+1 (9)
Now let us consider all the values of x in some symmetric interval around the origin, for example, −s x s. Then |x| s, |ξ| s, and eξ es. Hence, the remainder term satisfies
eξ lim
es xn+1 lim
ex =
k=0
this inequality:
sn+1=0
Thus, if we take the limit as n → ∞ on both sides of Equation (9), we obtain
x nxk∞xk
n→∞ (n+1)!
n→∞ (n+1)!
e = lim = n→∞ k=0 k!
k=0 k!
■
EXAMPLE 6
Solution
This example illustrates how we can establish, in specific cases, that a formal Taylor Series (6) actually represents the function. Let’s examine another example to see how the formal series can fail to represent the function.
Derive the formal Taylor series for f (x) = ln(1 + x) at c = 0, and determine the range of positive x for which the series represents the function.
We need f (k)(x) and f (k)(0) for k 1. Here is the work:
f(x) =ln(1+x) f′(x) =(1+x)−1 f′′(x) =−(1+x)−2 f′′′(x) =2(1+x)−3
f (4)(x) = −6(1 + x)−4
.
f (k)(x) = (−1)k−1(k − 1)!(1 + x)−k
Hence by Taylor’s Theorem, we obtain
f(0) =0 f′(0) =1 f′′(0) =−1 f′′′(0)=2 f(4)(0)=−6
.
f (k)(0) = (−1)k−1(k − 1)!
n ln(1 + x) =
(−1)k−1
(k−1)! (−1)nn!(1+ξ)−n−1
xn+1
k=1 n
k!
xk +
(n + 1)!
( − 1 ) n
= (−1)k−1 + 1+ξ xn+1
x k
− n − 1
(10)
k=1 k n+1
26 Chapter 1
Introduction
For the infinite series to represent ln(1 + x), it is necessary and sufficient that the error term converge to zero as n → ∞. Assume that 0 x 1. Then 0 ξ x (because zero is the point of expansion); thus, 0 x /(1 + ξ ) 1. Hence, the error term converges to zero in this case. If x > 1, the terms in the series do not approach zero, and the series does not converge. Hence, the series represents ln(1 + x ) if 0 x 1 but not if x > 1. (The series alsorepresentsln(1+x)for−1
In Equation (12), let h = 10−5. Then
√1.00001 ≈ 1 + 0.5 × 10−5 − 0.125 × 10−10 = 1.00000 49999 87500
28 Chapter 1
Introduction
By substituting −h for h in the series, we obtain √1−h=1−1h−1h2− 1h3ξ−5/2
Hence, we have
2 8 16
√
0.99999 ≈ 0.99999 49999 87500
Since 1 < ξ < 1 + h, the absolute error does not exceed
1 h3ξ−5/2 < 1 10−15 = 0.00000 00000 00000 0625
16 16
and both numerical values are correct to all 15 decimal places shown. ■
Alternating Series
Another theorem from calculus is often useful in establishing the convergence of a series and in estimating the error involved in truncation. From it, we have the following important principle for alternating series:
If the magnitudes of the terms in an alternating series converge monotonically to zero, then the error in truncating the series is no larger than the magnitude of the first omitted term.
This theorem applies only to alternating series—that is, series in which the successive terms are alternately positive and negative.
ALTERNATING SERIES THEOREM
Ifa1a2 ···an ···0forallnandlimn→∞an =0,thenthealternatingseries a1 −a2 +a3 −a4 +···
converges; that is,
∞ k=1
(−1)k−1ak = lim n→∞
n k=1
(−1)k−1ak = lim Sn = S n→∞
where S is its sum and Sn is the nth partial sum. Moreover, for all n, |S − Sn| an+1
■ THEOREM4
EXAMPLE 8
Solution
If the sine series is to be used in computing sin 1 with an error less than 1 × 10−6 , how 2
many terms are needed? From Series (2), we have
sin1=1− 1 + 1 − 1 +··· 3! 5! 7!
EXAMPLE 9
Solution
1.2 Review of Taylor Series 29 If we stop at 1/(2n − 1)!, the error does not exceed the first neglected term, which is
1/(2n + 1)!. Thus, we should select n so that
1 < 1 × 10−6
(2n + 1)! 2
Using logarithms to base 10, we obtain log(2n + 1)! > log 2 + 6 = 6.3. With a calcula- tor, we compute a table of values for log n! and find that log 10! ≈ 6.6. Hence, if n 5, the error will be acceptable. ■
If the logarithmic Series (5) is to be used for computing ln 2 with an error of less than
1 × 10−6, how many terms will be required? 2
To compute ln 2, we take x = 1 in the series, and using ≈ to mean approximate equality, we have
1 1 1 (−1)n−1
S = ln 2 ≈ 1 − 2 + 3 − 4 + · · · + n = Sn
By the Alternating Series Theorem, the error involved when the series is truncated with n terms is
EXAMPLE 10
Solution
Hence, more than two million terms would be needed! We conclude that this method of computing ln 2 is not practical. (See Problems 1.2.10 through 1.2.12 for several good alternatives.) ■
A word of caution is needed about this technique of calculating the number of terms to be used in a series by just making the (n + 1)st term less than some tolerance. This procedure is valid only for alternating series in which the terms decrease in magnitude to zero, although it is occasionally used to get rough estimates in other cases. For example, it can be used to identify a nonalternating series as one that converges slowly. When this technique cannot be used, a bound on the remaining terms of the series has to be established. Determining such a bound may be somewhat difficult.
It is known that
π4 −4 −4 −4
90 =1 +2 +3 +···
How many terms should we take to compute π4/90 with an error of at most 1 × 10−6? 2
A naive approach is to take
We select n so that
|S − Sn| 1 n+1
1 <1×10−6 n+1 2
1−4 +2−4 +3−4 +···+n−4
30
Chapter 1
Introduction
where n is chosen so that the next term, (n + 1)−4, is less that 1 × 10−6. This value of n is 2
37, but this is an erroneous answer because the partial sum 37
S37 =
n so that all the omitted terms add up to less than 1 × 10−6; that is,
k−4
differs from π4/90 by approximately 6 × 10−6. What we should do, of course, is to select
k=1
2 ∞ 1
k−4 < 2 × 10−6 k=n+1
By a technique familiar from calculus (see Figure 1.3), we have
∞ ∞ x−3∞ 1 k−4< x−4dx= =3
k=n+1n −3n3n
Thus, it suffices to select n so that (3n3)−1 < 1 × 10−6, or n 88. (A more sophisticated
2
y x4
n n1 n2 n3
FIGURE 1.3
Illustrating Example 10
Summary
etc.
analysis will improve this considerably.)
with error term
k=0 k!
En+1 = f (n+1)(ξ)(x − c)n+1
(n1)4
(n 2)4
(n3)4
x
■
(1) The Taylor series expansion about c for f (x) is f(x)=n f(k)(c)(x−c)k +En+1
(n + 1)!
A more useful form for us is the Taylor series expansion for f (x + h), which is
f(x+h)=n f(k)(x)hk +En+1 k=0 k!
with error term
(2) An alternating series
En+1 = f (n+1)(ξ)hn+1 = O(hn+1) (n + 1)!
differ from S by an amount that is bounded by
|S − Sn| an+1
Additional References
For additional study, see the following references found in the Bibliography: Atkinson [1988, 1993], Burden and Faires [2001], Conte and de Boor [1980], Dahlquist and Bjo ̈rck [1974], Forsythe, Malcolm, and Moler [1977], Fro ̈berg [1969], Gautschi [1997], Gerald and Wheatley [1999], Golub and Ortega [1993], Golub and Van Loan [1996], Hämmerlin and Hoffmann [1991], Heath [2002], Higham and Higham [2006], Hildebrand [1974], Isaacson and Keller [1966], Kahaner, Moler, and Nash [1989], Kincaid and Cheney [2002], Maron [1991], Moler [2004], Nievergelt, Farra, and Reingold [1974], Oliveira and Stewart [2006], Ortega [1990a], Phillips and Taylor [1973], Ralston [1965], Ralston and Rabinowitz [2001], Rice [1983], Scheid [1968], Skeel and Keiper [1992], Van Loan [1997, 2000], Wood [1999], and Young and Gregory [1988].
Some other numerical methods books with an emphasis on a particular mathematical software system or computer language are Chapman [2000], Devitt [1993], Ellis and Lodi [1991], Ellis, Johnson, Lodi, and Schwalbe [1997], Garvan [2002], Knight [2000], Lindfield and Penny [2000], Press, Teukolsky, Vetterling, and Flannery [2002], Recktenwald [2000], Schilling and Harris [2000], and Szabo [2002].
1. The Maclaurin series for (1 + x)n is also known as the binomial series. It states that (1+x)n =1+nx+n(n−1)x2+n(n−1)(n−2)x3+··· (x2 <1)
1.2 Review of Taylor Series 31
∞ k=1
(−1)k−1ak
converges when the terms ak converge downward to zero. Furthermore, the partial sums Sn
S =
2! 3!
Derive this series. Then give its particular forms in summation notation by letting
n = 2, n = 3, and n = 1. Next use the last form to compute √1.0001 correct to 2
15 decimal places (rounded).
2. (Continuation) Use the series in the preceding problem to obtain Series (4). How could this series be used on a computing machine to produce x/y if only addition and multiplication are built-in operations?
3. (Continuation) Use the previous problem to obtain a series for (1 + x 2 )−1 .
Problems 1.2
32 Chapter 1
Introduction
4.
a5. 6.
a7. a8. a 9.
10. a11.
a12.
13. 14. 15.
a16.
a17.
Why do the following functions not possess Taylor series expansions at x = 0?
aa. f(x)=√x ab. f(x)=|x| c. f(x)=arcsin(x−1)
d. f(x)=cotx ae. f(x)=logx f. f(x)=xπ DeterminetheTaylorseriesforcoshxaboutzero.Evaluatecosh(0.7)bysummingfour
terms. Compare with the actual value.
Determine the first two nonzero terms of the series expansion about zero for the following:
aa. ecosx ab. sin(cosx) c. (cosx)2(sinx)
Find the smallest nonnegative integer m such that the Taylor series about m for
(x − 1)1/2 exists. Determine the coefficients in the series.
Determine how many terms are needed to compute e correctly to 15 decimal places
(rounded) using Series (1) for ex .
(Continuation) If x < 0 in the preceding problem, what are the signs of the terms in the series? Loss of significant digits can be a serious problem in using the series. Will the formula e−x = 1/ex be helpful in reducing the error? Explain. (See Section 2.3 for further discussion.) Try high-precision computer arithmetic to see how bad the floating-point errors can be.
Showhowthesimpleequationln2=ln[e(2/e)]canbeusedtospeedupthecalculation of ln 2 in Series (10).
What is the series for ln(1 − x)? What is the series for ln[(1 + x)/(1 − x)]?
(Continuation)Intheseriesforln[(1+x)/(1−x)],determinewhatvalueofxtouseif we wish to compute ln 2. Estimate the number of terms needed for ten digits (rounded) of accuracy. Is this method practical?
Use the Alternating Series Theorem to determine the number of terms in Series (5)
needed for computing ln 1.1 with error less than 1 × 10−8 . 2
WritetheTaylorseriesforthefunction f(x)=x3−2x2+4x−1,usingx=2asthe point of expansion; that is, write a formula for f (2 + h).
Determinethefirstfournonzerotermsintheseriesexpansionaboutzerofor aa. f (x) = (sin x) + (cos x) and find an approximate value for f (0.001)
ab. g(x)=(sinx)(cosx)andfindanapproximatevalueforg(0.0006).
Compare the accuracy of these approximations to those obtained from tables or via a
calculator. VerifythisTaylorseriesandprovethatitconvergesontheinterval−e
cj−lbl (j = 0,1,…,m)
Maclaurin series for f . Moreover, the Pade ́ approximations may contain singularities.
a. Determine the rational functions R1,1(x) and R2,2(x). Produce and compare com- puter plots for f(x) = ex, R1,1, and R2,2. Do these low-order rational functions approximatetheexponentialfunctionex satisfactorilywithin[−1,1]?Howdothey compare to the truncated Maclaurin polynomials of the preceding problem?
b. Repeat using R2,2(x) and R3,1(x) for the function g(x) = ln(1 + x).
Information on the life and work of the French mathematician Herni Euge`ne Pade ́ (1863–1953) can be found in Wood [1999]. This reference also has examples and exercises similar to these. Further examples of Pade ́ approximation can be seen.
42 Chapter 1
Introduction
23. (Continuation) Repeat for the Bessel function J0(2x), whose Maclaurin series is
x4 x6 ∞ xi2 1−x2 + − +···= (−1)i
436 i=0 i!
Then determine R2,2(x), R4,3(x), and R2,4(x) as well as comparing plots.
24. Carry out the details in the introductory example to this chapter by first deriving the Taylor series for ln(1 + x ) and computing ln 2 ≈ 0.63452 using the first eight terms. Then establish the series ln[(1 + x)/(1 − x)] and calculate ln2 ≈ 0.69313 using the terms shown. Determine the absolute error and relative errors for each of these answers.
25. Reproduce Figure 1.2 using your computer as well as adding the curve for S4.
26. UseamathematicalsoftwaresystemthatdoessymbolicmanipulationssuchasMaple
or Mathematica to carry out
a. Example 3 b. Example 6
27. Canyouobtainthefollowingnumericalresults? √
√1.00001 = 1.00000 49999 87500 06249 96093 77734 37500 0000 0.99999 = 0.99999 49999 87499 93749 96093 72265 62500 00000
Are these answers accurate to all digits shown?
2
Floating-Point Representation and Errors
Computers usually do not use base-10 arithmetic for storage or computa- tion. Numbers that have a finite expression in one number system may have an infinite expression in another system. This phenomenon is illus- trated when the familiar decimal number 1/10 is converted into the binary system:
(0.1)10 = (0.0 0011 0011 0011 0011 0011 0011 0011 0011 . . .)2
In this chapter, we explain the floating-point number system and develop basic facts about roundoff errors. Another topic is loss of significance, which occurs when nearly equal numbers are subtracted. It is studied and shown to be avoidable by various programming techniques.
2.1 Floating-Point Representation
The standard way to represent a nonnegative real number in decimal form is with an in- teger part, a fractional part, and a decimal point between them—for example, 37.21829, 0.002271828, and 30 00527.11059. Another standard form, often called normalized scientific notation, is obtained by shifting the decimal point and supplying appropriate powers of 10. Thus, the preceding numbers have alternative representations as
37.21829 = 0.37218 29 × 102 0.00227 1828 = 0.22718 28 × 10−2
30 00527.11059 = 0.30005 27110 59 × 107
In normalized scientific notation, the number is represented by a fraction multiplied by 10n , and the leading digit in the fraction is not zero (except when the number involved is zero). Thus, we write 79325 as 0.79325 × 105 , not 0.07932 5 × 106 or 7.9325 × 104 or some other way.
43
44 Chapter 2
Floating-Point Representation and Errors
Normalized Floating-Point Representation
In the context of computer science, normalized scientific notation is also called normalized floating-point representation. In the decimal system, any real number x (other than zero) can be represented in normalized floating-point form as
x = ±0.d1d2d3 . . . × 10n
where d1 ≠ 0 and n is an integer (positive, negative, or zero). The numbers d1 , d2 , . . . are the decimal digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
Stated another way, the real number x, if different from zero, can be represented in normalized floating-point decimal form as
x=±r×10n 1 r<1 10
This representation consists of three parts: a sign that is either + or −, a number r in the interval 1 , 1, and an integer power of 10. The number r is called the normalized
The floating-point representation in the binary system is similar to that in the decimal system in several ways. If x ≠ 0, it can be written as
x = ±q × 2m 1 q < 1 2
10
mantissa and n the exponent.
The mantissa q would be expressed as a sequence of zeros or ones in the form q = (0.b1b2b3 . . .)2, where b1 ≠ 0. Hence, b1 = 1 and then necessarily q 1 .
2
A floating-point number system within a computer is similar to what we have just
described, with one important difference: Every computer has only a finite word length and a finite total capacity, so only numbers with a finite number of digits can be represented. A number is allotted only one word of storage in the single-precision mode (two or more words in double or extended precision). In either case, the degree of precision is strictly limited. Clearly, irrational numbers cannot be represented, nor can those rational numbers that do not fit the finite format imposed by the computer. Furthermore, numbers may be either too large or too small to be representable. The real numbers that are representable in a computer are called its machine numbers.
Since any number used in calculations with a computer system must conform to the format of numbers in that system, it must have a finite expansion. Numbers that have a nonterminating expansion cannot be accommodated precisely. Moreover, a number that has a terminating expansion in one base may have a nonterminating expansion in another. A good example of this is the following simple fraction as given in the introductory example to this chapter:
1 =(0.1)10 =(0.06314631463146314...)8
10
= (0.0 0011 0011 0011 0011 0011 0011 0011 0011 . . .)2
The important point here is that most real numbers cannot be represented exactly in a computer. (See Appendix B for a discussion of representation of numbers in difference bases.)
The effective number system for a computer is not a continuum but a rather peculiar discrete set. To illustrate, let us take an extreme example, in which the floating-point numbers must be of the form x = ±(0.b1b2b3)2 × 2±k , where b1, b2, b3, and m are allowed to have only the value 0 or 1.
EXAMPLE 1
Solution
2.1 Floating-Point Representation 45 List all the floating-point numbers that can be expressed in the form
x = ±(0.b1b2b3)2 × 2±k (k, bi ∈ {0, 1})
There are two choices for the ±, two choices for b1, two choices for b2, two choices for b3, and three choices for the exponent. Thus, at first, one would expect 2 × 2 × 2 × 2 × 3 = 48 different numbers. However, there is some duplication. For example, the nonnegative num- bers in this system are as follows:
0.000×20 =0
0.000×21 =0 0.001×21 = 1
4
0.010×21 = 2 4
0.011×21 = 3 4
0.100×21 = 4 4
0.101×21 = 5 4
0.110×21 = 6 4
0.111×21 = 7 4
0.000×2−1=0
0.001 × 20
0.010 × 20
0.011 × 20
0.100 × 20
0.101 × 20
0.110 × 20
0.111 × 20
= 1 8
= 2 8
= 3 8
= 4 8
= 5 8
= 6 8
= 7 8
0.001 × 2−1 0.010 × 2−1 0.011 × 2−1 0.100 × 2−1 0.101 × 2−1 0.110 × 2−1
0.111 × 2−1
= 1 16
= 2 16
= 3 16
= 4 16
= 5 16
= 6 16
= 7 16
FIGURE 2.1
Positive machine numbersin Example 1
Altogether there are 31 distinct numbers in the system. The positive numbers obtained are shown on a line in Figure 2.1. Observe that the numbers are symmetrically but unevenly distributed about zero. ■
011315371 5 3 7 1 5 3 7 16 8 16 4 16 8 16 2 8 4 8 4 2 4
If, in the course of a computation, a number x is produced of the form ±q × 2m , where m is outside the computer’s permissible range, then we say that an overflow or an underflow has occurred or that x is outside the range of the computer. Generally, an overflow results in a fatal error (or exception), and the normal execution of the program stops. An underflow, however, is usually treated automatically by setting x to zero without any interruption of the program but with a warning message in most computers.
In a computer whose floating-point numbers are restricted to the form in Example 1,
any number closer to zero than 1 would underflow to zero, and any number outside the 16
range −1.75 to +1.75 would overflow to machine infinity.
If, in Example 1, we allow only normalized floating-point numbers, then all our numbers
(with the exception of zero) have the form
x = ±(0.1b2b3)2 × 2±k
This creates a phenomenon known as the hole at zero. Our nonnegative machine numbers
are now distributed as in Figure 2.2. There is a relatively wide gap between zero and the
smallest positive machine number, which is (0.100)2 × 2−1 = 1 . 4
46
Chapter 2 Floating-Point Representation and Errors
FIGURE 2.2
Normalized machine
numbersin0 Example 1
15371 5 3 7 1 4168162 8 4 8
5 3 7 4 2 4
Floating-Point Representation
A computer that operates in floating-point mode represents numbers as described earlier except for the limitations imposed by the finite word length. Many binary computers have a word length of 32 bits (binary digits). We shall describe a machine of this type whose features mimic many workstations and personal computers in widespread use. The internal representation of numbers and their storage is standard floating-point form, which is used in almost all computers. For simplicity, we have left out a discussion of some of the details and features. Fortunately, one need not know all the details of the floating-point arithmetic system used in a computer to use it intelligently. Nevertheless, it is generally helpful in debugging a program to have a basic understanding of the representation of numbers in your computer.
By single-precision floating-point numbers, we mean all acceptable numbers in a computer using the standard single-precision floating-point arithmetic format. (In this dis- cussion, we are assuming that such a computer stores these numbers in 32-bit words.) This set is a finite subset of the real numbers. It consists of ±0, ±∞, normal and subnormal single-precision floating-point numbers, but not NotaNumber (NaN) values. (More detail on these subjects are in Appendix B and in the references.) Recall that most real numbers cannot be represented exactly as floating-point numbers, since they have infinite decimal or binary expansions (all irrational numbers and some rational numbers); for example, π,e, 1,0.1 and so on.
3
Because of the 32-bit word-length, as much as possible of the normalized floating-point
number
must be contained in those 32 bits. One way of allocating the 32 bits is as follows:
sign of q integer |m| number q
1 bit
8 bits 23 bits
±q × 2m
Information on the sign of m is contained in the eight bits allocated for the integer |m|. In such a scheme, we can represent real numbers with |m| as large as 27 − 1 = 127. The exponent represents numbers from −127 through 128.
Single-Precision Floating-Point Form
We now describe a machine number of the following form in standard single-precision floating-point representation:
(−1)s × 2c−127 × (1.f )2
The leftmost bit is used for the sign of the mantissa, where s = 0 corresponds to + and s = 1 corresponds to −. The next eight bits are used to represent the number c in the exponent
2.1
Floating-Point Representation 47
s
biased exponent c
f from one-plus mantissa (1.f )2
FIGURE 2.3
Partitioned floating-point single-precision computer word
Sign of mantissa
9 bits
23 bits
radix point
of 2c−127, which is interpreted as an excess-127 code. Finally, the last 23 bits represent f from the fractional part of the mantissa in the 1-plus form: (1.f)2. Each floating-point single-precision word is partitioned as in Figure 2.3.
In the normalized representation of a nonzero floating-point number, the first bit in the mantissa is always 1 so that this bit does not have to be stored. This can be accomplished by shifting the binary point to a “1-plus” form (1.f )2 . The mantissa is the rightmost 23 bits and contains f with an understood binary point as in Figure 2.3. So the mantissa (significand) actually corresponds to 24 binary digits since there is a hidden bit. (An important exception is the number ±0.)
We now outline the procedure for determining the representation of a real number x . If x is zero, it is represented by a full word of zero bits with the possible exception of the sign bit. For a nonzero x, first assign the sign bit for x and consider |x|. Then convert both the integer and fractional parts of |x| from decimal to binary. Next one-plus normalize (|x|)2 by shifting the binary point so that the first bit to the left of the binary point is a 1 and all bits to the left of this 1 are 0. To compensate for this shift of the binary point, adjust the exponent of 2; that is, multiply by the appropriate power of 2. The 24-bit one-plus-normalized mantissa in binary is thus found. Now the current exponent of 2 should be set equal to c − 127 to determine c, which is then converted from decimal to binary. The sign bit of the mantissa is combined with (c)2 and (f )2 . Finally, write the 32-bit representation of x as eight hexadecimal digits.
The value of c in the representation of a floating-point number in single precision is restricted by the inequality
0 < c < (11111111)2 = 255
The values 0 and 255 are reserved for special cases, including ±0 and ±∞, respectively.
Hence, the actual exponent of the number is restricted by the inequality −126 c − 127 127
Likewise, we find that the mantissa of each nonzero number is restricted by the inequality
1(1.f)2(1.11111111111111111111111)2 =2−2−23
The largest number representable is therefore (2 − 2−23)2127 ≈ 2128 ≈ 3.4 × 1038. The smallest positive number is 2−126 ≈ 1.2 × 10−38.
The binary machine floating-point number ε = 2−23 is called the machine epsilon when using single precision. It is the smallest positive machine number ε such that 1 + ε ≠ 1. Because 2−23 ≈ 1.2 × 10−7, we infer that in a simple computation, approximately six significant decimal digits of accuracy may be obtained in single precision. Recall that 23 bits are allocated for the mantissa.
48
Chapter 2
Floating-Point Representation and Errors
EXAMPLE 2
Solution
Double-Precision Floating-Point Form
When more precision is needed, double precision can be used, in which case each double- precision floating-point number is stored in two computer words in memory. In double precision, there are 52 bits allocated for the mantissa. The double precision machine epsilon is 2−52 ≈ 2.2 × 10−16, so approximately 15 significant decimal digits of precision are available. There are 11 bits allowed for the exponent, which is biased by 1023. The exponent represents numbers from −1022 through 1023. A machine number in standard double- precision floating-point form corresponds to
(−1)s × 2c−1023 × (1.f )2
The leftmost bit is used for the sign of the mantissa with s = 0 for + and s = 1 for −. The next eleven bits are used to represent the exponent c corresponding to 2c−1023. Finally, 52 bits represent f from the fractional part of the mantissa in the one-plus form: (1.f )2 .
The value of c in the representation of a floating-point number in double precision is restricted by the inequality
0 < c < (1111111111)2 = 2047
As in single precision, the values at the ends of this interval are reserved for special cases.
Hence, the actual exponent of the number is restricted by the inequality −1022 c − 1023 1023
We find that the mantissa of each nonzero number is restricted by the inequality 1(1.f)2(1.111111111···1111111111)2 =2−2−52
Because 2−52 ≈ 1.2 × 10−16, we infer that in a simple computation approximately 15 significant decimal digits of accuracy may be obtained in double precision. Recall that 52 bits are allocated for the mantissa. The largest double-precision machine number is (2 − 2−52)21023 ≈ 21024 ≈ 1.8 × 10308. The smallest double-precision positive machine number is 2−1022 ≈ 2.2 × 10−308.
Single precision on a 64-bit computer is comparable to double precision on a 32-bit computer, whereas double precision on a 64-bit computer gives four times the precision available on a 32-bit computer.
In single precision, 31 bits are available for an integer because only 1 bit is needed for the sign. Consequently, the range for integers is from −(231 −1) to (231 −1) = 21474 83647. In double precision, 63 bits are used for integers giving integers in the range −(263 − 1) to (263 − 1). In using integer arithmetic, accurate calculations can result in only approximately nine digits in single precision and 18 digits in double precision! For high accuracy, most computations should be done by using double-precision floating-point arithmetic.
Determine the single-precision machine representation of the decimal number −52.23437 5 in both single precision and double precision.
Converting the integer part to binary, we have (52.)10 = (64.)8 = (110 100.)2 . Next, con- verting the fractional part, we have (.23437 5)10 = (.17)8 = (.001 111)2. Now
(52.23437 5)10 = (110 100.001 111)2 = (1.101 000 011 110)2 × 25
EXAMPLE 3
Solution
is the corresponding one-plus form in base 2, and (.101 000 011 110)2 is the stored man- tissa.Nexttheexponentis(5)10,andsincec−127=5,weimmediatelyseethat(132)10 = (204)8 = (10 000 100)2 is the stored exponent. Thus, the single-precision machine repre- sentation of −52.23437 5 is
[11000010010100001111000000000000]2 = [11000010010100001111000000000000]2 =[C250F000]16
Indoubleprecision,fortheexponent(5)10,weletc−1023=5,andwehave(1028)10 = (2004)8 = (10 000 000 100)2 , which is the stored exponent. Thus, the double-precision machine representation of −52.23437 5 is
[110000000100101000011110000 ··· 00]2 = [1100000001001010000111100000···0000]2 =[C04A1E0000000000]16
Here[···]k isthebitpatternofthemachineword(s)thatrepresentsfloating-pointnumbers, which is displayed in base-k. ■
Determine the decimal numbers that correspond to these machine words: [45DE4000]16 [BA390000]16
The first number in binary is
[0100 0101 1101 1110 0100 0000 0000 0000]2
The stored exponent is (10 001 011)2 = (213)8 = (139)10, so 139 − 127 = 12. The man- tissa is positive and represents the number
(1.101 111 001)2 × 212 = (1 101 111 001 000.)2 = (15710.)8
=0×1+1×8+7×82 +5×83 +1×84 = 8(1 + 8(7 + 8(5 + 8(1))))
= 7112
Similarily, the second word in binary is
[1011 1010 0011 1001 0000 0000 0000 0000]2
The exponential part of the word is (01 110 100)2 = (164)8 = 116, so the exponent is 116 − 127 = −11. The mantissa is negative and corresponds to the following floating- point number:
−(1.011 100 100)2 × 2−11 = −(0.000 000 000 010 111 001)2 = −(0.00027 1)8
= −2×8−4 −7×8−5 −1×8−6 = −8−6(1 + 8(7 + 8(2)))
= − 185 ≈ −7.05718 99 × 10−4 26214 4
■
2.1 Floating-Point Representation 49
50
Chapter 2
Floating-Point Representation and Errors
Computer Errors in Representing Numbers
We turn now to the errors that can occur when we attempt to represent a given real number x in the computer. We use a model computer with a 32-bit word length. Suppose first that we let x = 253 21697 or x = 2−32591 . The exponents of these numbers far exceed the limitations of the machine (as described above). These numbers would overflow and underflow, respectively, and the relative error in replacing x by the closest machine number will be very large. Such numbers are outside the range of a 32-bit word-length computer.
Consider next a positive real number x in normalized floating-point form x = q × 2m 1 q < 1, −126 m 127
2
The process of replacing x by its nearest machine number is called correct rounding, and the error involved is called roundoff error. We want to know how large it can be. We suppose that q is expressed in normalized binary notation, so
x = (0.1b2b3b4 ...b24b25b26 ...)2 × 2m
One nearby machine number can be obtained by rounding down or by simply dropping the excess bits b25 b26 . . . , since only 23 bits have been allocated to the stored mantissa. This machine number is
x− =(0.1b2b3b4...b24)2 ×2m
It lies to the left of x on the real-number axis. Another machine number, x+, is just to the right of x on the real axis and is obtained by rounding up. It is found by adding one unit to b24 in the expression for x−. Thus,
x+ =(0.1b2b3b4...b24)2 +2−24×2m
The closer of these machine numbers is the one chosen to represent x.
The two situations are illustrated by the simple diagrams in Figure 2.4. If x lies closer
to x− than to x+, then
In this case, the relative error is bounded as follows:
|x − x−| 1 |x+ − x−| = 2−25+m 2
x−x
−
x
where u = 2−24 is the unit roundoff error for a 32-bit binary computer with standard
2−25+m (0.1b2b3b4 ...)2 ×2
2−25
m 1 =2−24=u
floating-point arithmetic. Recall that the machine epsilon is ε = 2−23, so u = 1 ε. Moreover, 2
2
u = 2−k , where k is the number of binary digits used in the mantissa, including the hidden bit (k = 24 in single precision and k = 53 in double precision). On the other hand, if x lies closer to x+ than to x−, then
FIGURE 2.4
A possible relationship between x−, x+, and x.
|x − x+| 1 |x+ − x−| 2
and the same analysis shows that the relative error is no greater than 2−24 = u. So in the case of rounding to the nearest machine number, the relative error is bounded by u. We note in
xx xx xx
passing that when all excess digits or bits are discarded, the process is called chopping. If a 32-bit word-length computer has been designed to chop numbers, the relative error bound would be twice as large as above, or 2u = 2−23 = ε.
Notation fl(x) and Backward Error Analysis
Next let us turn to the errors that are produced in the course of elementary arithmetic oper- ations. To illustrate the principles, suppose that we are working with a five-place decimal machine and wish to add numbers. Two typical machine numbers in normalized floating- point form are
x = 0.37218 × 104 y = 0.71422 × 10−1
Many computers perform arithmetic operations in a double-length work area, so let us assume that our computer will have a ten-place accumulator. First, the exponent of the smaller number is adjusted so that both exponents are the same. Then the numbers are added in the accumulator, and the rounded result is placed in a computer word:
x = 0.37218 00000 × 104
y = 0.00000 71422 × 104 x + y = 0.37218 71422 × 104
The nearest machine number is z = 0.37219 × 104, and the relative error involved in this machine addition is
|x + y − z| 0.00000 28578 × 104 −5 |x + y| = 0.3721871422×104 ≈ 0.77×10
This relative error would be regarded as acceptable on a machine of such low precision. To facilitate the analysis of such errors, it is convenient to introduce the notation fl(x) to denote the floating-point machine number that corresponds to the real number x. Of course, the function fl depends on the particular computer involved. The hypothetical
five-decimal-digit machine used above would give
fl(0.37218 71422 × 104 ) = 0.37219 × 104
For a 32-bit word-length computer, we established previously that if x is any real number within the range of the computer, then
|x − fl(x)| u u = 2−24 (1) |x|
Here and throughout, we assume that correct rounding is used. This inequality can also be expressed in the more useful form
fl(x) = x(1 + δ) |δ| 2−24
To see that these two inequalities are equivalent, simply let δ = [fl(x) − x]/x. Then, by Inequality (1), we have |δ| 2−24 and solving for fl(x) yields fl(x) = x(1 + δ).
By considering the details in the addition 1 + ε, we see that if ε 2−23, then fl(1 + ε) > 1, while if ε < 2−23, then fl(1 + ε) = 1. Consequently, if machine epsilon is the smallest positive machine number ε such that
2.1 Floating-Point Representation 51
fl(1 + ε) > 1
52
Chapter 2
Floating-Point Representation and Errors
EXAMPLE 4
Solution
then ε = 2−23. Sometimes it is necessary to furnish the machine epsilon to a program. Since it is a machine-dependent constant, it can be found by either calling a system routine or by writing a simple program that finds the smallest positive number x = 2m such that 1+x >1inthemachine.
Now let the symbol ⊙ denote any one of the arithmetic operations +, −, ×, or ÷. Suppose a 32-bit word-length computer has been designed so that whenever two machine numbers x and y are to be combined arithmetically, the computer will produce fl(x ⊙ y) instead of x ⊙ y. We can imagine that x ⊙ y is first correctly formed, then normalized, and finally rounded to become a machine number. Under this assumption, the relative error will not exceed 2−24 by the previous analysis:
fl(x ⊙ y) = (x ⊙ y)(1 + δ) |δ| 2−24 Special cases of this are, of course,
fl(x ± y) = (x ± y)(1 + δ) fl(xy) = xy(1 + δ)
x x
fl y = y (1+δ)
In these equations, δ is variable but satisfies −2−24 δ 2−24. The assumptions that we have made about a model 32-bit word-length computer is not quite true for a real computer. For example, it is possible for x and y to be machine numbers and for x ⊙ y to overflow or underflow. Nevertheless, the assumptions should be realistic for most computing machines.
The equations given above can be written in a variety of ways, some of which suggest alternative interpretations of roundoff. For example,
fl(x + y) = x(1 + δ) + y(1 + δ)
This says that the result of adding machine numbers x and y is not in general x + y but is the true sum of x(1 + δ) and y(1 + δ). We can think of x(1 + δ) as the result of slightly perturbing x. Thus, the machine version of x + y, which is fl(x + y), is the exact sum of a slightly perturbed x and a slightly perturbed y. The reader can supply similar interpretations in the examples given in the problems.
This interpretation is an example of backward error analysis. It attempts to determine what perturbation of the original data would cause the computer results to be the exact results for a perturbed problem. In contrast, a direct error analysis attempts to determine how computed answers differ from exact answers based on the same data. In this aspect of scientific computing, computers have stimulated a new way of looking at computational errors.
If x, y, and z are machine numbers in a 32-bit word-length computer, what upper bound can be given for the relative roundoff error in computing z(x + y)?
In the computer, the calculation of x + y will be done first. This arithmetic operation pro- duces the machine number fl(x + y), which differs from x + y because of roundoff. By the principles established above, there is a δ1 such that
fl(x + y) = (x + y)(1 + δ1) |δ1| 2−24
EXAMPLE 5
Now z is already a machine number. When it multiplies the machine number fl(x + y), the result is the machine number fl[z fl(x + y)]. This, too, differs from its exact counterpart, and we have, for some δ2,
fl[z fl(x + y)] = z fl(x + y)(1 + δ2) |δ2| 2−24 Putting both of our equations together, we have
fl[zfl(x + y)] = z(x + y)(1+δ1)(1+δ2)
= z(x + y)(1+δ1 +δ2 +δ1δ2)
≈ z(x + y)(1 + δ1 + δ2)
= z(x + y)(1 + δ) |δ| 2−23
In this calculation, |δ1δ2| 2−48, and so we ignore it. Also, we put δ = δ1 + δ2 and then reasonthat|δ|=|δ1 +δ2||δ1|+|δ2|2−24 +2−24 =2−23. ■
Critique the following attempt to estimate the relative roundoff error in computing the sum of two real numbers, x and y. In a 32-bit word-length computer, the calculation yields
z = fl[fl(x) + fl(y)]
= [x(1+δ)+ y(1+δ)](1+δ) = (x + y)(1 + δ)2
≈ (x + y)(1 + 2δ)
Therefore, the relative error is bounded as follows: (x+y)−z 2δ(x+y)
(x+y) = (x+y) =|2δ|2−23 Why is this calculation not correct?
2.1 Floating-Point Representation 53
Solution
The quantities δ that occur in such calculations are not, in general, equal to each other. The correct calculation is
z = fl[fl(x) + fl(y)]
= [x(1+δ1)+ y(1+δ2)](1+δ3)
= [(x + y)+δ1x +δ2y](1+δ3)
= (x + y)+δ1x +δ2y +δ3x +δ3y +δ1δ3x +δ2δ3y ≈(x+y)+x(δ1 +δ3)+y(δ2 +δ3)
Therefore, the relative roundoff error is
(x+y)−z x(δ +δ)+y(δ +δ) =13 23
(x + y)
(x + y)
(x+y)δ +xδ +yδ =312
x δ + y δ = δ3 + 1 2
(x + y)
(x + y)
54 Chapter 2
Floating-Point Representation and Errors
This cannot be bounded, because the second term has a denominator that can be zero or close to zero. Notice that if x and y are machine numbers, then δ1 and δ2 are zero, and a useful bound results—namely, δ3. But we do not need this calculation to know that! It has been assumed that when machine numbers are combined with any of the four arithmetic operations, the relative roundoff error will not exceed 2−24 in magnitude (on a 32-bit word- length computer). ■
Historical Notes
In the 1991 Gulf War, a failure of the Patriot missile defense system was the result of a software conversion error. The system clock measured time in tenths of a second, but it was stored as a 24-bit floating-point number, resulting in rounding errors. Field data had shown that the system would fail to track and intercept an incoming missile after being on for 20 consecutive hours and would need to be rebooted. After it had been on for 100 hours, a system failure resulted in the death of 28 American soldiers in a barracks in Dhahran, Saudi Arabia, because it failed to intercept an incoming Iraqi Scud missile. Since the number 0.1 has an infinite binary expansion, the value in the 24-bit register was in error by (1.1001100 . . .)2 × 2−24 ≈ 0.95 × 10−7 . The resulting time error was approximately thirty-four one-hundreds of a second after running for 100 hours.
In 1996, the Ariane 5 rocket launched by the European Space Agency exploded 40 sec- onds after lift-off from Kourou, French Guiana. An investigation determined that the hori- zontal velocity required the conversion of a 64-bit floating-point number to a 16-bit signed integer. It failed because the number was larger than 32,767, which was the largest inte- ger of this type that could be stored in memory. The rocket and its cargo were valued at $500 million.
Additional details about these disasters can be found by searching the World Wide Web. There are other interesting accounts of calamities that could have been averted by more careful computer programming, especially in using floating-point arithmetic.
Summary
(1) A single-precision floating-point number in a 32-bit word-length computer with stan- dard floating-point representation is stored in a single word with the bit pattern
which is interpreted as the real number
(−1)b1 ×2(b2b3…b9)2 ×2−127 ×(1.b10b11 …b32)2
(2) A double-precision floating-point number in a 32-bit word-length computer with standard floating-point representation is stored in two words with the bit pattern
which is interpreted as the real number
(−1)b1 ×2(b2b3…b12)2 ×2−1023×(1.b13b14…b64)2
b1b2b3 ···b9b10b11 ···b32
b1b2b3 · · · b9b10b11b12b13 · · · b32b33b34b35 · · · · · · b64
2.1 Floating-Point Representation 55 (3) The relationship between a real number x and the floating-point machine number fl(x)
can be written as
fl(x) = x(1 + δ) |δ| 2−24 If ⊙ denotes any one of the arithmetic operations, then we write
fl(x ⊙ y) = (x ⊙ y)(1 + δ) In these equations, δ depends on x and y.
1.
2.
3. 4.
Determinethemachinerepresentationinsingleprecisionona32-bitword-lengthcom- puter for the following decimal numbers.
a. 2−30 b. 64.015625 ac. −8×2−24 Determinethesingle-precisionanddouble-precisionmachinerepresentationina32-bit
word-length computer of the following decimal numbers:
a. 0.5,−0.5 b. 0.125,−0.125 c. 0.0625,−0.0625 ad. 0.03125,−0.03125
Whichofthesearemachinenumbers?
a. 10403 b. 1+2−32 c. 1 d. 1 e. 1 5 10 256
Determine the single-precision and double-precision machine representation of the following decimal numbers:
a. 1.0,−1.0 b. +0.0,−0.0 c. −9876.54321 ad. 0.234375
ae. 492.78125 f. 64.37109375 g. −285.75
h. 10−2
5. Identifythefloating-pointnumberscorrespondingtothefollowingbitstrings: a.
b.
c. ad. e. f. g. h.
6. Whatarethebit-stringmachinerepresentationsforthefollowingsubnormalnumbers?
a. 2−127 + 2−128 b. 2−127 + 2−150 c. 2−127 + 2−130 d. 150 2−k k =127
0 00000000 00000000000000000000000
1 00000000 00000000000000000000000
0 11111111 00000000000000000000000
1 11111111 00000000000000000000000
0 00000001 00000000000000000000000
0 10000001 01100000000000000000000
0 01111111 00000000000000000000000
0 01111011 10011001100110011001100
Problems 2.1
56 Chapter 2
Floating-Point Representation and Errors
7. Determinethedecimalnumbersthathavethefollowingmachinerepresentations:
a. [3F27E520]16 b. [3BCDCA00]16 c. [BF4F9680]16 d. [CB187ABC]16
8. Determinethedecimalnumbersthathavethefollowingmachinerepresentations:
aa. [CA3F2900]16 b. [C705A700]16 c. [494F96A0]16 ad. [4B187ABC]16
e. [45223000]16 f. [45607000]16 ag. [C553E000]16 h. [437F0000]16 9. Arethesemachinerepresentations?Whyorwhynot?
a. [4BAB2BEB]16 b. [1A1AIA1A]16 c. [FADEDEAD]16 d. [CABE6G94]16
10. Thecomputerwordassociatedwiththevariableappearsas[7F7FFFFF]16,whichis the largest representable floating-point single-precision number. What is the decimal value of ? The variable ε appears as [00800000]16, which is the smallest positive number. What is the decimal value of ε?
11. Enumerate the set of numbers in the floating-point number system that have binary representations of the form ±(0.b1 b2 ) × 2k , where
a. k ∈ {−1,0} b. k ∈ {−1,1} ac. k ∈ {−1,0,1}
12. Whatarethemachinenumbersimmediatelytotherightandleftof2m?Howfariseach
from 2m ?
13. Generally, when a list of floating-point numbers is added, less roundoff error will occur if the numbers are added in order of increasing magnitude. Give some examples to illustrate this principle.
14. (Continuation)Theprincipleoftheprecedingproblemisnotuniversallyvalid.Consider a decimal machine with two decimal digits allocated to the mantissa. Show that the four numbers 0.25, 0.0034, 0.00051, and 0.061 can be added with less roundoff error if not added in ascending order.
a15. In the case of machine underflow, what is the relative error involved in replacing a number x by zero?
16. Consider a computer that operates in base β and carries n digits in the mantissa of
its floating-point numbers. Show that the rounding of a real number x to the nearest
machine number x involves a relative error of at most 1 β 1−n . Hint: Imitate the argument 2
in the text.
a17. Consideradecimalmachineinwhichfivedecimaldigitsareallocatedtothemantissa. Give an example, avoiding overflow or underflow, of a real number x whose closest machine number x involves the greatest possible relative error.
a18. Inafive-decimalmachinethatcorrectlyroundsnumberstothenearestmachinenumber, what real numbers x will have the property fl(1.0 + x) = 1.0?
a19. Consider a computer operating in base β. Suppose that it chops numbers instead of correctly rounding them. If its floating-point numbers have a mantissa of n digits, how large is the relative error in storing a real number in machine format?
2.1 Floating-Point Representation 57 20. What is the roundoff error when we represent 2−1 + 2−25 by a machine number? Note:
a 21.
This refers to absolute error, not relative error.
(Continuation) What is the relative roundoff error when we round off 2−1 + 2−26 to get the closest machine number?
22. Ifxisarealnumberwithintherangeofa32-bitword-lengthcomputerthatisrounded and stored, what can happen when x2 is computed? Explain the difference between fl[fl(x)fl(x)] and fl(x x).
23. Abinarymachinethatcarries30bitsinthefractionalpartofeachfloating-pointnumber is designed to round a number up or down correctly to get the closest floating-point number. What simple upper bound can be given for the relative error in this rounding process?
24. A decimal machine that carries 15 decimal places in its floating-point numbers is designed to chop numbers. If x is a real number in the range of this machine and x is its machine representation, what upper bound can be given for | x − x | / | x | ?
a 25.
a 26.
If x and y are real numbers within the range of a 32-bit word-length computer and if x y is also within the range, what relative error can there be in the machine computation of xy? Hint: The machine produces fl[fl(x)fl(y)].
Let x and y be positive real numbers that are not machine numbers but are within the exponent range of a 32-bit word-length computer. What is the largest possible relative error in the machine representation of x + y2? Include errors made to get the numbers in the machine as well as errors in the arithmetic.
27. Show that if x and y are positive real numbers that have the same first n digits in their decimal representations, then y approximates x with relative error less than 101−n . Is the converse true?
28. Show that a rough bound on the relative roundoff error when n machine numbers are multiplied in a 32-bit word-length computer is (n − 1)2−24.
29. Show that fl(x + y) = y on a 32-bit word-length computer if x and y are positive machine numbers and x < y × 2−25.
a 30.
If 1000 nonzero machine numbers are added in a 32-bit word-length computer, what upper bound can be given for the relative roundoff error in the result? How many decimal digits in the answer can be trusted?
31. Suppose that x = n a 2−i , where a ∈ {−1, 0, 1} is a positive number. Show that i=1i ni
xcanalsobewrittenintheform i=1bi2−i,wherebi ∈{0,1}.
32. If x and y are machine numbers in a 32-bit word-length computer and if fl(x/y) =
x/[y(1 + δ)], what upper bound can be placed on |δ|?
33. How big is the hole at zero in a 32-bit word-length computer?
34. How many machine numbers are there in a 32-bit length computer? (Consider only normalized floating-point numbers.)
58 Chapter 2
Floating-Point Representation and Errors
35. How many normalized floating-point numbers are available in a binary machine if n bits are allocated to the mantissa and m bits are allocated to the exponent? Assume that two additional bits are used for signs, as in a 32-bit length computer.
36. Show by an example that in computer arithmetic a+(b+c) may differ from (a + b) + c.
a37. Consideradecimalmachineinwhichfloating-pointnumbershave13decimalplaces. Suppose that numbers are correctly rounded up or down to the nearest machine number. Give the best bound for the roundoff error, assuming no underflow or overflow. Use relative error, of course. What if the numbers are always chopped?
a38. Consider a computer that uses five-decimal-digit numbers. Let fl(x) denote the floating-point machine number closest to x. Show that if x=0.5321487513 and y = 0.53213 04421, then the operation fl(x) − fl(y) involves a large relative error. Compute it.
a39. Twonumbersxandythatarenotmachinenumbersarereadintoa32-bitword-length computer. The machine computes xy2. What sort of relative error can be expected? Assume no underflow or overflow.
40. Let x, y, and z be three machine numbers in a 32-bit word-length computer. By ana- lyzing the relative error in the worst case, determine how much roundoff error should be expected in forming (x y)z.
41. Let x and y be machine numbers in a 32-bit word-length computer. What relative roundoff error should be expected in the computation of x + y ? If x is around 30 and y is around 250, what absolute error should be expected in the computation of x + y ?
a42. Every machine number in a 32-bit word-length computer can be interpreted as the correct machine representation of an entire interval of real numbers. Describe this interval for the machine number q × 2m .
43. Is every machine number on a 32-bit word-length computer the average of two other machine numbers? If not, describe those that are not averages.
44. Let x and y be machine numbers in a 32-bit word-length computer. Let u and v be real numbers in the range of a 32-bit word-length computer but not machine numbers. Find a realistic upper bound on the relative roundoff error when u and v are read into the computer and then used to compute (x + y)/(uv). As usual, ignore products of two or more numbers having magnitudes as small as 2−24. Assume that no overflow or underflow occurs in this calculation.
45. Interpretthefollowing:
a. fl(x)=x(1−δ) b. fl(xy)=[x(1+δ)]y
c. fl(xy)=x[y(1+δ)] d. fl(xy)=x√1+δy√1+δ
x x(1+δ) x x√1+δ x x
e. fl y = y f. fl y = y/√1+δ g. fl y ≈ y(1−δ)
46. Let x and y be real numbers that are not machine numbers for a 32-bit word-length computer and have to be rounded to get them into the machine. Assume that there is
no overflow or underflow in getting their (rounded) values into the machine. (Thus, the numbers are within the range of a 32-bit word-length computer, although they are not machine numbers.) Find a rough upper bound on the relative error in computing x2 y3. Hint:Wesayroughupperboundbecauseyoumayuse(1+δ1)(1+δ2)≈1+δ1 +δ2 and similar approximations. Be sure to include errors involved in getting the numbers into the machine as well as errors that arise from the arithmetic operations.
47. (Student Research Project) Write a research paper on the standard floating-point number system providing additional details on
a. types of rounding b. subnormal floating-point numbers c. extendedprecision d. handlingexceptionalsituations
1. Print several numbers, both integers and reals, in octal format and try to explain the machine representation used in your computer. For example, examine (0.1)10 and compare to the results given at the beginning of this chapter.
2. Use your computer to construct a table of three functions f , g, and h defined as follows. For each integer n in the range 1 to 50, let f (n) = 1/n. Then g(n) is computed by adding f (n) to itself n − 1 times. Finally, set h(n) = n f (n). We want to see the effects of roundoff error in these computations. Use the function real(n) to convert an integer variable n to its real (floating-point) form. Print the table with all the precision of which your computer is capable (in single-precision mode).
√
3. Predict and then show what value your computer will print for precision. Repeat for double or extended precision. Explain.
2 computed in single
4. Write a program to determine the machine epsilon ε within a factor of 2 for single, double, and extended precision.
5. LetAdenotethesetofpositiveintegerswhosedecimalrepresentationdoesnotcontain the digit 0. The sum of the reciprocals of the elements in A is known to be 23.10345. Can you verify this numerically?
6. Writeacomputercode
integer function nDigit(n, x)
which returns the nth nonzero digit in the decimal expression for the real number x. 7. Theharmonicseries1+1 +1 +1 +···isknowntodivergeto+∞.Thenthpartial
2.1 Floating-Point Representation 59
234
sum approaches +∞ at the same rate as ln(n). Euler’s constant is defined to be
n
γ = lim 1 − ln(n) ≈ 0.57721 n→∞ k=1 k
Computer Problems 2.1
60 Chapter 2
Floating-Point Representation and Errors
If your computer ran a program for a week based on the pseudocode
real s, x
x ← 1.0; s ← 1.0 repeat
x ← x + 1.0; s ← s + 1.0/x end repeat
what is the largest value of s it would obtain? Write and test a program that uses a loop of 5000 steps to estimate Euler’s constant. Print intermediate answers at every 100 steps.
8. (Continuation)ProvethatEuler’sconstant,γ,canalsoberepresentedby m 1 1
γ = lim − ln m + m→∞ k=1k
2
Write and test a program that uses m = 1, 2, 3, . . . , 5000 to compute γ by this formula. The convergence should be more rapid than that in the preceding computer problem. (See the article by De Temple [1993].)
9. Determinethebinaryformof1.Whatisthecorrectlyroundedmachinerepresentation 3
in single precision on a 32-bit word-length computer? Check your answer on an actual machine with the instructions
x ← 1.0/3.0; output x
using a long format of 16 digits for the output statement.
10. Owing to its gravitational pull, the earth gains weight and volume slowly over time from space dust, meteorites, and comets. Suppose the earth is a sphere. Let the radius be ra = 7000 kilometers at the beginning of the year 1900, and let rb be its radius at the end oftheyear2000.Assumethatrb =ra+0.000001,anincreaseof1millimeter.Usinga computer, calculate how much the earth’s volume and surface area has increased during the last century by the following three procedures (exactly as given):
a. Va = 4 πra3, Vb = 4 πrb3, δ1 = Vb − Va (difference in spherical volume) 33
b. δ2 = 4π(rb −ra)(rb2 +rbra +ra2) (differenceinsphericalvolume) 3
c. h=rb −ra,δ3 =4πra2h (differenceinsphericalsurfacearea)
First use single precision and then double precision. Compare and analyze your results.
(This problem was suggested by an anonymous reviewer.)
11. (StudentResearchProject)Explorerecentdevelopmentsinfloating-pointarithmetic. In particular, learn about extended precision for both real numbers and integers as well as for complex numbers.
12. Whatisthelargestintegeryourcomputercanhandle?
2.2 Loss of Significance 61
2.2 Loss of Significance
In this section, we show how loss of significance in subtraction can often be reduced or eliminated by various techniques, such as the use of rationalization, Taylor series, trigono- metric identities, logarithmic properties, double precision, and/or range reduction. These are some of the techniques that can be used when one wants to guard against the degradation of precision in a calculation. Of course, we cannot always know when a loss of significance has occurred in a long computation, but we should be alert to the possibility and take steps to avoid it, if possible.
Significant Digits
We first address the elusive concept of significant digits in a number. Suppose that x is a real number expressed in normalized scientific notation in the decimal system
For example, x might be
x=±r×10n 1 r<1 10
x = 0.37214 98 × 10−5
The digits 3, 7, 2, 1, 4, 9, 8 used to express r do not all have the same significance because they represent different powers of 10. Thus, we say that 3 is the most significant digit, and the significance of the digits diminishes from left to right. In this example, 8 is the least significant digit.
If x is a mathematically exact real number, then its approximate decimal form can be given with as many significant digits as we wish. Thus, we may write
π ≈0.314159265358979 10
and all the digits given are correct. If x is a measured quantity, however, the situation is quite different. Every measured quantity involves an error whose magnitude depends on the nature of the measuring device. Thus, if a meter stick is used, it is not reasonable to measure any length with precision better than 1 millimeter. Therefore, the result of measuring, say, a plate glass window with a meter stick should not be reported as 2.73594 meters. That would be misleading. Only digits that are believed to be correct or in error by at most a few units should be reported. It is a scientific convention that the least significant digit given in a measured quantity should be in error by at most five units; that is, the result is rounded correctly.
Similar remarks pertain to quantities computed from measured quantities. For example, if the side of a square is reported to be s = 0.736 meter, then one can assume that the error does not exceed a few units in the third decimal place. The diagonal of that square is then
s√2 ≈ 0.10408 61182 × 101
but should be reported as 0.1041 × 101 or (more conservatively) 0.104 × 101. The infinite
√
precision available in 2,
√
2 = 1.41421 35623 73095 . . .
√
does not convey any more precision to s 2 than was already present in s.
62
Chapter 2
Floating-Point Representation and Errors
EXAMPLE 1
Solution
Computer-Caused Loss of Significance
Perhaps it is surprising that a loss of significance can occur within the computer. It is essential to understand this process so that blind trust will not be placed in numerical output from a computer. One of the most common causes for a deterioration in precision is the subtraction of one quantity from another nearly equal quantity. This effect is potentially quite serious and can be catastrophic. The closer these two numbers are to each other, the more pronounced is the effect.
To illustrate this phenomenon, let us consider the assignment statement y ← x − sin(x)
and suppose that at some point in a computer program this statement is executed with an x value of 1 . Assume further that our computer works with floating-point numbers that have
x ← 0.66666 66667 × 10−1 sin(x ) ← 0.66617 29492 × 10−1 x − sin(x ) ← 0.00049 37175 × 10−1 x − sin(x ) ← 0.49371 75000 × 10−4
In the last step, the result has been shifted to normalized floating-point form. Three zeros have then been supplied by the computer in the three least significant decimal places. We refer to these as spurious zeros; they are not significant digits. In fact, the ten-decimal-digit correct value is
1 −sin 1 ≈0.4937174327×10−4 15 15
Another way of interpreting this is to note that the final digit in x − sin(x) is derived from the tenth digits in x and sin(x ). When the eleventh digit in either x or sin(x ) is 5, 6, 7, 8, or 9, the numerical values are rounded up to ten digits so that their tenth digits may be altered by plus one unit. Since these tenth digits may be in error, the final digit in x − sin(x) may also be in error—which it is!
If x = 0.37214 48693 and y = 0.37202 14371, what is the relative error in the computation of x − y in a computer that has five decimal digits of accuracy?
The numbers would first be rounded to x = 0.37214 and y = 0.37202. Then we have x − y = 0.00012, while the correct answer is x − y = 0.00012 34322. The relative error involved is
|(x − y)−(x −y)| = 0.0000034322 ≈ 3×10−2 |x − y| 0.00012 34322
This magnitude of relative error must be judged quite large when compared with the relative
15
ten decimal digits. Then
error of x and y. (They cannot exceed 1 × 10−4 by the coarsest estimates, and in this 2
example, they are, in fact, approximately 1.3 × 10−5.) ■
2.2 Loss of Significance 63 It should be emphasized that this discussion pertains not to the operation
fl(x − y) ← x − y
but rather to the operation
fl[fl(x) − fl(y)] ← x − y Roundoff error in the former case is governed by the equation
fl(x − y) = (x − y)(1 + δ)
where |δ| 2−24 on a 32-bit word-length computer, and on a five-decimal-digit computer
in the example above |δ| 1 × 10−4. 2
In Example 1, we observe that the computed difference of 0.00012 has only two significant figures of accuracy, whereas in general, one expects the numbers and calculations in this computer to have five significant figures of accuracy.
The remedy for this difficulty is first to anticipate that it may occur and then to re- program. The simplest technique may be to carry out part of a computation in double- or extended-precision arithmetic (that means roughly twice as many significant digits), but often a slight change in the formulas is required. Several illustrations of this will be given, and the reader will find additional ones among the problems.
Consider Example 1, but imagine that the calculations to obtain x , y , and x − y are being done in double precision. Suppose that single-precision arithmetic is used thereafter. In the computer, all ten digits of x , y , and x − y will be retained, but at the end, x − y will be rounded to its five-digit form, which is 0.12343 × 10−3. This answer has five significant digits of accuracy, as we would like. Of course, the programmer or analyst must know in advance where the double-precision arithmetic will be necessary in the computation. Programming everything in double precision is very wasteful if it is not needed. This approach has another drawback: There may be such serious cancellation of significant digits that even double precision might not help.
Theorem on Loss of Precision
Before considering other techniques for avoiding this problem, we ask the following ques- tion: Exactly how many significant binary digits are lost in the subtraction x − y when x is close to y? The closeness of x and y is conveniently measured by |1 − (y/x)|. Here is the result:
LOSS OF PRECISION THEOREM
Let x and y be normalized floating-point machine numbers, where x > y > 0. If 2−p 1−(y/x)2−q forsomepositiveintegers pandq,thenatmost pandatleast q significant binary bits are lost in the subtraction x − y.
■ THEOREM1
Proof
We prove the second part of the theorem and leave the first as an exercise. To this end, let
x = r × 2n and y = s × 2m , where 1 r, s < 1. (This is the normalized binary floating-point 2
64
Chapter 2
Floating-Point Representation and Errors
EXAMPLE 2
Solution
form.) Since y < x , the computer may have to shift y before carrying out the subtraction. In any case, y must first be expressed with the same exponent as x. Hence, y = (s2m−n) × 2n and
x − y = (r − s2m−n) × 2n
The mantissa of this number satisfies the equations and inequality
s2m y r−s2m−n=r 1−r2n =r 1−x <2−q
Hence, to normalize the representation of x − y , a shift of at least q bits to the left is necessary. Then at least q (spurious) zeros are supplied on the right-hand end of the mantissa. This means that at least q bits of precision have been lost. ■
In the subtraction 37.59362 1 − 37.58421 6, how many bits of significance will be lost? Let x denote the first number and y the second. Then
1 − y = 0.00025 01754 x
This lies between 2−12 and 2−11. These two numbers are 0.00024 4 and 0.00048 8. Hence, at least 11 but not more than 12 bits are lost. ■
Here is an example in decimal form: let x = .6353 and y = .6311. These are close, and 1 − y/x = .00661 < 10−2. In the subtraction, we have x − y = .0042. There are two significant figures in the answer, although there were four significant figures in x and y.
Avoiding Loss of Significance in Subtraction
Now we take up various techniques that can be used to avoid the loss of significance that may occur in subtraction. Consider the function
f(x)= x2+1−1
√
whose values may be required for x near zero. Since
that there is a potential loss of significance in the subtraction. However, the function can be
rewritten in the form
f (x) =
√ x2 +1+1
= √
x2
x2 +1+1
(1)
x2 +1 ≈ 1 when x ≈ 0, we see
x2 + 1 − 1 √
x2 +1+1
(2)
by rationalizing the numerator—that is, removing the radical in the numerator. This proce-
dure allows terms to be canceled and thereby removes the subtraction. For example, if we
use five-decimal-digit arithmetic and if x = 10−3, then f (x) will be computed incorrectly
as zero by the first formula but as 1 × 10−6 by the second. If we use the first formula to- 2
gether with double precision, the difficulty is ameliorated but not circumvented altogether. For example, in double precision, we have the same problem when x = 10−6.
As another example, suppose that the values of
f (x) = x − sin x
(3)
are required near x = 0. A careless programmer might code this function just as indicated in Equation (3), not realizing that serious loss of accuracy will occur. Recall from calculus that
lim sin x = 1 x→0 x
to see that sin x ≈ x when x ≈ 0. One cure for this problem is to use the Taylor series for sinx:
x3 x5 x7
sin x = x − 3! + 5! − 7! + · · ·
This series is known to represent sin x for all real values of x . For x near zero, it converges quite rapidly. Using this series, we can write the function f as
x3 x5 x7 x3 x5 x7
f(x)=x− x−3!+5!−7!−··· =3!−5!+7!−··· (4)
We see in this equation where the original difficulty arose; namely, for small values of x, the term x in the sine series is much larger than x3/3! and thus more important. But when f (x) is formed, this dominant x term disappears, leaving only the lesser terms. The series
that starts with x3/3! is very effective for calculating f (x) when x is small.
In this example, further analysis is needed to determine the range in which Series (4)
should be used and the range in which Formula (3) can be used. Using the Theorem on Loss
of Precision, we see that the loss of bits in the subtraction of Formula (3) can be limited
to at most one bit by restricting x so that 1 1 − sin x/x. (Here we are considering only 2
the case when sin x > 0.) With a calculator, it is easy to see that x must be at least 1.9. Thus, for |x| < 1.9, we use the first few terms in the Series (4), and for |x| 1.9, we use f (x) = x − sin x. One can verify that for the worst case (x = 1.9), ten terms in the series give f (x) with an error of at most 10−16. (That is good enough for double precision on a
32-bit word-length computer.)
To construct a function procedure for f (x), notice that the terms in the series can be
2.2 Loss of Significance 65
obtained inductively by the algorithm
⎧⎪ ⎪⎨ t 1 = x 3 6
⎪ −tnx2
⎩tn+1 = (2n+2)(2n+3)
(n1)
Then the partial sums can be obtained inductively by
so that
(n1)
n n x2k+1
s1 = t1
sn+1 =sn +tn+1
sn = tk = k=1
(−1)k+1
(2k + 1)!
k=1
66
Chapter 2
Floating-Point Representation and Errors
Suitable pseudocode for a function is given here:
real function f (x) integeri,n←10; reals,t,x if |x| 1.9 then
s ← x − sin x else
t ← x3/6 s←t
for i = 2 to n do
t ← −tx2/[(2i + 2)(2i + 3)] s←s+t
end for end if
f←s
end function f
EXAMPLE 3
Solution
How can accurate values of the function
f(x)=ex −e−2x
be computed in the vicinity of x = 0?
Since ex and e−2x are both equal to 1 when x = 0, there will be a loss of significance because of subtraction when x is close to zero. Inserting the appropriate Taylor series, we
obtain
x2 x3 4x2 8x3 f(x)= 1+x+2!+3!+··· − 1−2x+ 2! − 3! +···
= 3x − 3x2 + 3x3 −··· 22
An alternative is to write
f(x)=e−2xe3x −1
9 27 = e−2x 3x + 2! x 2 + 3! x 3 + · · ·
By using the Theorem on Loss of Precision, we find that at most one bit is lost in the subtraction ex − e−2x when x > 0 and
1 1 − e−2x
2 ex
This inequality is valid when x 1 ln 2 = 0.23105. Similar reasoning when x < 0 shows
EXAMPLE 4
0.23105. ■
Criticize the assignment statement
y ← cos2(x) − sin2(x)
3
that for x − 0.23105, at most one bit is lost. Hence, the series should be used for |x| <
Solution
EXAMPLE 5
Solution
2.2 Loss of Significance 67 When cos2(x) − sin2(x) is computed, there will be a loss of significance at x = π/4 (and
other points). The simple trigonometric identity
cos 2θ = cos2 θ − sin2 θ
should be used. Thus, the assignment statement should be replaced by
y ← cos(2x) ■
Criticize the assignment statement
y ← ln(x) − 1
Range Reduction
If the expression ln x − 1 is used for x near e, there will be a cancellation of digits and a loss of accuracy. One can use elementary facts about logarithms to overcome the diffi- culty. Thus, we have y = lnx −1 = lnx −lne = ln(x/e). Here is a suitable assignment statement
x
y←ln e ■
EXAMPLE 6
Solution
Another cause of loss of significant figures is the evaluation of various library functions with very large arguments. This problem is more subtle than the ones previously discussed. We illustrate with the sine function.
A basic property of the function sin x is its periodicity: sin x = sin(x + 2nπ)
for all real values of x and for all integer values of n. Because of this relationship, one needs to know only the values of sin x in some fixed interval of length 2π to compute sin x for arbitrary x . This property is used in the computer evaluation of sin x and is called range reduction.
Suppose now that we want to evaluate sin(12532.14). By subtracting integer multiples of 2π , we find that this equals sin(3.47) if we retain only two decimal digits of accuracy. From sin(12532.14) = sin(12532.14 − 2k π ), we want 12532 = 2k π and k = 3989/2π ≈ 1994. Consequently, we obtain 12532.14 − 2(1994)π = 3.49 and sin(12532.14) ≈ sin(3.49). Thus, although our original argument 12532.14 had seven significant figures, the reduced argument has only three. The remaining digits disappeared in the subtraction of 3988π. Since 3.47 has only three significant figures, our computed value of sin(12532.14) will have no more than three significant figures. This decrease in precision is unavoidable if there is no way of increasing the precision of the original argument. If the original argument (12532.14 in this example) can be obtained with more significant figures, these additional figures will be present in the reduced argument (3.47 in this example). In some cases, double- or extended-precision programming will help.
For sin x , how many binary bits of significance are lost in range reduction to the interval [0, 2π )?
Given an argument x > 2π , we determine an integer n that satisfies the inequality 0 x − 2nπ < 2π . Then in evaluating elementary trigonometric functions, we use
68 Chapter 2
Floating-Point Representation and Errors
f (x) = f (x − 2nπ). In the subtraction x − 2nπ, there will be a loss of significance. By the Theorem on Loss of Precision, at least q bits are lost if
Since
1 − 2nπ 2−q x
1 − 2nπ = x − 2nπ < 2π xxx
we conclude that at least q bits are lost if 2π/x 2−q . Stated otherwise, at least q bits are lost if 2q x/2π. ■
Summary
(1) To avoid loss of significance in subtraction, one may be able to reformulate the expression using rationalizing, series expansions, or mathematical identities.
(2) If x and y are positive normalized floating-point machine numbers with 2− p 1 − y 2−q
x
then at most p and at least q significant binary bits are lost in computing x − y. Note that
it is permissible to leave out the hypothesis x > y here. Additional References
For supplemental study and reading of material related to this chapter, see Appendix B as well as the following references: Acton [1996], Bornemann, Laurie, Wagon, and Waldvogel [2004], Goldberg [1991], Higham [2002], Hodges [1983], Kincaid and Cheney [2002], Overton [2001], Salamin [1976], Wilkinson [1963], and others listed in the Bibliography.
1. How can values of the function f (x) = √x + 4 − 2 be computed accurately when x is small?
2. Calculate f (10−2) for the function f (x) = ex − x − 1. The answer should have five significant figures and can easily be obtained with pencil and paper. Contrast it with the straightforward evaluation of f (10−2) using e0.01 ≈ 1.0101.
3. What is a good way to compute values of the function f (x) = ex − e if full machine precision is needed? Note: There is some difficulty when x = 1.
a4. Whatdifficultycouldthefollowingassignmentcause? y ← 1 − sin x
Circumvent it without resorting to a Taylor series if possible.
Problems 2.2
2.2 Loss of Significance 69 5. The hyperbolic sine function is defined by sinh x = 1 (ex − e−x ). What drawback could
2
there be in using this formula to obtain values of the function? How can values of sinh x
be computed to full machine precision when |x | 1 ? 2
a6. Determinethefirsttwononzerotermsintheexpansionaboutzeroforthefunction
tanx −sinx f (x) = √
x− 1+x2 Give an approximate value for f (0.0125).
7. Findamethodforcomputing
y← 1(sinhx−tanhx) x
that avoids loss of significance when x is small. Find appropriate identities to solve this problem without using Taylor series.
a8. Findawaytocalculateaccuratevaluesfor
√1 + x2 − 1 x2 sin x
f (x) = x2 − x − tan x
9. For some values of x , the assignment statement y ← 1 − cos x involves a difficulty.
Determine limx→0 f (x).
What is it, what values of x are involved, and what remedy do you propose?
a10. Forsomevaluesofx,thefunction f(x)=√x2 +1−xcannotbeaccuratelycomputed by using this formula. Explain and find a way around the difficulty.
a11. Theinversehyperbolicsineisgivenby f(x)=lnx+√x2 +1.Showhowtoavoid loss of significance in computing f (x) when x is negative. Hint: Find and exploit the relationship between f (x) and f (−x).
12. On most computers, a highly accurate routine for cos x is provided. It is proposed to √
base a routine for sin x on the formula sin x = ± 1 − cos2 x. From the standpoint of precision (not efficiency), what problems do you foresee and how can they be avoided if we insist on using the routine for cos x ?
Criticize and recode the assignment statement z ← √x 4 + 4 − 2 assuming that z will sometimes be needed for an x close to zero.
a 13.
14. How can values of the function f (x) = √x + 2 − √x be computed accurately when
x is large?
15. Write a function that computes accurate values of f (x ) = √4 x + 4 − √4 x for positive x .
Find a way to calculate f (x ) = (cos x − e−x )/ sin x correctly. Determine f (0.008) correctly to ten decimal places (rounded).
a 16.
17. Withoutusingseries,howcouldthefunction
sin x f (x) = √
x− x2−1 be computed to avoid loss of significance?
70 Chapter 2
Floating-Point Representation and Errors
18.
19. a 20. 21. a22. 23.
24.
25. 26.
27.
a28. a29.
Write a function procedure that returns accurate values of the hyperbolic tangent function
tanh x = ex − e−x ex +e−x
for all values of x . Notice the difficulty when |x | < 1 . 2
Findagoodwaytocomputesinx+cosx−1forxnearzero.
Find a good way to compute arctan x − x for x near zero.
Find a good bound for | sin x − x| using Taylor series and assuming that |x| < 1 . 10
How would you compute (e2x − 1)/(2x) to avoid loss of significance near zero? For any x0 > −1, the sequence defined recursively by
xn+1 =2n+11+2−nxn −1 (n0)
converges to ln(x0 + 1). Arrange this formula in a way that avoids loss of significance.
Indicatehowthefollowingformulasmaybeusefulforarrangingcomputationstoavoid loss of significant digits.
aa. sinx−siny=2sin1(x−y)cos1(x+nny) 22
b. logx−logy=log(x/y) c. ex−y =ex/ey d. 1−cosx=2sin2(x/2) x−y
e. arctanx−arctany=arctan 1+xy
Whatisagoodwaytocomputetanx−xwhenxisnearzero?
Findwaystocomputethesefunctionswithoutseriouslossofsignificantfigures:
a. ex −sinx −cosx ab. ln(x)−1 c. logx −log(1/x) ad. x−2(sinx−ex +1) e. x−arctanhx
Let
a(x)= 1−cosx b(x)= sinx
sinx c(x)= x + x3 1+cosx 2 24
Show that b(x) is identical to a(x) and that c(x) approximates a(x) in a neighborhood of zero.
Onyourcomputerdeterminetherangeofxforwhich(sinx)/x≈1withfullmachine precision. Hint: Use Taylor series.
Useofthefamiliarquadraticformula
1
−b ± b2 − 4ac
will cause a problem when the quadratic equation x 2 − 105 x + 1 = 0 is solved with a machine that carries only eight decimal digits. Investigate the example, observe the difficulty, and propose a remedy. Hint: An example in the text is similar.
x =
2a
a30. When accurate values for the roots of a quadratic equation are desired, some loss of significance may occur if b2 ≈ 4ac. What (if anything) can be done to overcome this when writing a computer routine?
31. Refer to the discussion of the function f (x) = x − sin x given in the text. Show that when 0 < x < 1.9, there will be no undue loss of significance from subtraction in Equation (3).
32. Discuss the problem of computing tan(10100). (See Gleick [1992], p. 178.)
33. Let x and y be two normalized binary floating-point machine numbers. Assume that
x=q×2n,y=r×2n−1,1 r,q<1,and2q−1r.Howmuchlossofsignificance 2
occurs in subtracting x − y? Answer the same question when 2q − 1 < r . Observe that the Theorem on Loss of Precision is not strong enough to solve this problem precisely.
34. Prove the first part of the Theorem on Loss of Precision.
35. Show that if x is a machine number on a 32-bit computer that satisfies the inequality
x > π 225 , then sin x will be computed with no significant digits.
36. Let x and y be two positive normalized floating-point machine numbers in a 32-bit
computer. Let x = q × 2m and y = r × 2n with 1 r, q < 1. Show that if n = m, then 2
at least one bit of significance is lost in the subtraction x − y.
37. (StudentResearchProject)Readaboutanddiscussthedifferencebetweencancella-
tion error, a bad algorithm, and an ill-conditioned problem. Suggestion: One example
involves the quadratic equation. Read Stewart [1996].
√
38. On a three-significant-digit computer, calculate 9.01 − 3.00, with as much accuracy
as possible.
a1. Write a routine for computing the two roots x1 and x2 of the quadratic equation f (x) = ax2+bx+c=0withrealconstantsa,b,andcandforevaluating f(x1)and f(x2). Use formulas that reduce roundoff errors and write efficient code. Test your routine on the following (a, b, c) values: (0, 0, 1); (0, 1, 0); (1, 0, 0); (0, 0, 0); (1, 1, 0); (2, 10, 1); (1, −4, 3.99999); (1, −8.01, 16.004); (2 × 1017 , 1018 , 1017 ); and (10−17 , −1017 , 1017 ).
2. (Continuation)Writeandtestaroutineforsolvingaquadraticequationthatmayhave complex roots.
3. Alter and test the pseudocode in the text for computing x − sin x by using nested multiplication to evaluate the series.
4. Write a routine for the function f (x) = ex − e−2x using the examples in the text for guidance.
5. Write code using double or extended precision to evaluate f (x ) = cos(104 x ) on the interval [0, 1]. Determine how many significant figures the values of f (x) will have.
2.2 Loss of Significance 71
Computer Problems 2.2
72 Chapter 2
Floating-Point Representation and Errors
6. Write a procedure to compute f (x) = sin x − 1 + cos x. The routine should produce nearly full machine precision for all x in the interval [0, π/4]. Hint: The trigonometric identity sin2 θ = 1 (1 − cos 2θ ) may be useful.
2 xy
7. Write a procedure to compute f (x, y) = 1 t dt for arbitrary x and y. Note: Notice
the exceptional case y = −1 and the numerical problem near the exceptional case.
8. Suppose that we wish to evaluate the function f (x) = (x − sin x)/x3 for values of x
close to zero.
a. Write a routine for this function. Evaluate f (x) sixteen times. Initially, let x ← 1,
and then let x ← 1 x fifteen times. Explain the results. Note: L’Hoˆpital’s rule 10
indicates that f (x) should tend to 1 . Test this code. 6
b. Write a function procedure that produces more accurate values of f (x) for all
values of x. Test this code.
√
9. Writeaprogramtoprintatableofthefunction f(x)=5− 25+x2 forx =0to1
with steps of 0.01. Be sure that your program yields full machine precision, but do not program the problem in double precision. Explain the results.
a10. Write a routine that computes ex by summing n terms of the Taylor series until the n + 1st term t is such that |t| < ε = 10−6. Use the reciprocal of ex for negative values of x. Test on the following data: 0, +1, −1, 0.5, −0.123, −25.5, −1776, 3.14159. Compute the relative error, the absolute error, and n for each case, using the exponential function on your computer system for the exact value. Sum no more than 25 terms.
11. (Continuation) The computation of ex can be reduced to computing eu for |u| < (ln 2)/2 only. This algorithm removes powers of 2 and computes eu in a range where the series converges very rapidly. It is given by
ex =2meu where m and u are computed by the steps
z←x/ln2; m← integer(z±1)
2
w ← z − m; u ← w ln 2
Here the minus sign is used if x < 0 because z < 0. Incorporate this range reduction
technique into the code.
12. (Continuation) Write a routine that uses range reduction ex = 2meu and computes eu from the even part of the Gaussian continued fraction; that is,
eu=
s + u s−u
where s=2+u2
2520 + 28u2 15120+420u2 +u4
Test on the data given in Computer Problem 2.2.10. Note: Some of the computer problems in this section contain rather complicated algorithms for computing various intrinsic functions that correspond to those actually used on a large mainframe computer system. Descriptions of these and other similar library functions are frequently found in the supporting documentation of your computer system.
13. Quite important in many numerical calculations is the accurate computation of the absolute value |z| of a complex number z = a + bi. Design and carry out a computer
a 14.
a15.
experiment to compare the following three schemes: w21/2
a. |z|=(a2 +b2)1/2 b. |z|=v 1+ v 1 w 21/2
c.|z|=2v4+ 2v
where v = max {|a|, |b|} and w = min {|a|, |b|}. Use very small and large numbers
for the experiment.
For what range of x is the approximation (ex − 1)/2x ≈ 0.5 correct to 15 decimal digits of accuracy? Using this information, write a function procedure for (ex − 1)/2x , producing 15 decimals of accuracy throughout the interval [−10, 10].
In the theory of Fourier series, some numbers known as Lebesgue constants play a role. A formula for them is
ρn= 1 +2n 1tan πk 2n+1 πk=1k 2n+1
Write and run a program to compute ρ1, ρ2, . . . , ρ100 with eight decimal digits of accuracy. Then test the validity of the inequality
0 4 ln(2n+1)+1−ρn 0.0106 π2
Compute in double or extended precision the following number: 1 2
x = π ln(6 403203 + 744) What is the point of this problem? (See Good [1972].)
Write a routine to compute sin x for x in radians as follows. First, using properties of the sine function, reduce the range so that −π/2 x π/2. Then if |x| < 10−8, set sin x ≈ x; if |x| > π/6, set u = x/3, compute sin u by the formula below, and then set sin x ≈ [3 − 4 sin2 u] sin u; if |x| π/6, set u = x and compute sin u as follows:
2.2 Loss of Significance 73
16.
17.
⎡ 29593 34911 479249 ⎤
7613320 u4 −
11511339840 u6⎥ 2623 ⎥⎦
u6
Writearoutinetocomputethenaturallogarithmbythealgorithmoutlinedherebased
on telescoped rational and Gaussian continued fractions for ln x and test for several
values of x. First check whether x = 1 and return zero if so. Reduce the range of
xbydeterminingnandrsuchthatx=r×2n with1r<1.Next,setu= √√2
(r − 2/2)/(r + 2/2), and compute ln[(1 + u)/(1 − u)] by the approximation
⎢1−
207636 u2 +
sin u ≈ u ⎢⎣
Try to determine whether the sine function on your computer system uses this algorithm.
1671
97
3 51384
1+
Note: This is the Pade ́ rational approximation for sine.
18.
69212
u2 +
u4 +
16444 77120
1+u
ln 1 − u ≈ u
20790−21545.27u2 +4223.9187u4 10395 − 14237.635u2 + 4778.8377u4 − 230.41913u6
74 Chapter 2
Floating-Point Representation and Errors
√
which is valid for |u| < 3 − 2 2. Finally, set
1 1+u lnx≈ n−2 ln2+ln 1−u
19. Writearoutinetocomputethetangentofxinradians,usingthealgorithmbelow.Test the resulting routine over a range of values of x. First, the argument x is reduced to |x| π/2 by adding or subtracting multiples of π. If we have 0 |x| 1.7 × 10−9, set tan x ≈ x. If |x| > π/4, set u = π/2 − x; otherwise, set u = x. Now compute the approximation
1 35135 − 17336.106u2 + 379.23564u4 − 1.01186 25u6 tan u ≈ u 1 35135 − 62381.106u2 + 3154.9377u4 + 28.17694u6
Finally, if |x| > π/4, set tanx ≈ 1/tanu; if |x|π/4, set tanx ≈ tanu. Note: This algorithm is obtained from the telescoped rational and Gaussian continued fraction for the tangent function.
20. Writearoutinetocomputearcsinxbasedonthefollowingalgorithm,usingtelescoped polynomials for the arcsine. If |x| < 10−8, set arcsin x ≈ x. Otherwise, if 0 x 1 ,
11√22 set u = x, a = 0, and b = 1; if < x 3, set u = 2x − 1, a = π/4, and
√√22
b = 1 ; if 1 3 < x 1 2 + 3, set u = 8x4 − 8x2 + 1, a = 3π/8, and b = 1 ; if
2√22 4
1 2 + 3 < x 1, set u = 1 (1 − x), a = π/2, and b = −2. Now compute the
22 approximation
arcsin u ≈ u1.0 + 1 u2 + 0.075u4 + 0.04464 286u6 + 0.03038 182u8 6
+ 0.02237 5u10 + 0.01731 276u12 + 0.01433 124u14
+ 0.00934 2806u16 + 0.01835 667u18 − 0.01186 224u20 + 0.03162 712u22
Finally, set arcsin x ≈ a + b arcsin u. Test this routine for various values of x.
21. Writeandtestaroutinetocomputearctanxforxinradiansasfollows.If0x1.7×
10−9, set arctan x ≈ x. If 1.7 × 10−9 < x 2 × 10−2, use the series approximation x3 x5 x7
arctan x ≈ x − 3 + 5 − 7
Otherwise, set y = x, a = 0, and b = 1 if 0 x 1; set y = 1/x, a = π/2, and b = −1
√
d=tancif 2−1
in [a, c], but since wv < 0, f must have a root in [c, b]. In this case, we store the value of
c in a and w in u. In either case, the situation at the end of this step is just like that at the
beginning except that the final interval is half as large as the initial interval. This step can
now be repeated until the interval is satisfactorily small, say |b − a| < 1 × 10−6. At the 2
end, the best estimate of the root would be (a + b)/2, where [a, b] is the last interval in the procedure.
Now let us construct pseudocode to carry out this procedure. We shall not try to create a piece of high-quality software with many “bells and whistles,” but we will write the pseudocode in the form of a procedure for general use. This will afford the reader an opportunity to review how a main program and one or more procedures can be connected.
As a general rule, in programming routines to locate the roots of arbitrary functions, unnecessary evaluations of the function should be avoided because a given function may be costly to evaluate in terms of computer time. Thus, any value of the function that may be needed later should be stored rather than recomputed. A careless programming of the bisection method might violate this principle.
The procedure to be constructed will operate on an arbitrary function f . An interval [a, b] is also specified, and the number of steps to be taken, nmax, is given. Pseudocode to
∗ A formal statement of the Intermediate-Value Theorem is as follows: If the function f is continuous on the closedinterval[a,b],andif f(a)y f(b)or f(b)y f(a),thenthereexistsapointcsuchthatacb and f (c) = y.
perform nmax steps in the bisection algorithm follows:
3.1 Bisection Method 79
procedure Bisection( f, a, b, nmax, ε) integer n, nmax; real a, b, c, fa, fb, fc, error fa← f(a)
fb← f(b)
if sign(fa) = sign(fb) then
output a, b, fa, fb
output “function has same signs at a and b” return
end if
error ← b − a
for n = 0 to nmax do
error ← error/2
c ← a + error
fc← f(c)
output n, c, fc, error if |error| < ε then
output “convergence”
return end if
if sign(fa) ≠ sign(fc) then b←c
fb ← fc else
a←c
fa ← fc end if
end for
end procedure Bisection
Many modifications are incorporated to enhance the pseudocode. For example, we use fa, fb, fc as mnemonics for u, v, w, respectively. Also, we illustrate some techniques of structured programming and some other alternatives, such as a test for convergence. For example, if u, v, or w is close to zero, then uv or wu may underflow. Similarly, an overflow situation may arise. A test involving the intrinsic function sign could be used to avoid these difficulties, such as a test that determines whether sign(u) ≠ sign(v). Here, the iterations terminate if they exceed nmax or if the error bound (discussed later in this section) is less than ε. The reader should trace the steps in the routine to see that it does what is claimed.
Examples
Now we want to illustrate how the bisection pseudocode can be used. Suppose that we have two functions, and for each, we seek a zero in a specified interval:
f(x)=x3 −3x+1 on[0,1] g(x)=x3 −2sinx on[0.5,2]
80 Chapter 3
Locating Roots of Equations
First, we write two procedure functions to compute f (x) and g(x). Then we input the initial intervals and the number of steps to be performed in a main program. Since this is a rather simple example, this information could be assigned directly in the main program or by way of statements in the subprograms rather than being read into the program. Also, depending on the computer language being used, an external or interface statement is needed to tell the compiler that the parameter f in the bisection procedure is not an ordinary variable with numerical values but the name of a function procedure defined externally to the main program. In this example, there would be two of these function procedures and two calls to the bisection procedure.
A call program or main program that calls the second bisection routine might be written as follows:
program Test Bisection
integer n, nmax ← 20
real a, b, ε ← 1 10−6 2
external function f, g
a ← 0.0
b ← 1.0
call Bisection( f, a, b, nmax, ε) a ← 0.5
b ← 2.0
call Bisection(g, a, b, nmax, ε) end program Test Bisection
real function f (x) real x
f ← x3 − 3x + 1 end function f
real function g(x) real x
g ← x 3 − 2 sin x end function g
The computer results for the iterative steps of the bisection method for f (x):
n cn f(cn)
0 0.5 −0.375
1 0.25 0.266
error
0.5
0.25
0.125
6.25 × 10−2 3.125 × 10−2
9.54 × 10−7 4.77 × 10−7
2 0.375
3 0.3125
4 0.34375
.
19 0.34729 67
20 0.34729 62
−7.23 × 10−2 9.30 × 10−2 9.37 × 10−3
−9.54 × 10−7 3.58 × 10−7
Also, the results for g(x) are as follows:
n cn g(cn)
error
0.75
0.375
0.188
9.38 × 10−2 4.69 × 10−2
1.43 × 10−6 7.15 × 10−7
0 1.25
1 0.875
2 1.0625
3 1.15625
4 1.20312 5
.
19 1.23618 27
20 1.23618 34
5.52 × 10−2 −0.865
−0.548 −0.285 −0.125
−4.88 × 10−6 −2.15 × 10−6
To verify these results, we use built-in procedures in mathematical software such as Matlab, Mathematica, or Maple to find the desired roots of f and g to be 0.34729 63553 and 1.23618 3928, respectively. Since f is a polynomial, we can use a routine for finding numerical approximations to all the zeros of a polynomial function. However, when more complicated nonpolynomial functions are involved, there is generally no systematic pro- cedure for finding all zeros. In this case, a routine can be used to search for zeros (one at a time), but we have to specify a point at which to start the search, and different starting points may result in the same or different zeros. It may be particularly troublesome to find all the zeros of a function whose behavior is unknown.
Convergence Analysis
Now let us investigate the accuracy with which the bisection method determines a root of a function. Suppose that f is a continuous function that takes values of opposite sign at the ends of an interval [a0,b0]. Then there is a root r in [a0,b0], and if we use the midpoint c0 = (a0 + b0)/2 as our estimate of r, we have
|r −c0| b0 −a0 2
as illustrated in Figure 3.1. If the bisection algorithm is now applied and if the computed quantities are denoted by a0, b0, c0, a1, b1, c1 and so on, then by the same reasoning,
3.1 Bisection Method 81
|r − cn | bn − an (n 0) 2
2n+1 (b0 — a0)2
r c0
a0 r c0
Since the widths of the intervals are divided by 2 in each step, we conclude that |r − cn | b0 − a0
(1)
FIGURE 3.1
Bisection method: Illustrating error upper bound
b0
82 Chapter 3
Locating Roots of Equations
BISECTION METHOD THEOREM
If the bisection algorithm is applied to a continuous function f on an interval [a, b], where f (a) f (b) < 0, then, after n steps, an approximate root will have been computed with error at most (b − a)/2n+1.
■ THEOREM1
To summarize:
If an error tolerance has been prescribed in advance, it is possible to determine the number of steps required in the bisection method. Suppose that we want |r − cn | < ε. Then it is necessary to solve the following inequality for n:
b−a <ε 2n+1
By taking logarithms (with any convenient base), we obtain
n > log(b − a) − log(2ε) (2) log 2
How many steps of the bisection algorithm are needed to compute a root of f to full machine single precision on a 32-bit word-length computer if a = 16 and b = 17?
The root is between the two binary numbers a = (10 000.0)2 and b = (10 001.0)2 . Thus, we already know five of the binary digits in the answer. Since we can use only 24 bits altogether, that leaves 19 bits to determine. We want the last one to be correct, so we want the error to be less than 2−19 or 2−20 (being conservative). Since a 32-bit word-length computer has a 24-bit mantissa, we can expect the answer to have an accuracy of only 2−20. From the equationabove,wewant(b−a)/2n+1 <ε.Sinceb−a=1andε=2−20,wehave 1/2n+1 < 2−20. Taking reciprocals gives 2n+1 > 220, or n 20. Alternatively, we can use Equation (2), which in this case is
n > log1−log2−19 log 2
Using a basic property of logarithms (log x y = y log x ), we find that n 20. In this example, each step of the algorithm determines the root with one additional binary digit of precision. ■
A sequence {xn } exhibits linear convergence to a limit x if there is a constant C in the interval [0, 1) such that
|xn+1 −x|C|xn −x| (n1) (3) If this inequality is true for all n, then
|xn+1 −x| C|xn −x|C2|xn−1 −x| ··· Cn|x1 −x| Thus, it is a consequence of linear convergence that
EXAMPLE 1
Solution
|xn+1 −x| ACn (0C <1) (4)
The sequence produced by the bisection method obeys Inequality (4), as we see from Equation (1). However, the sequence need not obey Inequality (3).
The bisection method is the simplest way to solve a nonlinear equation f (x) = 0. It arrives at the root by constraining the interval in which a root lies, and it eventually makes the interval quite small. Because the bisection method halves the width of the interval at each step, one can predict exactly how long it will take to find the root within any desired degree of accuracy. In the bisection method, not every guess is closer to the root than the previous guess because the bisection method does not use the nature of the function itself. Often the bisection method is used to get close to the root before switching to a faster method.
False Position (Regula Falsi) Method and Modifications
The false position method retains the main feature of the bisection method: that a root is trapped in a sequence of intervals of decreasing size. Rather than selecting the midpoint of each interval, this method uses the point where the secant lines intersect the x-axis.
a
(b, f(b))
b
3.1 Bisection Method 83
FIGURE 3.2
False position method
(a, f (a))
In Figure 3.2, the secant line over the interval [a, b] is the chord between (a, f (a)) and
(b, f (b)). The two right triangles in the figure are similar, which means that b−c = c−a
It is easy to show that
rc
f(b) −f(a)
a − b b − a a f (b) − b f (a)
c=b− f(b) f(a)− f(b) =a− f(a) f(b)− f(a) = f(b)− f(a)
We then compute f (c) and proceed to the next step with the interval [a, c] if f (a) f (c) < 0 ortotheinterval[c,b]if f(c)f(b)<0.
In the general case, the false position method starts with the interval [a0,b0] contain- ing a root: f (a0) and f (b0) are of opposite signs. The false position method uses intervals [ak,bk] that contain roots in almost the same way that the bisection method does. How- ever, instead of finding the midpoint of the interval, it finds where the secant line joining (ak , f (ak )) and (bk , f (bk )) crosses the x -axis and then selects it to be the new endpoint.
84
Chapter 3
Locating Roots of Equations
At the kth step, it computes
ck = ak f(bk)−bk f(ak) f(bk)− f(ak)
If f (ak) and f (ck) have the same sign, then set ak+1 = ck and bk+1 = bk; otherwise, set ak+1 = ak and bk+1 = ck . The process is repeated until the root is approximated sufficiently well.
For some functions, the false position method may repeatedly select the same endpoint, and the process may degrade to linear convergence. There are various approaches to rectify this. For example, when the same endpoint is to be retained twice, the modified false position method uses
⎧⎪ak f(bk)−2bk f(ak), if f(ak)f(bk)<0 (m) ⎨ f (bk ) − 2 f (ak )
ck =⎪⎩2ak f(bk)−bk f(ak), if f(ak)f(bk)>0 2f(bk)− f(ak)
So rather than selecting points on the same side of the root as the regular false position method does, the modified false position method changes the slope of the straight line so that it is closer to the root. See Figure 3.3.
(bk1, f(bk1))
ak1 ak r
k
(ak, 12 f(ak))
(ak1, f(ak1))
ck
ck1 bk bk1
(bk, f(bk)) f
FIGURE 3.3
Modified false position method
c(m)
The bisection method uses only the fact that f (a) f (b) < 0 for each new interval [a, b], but the false position method uses the values of f (a) and f (b). This is an example showing how one can include additional information in an algorithm to build a better one. In the next section, Newton’s method uses not only the function but also its first derivative.
Some variants of the modified false position procedure have superlinear convergence, which we discuss in Section 3.3. See, for example, Ford [1995]. Another modified false position method replaces the secant lines by straight lines with ever-smaller slope until the iterate falls to the opposite side of the root. (See Conte and de Boor [1980].) Early versions of the false position method date back to a Chinese mathematical text (200 B.C.E. to 100 C.E.) and an Indian mathematical text (3 B.C.E.).
Summary
(1) For finding a zero r of a given continuous function f in an interval [a, b], n steps of the
bisection method produce a sequence of intervals [a, b] = [a0, b0], [a1, b1], [a2, b2], . . . ,
[an , bn ] each containing the desired root of the function. The midpoints of these intervals
c0, c1, c2, . . . , cn form a sequence of approximations to the root, namely, ci = 1 (ai + bi ).
3.1 Bisection Method 85
Oneachinterval[ai,bi],theerrorei =r−ci obeystheinequality |ei| 1(bi −ai)
2
and after n steps we have
2
|en| 1(b0−a0) 2n+1
(2) For an error tolerance ε such that |en| < ε, n steps are needed, where n satisfies the inequality
n > log(b − a) − log 2ε log 2
(3) For the k th step of the false position method over the interval [ak , bk ], let ck = ak f(bk)−bk f(ak)
f(bk)− f(ak)
If f(ak)f(ck)>0,setak+1 =ck andbk+1 =bk;otherwise,setak+1 =ak andbk+1 =ck.
a1. Find where the graphs of y = 3x and y = ex intersect by finding roots of ex −3x = 0 correct to four decimal digits.
2. Give a graphical demonstration that the equation tan x = x has infinitely many roots. Determine one root precisely and another approximately by using a graph. Hint: Use the approach of the preceding problem.
3. Demonstrate graphically that the equation 50π + sin x = 100 arctan x has infinitely many solutions.
a4. By graphical methods, locate approximations to all roots of the nonlinear equation ln(x + 1) + tan(2x) = 0.
5. Give an example of a function for which the bisection method does not converge linearly.
6. Draw a graph of a function that is discontinuous yet the bisection method converges. Repeat, getting a function for which it diverges.
7. ProveInequality(1).
Problems 3.1
86 Chapter 3
Locating Roots of Equations
8.
a9. a 10. 11.
a12. 13. 14.
15. 16.
a17.
18.
19.
20.
Ifa=0.1andb=1.0,howmanystepsofthebisectionmethodareneededtodetermine the root with an error of at most 1 × 10−8?
2
Findalltherootsof f(x)=cosx−cos3x.Usetwodifferentmethods.
(Continuation) Find the root or roots of ln[(1 + x )/(1 − x 2 )] = 0.
If f has an inverse, then the equation f(x) = 0 can be solved by simply writing x = f −1(0). Does this remark eliminate the problem of finding roots of equations? Illustrate with sin x = 1/π .
Howmanybinarydigitsofprecisionaregainedineachstepofthebisectionmethod? How many steps are required for each decimal digit of precision?
Trytodeviseastoppingcriterionforthebisectionmethodtoguaranteethattherootis determined with relative error at most ε.
Denote the successive intervals that arise in the bisection method by [a0,b0], [a1,b1], [a2, b2], and so on.
a. Showthata0 a1 a2 ···andthatb0 b1 b2 ···.
b. Showthatbn −an =2−n(b0 −a0).
c. Show that, for all n, anbn + an−1bn−1 = an−1bn + anbn−1.
(Continuation) Can it happen that a0 = a1 = a2 = · · · (Continuation) Let cn = (an + bn )/2. Show that
limcn = liman = limbn n→∞ n→∞ n→∞
(Continuation) Consider the bisection method with the initial interval [a0,b0]. Show that after ten steps with this method,
1 1−11
2(a10 +b10)− 2(a9 +b9)=2 (b0 −a0)
Also, determine how many steps are required to guarantee an approximation of a root to six decimal places (rounded).
(True–False) If the bisection method generates intervals [a0,b0], [a1,b1], and so on, which of these inequalities are true for the root r that is being calculated? Give proofs or counterexamples in each case.
a. |r −an|2|r −bn| ab. |r −an|2−n−1(b0 −a0) c. |r−1(an+bn)|2−n−2(b0−a0)
2
ad. 0r −an 2−n(b0 −a0) e. |r −bn|2−n−1(b0 −a0)
(True–False) Using the notation of the text, determine which of these assertions are
true and which are generally false:
aa. |r−cn|<|r−cn−1| b. anrcn c. cnrbn
d. |r −an|2−n ae. |r −bn|2−n(b0 −a0) Prove that |cn − cn+1| = 2−n−2(b0 − a0).
a 21.
22.
a23.
1. 2. 3. 4. 5. 6. 7.
8.
9.
If the bisection method is applied with starting interval [a, a + 1] and a 2m , where m 0, what is the correct number of steps to compute the root with full machine precision on a 32-bit word-length computer?
Ifthebisectionmethodisappliedwithstartinginterval[2m,2m+1],wheremisapositive or negative integer, how many steps should be taken to compute the root to full machine precision on a 32-bit word-length computer?
Every polynomial of degree n has n zeros (counting multiplicities) in the complex plane. Does every real polynomial have n real zeros? Does every polynomial of infinite degree f (x) = ∞n=0 an xn have infinitely many zeros?
Usingthebisectionmethod,determinethepointofintersectionofthecurvesgivenby y = x3 − 2x + 1 and y = x2.
Findarootofthefollowingequationintheinterval[0,1]byusingthebisectionmethod: 9x4 +18x3 +38x2 −57x +14 = 0.
Find a root of the equation tan x = x on the interval [4, 5] by using the bisection method. What happens on the interval [1, 2]?
Findarootoftheequation6(ex −x)=6+3x2 +2x3 between−1and+1usingthe bisection method.
Use the bisection method to find a zero of the equation λ cosh(50/λ) = λ + 10 that begins this chapter.
Programthebisectionmethodasarecursiveprocedureandtestitononeortwoofthe examples in the text.
Usethebisectionmethodtodeterminerootsofthesefunctionsontheintervalsindicated. Process all three functions in one computer run.
f(x)=x3 +3x−1
g(x)=x3 −2sinx
h(x) = x + 10 − x cosh(50/x)
on[0,1] on[0.5,2] on [120, 130]
Find each root to full machine precision. Use the correct number of steps, at least approximately. Repeat using the false position method.
Test the three bisection routines on f(x) = x3 + 2x2 + 10x − 20, with a = 1 and b = 2. The zero is 1.36880 8108. In programming this polynomial function, use nested multiplication. Repeat using the modified false position method.
Write a program to find a zero of a function f in the following way: In each step, an interval [a, b] is given and f (a) f (b) < 0. Then c is computed as the root of the linear function that agrees with f at a and b. We retain either [a, c] or [c, b], depending on whether f (a) f (c) < 0 or f (c) f (b) < 0. Test your program on several functions.
3.1 Bisection Method 87
Computer Problems 3.1
88 Chapter 3
Locating Roots of Equations
a10. Selectaroutinefromyourprogramlibrarytosolvepolynomialequationsanduseitto find the roots of the equation
x8 − 36x7 + 546x6 − 4536x5 + 22449x4 − 67284x3
+118124x2 − 109584x + 40320 = 0
The correct roots are the integers 1, 2, . . . , 8. Next, solve the same equation when the coefficient of x7 is changed to −37. Observe how a minor perturbation in the coeffi- cients can cause massive changes in the roots. Thus, the roots are unstable functions of the coefficients. (Be sure to program the problem to allow for complex roots.) Cul- tural Note: This is a simplified version of Wilkinson’s polynomial, which is found in Computer Problem 3.3.9.
a11. A circular metal shaft is being used to transmit power. It is known that at a certain critical angular velocity ω, any jarring of the shaft during rotation will cause the shaft to deform or buckle. This is a dangerous situation because the shaft might shatter under the increased centrifugal force. To find this critical velocity ω, we must first compute a number x that satisfies the equation
tan x + tanh x = 0
This number is then used in a formula to obtain ω. Solve for x (x > 0).
12. Usingbuilt-inroutinesinmathematicalsoftwaresystemssuchasMatlab,Mathematica, orMaple,findtherootsfor f(x)=x3−3x+1on[0,1]andg(x)=x3−sinxon [0.5, 2] to more digits of accuracy than shown in the text.
13. (Engineeringproblem)Nonlinearequationsoccurinalmostallfieldsofengineering. For example, suppose a given task is expressed in the form f (x) = 0 and the objective is to find values of x that satisfy this condition. It is often difficult to find an explicit solution and an approximate solution is sought with the aid of mathematical software. Find a solution of
1 −(1/2)x2 1
f (x) = √2π e + 10 sin(πx)
Plot the curve in the range [−3.5,3.5] for x values and [−0.5,0.5] for y= f(x) values.
14. (Circuit problem) A simple circuit with resistance R, capacitance C in series with a battery of voltage V is given by Q = CV[1 − e−T/(RC)], where Q is the charge of the capacitor and T is the time needed to obtain the charge. We wish to solve for the unknown C. For example, solve this problem
f (x) = 10x1 − e−0.004/(2000x) − 0.00001
Plot the curve. Hint: You may wish to magnify the vertical scale by using y = 105 f (x).
15. (Engineering polynomials) Equations such as A+Bx2eCx =0 and A+Bx+ C x 2 + D x 3 + E x 4 = 0 occur in engineering problems. Using mathematical software, find one or more solutions to the following equations and plot their curves:
a. 2−x2e−0.385x =0 b. 1−32x+160×2 −256×3 +128×4 =0
3.2 Newton’s Method 89
16. (Reinforced concrete) In the design of reinforced concrete with regard to stress, one
needs to solve numerically a quadratic equation such as
24147 07.2x [450 − 0.822x (225)] − 265,000,000 = 0
Find approximate values of the roots.
17. (Board in hall problem) In a building, two intersecting halls with widths w1 = 9 feet
andw2 =7feetmeetatanangleα=125◦,asshown:
1
2
Assuming a two-dimensional situation, what is the longest board that can negotiate the turn? Ignore the thickness of the board. The relationship between the angles θ and the length of the board l = l1 + l2 is l1 = w1 csc(β), l2 = w2 csc(γ ), β = π − α − γ and l = w1 csc(π − α − γ) + w2 csc(γ). The maximum length of the board that can make the turn is found by minimizing l as a function of γ . Taking the derivative and setting dl/dγ = 0, we obtain
w1 cot(π − α − γ ) csc(π − α − γ ) − w2 cot(γ ) csc(γ ) = 0
Substitute in the known values and numerically solve the nonlinear equation. This
problem is similar to an example in Gerald and Wheatley [1999].
18. Find the rectangle of maximum area if its vertices are at (0,0), (x,0), (x,cosx),
(0, cos x). Assume that 0 x π/2.
19. Programthefalsepositionalgorithmandtestitonsomeexamplessuchassomeofthe nonlinear problems in the text or in the computer problems. Compare your results with those given for the bisection method.
20. Programthemodifiedfalsepositionmethod,testit,andcompareittothefalseposition method when using some sample functions.
3.2 Newton’s Method
The procedure known as Newton’s method is also called the Newton-Raphson iteration. It has a more general form than the one seen here, and the more general form can be used to find roots of systems of equations. Indeed, it is one of the more important procedures
1 2
90
Chapter 3
Locating Roots of Equations
in numerical analysis, and its applicability extends to differential equations and integral equations. Here it is being applied to a single equation of the form f (x) = 0. As before, we seek one or more points at which the value of the function f is zero.
Interpretations of Newton’s Method
In Newton’s method, it is assumed at once that the function f is differentiable. This implies that the graph of f has a definite slope at each point and hence a unique tangent line. Now let us pursue the following simple idea. At a certain point (x0, f (x0)) on the graph of f , there is a tangent, which is a rather good approximation to the curve in the vicinity of that point. Analytically, it means that the linear function
l(x)= f′(x0)(x−x0)+ f(x0)
is close to the given function f near x0. At x0, the two functions l and f agree. We take the
zero of l as an approximation to the zero of f . The zero of l is easily found: x1=x0− f(x0)
f ′(x0)
Thus, starting with point x0 (which we may interpret as an approximation to the root sought), we pass to a new point x1 obtained from the preceding formula. Naturally, the process can be repeated (iterated) to produce a sequence of points:
x2 =x1 − f(x1), x3 =x2 − f(x2), etc. f ′(x1) f ′(x2)
Under favorable conditions, the sequence of points will approach a zero of f .
The geometry of Newton’s method is shown in Figure 3.4. The line y = l(x) is tangent
to the curve y = f (x). It intersects the x-axis at a point x1. The slope of l(x) is f ′(x0). y
FIGURE 3.4
Newton’s method
y f(x)
r x1 x0
Tangent line y l(x)
x
There are other ways of interpreting Newton’s method. Suppose again that x0 is an initial approximation to a root of f . We ask: What correction h should be added to x0 to obtain the root precisely? Obviously, we want
f (x0 + h) = 0
3.2 Newton’s Method 91 If f is a sufficiently well-behaved function, it will have a Taylor series at x0 [see Equa-
tion (11) in Section 1.2]. Thus, we could write
′ h2′′
number
Our new approximation is then
h = − f (x0) f ′(x0)
x1=x0+h=x0− f(x0) f ′(x0)
f(x0)+hf (x0)+ 2 f (x0)+···=0
Determining h from this equation is, of course, not easy. Therefore, we give up the expec- tation of arriving at the true root in one step and seek only an approximation to h. This can be obtained by ignoring all but the first two terms in the series:
f (x0) + h f ′(x0) = 0
The h that solves this is not the h that solves f (x0 + h) = 0, but it is the easily computed
EXAMPLE1
Solution
and the process can be repeated. In retrospect, we see that the Taylor series was not needed after all because we used only the first two terms. In the analysis to be given later, it is assumed that f ′′ is continuous in a neighborhood of the root. This assumption enables us to estimate the errors in the process.
If Newton’s method is described in terms of a sequence x0, x1, . . . , then the following recursive or inductive definition applies:
xn+1=xn− f(xn) f′(xn)
Naturally, the interesting question is whether
lim xn = r
n→∞
where r is the desired root.
If f(x)=x3 −x+1andx0 =1,whatarex1 andx2 intheNewtoniteration?
Fromthebasicformula,x =x −f(x)/f′(x).Nowf′(x)=3×2−1,andsof′(1)=2.
12228
1000
Also,wefind f(1)=1.Hence,wehavex =1−1 = 1.Similarly,weobtain f 1 = 5,
f′ 1 =−1,andx2=3. ■ 24
92 Chapter 3
Locating Roots of Equations
Pseudocode
A pseudocode for Newton’s method can be written as follows:
procedure Newton( f, f ′, x, nmax, ε, δ) integer n, nmax; real x, fx, fp, ε, δ external function f, f ′
fx← f(x)
output 0, x , fx
for n = 1 to nmax do
fp← f′(x)
if | f p| < δ then
output “small derivative”
return end if
d ← fx/fp x←x−d fx← f(x) output n, x, fx if |d| < ε then
output “convergence”
return end if
end for
end procedure Newton
Using the initial value of x as the starting point, we carry out a maximum of nmax iterations ofNewton’smethod.Proceduresmustbesuppliedfortheexternalfunctions f(x)and f′(x). The parameters ε and δ are used to control the convergence and are related to the accuracy desired or to the machine precision available.
Illustration
Now we illustrate Newton’s method by locating a root of x3 + x = 2x2 + 3. We apply the methodtothefunction f(x)=x3−2x2+x−3,startingwithx0 =3.Ofcourse, f′(x)= 3x2 − 4x + 1, and these two functions should be arranged in nested form for efficiency:
f(x) = ((x −2)x +1)x −3 f ′(x) = (3x − 4)x + 1
To see in greater detail the rapid convergence of Newton’s method, we use arithmetic with double the normal precision in the program and obtain the following results:
n xn
0 3.0
1 2.4375
2 2.21303 27224 73144 5
3 2.17555 49386 14368 4
4 2.17456 01006 55071 4
5 2.17455 94102 93284 1
f(xn)
9.0
2.04
0.256
6.46 × 10−3 4.48 × 10−6 1.97 × 10−12
y
10
8
6
4
2
y f(x)
3.2 Newton’s Method 93
FIGURE 3.5
Three steps of Newton’s methodf(x)= x3−2x2+x−3
0x 2 2.2 2.4 2.6 2.8 3 3.2
x2x1 x0
Notice the doubling of the accuracy in f (x) (and also in x) until the maximum precision of the computer is encountered. Figure 3.5 shows a computer plot of three iterations of Newton’s method for this sample problem.
Using mathematical software that allows for complex roots such as in Matlab, Maple, or Mathematica, we find that the polynomial has a single real root, 2.17456, and a pair of complex conjugate roots, −0.0872797 ± 1.17131i .
Convergence Analysis
Anyone who has experimented with Newton’s method—for instance, by working some of the problems in this section—will have observed the remarkable rapidity in the convergence of the sequence to the root. This phenomenon is also noticeable in the example just given. Indeed, the number of correct figures in the answer is nearly doubled at each successive step. Thus in the example above, we have first 0 and then 1, 2, 3, 6, 12, 24, . . . accurate digits from each Newton iteration. Five or six steps of Newton’s method often suffice to yield full machine precision in the determination of a root. There is a theoretical basis for this dramatic performance, as we shall now see.
Let the function f , whose zero we seek, possess two continuous derivatives f ′ and f′′, and let r be a zero of f. Assume further that r is a simple zero; that is, f′(r) ≠ 0. Then Newton’s method, if started sufficiently close to r , converges quadratically to r . This
means that the errors in successive steps obey an inequality of the form |r − xn+1| c|r − xn|2
We shall establish this fact presently, but first, an informal interpretation of the inequality may be helpful.
Suppose, for simplicity, that c = 1. Suppose also that xn is an estimate of the root r that differs from it by at most one unit in the kth decimal place. This means that
|r − xn | 10−k
94 Chapter 3
Locating Roots of Equations
The two inequalities above imply that
|r − xn+1| 10−2k
In other words, xn+1 differs from r by at most one unit in the (2k)th decimal place. So xn+1 has approximately twice as many correct digits as xn! This is the doubling of significant digits alluded to previously.
NEWTON’S METHOD THEOREM
If f, f′,and f′′ arecontinuousinaneighborhoodofarootr of f andif f′(r)≠ 0, then there is a positive δ with the following property: If the initial point in Newton’s method satisfies |r − x0| δ, then all subsequent points xn satisfy the same inequality, converge to r , and do so quadratically; that is,
|r − xn+1| c(δ)|r − xn|2 where c(δ) is given by Equation (2) below.
■ THEOREM1
Proof
To establish the quadratic convergence of Newton’s method, let en = r − xn . The formula that defines the sequence {xn} then gives
en+1 =r−xn+1 =r−xn + f(xn) =en + f(xn) = en f′(xn)+ f(xn) f ′(xn) f ′(xn) f ′(xn)
By Taylor’s Theorem (see Section 1.2), there exists a point ξn situated between xn and r for which
0= f(r)= f(xn +en)= f(xn)+en f′(xn)+1en2 f′′(ξn) 2
(Thesubscriptonξn emphasizesthedependenceonxn.)Thislastequationcanberearranged to read
en f′(xn)+ f(xn)=−1en2 f′′(ξn) 2
and if this is used in the previous equation for en+1, the result is 1 f′′(ξn)
define a function
(δ>0) (2) By virtue of this definition, we can assert that, for any two points x and ξ within distance
en+1=−2 f′(xn) en2 (1) This is, at least qualitatively, the sort of equation we want. Continuing the analysis, we
max | f ′′(x)| c(δ)= 1|x−r|δ
2 min |f′(x)| |x−r|δ
δ of the root r, the inequality 1|f′′(ξ)/f′(x)|c(δ) is true. Now select δ so small that 2
δc(δ) < 1. This is possible because as δ approaches 0, c(δ) converges to 1 | f ′′(r)/f ′(r)|, 2
and so δc(δ) converges to 0. Recall that we assumed that f ′(r) ̸= 0. Let ρ = δc(δ). In the remainder of this argument, we hold δ, c(δ), and ρ fixed with ρ < 1.
Suppose now that some iterate xn lies within distance δ from the root r. We have
|en|=|r−xn|δ and |ξn −r|δ
Bythedefinitionofc(δ),itfollowsthat 1|f′′(ξn)|/|f′(xn)|c(δ).FromEquation(1),we 2
now have
|e |=1f′′(ξn)e2c(δ)e2δc(δ)|e|=ρ|e| n+1 2f′(xn) n n n n
3.2 Newton’s Method 95
Consequently, xn+1 is also within distance δ of r because
|r − xn+1| = |en+1| ρ|en| |en| δ
If the initial point x0 is chosen within distance δ of r, then
|en| ρ|en−1| ρ2|en−1| ··· ρn|e0|
Since0<ρ<1,limn→∞ρn =0andlimn→∞en =0.Inotherwords,weobtain lim xn = r
n→∞
In this process, we have |en+1| c(δ)en2. ■
In the use of Newton’s method, consideration must be given to the proper choice of a starting point. Usually, one must have some insight into the shape of the graph of the function. Sometimes a coarse graph is adequate, but in other cases, a step-by-step evaluation of the function at various points may be necessary to find a point near the root. Often several steps of the bisection method is used initially to obtain a suitable starting point, and Newton’s method is used to improve the precision.
Although Newton’s method is truly a marvelous invention, its convergence depends upon hypotheses that are difficult to verify a priori. Some graphical examples will show what can happen. In Figure 3.6(a), the tangent to the graph of the function f at x0 intersects the x -axis at a point remote from the root r , and successive points in Newton’s iteration recede
f
r x0 x1x2 (a) Runaway
f
x0 r
(b) Flat spot
x
x
FIGURE 3.6
Failure of
Newton’s method due to bad starting points
f
x1 x
r x0 x2
(c) Cycle
96
Chapter 3
Locating Roots of Equations
from r instead of converging to r . The difficulty can be ascribed to a poor choice of the initial point x0; it is not sufficiently close to r. In Figure 3.6(b), the tangent to the curve is parallel to the x-axis and x1 = ±∞, or it is assigned the value of machine infinity in a computer. In Figure 3.6(c), the iteration values cycle because x2 = x0. In a computer, roundoff errors or limited precision may eventually cause this situation to become unbalanced such that the iterates either spiral inward and converge or spiral outward and diverge.
The analysis that establishes the quadratic convergence discloses another troublesome hypothesis; namely, f′(r) ≠ 0. If f′(r) = 0, then r is a zero of f and f′. Such a zero is termed a multiple zero of f —in this case, at least a double zero. Newton’s iteration for a multiple zero converges only linearly! Ordinarily, one would not know in advance that the zero sought was a multiple zero. If one knew that the multiplicity was m, however, Newton’s method could be accelerated by modifying the equation to read
xn+1 =xn −m f(xn) f′(xn)
in which m is the multiplicity of the zero in question. The multiplicity of the zero r is the least m such that f (k)(r) = 0 for 0 k < m, but f (m)(r) ≠ 0. (See Problem 3.2.35.)
As is shown in Figure 3.7, the equation p2(x) = x2 −2x +1 = 0 has a root at 1 of multiplicity 2, and the equation p3(x) = x3 −3x2 +3x −1 = 0 has a root at 1 of multiplicity 3. It is instructive to plot these curves. Both curves are rather flat at the roots, which slows down the convergence of the regular Newton’s method. Also, the figures illustrate the curves of two nonlinear functions with multiplicities as well as their regions of uncertainty about the curves. So the computed solutions could be anywhere within the indicated intervals on the x-axis. This is an indication of the difficulty in obtaining precise solutions of nonlinear functions with multiplicities.
p2 p3
[]x[]x 0202
r1 r1
FIGURE 3.7
Curves p2 and p3 with multiplicity 2 and 3
(a)p2(x)x2 2x 1 Systems of Nonlinear Equations
(b)p3(x)x3 3x2 3x 1
Some physical problems involve the solution of systems of N nonlinear equations in N unknowns. One approach is to linearize and solve, repeatedly. This is the same strategy used by Newton’s method in solving a single nonlinear equation. Not surprisingly, a natural extension of Newton’s method for nonlinear systems can be found. The topic of systems of nonlinear equations requires some familiarity with matrices and their inverses. (See Appendix D.)
3.2 Newton’s Method 97 In the general case, a system of N nonlinear equations in N unknowns xi can be
displayed in the form
⎧⎪ f1(x1,x2,...,xN) = 0 ⎨ f2(x1,x2,...,xN) = 0 ⎪ . ⎩fN(x1,x2,...,xN) = 0
Using vector notation, we can write this system in a more elegant form:
by defining column vectors as
F(X) = 0
F = [f1, f2,..., fN]T X = [x1,x2,...,xN]T
The extension of Newton’s method for nonlinear systems is X(k+1) = X(k) − F ′X(k)−1FX(k)
where F′X(k) is the Jacobian matrix, which will be defined presently. It comprises partial derivatives of F evaluated at X(k) = x(k), x(k), . . . , x(k)T . This formula is similar to
12N
the previously seen version of Newton’s method except that the derivative expression is not
in the denominator but in the numerator as the inverse of a matrix. In the computational form of the formula, X(0) = x(0), x(0), . . . , x(0)T is an initial approximation vector, taken
12N
to be close to the solution of the nonlinear system, and the inverse of the Jacobian matrix is
not computed but rather a related system of equations is solved.
We illustrate the development of this procedure using three nonlinear equations
⎧
⎪⎨f1(x1,x2,x3) = 0
⎪⎩f2(x1,x2,x3) = 0 (3) f3(x1,x2,x3) = 0
Recall the Taylor expansion in three variables for i = 1, 2, 3:
f(x +h,x +h,x +h)=f(x,x,x)+h ∂fi +h ∂fi +h ∂fi +··· (4)
i 1 1 2 2 3 3 i 1 2 3 1∂x 2∂x 3∂x 123
where the partial derivatives are evaluated at the point (x , x , x ). Here only the linear 1 2 3 (0) (0) (0)T
terms in step sizes hi are shown. Suppose that the vector X(0) = x1 , x2 , x3 is an approximate solution to (3). Let H = h , h , h T be a computed correction to the initial
123
guesssothatX(0)+H= x(0)+h ,x(0)+h ,x(0)+h T isabetterapproximatesolution.
112233
Discarding the higher-order terms in the Taylor expansion (4), we have in vector notation
0 ≈ FX(0) + H ≈ FX(0) + F ′X(0)H (5)
98
Chapter 3
Locating Roots of Equations
where the Jacobian matrix is defined by
⎡∂f ∂f ∂f ⎤
⎢∂x1 F ′X(0) = ⎢∂f2
⎢∂x1 ⎣∂f3 ∂x1
∂x2 ∂x3 ⎥ ∂f2 ∂f2 ⎥
∂x2 ∂x3 ⎥ ∂f3 ∂f3 ⎦ ∂x2 ∂x3
111
Here all of the partial derivatives are evaluated at X(0); namely, (0)
∂fi =∂fi X ∂xj ∂xj
EXAMPLE 2
Solution
Also, we assume that the Jacobian matrix F ′X(0) is nonsingular, so its inverse exists. Solving for H in (5), we have
H ≈ −F ′X(0)−1FX(0)
Let X(1) = X(0) + H be the better approximation after the correction; we then arrive at the
first iteration of Newton’s method for nonlinear systems
X(1) = X(0) − F ′X(0)−1FX(0)
In general, Newton’s method uses this iteration:
X(k+1) = X(k) − F ′X(k)−1FX(k)
In practice, the computational form of Newton’s method does not involve inverting the Jacobian matrix but rather solves the Jacobian linear systems
F ′X(k)H(k) = −FX(k) (6) The next iteration of Newton’s method is then
X(k+1) = X(k) + H(k) (7)
This is Newton’s method for nonlinear systems. The linear system (6) can be solved by procedures Gauss and Solve as discussed in Chapter 7. Small systems of order 2 can be solved easily. (See Problem 3.2.39.)
As an illustration, we can write a pseudocode to solve the following nonlinear system of equations using a variant of Newton’s method given by (6) and (7):
⎧
⎪⎨ x+y+z=3
⎪⎩x2+y2+z2 =5 (8) ex +xy−xz = 1
With a sharp eye, the reader immediately sees that the solution of this system is x = 0, y = 1, z = 2. But in most realistic problems, the solution is not so obvious. We wish to develop
3.2 Newton’s Method 99 a numerical procedure for finding such a solution. Here is a pseudocode:
X = 0.1, 1.2, 2.5T fork =1to10do
⎡⎤
x1 +x2 +x3 −3 F=⎢⎣ x12+x2+x32−5 ⎥⎦
ex1 +x1x2 −x1x3 −1 ⎡⎤
111
J = ⎣ 2x1 2x2 2x3 ⎦
ex1 +x2−x3 x1 −x1 solve JH = F
X=X−H end for
When programmed and executed on a computer, we found that it converges to x = (0, 1, 2), but when we change to a different starting vector, (1, 0, 1), it converges to another root, (1.2244, −0.0931, 1.8687). (Why?) ■
We can use mathematical software such as in Matlab, Maple, or Mathematica and their built-in procedures for solving the system of nonlinear equations (8). The important appli- cation area of solving systems of nonlinear equations is used in Chapter 16 on minimization of functions.
Fractal Basins of Attraction
The applicability of Newton’s method for finding complex roots is one of its outstanding strengths. One need only program Newton’s method using complex arithmetic.
The frontiers of numerical analysis and nonlinear dynamics overlap in some intriguing ways. Computer-generated displays with fractal patterns, such as in Figure 3.8, can easily be created with the help of the Newton iteration. The resulting pictures show intricately
FIGURE 3.8
Basins of attraction
100 Chapter 3
Locating Roots of Equations
interwoven sets in the plane that are quite beautiful if displayed on a color computer monitor. One begins with a polynomial in the complex variable z. For example, p(z) = z4 − 1 is suitable. This polynomial has four zeros, which are the fourth roots of unity. Each of these zeros has a basin of attraction, that is, the set of all points z0 such that Newton’s iteration, started at z0, will converge to that zero. These four basins of attraction are disjoint from each other, because if the Newton iteration starting at z0 converges to one zero, then it cannot also converge to another zero. One would naturally expect each basin to be a simple set surrounding the zero in the complex plane. But they turn out to be far from simple. To see what they are, we can systematically determine, for a large number of points, which zero of p the Newton iteration converges to if started at z0. Points in each basin can be assigned different colors. The (rare) points for which the Newton iteration does not converge can be left uncolored. Computer Problem 3.2.27 suggests how to do this.
Summary
(1) For finding a zero of a continuous and differentiable function f , Newton’s method is given by
xn+1=xn− f(xn) (n0) f′(xn)
It requires a given initial value x0 and two function evaluations (for f and f ′) per step.
(2) The errors are related by
which leads to the inequality
1 f ′′(ξn) en+1 =−2 f′(xn) en2
|en+1| c|en|2
This means that Newton’s method has quadratic convergence behavior for x0 sufficiently
close to the root r.
(3) For an N × N system of nonlinear equations F(X) = 0, Newton’s method is written as
. In practice,
X(k+1) = X(k) − F ′X(k)−1FX(k) (k 0) which involves the Jacobian matrix F′X(k) = J = ∂f X(k)/∂x
one solves the Jacobian linear system
i jN×N F ′(X(k)H(k) = −FX(k)
using Gaussian elimination and then finds the next iterate from the equation X(k+1) = X(k) + H(k)
Additional References
For additional details and sample plots, see Kincaid and Cheney [2002] or Epureanu and Greenside [1998]. For other references on fractals, see Crilly, Earnshall, and Jones [1991], Feder [1998], Hastings and Sugihara [1993], and Novak [1998].
Moreover, an expository paper by Ypma [1995] traces the historical development of Newton’s method through notes, letters, and publications by Isaac Newton, Joseph Raphson, and Thomas Simpson.
√
1. Verify that when Newton’s method is used to compute x2 = R), the sequence of iterates is defined by
1 R xn+1=2 xn+xn
R (by solving the equation
2. (Continuation) Show that if the sequence {xn} is defined as in the preceding problem, then
n+1 2xn Interpret this equation in terms of quadratic convergence.
x2 − R2 x2 −R= n
3.2 Newton’s Method 101
a3. WriteNewton’smethodinsimplifiedformfordeterminingthereciprocalofthesquare √
root of a positive number. Perform two iterations to approximate 1/ ± 5, starting with x0 =1andx0 =−1.
a4. Two of the four zeros of x4 + 2x3 − 7x2 + 3 are positive. Find them by Newton’s method, correct to two significant figures.
5. The equation x − R x −1 = 0 has x = ± R 1/2 for its solution. Establish Newton’s iterative scheme, in simplified form, for this situation. Carry out five steps for R = 25 and x0 = 1.
6. Using a calculator, observe the sluggishness with which Newton’s method converges inthecaseof f(x)=(x−1)m withm=8or12.Reconcilethiswiththetheory.Use x0 = 1.1.
a7. What linear function y = ax + b approximates f (x) = sin x best in the vicinity of x = π/4? How does this problem relate to Newton’s method?
8. In Problems 1.2.11 and 1.2.12, several methods are suggested for computing ln2. Compare them with the use of Newton’s method applied to the equation ex = 2.
a9. Define a sequence xn+1 = xn − tan xn with x0 = 3. What is limn→∞ xn?
10. The iteration formula xn+1 = xn − (cos xn )(sin xn ) + R cos2 xn , where R is a positive constant, was obtained by applying Newton’s method to some function f (x). What was f (x)? What can this formula be used for?
a11. EstablishNewton’siterativeschemeinsimplifiedform,notinvolvingthereciprocalof x, for the function f (x) = x R − x−1. Carry out three steps of this procedure using R=4andx0 =−1.
Problems 3.2
102 Chapter 3
Locating Roots of Equations
12. Considerthefollowingprocedures:
a 1r 11
13.
14. a15.
16. a17. 18.
a 19.
20.
a 21. a22.
23.
24.
a.xn+1=3 2xn−xn2 b.xn+1=2xn+xn
Do they converge for any nonzero initial point? If so, to what values?
Each of the following functions has √3 R as a zero for any positive real number R. Determine the formulas for Newton’s method for each and any necessary restrictions on the choice for x0.
aa. a(x)=x3 −R
d. d(x)=x−R/x2
b. b(x)=1/x3 −1/R ac. c(x)=x2 −R/x ae. e(x)=1−R/x3 f. f(x)=1/x−x2/R
ag. g(x)=1/x2 −x/R
Determine the formulas for Newton’s method for finding a root of the function f (x) =
h. h(x)=1−x3/R x − e/x. What is the behavior of the iterates?
IfNewton’smethodisusedon f(x)=x3 −x+1startingwithx0 =1,whatwillx1 be?
Locate the root of f (x) = e−x − cos x that is nearest π/2. IfNewton’smethodisusedon f(x)=x5 −x3 +3andifxn =1,whatisxn+1?
DetermineNewton’siterationformulaforcomputingthecuberootofN/Mfornonzero integers N and M.
For what starting values will Newton’s method converge if the function f is f (x ) = x2/(1 + x2)?
Starting at x = 3, x < 3, or x > 3, analyze what happens when Newton’s method is applied to the function f (x) = 2×3 − 9×2 + 12x + 15.
(Continuation) Repeat for f (x ) = √|x |, starting with x < 0 or x > 0.
To determine x = √3 R, we can solve the equation x3 = R by Newton’s method. Write
the loop that carries out this process, starting from the initial approximation x0 = R. ThereciprocalofanumberRcanbecomputedwithoutdivisionbytheiterativeformula
xn+1 =xn(2−xnR)
Establish this relation by applying Newton’s method to some f (x). Beginning with x0 = 0.2, compute the reciprocal of 4 correct to six decimal digits or more by this rule. Tabulate the error at each step and observe the quadratic convergence.
On a certain modern computer, floating-point numbers have a 48-bit mantissa. More- over, floating-point hardware can perform addition, subtraction, multiplication, and reciprocation, but not division. Unfortunately, the reciprocation hardware produces a result accurate to less than full precision, whereas the other operations produce results accurate to full floating-point precision.
a. Show that Newton’s method can be used to find a zero of the function f(x) = 1 − 1/(ax). This will provide an approximation to 1/a that is accurate to full floating-point precision. How many iterations are required?
3.2 Newton’s Method 103 b. Show how to obtain an approximation to b/a that is accurate to full floating-point
precision.
25. Newton’s method for finding
Perform three iterations of this scheme for computing
√
√
needed for each method in order to obtain 10−6 accuracy?
0
√
R is
xn+1=2 xn+xn
1 R
√
= 1, and
2, starting with x
2, starting with interval [1, 2]. How many iterations are
of the bisection method for
26. (Continuation) Newton’s method for finding mation:
R, where R = AB, gives this approxi- √AB≈A+B+ AB
4 A+B
Show that if x0 = A or B, then two iterations of Newton’s method are needed to obtain
this approximation, whereas if x0 = 1 (A + B), then only one iteration is needed. 2
a27. Show that Newton’s method applied to xm − R and to 1 − (R/xm) for determining √m R results in two similar yet different iterative formulas. Here R > 0, m 2. Which
28. a 29. 30. a31.
a 32.
formula is better and why? Usingahandheldcalculator,carryoutthreeiterationsofNewton’smethodusingx0 =1
and f (x) = 3×3 + x2 − 15x + 3.
What happens if the Newton iteration is applied to f (x ) = arctan x with x0 = 2? For
what starting values will Newton’s method converge? (See Computer Problem 3.2.7.) Newton’s method can be interpreted as follows: Suppose that f (x + h) = 0. Then
f′(x)≈[f(x+h)− f(x)]/h=−f(x)/h.Continuethisargument.
Derive a formula for Newton’s method for the function F(x) = f(x)/f′(x), where f (x) is a function with simple zeros that is three times continuously differentiable. Show that the convergence of the resulting method to any zero r of f (x) is at least quadratic. Hint: Apply the result in the text to F, making sure that F has the required
properties.
The Taylor series for a function f looks like this:
′ h2 ′′ h3 ′′′ f(x+h)= f(x)+hf (x)+ 2 f (x)+ 6 f (x)+···
Suppose that f (x), f ′(x), and f ′′(x) are easily computed. Derive an algorithm like Newton’s method that uses three terms in the Taylor series. The algorithm should take as input an approximation to the root and produce as output a better approximation to the root. Show that the method is cubically convergent.
ToavoidcomputingthederivativeateachstepinNewton’smethod,ithasbeenproposed to replace f ′(xn) by f ′(x0). Derive the rate of convergence for this method.
33.
104 Chapter 3
Locating Roots of Equations
34.
a35.
a 36.
a37.
38.
39.
RefertothediscussionofNewton’smethodandestablishthat 1 f ′′(r)
lim e e−2 =−
n→∞ n+1 n 2 f′(r)
How can this be used in a practical case to test whether the convergence is quadratic?
Devise an example in which r, f ′(r), and f ′′(r) are all known, and test numerically
the convergence of e e−2. n+1 n
Show that in the case of a zero of multiplicity m, the modified Newton’s method xn+1 =xn −m f(xn)
f′(xn) isquadraticallyconvergent.Hint:UseTaylorseriesforeachof f(r+en)and f′(r+en).
The Steffensen method for solving the equation f (x ) = 0 uses the formula xn+1 =xn − f(xn)
g(xn )
in which g(x) = { f [x + f (x)] − f (x)}/f (x). It is quadratically convergent, like
Newton’s method. How many function evaluations are necessary per step? Using Taylor series, show that g(x) ≈ f ′(x) if f (x) is small and thus relate Steffensen’s iteration to Newton’s. What advantage does Steffensen’s have? Establish the quadratic convergence.
AproposedGeneralizationofNewton’smethodis xn+1 =xn −ω f(xn)
f′(xn)
where the constant ω is an acceleration factor chosen to increase the rate of convergence.
For what range of values of ω is a simple root r of f (x) a point of attraction; that is, |g′(r)| < 1, where g(x) = x −ωf (x)/f ′(x)? This method is quadratically convergent onlyifω=1becauseg′(r)≠ 0whenω≠ 1.
Supposethatr isadoublerootof f(x)=0;thatis, f(r)= f′(r)=0but f′′(r)≠ 0,
and suppose that f and all derivatives up to and including the second are continuous
in some neighborhood of r. Show that en+1 ≈ 1en for Newton’s method and thereby 2
conclude that the rate of convergence is linear near a double root. (If the root has multiplicity m, then en+1 ≈ [(m − 1)/m]en.)
(Simultaneous nonlinear equations) Using the Taylor series in two variables (x, y) of the form
f(x+h,y+k)= f(x,y)+hfx(x,y)+kfy(x,y)+···
where fx = ∂ f/∂x and fy = ∂ f/∂y, establish that Newton’s method for solving the
two simultaneous nonlinear equations
f (x, y) = 0 g(x, y) = 0
can be described with the formulas
xn+1=xn− fgy−gfy , yn+1=yn− fxg−gxf
fxgy −gx fy fxgy −gx fy Herethefunctions f, fx,andsoonareevaluatedat(xn,yn).
40. Newton’s method can be defined for the equation f (z) = g(x, y) + ih(x, y), where f (z) is an analytic function of the complex variable z = x + i y (x and y real) and g(x, y) and h(x, y) are real functions for all x and y. The derivative f ′(z) is given by f ′(z) = gx + ihx = hy − igy because the Cauchy-Riemann equations gx = hy and hx = −gy hold. Here the partial derivatives are defined as gx = ∂g/∂x, gy = ∂g/∂y,
and so on. Show that Newton’s method
zn+1=zn− f(zn)
can be written in the form
f′(zn)
xn+1 =xn − ghy −hgy , yn+1 =yn − hgx −ghx
3.2 Newton’s Method 105
gxhy −gyhx
Here all functions are evaluated at zn = xn + i yn .
gxhy −gyhx
a41. Consider the algorithm of which one step consists of two steps of Newton’s method.
What is its order of convergence?
42. (Continuation)UsingtheideaoftheprecedingProblem,showhowwecaneasilycreate methods of arbitrarily high order for solving f (x) = 0. Why is the order of a method not the only criterion that should be considered in assessing its merits?
43. If we want to solve the equation 2 − x = ex using Newton’s iteration, what are the
equations and functions that must be coded? Give a pseudocode for doing this problem.
Include a suitable starting point and a suitable stopping criterion.
√
x2 = 2 (in the obvious, straightforward way). If the starting point is x0 = 7, what is
44. Suppose that we want to compute
the numerical value of the correction that must be added to x0 to get x1? Hint: The
2 by using Newton’s Method on the equation 5
arithmetic is quite easy if you do it using ratios of integers.
45. Apply Newton’s method to the equation f (x) = 0 with f (x) as given below. Find out what happens and why.
a. f(x)=ex b. f(x)=ex +x2
46. Consider Newton’s method xn+1 = xn − f (xn )/ f ′(xn ). If the sequence converges then
the limit point is a solution. Explain why or why not.
1. UsingtheprocedureNewtonandasinglecomputerrun,testyourcodeontheseexam- ples: f(t)=tant−twithx0 =7andg(t)=et −√t+9withx0 =2.Printeach iterate and its accompanying function value.
2. Write a simple, self-contained program to apply Newton’s method to the equation
x3 + 2x2 + 10x = 20, starting with x0 = 2. Evaluate the appropriate f (x) and f ′(x),
using nested multiplication. Stop the computation when two successive points differ by
1 × 10−5 or some other convenient tolerance close to your machine’s capability. Print 2
all intermediate points and function values. Put an upper limit of ten on the number of steps.
Computer Problems 3.2
106 Chapter 3
Locating Roots of Equations
3. (Continuation)Repeatusingdoubleprecisionandmoresteps.
a4. Findtherootoftheequation
2x(1 − x2 + x) ln x = x2 − 1
in the interval [0, 1] by Newton’s method using double precision. Make a table that
shows the number of correct digits in each step.
a5. In1685,JohnWallispublishedabookcalledAlgebra,inwhichhedescribedamethod devised by Newton for solving equations. In slightly modified form, this method was also published by Joseph Raphson in 1690. This form is the one now commonly called Newton’s method or the Newton-Raphson method. Newton himself discussed the method in 1669 and illustrated it with the equation x3 − 2x − 5 = 0. Wallis used the same example. Find a root of this equation in double precision, thus continuing the tradition that every numerical analysis student should solve this venerable equation.
6. In celestial mechanics, Kepler’s equation is important. It reads x = y − ε sin y, in which x is a planet’s mean anomaly, y its eccentric anomaly, and ε the eccentricity of its orbit. Taking ε = 0.9, construct a table of y for 30 equally spaced values of x in the interval 0 x π. Use Newton’s method to obtain each value of y. The y corresponding to an x can be used as the starting point for the iteration when x is changed slightly.
7. InNewton’smethod,weprogressineachstepfromagivenpointxtoanewpointx−h, where h = f (x)/f ′(x). A refinement that is easily programmed is this: If | f (x − h)| is not smaller than | f (x)|, then reject this value of h and use h/2 instead. Test this refinement.
a8. Write a brief program to compute a root of the equation x3 = x2 + x + 1, using Newton’s method. Be careful to select a suitable starting value.
a9. Findtherootoftheequation5(3x4−6x2+1)=2(3x5−5x3)thatliesintheinterval [0, 1] by using Newton’s method and a short program.
10. Foreachequation,writeabriefprogramtocomputeandprinteightstepsofNewton’s method for finding a positive root.
aa. x=2sinx ab. x3 =sinx+7 ac. sinx=1−x ad. x5+x2=1+7x3forx2
11. WriteandtestarecursiveprocedureforNewton’smethod.
12. Rewrite and test the Newton procedure so that it is a character function and returns key words such as iterating, success, near-zero, max-iteration. Then a case statement can be used to print the results.
13. Would you like to see the number 0.55887 766 come out of a calculation? Take three stepsinNewton’smethodon10+x3 −12cosx =0startingwithx0 =1.
a14. Write a short program to solve for a root of the equation e−x2 = cosx +1 on [0,4]. What happens in Newton’s method if we start with x0 = 0 or with x0 = 1?
15. Find the root of the equation 1 x2 + x + 1 − ex = 0 by Newton’s method, starting with 2
x0 = 1, and account for the slow convergence.
3.2 Newton’s Method 107
16. Using f (x) = x5 − 9x4 − x3 + 17x2 − 8x − 8 and x0 = 0, study and explain the
behavior of Newton’s method. Hint: The iterates are initially cyclic.
17. Find the zero of the function f (x) = x − tan x that is closest to 99 (radians) by both the bisection method and Newton’s method. Hint: Extremely accurate starting values are needed for this function. Use the computer to construct a table of values of f (x) around 99 to determine the nature of this function.
18. Using the bisection method, find the positive root of 2x(1 + x2)−1 = arctan x. Using the root as x0, apply Newton’s method to the function arctan x. Interpret the results.
19. If the root of f (x) = 0 is a double root, then Newton’s method can be accelerated by using
xn+1 =xn −2 f(xn) f′(xn)
Numerically compare the convergence of this scheme with Newton’s method on a function with a known double root.
20. Program and test Steffensen’s method, as described in Problem 3.2.36.
21. Consider the nonlinear system
f (x, y) = x2 + y2 − 25 = 0 g(x, y) = x2 − y − 2 = 0
Using a software package that has 2D plotting capabilities, illustrate what is going on in solving such a system by plotting f (x, y), g(x, y), and show their intersection with the (x, y)-plane. Determine approximate roots of these equations from the graphical results.
22. Solvethispairofsimultaneousnonlinearequationsbyfirsteliminatingyandthensolv- ingtheresultingequationinxbyNewton’smethod.Startwiththeinitialvaluex0 =1.0.
x3 − 2xy + y7 − 4x3 y = 5 y sin x + 3x2 y + tan x = 4
23. UsingEquations(7)and(8),codeNewton’smethodsfornonlinearsystems.Testyour program by solving one or more of the following systems:
a. System in Computer Problem 3.2.21.
b. System in Computer Problem 3.2.22.
c. System(3)usingstartingvalues(0,0,0).
d. Using starting values 3, 1,−1, solve 42 2⎧
⎪⎨ x + y + z = 0
⎪⎩ x 2 + y 2 + z 2 = 2
x(y + z) = −1 e. Usingstartingvalues(−0.01,−0.01),solve
4y2 + 4y + 52x − 19 = 0
169x2 +3y2 +111x−10y−10=0
108 Chapter 3
Locating Roots of Equations
f. Selectstartingvalues,andsolve
sin(x + y) = ex−y
cos(x + 6) = x2 y2
24. InvestigatethebehaviorofNewton’smethodforfindingcomplexrootsofpolynomials
with real coefficients. For example, the polynomial p(x) = x2 + 1 has the complex
conjugate pair of roots ±i and Newton’s method is xn+1 = 1 (xn −1/xn). First, program 2
this method using real arithmetic and real numbers as starting values. Second, modify the program using complex arithmetic but still using only real starting values. Finally, use complex numbers as starting values. Observe the behavior of the iterates in each case.
25. UsingProblem3.2.40,findacomplexrootofeachofthefollowing:
a. z3−z−1=0 b. z4−2z3−2iz2+4iz=0
d. z=ez
Hint: For the last part, use Euler’s relation eiy = cos y + i sin y.
26. In the Newton method for finding a root r of f (x ) = 0, we start with x0 and compute the sequencex1,x2,...usingtheformulaxn+1 =xn − f(xn)/f′(xn).Toavoidcomputing the derivative at each step, it has been proposed to replace f ′(xn) with f ′(x0) in all steps. It has also been suggested that the derivative in Newton’s formula be computed only every other step. This method is given by
⎧⎪ ⎪⎨ x 2 n + 1 = x 2 n − f ( x 2 n ) f′(x2n)
⎪ ⎪⎩ x 2 n + 2 = x 2 n + 1 − f ( x 2 n + 1 ) f′(x2n)
Numerically compare both proposed methods to Newton’s method for several simple functions that have known roots. Print the error of each method on every iteration to monitor the convergence. How well do the proposed methods work?
27. (Basinofattraction)Considerthecomplexpolynomialz3−1,whosezerosarethethree cube roots of unity. Generate a picture showing three basins of attraction in the complex plane in the square region defined by −1 Real(z) 1 and −1 Imaginary(z) 1. To do this, use a mesh of 1000 × 1000 pixels inside the square. The center point of each pixel is used to start the iteration of Newton’s method. Assign a particular basin color to each pixel if convergence to a root is obtained with nmax = 10 iterations. The large number of iterations suggested can be avoided by doing some analysis with the aid of Theorem 1, since the iterates get within a certain neighborhood of the root and the iteration can be stopped. The criterion for convergence is to check both |zn+1 − zn | < ε and |zn3+1 − 1| < ε with a small value such as ε = 10−4 as well as a maximum number of iterations. Hint: It is best to debug your program and get a crude picture with only a small number of pixels such as 10 × 10.
28. (Continuation) Repeat for the polynomial z4 − 1 = 0.
29. Write real function Sqrt(x) to compute the square root of a real argument x by the following algorithm: First, reduce the range of x by finding a real number r and an
c. 2z3 −6(1+i)z2 −6(1−i)=0
3.2 Newton’s Method 109 integer m such that x = 22m r with 1 r < 1. Next, compute x2 by using three iterations
of Newton’s method given by
4
1 r xn+1=2 xn+xn
with the special initial approximation
x0 = 1.27235 367 + 0.24269 3281r − 1.02966 039
1+r
Then set √x ≈ 2m x2. Test this algorithm on various values of x. Obtain a listing of the code for the square-root function on your computer system. By reading the comments,
try to determine what algorithm it uses.
30. The following method has third-order convergence for computing
x x2 +3R xn+1=n n
Carry out some numerical experiments using this method and the method of the pre- ceding problem to see whether you observe a difference in the rate of convergence. Use the same starting procedures of range reduction and initial approximation.
31. Write real function CubeRoot(x) to compute the cube root of a real argument x by
the following procedure: First, determine a real number r and an integer m such that
x = r 23m with 1 r < 1. Compute x4 using four iterations of Newton’s method: 8
2r xn+1=3 xn+2xn2
with the special starting value
x0 = 2.50292 6 − 8.04512 5(r + 0.38775 52)
(r + 4.61224 4)(r + 0.38775 52) − 0.35984 96 Then set √3 x ≈ 2m x4. Test this algorithm on a variety of x values.
32. Use mathematical software such as in Maple or Mathematica to compute ten iterates of Newton’s method starting with x0 = 0 for f(x) = x3 − 2x2 + x − 3. With 100 decimal places of accuracy and after nine iterations, show that the value of x is
2.17455 94102 92980 07420 23189 88695 65392 56759 48725 33708 24983 36733 92030 23647 64792 75760 66115 28969 38832 0640
Show that the values of the function at each iteration are 9.0, 2.0, 0.26, 0.0065, 0.45 × 10−5 , 0.22×10−11 , 0.50×10−24 , 0.27×10−49 , 0.1×10−98 , and 0.1×10−98 . Again notice that the number of digits of accuracy in Newton’s method doubles (approximately) with each iteration once they are sufficiently close to the root. (Also, see Bornemann, Wagon, and Waldvogel [2004] for a 100-Digit Challenge, which is a study in high-accuracy numerical computing.)
√
R:
3 x n2 + R
110 Chapter 3
Locating Roots of Equations
33. (Continuation)UseMapleorMathematicatodiscoverthatthisrootisexactly 3 79+1√77+ 1 +2
54 6 93 79+1√77 3 54 6
Clearly, the decimal results are of more interest to us in our study of numerical methods.
34. (Continuation)Findalltherootsincludingcomplexroots.
35. Numerically, find all the roots of the following systems of nonlinear equations. Then plot the curves to verify your results:
a. y=2x2 +3x−4,y=x2 +2x+3
b. y+x+3=0,x2+y2=17
c. y=1x−5,y=x2+2x−15 2
d. xy=1,x+y=2
e. y=x2,x2+(y−2)2=4
f. 3x2 +2y2 =35,4x2 −3y2 =24
g. x2 −xy+y2 =21,x2 +2xy−8y2 =0
36. ApplyNewton’smethodonthesetestproblems:
a. b. c.
f (x) = x2. Hint: The first derivative is zero at the root and convergence may not be quadratic.
f (x) = x + x4/3. Hint: There is no second derivative at the root and convergence may fail to be quadratic.
f(x)=x+x2sin(2/x)forx ≠ 0and f(0)=0and f′(x)=1+2xsin(2/x)− 2 cos(2/x) for x ≠ 0 and f ′(0) = 1. Hint: The derivative of this function is not continuous at the root and convergence may fail.
x12−x2+c 0
37. LetF(X)= x2 −x1 +c = 0 .Eachcomponentequation f1(x)=0and f2(x)=
0 describes a parabola. Any point (x∗, y∗) where these two parabolas intersect is a
solution to the nonlinear system of equations. Using Newton’s method for systems
of nonlinear equations, find the solutions for each of these values of the parameter
c = 1 , 1 , − 1 , −1. Give the Jacobian matrix for each. Also for each of these values, 242
plot the resulting curves showing the points of intersection. (Heath 2000, p. 218)
x2+2x−2 0
38. Let F(X) = 1 2 = . Solve this nonlinear system starting with X(0) =
x1 + 4x2 − 4
39. Using Newton’s method, find the zeros of f (z) = z3 − z with these starting values
0
(1, 2). Give the Jacobian matrix. Also plot the resulting curves showing the point(s) of
intersection.
z(0) =1+1.5i,1+1.1i,1+1.2i,1+1.3i.
40. Use Halley’s method to produce a plot of the basins of attraction for p(z) = z6 − 1. Compare to Figure 3.8.
3.3 Secant Method 111
41. (Global positioning system project) Each time a GPS is used, a system of nonlinear
equations of the form
(x−a1)2 +(y−b1)2 +(z−ci)2 =[(C(t1 −D)]2
(x−a2)2 +(y−b2)2 +(z−ci)2 =[(C(t2 −D)]2 (x−a3)2 +(y−b3)2 +(z−ci)2 =[(C(t3 −D)]2 (x−a4)2 +(y−b4)2 +(z−ci)2 =[(C(t4 −D)]2
is solved for the (x, y, z) coordinates of the receiver. For each satellite i, the locations are(ai,bi,ci),andti isthesynchronizedtransmissiontimefromthesatellite.Further, C is the speed of light, and D is the difference between the synchronized time of the satellite clocks and the earth-bound receiver clock. While there are only two points on the intersection of three spheres (one of which can be determined to be the desired location), a fourth sphere (satellite) must be used to resolve the inaccuracy in the clock contained in the low-cost receiver on earth. Explore various ways for solving such a nonlinear system. See Hofmann-Wellenhof, Lichtenegger, and Collins [2001], Sauer [2006], and Strang and Borre [1997].
42. UsemathematicalsoftwaresuchasinMatlab,Maple,orMathematicaandtheirbuilt-in procedures to solve the system of nonlinear equations (8) in Example 2. Also, plot the given surfaces and the solution obtained. Hint: You may need to use a slightly perturbed starting point (0.5, 1.5, 0.5) to avoid a singularity in the Jacobian matrix.
3.3 Secant Method
We now consider a general-purpose procedure that converges almost as fast as Newton’s method. This method mimics Newton’s method but avoids the calculation of derivatives. Recall that Newton’s iteration defines xn+1 in terms of xn via the formula
xn+1=xn− f(xn) (1) f′(xn)
In the secant method, we replace f ′(xn) in Formula (1) by an approximation that is easily computed. Since the derivative is defined by
we can say that for small h,
f′(x)=lim f(x+h)−f(x) h→0 h
f′(x)≈ f(x+h)− f(x) h
112
Chapter 3
Locating Roots of Equations
(In Section 4.3, we revisit this subject and learn that this is a finite difference approximation to the first derivative.) In particular, if x = xn and h = xn−1 − xn, we have
f′(xn)≈ f(xn−1)− f(xn) (2) xn−1 − xn
When this is used in Equation (1), the result defines the secant method:
x−x
xn+1 =xn − n n−1 f(xn) (3)
The secant method (like Newton’s) can be used to solve systems of equations as well.
The name of the method is taken from the fact that the right member of Equation (2) is the slope of a secant line to the graph of f (see Figure 3.9). Of course, the left member is the slope of a tangent line to the graph of f . (Similarly, Newton’s method could be called
the “tangent method.”)
f(xn)− f(xn−1)
FIGURE 3.9
Secant method
y f(x)
r xn1 xn xn1
Secant line
x
A few remarks about Equation (3) are in order. Clearly, xn+1 depends on two previous elements of the sequence. So to start, two points (x0 and x1) must be provided. Equation (3) can then generate x2, x3, . . . . In programming the secant method, we could calculate and test the quantity f (xn)− f (xn−1). If it is nearly zero, an overflow can occur in Equation (3). Of course, if the method is succeeding, the points xn will be approaching a zero of f , so
f (xn ) will be converging to zero. (We are assuming that f is continuous.) Also, f (xn−1) willbeconvergingtozero,and,afortiori, f(xn)− f(xn−1)willapproachzero.Iftheterms f(xn) and f(xn−1) have the same sign, additional significant digits are canceled in the subtraction.Sowecouldperhapshalttheiterationwhen|f(xn)− f(xn−1)|δ|f(xn)|for
some specified tolerance δ, such as 1 × 10−6. (See Computer Problem 3.3.18.) 2
Secant Algorithm
A pseudocode for nmax steps of the secant method applied to the function f starting with the interval [a, b] = [x0, x1] can be written as follows:
procedure Secant( f, a, b, nmax, ε) integer n, nmax; real a, b, fa, fb, ε, d external function f
fa← f(a)
fb← f(b)
EXAMPLE 1
Solution
Here ←→ means interchange values. The endpoints [a, b] are interchanged, if necessary, to keep | f (a)| | f (b)|. Consequently, the absolute values of the function are nonincreasing; thus, we have | f (xn)| | f (xn+1)| for n 1.
Ifthesecantmethodisusedon p(x)=x5 +x3 +3withx0 =−1andx1 =1,whatisx8? The output from the computer program corresponding to the pseudocode for the secant
method is as follows. (We used a 32-bit word-length computer.)
n xn p(xn )
0 −1.0 1.0
1 1.0 5.0
2 −1.5 −7.97
3 −1.05575
4 −1.11416
5 −1.10462
6 −1.10529
7 −1.10530
8 −1.10530
0.512 −9.991 × 10−2
7.593 × 10−3 1.011 × 10−4 2.990 × 10−7 2.990 × 10−7
We can use mathematical software to find the single real root, −1.1053, and the two pairs of complex roots, −0.319201 ± 1.35008i and 0.871851 ± 0.806311i. ■
3.3 Secant Method 113
if |fa| > |fb| then a ←→ b
f a ←→ f b end if
output 0, a, fa output 1, b,fb
for n = 2 to nmax do
if |fa| > |fb| then a ←→ b
f a ←→ f b end if
d ← (b − a)/(fb − fa) b←a
fb←fa
d ← d · fa
if |d| < ε then
output “convergence”
return end if
a←a−d fa← f(a)
outputn,a, fa end for
end procedure Secant
114 Chapter 3
Locating Roots of Equations
Convergence Analysis
The advantages of the secant method are that (after the first step) only one function evaluation is required per step (in contrast to Newton’s iteration, which requires two) and that it is almost as rapidly convergent. It can be shown that the basic secant method defined by Equation (3) obeys an equation of the form
1 f ′′(ξn) 1 f ′′(r)
en+1 = −2 f ′(ζn) enen−1 ≈ −2 f ′(r) enen−1 (4)
where ξn and ζn are in the smallest interval that contains r, xn, and xn−1. Thus, the ratio en+1(enen−1)−1 converges to −1 f ′′(r)/f ′(r). The rapidity of convergence of this method
2
is, in general, between those for bisection and for Newton’s method.
To prove the second part of Equation (4), we begin with the definition of the secant method in Equation (3) and the error
en+1 = r − xn+1
=r− f(xn)xn−1−f(xn−1)xn f(xn)− f(xn−1)
= f (xn)en−1 − f (xn−1)en f(xn)− f(xn−1)
⎡ f(xn)
f(xn−1)⎤
x−x ⎢e−e ⎥
= n n−1 ⎢⎣ n n−1 ⎥⎦ en en−1 (5)
f(xn)− f(xn−1) By Taylor’s Theorem, we establish
f(xn)= f(r −en)= f(r)−en f′(r)+ 1en2 f′′(r)+Oen3 2
Since f (r) = 0, this gives us
f(xn)=−f′(r)+1en f′′(r)+Oen2
en 2 Changing the index to n − 1 yields
xn −xn−1
f (xn−1) = − f ′(r) + 1e f ′′(r) + Oe2
en−1 2 n−1 By subtraction between these equations, we arrive at
f(xn) − f(xn−1) = 1(e −e
en en−1 2n n−1
Since xn − xn−1 = en−1 − en , we reach the equation f(xn)− f(xn−1)
n−1
)f′′(r)+Oe2 n−1
en en−1 ≈−1 f′′(r) xn − xn−1 2
3.3 Secant Method 115 The first bracketed expression in Equation (5) can be written as
xn−xn−1 ≈ 1
f (xn) − f (xn−1) f ′(r)
Hence, we have shown the second part of Equation (4).
We leave the establishment of the first part of Equation (4) as a problem because it
depends on some material to be covered in Chapter 4. (See Problem 3.3.18.)
From Equation (4), the order of convergence for the secant method can be expressed
in terms of the inequality
|en+1| C|en|α (6) where α = 1 1+√5 ≈ 1.62 is the golden ratio. Since α > 1, we say that the convergence
2
is superlinear. Assuming that Inequality (6) is true, we can show that the secant method
converges under certain conditions.
Letc = c(δ)bedefinedasinEquation(2)ofSection3.2.If|r−xn|δand|r−xn−1|δ,
for some root r , then Equation (4) yields
|en+1| c|en||en−1| (7)
Suppose that the initial points x0 and x1 are sufficiently close to r that c|e0| D and c|e1| D for some D < 1. Then
In general, we have
where inductively,
c|e1|D, c|e0|D c|e2| c|e1| c|e0| D2 c|e3| c|e2| c|e1| D3 c|e4| c|e3| c|e2| D5 c|e5| c|e4| c|e3| D8 etc.
|en | c−1 Dλn+1 λ1=1, λ2=1
(8)
(9)
λn =λn−1 +λn−2
8, . . . . It can be shown to have the surprising explicit form
1n n
λn=√ α−β (10)
5
whereα=11+√5andβ=11−√5.SinceD<1andλn →∞,weconcludefrom 22
Inequality (8) that en → 0. Hence, xn → r as n → ∞, and the secant method converges to the root r if x0 and x1 are sufficiently close to it.
(n3)
This is the recurrence relation for generating the famous Fibonacci sequence, 1, 1, 2, 3, 5,
116 Chapter 3
Locating Roots of Equations
Next, we show that Inequality (6) is in fact reasonable—not a proof. From Equations (7), we now have
|en+1| c|en||en−1|
= c|en|α|en|1−α|en−1|
≈ c|en|αc−1Dλn+11−αc−1Dλn = |en|αcα−1Dλn+1(1−α)+λn
= |en|αcα−1Dλn+2−αλn+1
by using an approximation to Inequality (8). In the last line, we used the recurrence relation (9). Now λn+2 − αλn+1 converges to zero. (See Problem 3.3.6.). Hence, cα−1 Dλn+2−αλn+1 is bounded, say, by C, as a function of n. Thus, we have
|en+1| ≈ C|en|α
which is a reasonable approximation to Inequality (6).
Another derivation (with a bit of hand waving) for the order of convergence of the
secant method can be given by using a general recurrence relation. Equation (4) gives us en+1 ≈ Kenen−1
where K = −1 f ′′(r)/ f ′(r). We can write this as 2
|Ken+1| ≈ |Ken||Ken−1|
Letzi =log|Kei|.Thenwewanttosolvetherecurrenceequation
zn+1 = zn + zn−1
where z0 and z1 are arbitrary. This is a linear recurrence relation with constant coefficients similar to the one for the Fibonacci numbers (9) except that the first two values z0 and z1 are unknown. The solution is of the form
zn = Aαn + Bβn (11) whereα= 11+√5andβ= 11−√5.Thesearetherootsofthequadraticequation
22
λ2 −λ−1=0.Since|α|>|β|,theterm Aαn dominates,andwecansaythat
zn ≈ Aαn
for large n and for some constant A. Consequently, we have
Then it follows that
Hence, we have
|Ken| ≈ 10Aαn
|Ken+1| ≈ 10Aαn+1 = 10Aαn α = |Ken|α
|en+1| ≈ C|en|α (12)
for large n and for some constant C. Again, Inequality (6) is essentially established! A rigorous proof of Inequality (6) is tedious and quite long.
Comparison of Methods
In this chapter, three primary methods for solving an equation f (x) = 0 have been pre- sented. The bisection method is reliable but slow. Newton’s method is fast but often only near the root and requires f ′. The secant method is nearly as fast as Newton’s method and does not require knowledge of the derivative f ′, which may not be available or may be too expensive to compute. The user of the bisection method must provide two points at which the signs of f (x) differ, and the function f need only be continuous. In using Newton’s method, one must specify a starting point near the root, and f must be differ- entiable. The secant method requires two good starting points. Newton’s procedure can be interpreted as the repetition of a two-step procedure summarized by the prescription linearize and solve. This strategy is applicable in many other numerical problems, and its importance cannot be overemphasized. Both Newton’s method and the secant method fail to bracket a root. The modified false position method can retain the advantages of both methods.
The secant method is often faster at approximating roots of nonlinear functions in comparison to bisection and false position. Unlike these two methods, the intervals [ak , bk ] do not have to be on opposite sides of the root and have a change of sign. Moreover, the slope of the secant line can become quite small, and a step can move far from the current point. The secant method can fail to find a root of a nonlinear function that has a small slope near the root because the secant line can jump a large amount.
For nice functions and guesses relatively close to the root, most of these methods require relatively few iterations before coming close to the root. However, there are pathological functions that can cause troubles for any of those methods. When selecting a method for solving a given nonlinear problem, one must consider many issues such as what you know about the behavior of the function, an interval [a, b] satisfying f (a) f (b) < 0, the first derivative of the function, a good initial guess to the desired root, and so on.
Hybrid Schemes
In an effort to find the best algorithm for finding a zero of a given function, various hybrid methods have been developed. Some of these procedures combine the bisection method (used during the early iterations) with either the secant method or the Newton method. Also, adaptive schemes are used for monitoring the iterations and for carrying out stopping rules. More information on some hybrid secant-bisection methods and hybrid Newton-bisection methods with adaptive stopping rules can be found in Bus and Dekker [1975], Dekker [1969], Kahaner, Moler, and Nash [1989], and Novak, Ritter, and Woz ́niakowski [1995].
Fixed-Point Iteration
For a nonlinear equation f (x) = 0, we seek a point where the curve f intersects the x-axis (y = 0). An alternative approach is to recast the problem as a fixed-point problem x = g(x) for a related nonlinear function g. For the fixed point problem, we seek a point where the curve g intersects the diagonal line y = x. A value of x such that x = g(x) is a fixed point of g because x is unchanged when g is applied to it. Many iterative algorithms for solving a nonlinear equation f (x) = 0 are based on a fixed-point iterative method x(n+1) = gx(n) where g has fixed points that are solutions of f(x) = 0. An initial starting value x(0)
3.3 Secant Method 117
118
Chapter 3
Locating Roots of Equations
EXAMPLE 2
Solution
is selected, and the iterative method is applied repeatedly until it converges sufficiently well.
Apply the fixed-point procedure, where g(x ) = 1 + 2/x , starting with x (0) = 1, to compute a zero of the nonlinear function f (x) = x2 − x − 2. Graphically, trace the convergence process.
The fixed-point method is
x(n+1)=1+ 2 x(n)
Eight steps of the iterative algorithm are x (0) = 1, x (1) = 3, x (2) = 5/3, x (3) = 11/5, x(4) = 21/11, x(5) = 43/21, x(6) = 85/43, x(7) = 171/85, and x(8) = 341/171 ≈ 1.99415. In Figure 3.10, we see that these steps spiral into the fixed point 2.
y
3
2
1
y 1 2x
y x
FIGURE 3.10
Fixed point iterations for f(x)=x2−x−2
0123
x
■
For a given nonlinear equation f (x) = 0, there may be many equivalent fixed-point problems x = g(x) with different functions g, some better than others. A simple way to characterize the behavior of an iterative method x(n+1) = gx(n) is locally convergent for x∗ if x∗ = g(x∗) and |g′(x∗)| < 1. By locally convergent, we mean that there is an interval containing x(0) such that the fixed-point method converges for any starting value x(0) within that interval. If |g′(x∗)| > 1, then the fixed-point method diverges for any starting point x(0) other than x∗. Fixed-point iterative methods are used in standard practice for solving many science and engineering problems. In fact, the fixed-point theory can simplify the proof of the convergence of Newton’s method.
Summary
(1) The secant method for finding a zero r of a function f (x) is written as
x−x
xn+1 =xn − n n−1 f(xn)
f(xn)− f(xn−1)
3.3 Secant Method 119 for n 1, which requires two initial values x0 and x1. After the first step, only one new
function evaluation per step is needed.
(2) After n + 1 steps of the secant method, the error iterates ei = r − xi obey the equation
1 f ′′(ξn) en+1 =−2 f′(ζn)
enen−1
which leads to the approximation
|en+1| ≈ C|en|1/2(1+√5) ≈ C|en|1.62
Therefore, the secant method has superlinear convergence behavior.
Additional References
For supplemental reading and study, see Barnsley [2006], Bus and Dekker [1975], Dekker [1969], Dennis and Schnabel [1983], Epureanu and Greenside [1998], Fauvel, Flood, Shortland, and Wilson [1988], Feder [1988], Ford [1995], Householder [1970], Kelley [1995], Lozier and Olver [1994], Nerinckx and Haegemans [1976], Novak, Ritter, and Woz ́niakowski [1995], Ortega and Rheinboldt [1970], Ostrowski [1966], Rabinowitz [1970], Traub [1964], Westfall [1995], and Ypma [1995].
a 1. 2. a3. a4. 5.
6.
7. 8.
Calculate an approximate value for 43/4 using one step of the secant method with x0 = 3 and x1 = 2.
Ifweusethesecantmethodon f(x)=x3 −2x+2startingwithx0 =0andx1 =1, what is x2?
Ifthesecantmethodisusedon f(x)=x5+x3+3andifxn−2 =0andxn−1 =1, what is xn?
Ifxn+1 =xn +(2−exn)(xn −xn−1)/(exn −exn−1)withx0 =0andx1 =1,whatis limn→∞ xn?
Usingthebisectionmethod,Newton’smethod,andthesecantmethod,findthelargest positive root correct to three decimal places of x3 − 5x + 3 = 0. (All roots are in [−3, +3].)
Prove that in the first analysis of the secant method, λn+1 − αλn converges to zero as n → ∞.
EstablishEquation(10).
Write out the derivation of the order of convergence of the secant method that uses recurrence relations; that is, find the constants A and B in Equation (11), and fill in the details in arriving at Equation (12).
Problems 3.3
120 Chapter 3
Locating Roots of Equations
a9. What is the appropriate formula for finding square roots using the secant method? (Refer to Problem 3.2.1.)
10. Theformulaforthesecantmethodcanalsobewrittenas xn+1 = xn−1 f(xn)−xn f(xn−1)
f(xn)− f(xn−1)
Establish this, and explain why it is inferior to Equation (3) in a computer program.
11. Show that if the iterates in Newton’s method converge to a point r for which f ′(r) ≠ 0, then f (r ) = 0. Establish the same assertion for the secant method. Hint: In the latter, the Mean-Value Theorem of Differential Calculus is useful. This is the case n = 0 in Taylor’s Theorem.
a12. A method of finding a zero of a given function f proceeds as follows. Two initial approximations x0 and x1 to the zero are chosen, the value of x0 is fixed, and successive iterations are given by
x−x
xn+1 =xn − n 0 f(xn)
f(xn)− f(x0)
This process will converge to a zero of f under certain conditions. Show that the rate
of convergence to a simple zero is linear under some conditions.
13. Testthefollowingsequencesfordifferenttypesofconvergence(i.e.,linear,superlinear,
or quadratic), where n = 1,2,3….
aa. xn =n−2 b. xn =2−n ac. xn =2−2n
d. xn =2−an witha0 =a1 =1andan+1 =an +an−1 forn2
14. This problem and the next three deal with the method of functional iteration. The method of functional iteration is as follows: Starting with any x0, we define xn+1 = f(xn), where n = 0,1,2,…. Show that if f is continuous and if the sequence {xn}
converges, then its limit is a fixed point of f .
a15. (Continuation) Show that if f is a function defined on the whole real line whose derivative satisfies | f ′(x)| c with a constant c less than 1, then the method of functional iterationproducesafixedpointof f.Hint:Inestablishingthis,theMean-ValueTheorem from Section 1.2 is helpful.
a16. (Continuation) With a calculator, try the method of functional iteration with f(x) = x/2 + 1/x, taking x0 = 1. What is the limit of the resulting sequence?
a17. (Continuation)Usingfunctionaliteration,showthattheequation10−2x+sinx=0 has a root. Locate the root approximately by drawing a graph. Starting with your approximate root, use functional iteration to obtain the root accurately by using a calculator. Hint: Write the equation in the form x = 5 + 1 sin x.
2
18. Establish the first part of Equation (4) using Equation (5). Hint: Use the relationship between divided differences and derivatives from Section 4.2.
a1. Use the secant method to find the zero near −0.5 of f (x) = ex − 3×2. This function also has a zero near 4. Find this positive zero by Newton’s method.
2. Write
procedure Secant( f, x1, x2, epsi, delta, maxf, x, ierr)
which uses the secant method to solve f (x) = 0. The input parameters are as follows: f is the name of the given function; x1 and x2 are the initial estimates of the solution; epsi is a positive tolerance such that the iteration stops if the difference between two consecutive iterates is smaller than this value; delta is a positive tolerance such that the iteration stops if a function value is smaller in magnitude than this value; and maxf is a positive integer bounding the number of evaluations of the function allowed. The output parameters are as follows: x is the final estimate of the solution, and ierr is an integer error flag that indicates whether a tolerance test was violated. Test this routine using the function of Computer Problem 3.3.1. Print the final estimate of the solution
and the value of the function at this point.
3. Find a zero of one of the functions given in the introduction of this chapter using one of the methods introduced in this chapter.
4. Writeandtestarecursiveprocedureforthesecantmethod.
5. Rerun the example in this section with x0 = 0 and x1 = 1. Explain any unusual results.
6. Write a simple program to compare the secant method with Newton’s method for finding a root of each function.
aa. x3−3x+1withx0 =2 b. x3−2sinxwithx0 =1 2
Use the x1 value from Newton’s method as the second starting point for the secant method. Print out each iteration for both methods.
a7. Writeasimpleprogramtofindtherootof f(x)=x3+2×2+10x−20usingthe secant method with starting values x0 = 2 and x1 = 1. Let it run at most 20 steps, and include a stopping test as well. Compare the number of steps needed here to the number needed in Newton’s method. Is the convergence quadratic?
8. Testthesecantmethodonthesetoffunctions fk(x)=2e−kx+1−3e−kx fork= 1,2,3,…,10. Use the starting points 0 and 1 in each case.
a9. AnexamplebyWilkinson[1963]showsthatminutealterationsinthecoefficientsofa polynomial may have massive effects on the roots. Let
f (x) = (x − 1)(x − 2) · · · (x − 20)
which has become known as the Wilkinson polynomial. The zeros of f are, of course, the integers 1, 2, . . . , 20. Try to determine what happens to the zero r = 20 when the function is altered to f (x ) − 10−8 x 19 . Hint: The secant method in double precision will locate a zero in the interval [20, 21].
3.3 Secant Method 121
Computer Problems 3.3
122 Chapter 3
Locating Roots of Equations
10. Test the secant method on an example in which r, f ′(r), and f ′′(r) are known in ad- vance.Monitortheratiosen+1/(enen−1)toseewhethertheyconvergeto−1 f′′(r)/f′(r).
2
The function f (x) = arctan x is suitable for this experiment.
11. Using a function of your choice, verify numerically that the iterative method
xn+1 =xn − f(xn)
[ f ′(xn)]2 − f (xn) f ′′(xn)
is cubically convergent at a simple root but only linearly convergent at a multiple root.
12. TestnumericallywhetherOlver’smethod,givenby
xn+1=xn− f(xn)−1 f′′(xn)f(xn)2 f ′(xn) 2 f ′(xn) f ′(xn)
is cubically convergent to a root of f . Try to establish that it is.
13. (Continuation)RepeatforHalley’smethod
xn+1=xn−1 with an=f′(xn)−1f′′(xn)
an f(xn) 2 f′(xn)
14. (Moler-Morrison algorithm) Computing an approximation for require square roots. It can be done as follows:
real function f (x, y) integern; reala,b,c,x,y
f ←max{|x|,|y|} a ← min{|x|,|y|} for n = 1 to 3 do
b ← (a/f )2
c ← b/(4 + b)
f ← f +2cf a ← ca
end for
end function f
x2 + y2 does not
Test the algorithm on some simple cases such as (x, y) = (3, 4), (−5, 12), and (7, −24). Then write a routine that uses the function f (x, y) for approximating the Euclidean norm of a vector x = (x1,x2,…,xn); that is, the nonnegative number ∥x∥ = x12 + x 2 2 + · · · + x n2 1 / 2 .
15. Study the following functions by starting with any initial value of x0 in the domain [0,2] and iterating xn+1 = F(xn). First use a calculator and then a computer. Explain the results.
a. Usethetentfunction
2x if 2x < 1 2x−1 if2x1
F(x) = b. Repeatusingthefunction
F(x) = 10x (modulo 1)
Hint: Don’t be surprised by chaotic behavior. The interested reader can learn more about the dynamics of one-dimensional maps by reading papers such as the one by Bassien [1998].
16. Show how the secant method can be used to solve systems of equations such as those in Computer Problems 3.2.21–3.2.23.
17. (Studentresearchproject)Muller’smethodisanalgorithmforcomputingsolutions of an equation f (x) = 0. It is similar to the secant method in that it replaces f locally by a simple function, and finds a root of it. Naturally, this step is repeated. The simple function chosen in Muller’s method is a quadratic polynomial, p, that interpolates f at the three most recent points. After p has been determined, its roots are computed, and one of them is chosen as the next point in the sequence. Since this quadratic function may have complex roots, the algorithm should be programmed with this in mind. Suppose that points xn−2, xn−1, and xn have been computed. Set
p(x) = a(x − xn)(x − xn−1) + b(x − xn) + c
where a, b, and c are determined so that p interpolates f at the three points mentioned previously. Then find the roots of p and take xn+1 to be the root of p closest to xn . At the beginning, three points must be furnished by the user. Program the method, allowing for complex numbers throughout. Test your program on the example
p(x) = x3 + x2 − 10x − 10
If the first three points are 1, 2, 3, then you should find that the polynomial is p(x) = 7(x −3)(x −2)+14(x −3)−4 and x4 = 3.17971086. Next, test your code on a polynomial having real coefficients but some complex roots.
18. Program and test the code for the secant algorithm after incorporating the stopping criterion described in the text.
19. Using mathematical software such as Matlab, Mathematica, and Maple, find the real zero of the polynomial p(x ) = x 5 + x 3 + 3. Attain more digits of accuracy than shown in the solution to Example 1 in the text.
20. (Continuation) Using mathematical software that allows for complex roots, find all zeros of the polynomial.
21. Programahybridmethodforsolvingseveralofthenonlinearproblemsgivenasexam- ples in the text, and compare your results with those given.
22. Find the fixed points for each of the following functions:
a. ex +1 b. e−x −x c. x2 −4sinx d. x3 +6x2 +11x−6 e. sinx
23. For the nonlinear equation f(x) = x2 − x − 2 = 0 with roots 1 and 2, write four fixed-point problems x = g(x) that are equivalent. Plot all of these, and show that they all intersect the line x = y. Also, plot the convergence steps of each of these fixed-point iterations for different starting values x(0). Show that the behavior of these fixed-point schemes can vary wildly: slow convergence, fast convergence, and divergence.
3.3 Secant Method 123
4
Interpolation and Numerical Differentiation
The viscosity of water has been experimentally determined at different temperatures, as indicated in the following table:
Temperature 0◦ 5◦ 10◦ 15◦ Viscosity 1.792 1.519 1.308 1.140
From this table, how can we estimate a reasonable value for the viscosity at temperature 8◦?
The method of polynomial interpolation, described in Section 4.1, can be used to create a polynomial of degree 3 that assumes the values in the table. This polynomial should provide acceptable intermediate values for temperatures not tabulated. The value of that polynomial at the point 8◦ turns out to be 1.386.
4.1 Polynomial Interpolation Preliminary Remarks
We pose three problems concerning the representation of functions to give an indication of the subject matter in this chapter, in Chapter 9 (on splines), and in Chapter 12 (on least squares).
First, suppose that we have a table of numerical values of a function: x x0 x1 ··· xn
y y0 y1 ··· yn
Is it possible to find a simple and convenient formula that reproduces the given points exactly?
The second problem is similar, but it is assumed that the given table of numerical values is contaminated by errors, as might occur if the values came from a physical experiment. Now we ask for a formula that represents the data (approximately) and, if possible, filters out the errors.
As a third problem, a function f is given, perhaps in the form of a computer procedure, but it is an expensive function to evaluate. In this case, we ask for another function g that is simpler to evaluate and produces a reasonable approximation to f . Sometimes in this problem, we want g to approximate f with full machine precision.
124
In all of these problems, a simple function p can be obtained that represents or approximates the given table or function f . The representation p can always be taken to be a polynomial, although many other types of simple functions can also be used. Once a simple function p has been obtained, it can be used in place of f in many situations. For example, the integral of f could be estimated by the integral of p, and the latter should generally be easier to evaluate.
In many situations, a polynomial solution to the problems outlined above will be unsat- isfactory from a practical point of view, and other classes of functions must be considered. In this book, one other class of versatile functions is discussed: the spline functions (see Chapter 9). The present chapter concerns polynomials exclusively, and Chapter 12 dis- cusses general linear families of functions, of which splines and polynomials are important examples.
The obvious way in which a polynomial can fail as a practical solution to one of the preceding problems is that its degree may be unreasonably high. For instance, if the table considered contains 1,000 entries, a polynomial of degree 999 may be required to represent it. Polynomials also may have the surprising defect of being highly oscillatory. If the table is precisely represented by a polynomial p, then p(xi ) = yi for 0 i n. For points other than the given xi, however, p(x) may be a very poor representation of the function from which the table arose. The example in Section 4.2 involving the Runge function illustrates this phenomenon.
Polynomial Interpolation
We begin again with a table of values:
x x0 x1 ··· xn
y y0 y1 ··· yn
and assume that the xi ’s form a set of n + 1 distinct points. The table represents n + 1 points in the Cartesian plane, and we want to find a polynomial curve that passes through all points. Thus, we seek to determine a polynomial that is defined for all x, and takes on thecorrespondingvaluesofyi foreachofthen+1distinctxi’sinthistable.Apolynomial p for which p(xi ) = yi when 0 i n is said to interpolate the table. The points xi are called nodes.
Consider the first and simplest case, n = 0. Here, a constant function solves the prob- lem. In other words, the polynomial p of degree 0 defined by the equation p(x ) = y0 repro- duces the one-node table.
The next simplest case occurs when n = 1. Since a straight line can be passed through two points, a linear function is capable of solving the problem. Explicitly, the polynomial
p defined by
is of first degree (at most) and reproduces the table. That means (in this case) that p(x0) = y0 and p(x1) = y1, as is easily verified. This p is used for linear interpolation.
4.1 Polynomial Interpolation 125
x−x x−x p(x)= 1y0+ 0y1
x0 − x1 x1 − x0
y−y
= y0 + 1 0 (x − x0) x1 − x0
126
Chapter 4
Interpolation and Numerical Differentiation
EXAMPLE 1
Solution
Find the polynomial of least degree that interpolates this table:
x 1.4 1.25
y 3.7 3.9
By the equation above, the polynomial that is sought is
x−1.25 x−1.4
p(x) = 1.4−1.25 3.7+ 3.9−3.7
1.25−1.4 3.9 = 3.7+ 1.25−1.4 (x −1.4)
= 3.7− 4(x −1.4) 3
■
As we can see, an interpolating polynomial can be written in a variety of forms; among these are those known as the Newton form and the Lagrange form. The Newton form is probably the most convenient and efficient; however, conceptually, the Lagrange form has several advantages. We begin with the Lagrange form, since it may be easier to understand.
Interpolating Polynomial: Lagrange Form
Supposethatwewishtointerpolatearbitraryfunctionsatasetoffixednodesx0,x1,...,xn. We first define a system of n + 1 special polynomials of degree n known as cardinal polynomials in interpolation theory. These are denoted by l0 , l1 , . . . , ln and have the property
li(xj)=δij =
Once these are available, we can interpolate any function f by the Lagrange form of the
interpolation polynomial:
n i=0
li (x ) f (xi ) (1) of degree at most n. Furthermore, when we evaluate pn at x j , we get f (x j ):
pn (x ) =
This function pn , being a linear combination of the polynomials li , is itself a polynomial
n i=0
li(xj)f(xi)=lj(xj)f(xj)= f(xj)
Thus, pn istheinterpolatingpolynomialforthefunction f atnodesx0,x1,...,xn.Itremains
pn(xj)=
now only to write the formula for the cardinal polynomial li , which is
0 i f i ≠ j 1 ifi=j
n x − x
li (x) = j (0 i n) (2)
j≠i xi−xj j=0
4.1 Polynomial Interpolation 127 This formula indicates that li (x ) is the product of n linear factors:
x−x x−x x−x x−x x−x li(x) = 0 1 ··· i−1 i+1 ··· n
xi −x0 xi −x1 xi −xi−1 xi −xi+1 xi −xn
(The denominators are just numbers; the variable x occurs only in the numerators.) Thus, li is a polynomial of degree n. Notice that when li (x) is evaluated at x = xi , each factor in the preceding equation becomes 1. Hence, li (xi ) = 1. But when li (x ) is evaluated at any other node, say, xj, one of the factors in the above equation will be 0, and li(xj) = 0, for i ≠ j .
Figure 4.1 shows the first few Lagrange cardinal polynomials: l0(x), l1(x), l2(x), l3(x), l4(x), and l5(x).
y
1.2
1 0 0.8
0.6 0.4 0.2
0
0.2
0.4
table, and give the Lagrange form of the interpolating polynomial:
1 2 3 4
FIGURE 4.1
First few Lagrange cardinal polynomials
EXAMPLE 2
Solution
0.6
1 0.8
x
0.6 0.4 0.2 0
Write out the cardinal polynomials appropriate to the problem of interpolating the following
Using Equation (2), we have
x111 34
f(x) 2 −1 7
1
0.2 0.4
0.6 0.8
1
x − 1 (x − 1)
l0(x) = 4 = −18
1−1 1−1 3 43
x−1 (x−1)
l1(x) = 3 = 16
x −
1 3
(x − 1) (x − 1)
4
1−1 1−1
4 34
x −
x−1x−111 l2(x)= 3 4 =2 x− x−
Therefore, the interpolating polynomial in Lagrange’s form is
1 1 1 1
p2(x)=−36 x−4 (x−1)−16 x−3 (x−1)+14 x−3 x−4 ■
1−11−134 34
128 Chapter 4
Interpolation and Numerical Differentiation
THEOREM ON EXISTENCE OF POLYNOMIAL INTERPOLATION
Ifpointsx0,x1,...,xn aredistinct,thenforarbitraryrealvaluesy0,y1,...,yn,there is a unique polynomial p of degree at most n such that p(xi) = yi for 0 i n.
■ THEOREM1
Existence of Interpolating Polynomial
The Lagrange interpolation formula proves the existence of an interpolating polynomial for any table of values. There is another constructive way of proving this fact, and it leads to a different formula.
Suppose that we have succeeded in finding a polynomial p that reproduces part of the table. Assume, say, that p(xi ) = yi for 0 i k. We shall attempt to add to p another term that will enable the new polynomial to reproduce one more entry in the table. We consider
p(x)+c(x −x0)(x −x1)···(x −xk)
where c is a constant to be determined. This is surely a polynomial. It also reproduces the first k points in the table because p itself does so, and the added portion takes the value 0 at each of the points x0, x1, . . . , xk . (Its form is chosen for precisely this reason.) Now we adjust the parameter c so that the new polynomial takes the value yk+1 at xk+1. Imposing this condition, we obtain
p(xk+1)+c(xk+1 −x0)(xk+1 −x1)···(xk+1 −xk)= yk+1
The proper value of c can be obtained from this equation because none of the factors xk+1 − xi , for 0 i k, can be zero. Remember our original assumption that the xi ’s are all distinct.
This analysis is an example of inductive reasoning. We have shown that the process can be started and that it can be continued. Hence, the following formal statement has been partially justified:
Two parts of this formal statement must still be established. First, the degree of the poly- nomial increases by at most 1 in each step of the inductive argument. At the beginning, the degree was at most 0, so at the end, the degree is at most n.
Second, we establish the uniqueness of the polynomial p. Suppose that another poly- nomial q claims to accomplish what p does; that is, q is also of degree at most n and satisfies q(xi)=yi for0in.Thenthepolynomialp−qisofdegreeatmostnandtakesthe value 0 at x0,x1,...,xn. Recall, however, that a nonzero polynomial of degree n can have at most n roots. We conclude that p = q, which establishes the uniqueness of p.
Interpolating Polynomial: Newton Form
In Example 2, we found the Lagrange form of the interpolating polynomial:
1 1 1 1
p2(x)=−36 x−4 (x−1)−16 x−3 (x−1)+14 x−3 x−4 It can be simplified to
p2(x) = −79 + 349x − 38x2 66
4.1 Polynomial Interpolation 129 We will now learn that this polynomial can be written in another form called the nested
Newton form:
1 1 p2(x)=2+ x−3 36+ x−4 (−38)
EXAMPLE 3
Solution
It involves the fewest arithmetic operations and is recommended for evaluating p2(x). It can not be overemphasized that the Newton and Lagrange forms are just two different derivations for precisely the same polynomial. The Newton form has the advantage of easy extensibility to accommodate additional data points.
The preceding discussion provides a method for constructing an interpolating polyno- mial. The method is known as the Newton algorithm, and the resulting polynomial is the Newton form of the interpolating polynomial.
Using the Newton algorithm, find the interpolating polynomial of least degree for this table: x 0 1 −1 2 −2
y −5 −3 −15 39 −9
In the construction, five successive polynomials will appear; these are labeled p0 , p1 , p2 , p3 , and p4. The polynomial p0 is defined to be
p0(x) = −5
p1(x)= p0(x)+c(x−x0)=−5+c(x−0)
The polynomial p1 has the form
The interpolation condition placed on p1 is that p1(1) = −3. Therefore, we have −5 +
c(1−0)=−3.Hence,c=2,and p1 is
p1(x) = −5 + 2x
The polynomial p2 has the form
p2(x)= p1(x)+c(x−x0)(x−x1)=−5+2x+cx(x−1)
The interpolation condition placed on p2 is that p2(−1) = −15. Hence, we have −5 + 2(−1) + c(−1)(−1 − 1) = −15. This yields c = −4, so
p2(x) = −5 + 2x − 4x(x − 1)
The remaining steps are similar, and the final result is the Newton form of the interpolating
polynomial: p4(x)=−5+2x−4x(x−1)+8x(x−1)(x+1)+3x(x−1)(x+1)(x−2) ■
Later, we will develop a better algorithm for constructing the Newton interpolating polynomial. Nevertheless, the method just explained is a systematic one and involves very little computation. An important feature to notice is that each new polynomial in the algorithm is obtained from its predecessor by adding a new term. Thus, at the end, the final polynomial exhibits all the previous polynomials as constituents.
130
Chapter 4
Interpolation and Numerical Differentiation
EXAMPLE 4
Solution
Nested Form
Before continuing, let us rewrite the Newton form of the interpolating polynomial for efficient evaluation.
Write the polynomial p4 of Example 3 in nested form and use it to evaluate p4(3). We write p4 as
Therefore,
p4(x) = −5 + x(2 + (x − 1)(−4 + (x + 1)(8 + (x − 2)3))) p4(3) = −5+3(2+2(−4+4(8+3)))
= 241 Another solution, also in nested form, is
p4(x) = −5 + x(4 + x(−7 + x(2 + 3x))) p4(3) = −5 + 3(4 + 3(−7 + 3(2 + 3 · 3))) = 241
from which we obtain
This form is obtained by expanding and systematic factoring of the original polynomial. It
is also known as a nested form and its evaluation is by nested multiplication. ■ To describe nested multiplication in a formal way (so that it can be translated into a
code), consider a general polynomial in the Newton form. It might be p(x)=a0 +a1(x−x0)+a2(x−x0)(x−x1)+···
+an(x −x0)(x −x1)···(x −xn−1)
p(x)=a0 +(x−x0)(a1 +(x−x1)(a2 +···+(x−xn−1)an))···))
The nested form of p(x) is
= (···((an(x − xn−1) + an−1)(x − xn−2) + an−2)···)(x − x0) + a0
The Newton interpolation polynomial can be written succinctly as n i−1
where
pn(x) =
πi(x) =
n i=0
i−1
(x − xj) (4)
pn(x) = ai (x − xj) (3) i=0 j=0
Here −1 (x − xj) is interpreted to be 1. Also, we can write it as j=0
j=0
Figure 4.2 shows the first few Newton polynomials: π0(x), π1(x), π2(x), π3(x), π4(x), and π5(x).
aiπi(x)
FIGURE 4.2
First few Newton polynomials
0.5
1 0.8
x
y
3
2.5
2
1.5
0 1
0.5
0
parentheses, forming successively the following quantities:
v0 = an
v1 =v0(t−xn−1)+an−1
v2 =v1(t−xn−2)+an−2
.
vn =vn−1(t−x0)+a0
The quantity vn is now p(t). In the following pseudocode, a subscripted variable is not needed for vi . Instead, we can write
Here,thearray(ai)0:n containsthen+1coefficientsoftheNewtonformoftheinterpolating polynomial (3) of degree at most n, and the array (xi )0:n contains the n + 1 nodes xi .
Calculating Coefficients ai Using Divided Differences Weturnnowtotheproblemofdeterminingthecoefficientsa0,a1,...,an efficiently.Again
we start with a table of values of a function f :
x x0 x1 x2 ··· xn
f(x) f(x0) f(x1) f(x2) ··· f(xn)
Thepointsx0,x1,...,xn areassumedtobedistinct,butnoassumptionismadeabouttheir positions on the real line.
4.1 Polynomial Interpolation 131
1 2 3 4
0.6 0.4 0.2
In evaluating p(t ) for a given numerical value of t , we naturally start with the innermost
0 0.2
0.4 0.6 0.8
1
integer i,n; real t,v; real array (ai)0:n,(xi)0:n v ← an
fori =n−1to0step−1do
v ← v(t − xi ) + ai end for
132 Chapter 4
Interpolation and Numerical Differentiation
Previously, we established that for each n = 0, 1, . . . , there exists a unique polynomial pn such that
• The degree of pn is at most n.
• pn(xi)= f(xi)fori =0,1,...,n.
It was shown that pn can be expressed in the Newton form
pn(x)=a0 +a1(x−x0)+a2(x−x0)(x−x1)+···
+an(x −x0)···(x −xn−1)
A crucial observation about pn is that the coefficients a0, a1, . . . do not depend on n. In other words, pn is obtained from pn−1 by adding one more term, without altering the coefficients already present in pn−1 itself. This is because we began with the hope that pn could be expressed in the form
pn(x)= pn−1(x)+an(x−x0)···(x−xn−1)
and discovered that it was indeed possible. Awayofsystematicallydeterminingtheunknowncoefficientsa0,a1,...,an istoset
x equal in turn to x0, x1, . . . , xn in the Newton form (3) and to write down the resulting
equations:
⎧
⎪ ⎪ ⎪⎨ f ( x 0 ) = a 0
f(x1)=a0 +a1(x1 −x0)
⎪⎩ f(x2)=a0 +a1(x2 −x0)+a2(x2 −x0)(x2 −x1)
etc.
The compact form of Equations (5) is
k i−1
f(xk)= ai (xk −xj) (0kn) (6) i=0 j=0
Equations (5) can be solved for the ai ’s in turn, starting with a0. Then we see that a0 depends on f (x0), that a1 depends on f (x0) and f (x1), and so on. In general, ak depends on f (x0), f(x1),..., f(xk).Inotherwords,ak dependsonthevaluesof f atthenodesx0,x1,...,xk.
The traditional notation is
ak = f[x0,x1,...,xk] (7)
Thisequationdefines f[x0,x1,...,xk].Thequantity f[x0,x1,...,xk]iscalledthedivided difference of order k for f. Notice also that the coefficients a0,a1,...,ak are uniquely determined by System (6). Indeed, there is no possible choice for a0 other than a0 = f (x0). Similarly, there is now no choice for a1 other than [ f (x1) − a0]/(x1 − x0) and so on. Using Equations (5), we see that the first few divided differences can be written as
a0 = f(x0)
a1 = f(x1)−a0 = f(x1)− f(x0)
x1 − x0 x1 − x0
a2 = f(x2)−a0 −a1(x2 −x0) = (x2 −x0)(x2 −x1)
f(x2)− f(x1) − f(x1)− f(x0)
x2 −x1 x1 −x0 x2 −x0
(5)
EXAMPLE 5
Solution
For the table
determine the quantities f [x0], f [x0, x1], and f [x0, x1, x2].
We write out the system of Equations (5) for this concrete case:
⎧
⎪⎨ 3=a0
x 1 −4 0 f(x) 3 13 −23
⎪⎩ 13=a0+a1(−5)
−23 = a0 + a1(−1) + a2(−1)(4)
The solution is a0 = 3, a1 = −2, and a2 = 7. Hence, for this function, f[1] = 3, f [1,−4] = −2, and f [1,−4,0] = 7. ■
With this new notation, the Newton form of the interpolating polynomial takes the
form
n i−1
pn(x)= f[x0,x1,...,xi] (x −xj) (8) i=0 j=0
with the usual convention that −1 (x − x ) = 1. Notice that the coefficient of xn in p is j=0 j n−1 n
f[x0,x1,...,xn]becausethetermxn occursonlyin j=0(x−xj).Itfollowsthatif f is apolynomialofdegreen−1,then f[x0,x1,...,xn]=0.
We return to the question of how to compute the required divided differences f [x0, x1, . . . , xk ]. From System (5) or (6), it is evident that this computation can be per-
formed recursively. We simply solve Equation (6) for ak as follows:
4.1 Polynomial Interpolation 133
■ ALGORITHM1
An Algorithm for Computing the Divided Differences of f
and
Using Equation (7), we have
f[x0,x1,...,xk] =
f(xk)− ai i=0
f(xk)=ak
(xk −xj)+ ai j=0 i=0
k−1 i−1
(xk −xj) j=0
k−1 k−1 i−1
ak =
j=0
(xk −xj) j=0
(xk −xj)
f(xk)−
k−1 i−1
k−1
i=0
j=0
(9)
f[x0,x1,...,xi] k−1
(xk −xj)
(xk −xj) j=0
• Set f[x0]= f(x0).
• Fork =1,2,...,n,compute f[x0,x1,...,xk]byEquation(9).
(10)
134
Chapter 4
Interpolation and Numerical Differentiation
EXAMPLE 6
Solution
Using Algorithm (10), write out the f [x0, x1, x2, x3].
formulas
for f [x0 ], f [x0 , x1 ], f [x0 , x1 , x2 ], and
f[x0]= f(x0)
f[x0,x1] = f(x1)− f[x0] x1 − x0
f[x0,x1,x2]= f(x2)− f[x0]− f[x0,x1](x2 −x0) (x2 − x0)(x2 − x1)
f[x0,x1,x2,x3]= f(x3)− f[x0]− f[x0,x1](x3 −x0)− f[x0,x1,x2](x3 −x0)(x3 −x1) (x3 − x0)(x3 − x1)(x3 − x2)
■
Algorithm (10) is easily programmed and is capable of computing the divided dif-
ferences f[x0], f[x0,x1],..., f[x0,x1,...,xn] at the cost of 1n(3n + 1) additions, 2
(n − 1)(n − 2) multiplications, and n divisions excluding arithmetic operations on the
indices. A more refined method will now be presented for which the pseudocode requires
only three statements (!) and costs only 1 n(n + 1) divisions and n(n + 1) additions. 2
At the heart of the new method is the following remarkable theorem:
RECURSIVE PROPERTY OF DIVIDED DIFFERENCES
The divided differences obey the formula
f[x0,x1,...,xk] = f[x1,x2,...,xk]− f[x0,x1,...,xk−1] (11) xk − x0
■ THEOREM2
Proof
Since f [x0, x1, . . . , xk ] was defined to be equal to the coefficient ak in the Newton form of the interpolating polynomial pk of Equation (3), we can say that f [x0, x1, . . . , xk ] is the coefficientofxk inthepolynomial pk ofdegreek,whichinterpolates f atx0,x1,...,xk. Similarly, f[x1,x2,...,xk]isthecoefficientofxk−1 inthepolynomialqofdegreek−1, whichinterpolates f atx1,x2,...,xk.Likewise, f[x0,x1,...,xk−1]isthecoefficientof xk−1 inthepolynomial pk−1 ofdegreek−1,whichinterpolates f atx0,x1,...,xk−1.The three polynomials pk, q, and pk−1 are intimately related. In fact,
pk(x)=q(x)+ x−xk [q(x)−pk−1(x)] (12) xk − x0
To establish Equation (12), observe that the right side is a polynomial of degree at most k. Evaluatingitatxi,for1ik−1,resultsin f(xi):
q(xi ) + xi − xk [q(xi ) − pk−1(xi )] = f (xi ) + xi − xk [ f (xi ) − f (xi )]
xk − x0
xk − x0
= f(xi)
Similarly, evaluating it at x0 and xk gives f (x0) and f (xk ), respectively. By the uniqueness of interpolating polynomials, the right side of Equation (12) must be pk (x ), and Equation (12) is established.
INVARIANCE THEOREM
The divided difference f [x0, x1, . . . , xk ] is invariant under all permutations of the arguments x0,x1,...,xk.
■ THEOREM3
Completing the argument to justify Equation (11), we take the coefficient of xk on both sidesofEquation(12).TheresultisEquation(11).Indeed,weseethat f[x1,x2,...,xk]is the coefficient of xk−1 in q, and f [x0, x1, . . . , xk−1] is the coefficient of xk−1 in pk−1. ■
Notice that f[x0,x1,...,xk] is not changed if the nodes x0,x1,...,xk are permuted: thus, for example, f [x0, x1, x2] = f [x1, x2, x0]. The reason is that f [x0, x1, x2] is the coeffi- cient of x2 in the quadratic polynomial interpolating f at x0, x1, x2, whereas f [x1, x2, x0] is the coefficient of x2 in the quadratic polynomial interpolating f at x1, x2, x0. These two poly- nomials are, of course, the same. A formal statement in mathematical language is as follows:
Sincethevariablesx0,x1,...,xk andkarearbitrary,therecursiveFormula(11)can also be written as
f[xi,xi+1,...,xj−1,xj]= f[xi+1,xi+2,...,xj]− f[xi,xi+1,...,xj−1] (13) xj −xi
The first three divided differences are thus f[xi]= f(xi)
f[xi,xi+1] = f[xi+1]− f[xi] xi+1 −xi
f[xi,xi+1,xi+2]= f[xi+1,xi+2]−f[xi,xi+1] xi+2 −xi
Using Formula (13), we can construct a divided-difference table for a function f . It is customary to arrange it as follows (here n = 3):
x f[] f[,] f[,,] f[,,,] x0 f [x0]
4.1 Polynomial Interpolation 135
EXAMPLE 7
In the table, the coefficients along the top diagonal are the ones needed to form the Newton form of the interpolating polynomial (3).
Construct a divided-difference diagram for the function f given in the following table, and write out the Newton form of the interpolating polynomial.
x1302 2
f(x) 3 13 3 5 43
f[x0,x1]
x1 f [x1] f [x0, x1, x2] f [x1, x2]
x2 f [x2] f [x1, x2, x3]
f[x2,x3]
x3 f [x3]
f [x0, x1, x2, x3]
136 Chapter 4 Solution
Interpolation and Numerical Differentiation
The first entry is f [x0, x1] = 13 − 3/3 − 1 = 1 . After completion of column 3, the 422
first entry in column 4 is
f[x0,x1,x2]=
f[x1,x2]− f[x0,x1] x2 − x0
1 − 1
=6 2 =1
The complete diagram is
0 − 1 3 x f[] f[,] f[,,] f[,,,]
13
1
2
313 1
243 1
−2 3
6
03 −5
−2 3
Thus, we obtain
2
5 3
p3(x) = 3 + 1 (x − 1) + 1 (x − 1)x − 3 − 2(x − 1)x − 3 x ■ 2322
Algorithms and Pseudocode
Turningnexttoalgorithms,wesupposethatatablefor f isgivenatpointsx0,x1,...,xn and thatallthedivideddifferencesaij ≡ f[xi,xi+1,...,xj]aretobecomputed.Thefollowing pseudocode accomplishes this:
integer i, j, n; real array (ai j )0:n×0:n , (xi )0:n for i = 0 to n do
ai0 ← f(xi) end for
for j = 1 to n do
fori =0ton− j do
aij ←(ai+1,j−1−ai,j−1)/(xi+j −xi) end for
end for
Observe that the coefficients of the interpolating polynomial (3) are stored in the first row of the array (ai j )0:n×0:n .
If the divided differences are being computed for use only in constructing the Newton form of the interpolation polynomial
n i−1
pn(x) = ai (x − xj) i=0 j=0
whereai = f[x0,x1,...,xi],thereisnoneedtostoreallofthem.Only f[x0], f[x0,x1],..., f [x0, x1, . . . , xn ] need to be stored.
Whenaone-dimensionalarray(ai)0:n isused,thedivideddifferencescanbeoverwritten each time from the last storage location backward so that, finally, only the desired coefficients
4.1 Polynomial Interpolation 137 remain. In this case, the amount of computing is the same as in the preceding case, but the
storage requirements are less. (Why?) Here is a pseudocode to do this:
integer i, j, n; real array (ai )0:n , (xi )0:n for i = 0 to n do
ai ← f(xi) end for
for j = 1 to n do
for i = n to j step −1 do
ai ←(ai −ai−1)/(xi −xi−j) end for
end for
This algorithm is more intricate, and the reader is invited to verify it—say, in the case n = 3. For the numerical experiments suggested in the computer problems, the following two procedures should be satisfactory. The first is called Coef. It requires as input the number n and tabular values in the arrays (xi) and (yi). Remember that the number of points in the table is n + 1. The procedure then computes the coefficients required in the Newton
interpolating polynomial, storing them in the array (ai ).
procedure Coef (n, (xi ), (yi ), (ai ))
integer i, j, n; real array (xi )0:n , (yi )0:n , (ai )0:n for i = 0 to n do
ai ← yi end for
for j = 1 to n do
for i = n to j step −1 do
ai ←(ai −ai−1)/(xi −xi−j) end for
end for
end procedure Coef
The second is function Eval. It requires as input the array (xi ) from the original table and the array (ai ), which is output from Coef. The array (ai ) contains the coefficients for the Newton form of the interpolation polynomial. Finally, as input, a single real value for t is given. The function then returns the value of the interpolating polynomial at t.
real function Eval(n, (xi ), (ai ), t)
integer i,n; real t,temp; real array (xi)0:n,(ai)0:n temp ← an
fori =n−1to0step−1do
temp ← (temp)(t − xi ) + ai end for
Eval ← temp
end function Eval
138
Chapter 4
Interpolation and Numerical Differentiation
EXAMPLE 8
Solution
Since the coefficients of the interpolating polynomial need be computed only once, we call Coef first, and then all subsequent calls for evaluating this polynomial are accomplished with Eval. Notice that only the t argument should be changed between successive calls to function Eval.
Write pseudocode for the Newton form of the interpolating polynomial p for sin x at ten equidistant points in the interval [0, 1.6875]. The code finds the maximum value of | sin x − p(x )| over a finer set of equally spaced points in the same interval.
If we take ten points, including the ends of the interval, then we create nine subintervals, each of length h = 0.1875. The points are then xi = ih for i = 0,1,...,9. After obtaining the polynomial, we divide each subinterval into four panels, and we evaluate | sin x − p(x )| at37points(calledtinthepseudocode).Thesearetj =jh/4forj=0,1,...,36.Hereis a suitable main program in pseudocode that calls the procedures Coef and Eval previously given:
program Test Coef Eval
integer j, k, n, jmax; real e, h, p, emax, pmax, tmax, real array (xi )0:n , (yi )0:n , (ai )0:n
n←9
h ← 1.6875/n
for k = 0 to n do
xk ←kh
yk ←sin(xk) end for
call Coef (n, (xi ), (yi ), (ai )) output (ai ); emax ← 0
for j = 0 to 4n do
t ← jh/4
p ← Eval(n,(xi)n,(ai)n,t)
e ← |sin(t) − p| output j, t, p, e if e > emax then
jmax ← j;tmax ←t; pmax ← p;emax ←e end if
end for
output jmax, tmax, pmax, emax end program Test Coef Eval
The first coefficient in the Newton form of the interpolating polynomial is 0 (why?), and the others range in magnitude from approximately 0.99 to 0.18 × 10−5 . The deviation between sin x and p(x) is practically zero at each interpolation node. (Because of roundoff errors, they are not precisely zero.) From the computer output, the largest error is at jmax = 35, where sin(1.64062 5) ≈ 0.99756 31 with an error of 1.19 × 10−7 . ■
Vandermonde Matrix
Another view of interpolation is that for a given set of n + 1 data points (x0, y0), (x1, y1), …,(xn,yn),wewanttoexpressaninterpolatingfunction f(x)asalinearcombinationof a set of basis functions φ0,φ1,φ2,…,φn so that
f (x) ≈ c0φ0(x) + c1φ1(x) + c2φ2(x) + ··· + cnφn(x)
Here the coefficients c0 , c1 , c2 , . . . , cn are to be determined. We want the function f to
interpolate the data (xi , yi ). This means that we have linear equations of the form f(xi)=c0φ0(xi)+c1φ1(xi)+c2φ2(xi)+···+cnφn(xi)= yi
for each i = 0,1,2,…,n. This is a system of linear equations Ac = y
Here,theentriesinthecoefficientmatrixAaregivenbyaij =φj(xi),whichisthevalueof the jth basis function evaluated at the ith data point. The right-hand side vector y contains the known data values yi , and the components of the vector c are the unknown coefficients ci . Systems of linear equations are discussed in Chapters 7 and 8.
Polynomials are the simplest and most common basis functions. The natural basis for Pn consists of the monomials
φ0(x) = 1,φ1(x) = x,φ2(x) = x2,…,φn(x) = xn Figure 4.3 shows the first few monomials: 1, x, x2, x3, x4, and x5.
y
4.1 Polynomial Interpolation 139
1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6
FIGURE 4.3 0.8
1
x
x4
x2 x3
First few monomials
1 x 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1
Consequently, a given polynomial pn has the form
pn(x)=c0 +c1x+c2x2 +···+cnxn
140
Chapter 4
Interpolation and Numerical Differentiation
The corresponding linear system Ac = y has the form
⎡1 x x2 ··· xn⎤⎡c⎤ ⎡y⎤
00000
⎢ ⎢ 1 x 1 x 12 · · · x 1n ⎥ ⎥ ⎢ ⎢ c 1 ⎥ ⎥ ⎢ ⎢ y 1 ⎥ ⎥
⎢1 x x2 ··· xn⎥⎢c⎥ ⎢y⎥ ⎢ 2 2 2⎥⎢ 2⎥=⎢ 2⎥
⎢⎣ . . . … . ⎥⎦⎢⎣ . ⎥⎦ ⎢⎣ . ⎥⎦ 1xnxn2···xncn yn
The coefficient matrix is called a Vandermonde matrix. It can be shown that this matrix is nonsingularprovidedthatthepointsx0,x1,x2,…,xn aredistinct.Sowecan,intheory, solve the system for the polynomial interpolant. Although the Vandermonde matrix is non- singular, it is ill-conditioned as n increases. For large n, the monomials are less distinguish- able from one another, as shown in Figure 4.4. Moreover, the columns of the Vandermonde become nearly linearly dependent in this case. High-degree polynomials often oscillate wildly and are highly sensitive to small changes in the data.
y
1
0.5
0
0.5
T0
FIGURE 4.4
First few Chebyshev polynomials
T1 T2
T3 T4
T5
1 x
1 0.5 0 0.5 1
As Figures 4.1, 4.2, and 4.3 show, we have discussed three choices for the basis func- tions: the Lagrange cardinal polynomials li(x), the Newton polynomials πi(x), and the monomials. It turns out that there are better choices for the basis functions; namely, the Chebyshev polynomials have more desirable features.
The Chebyshev polynomials play an important role in mathematics because they have
several special properties such as the recursive relation
for i = 2, 3, 4, and so on. Thus, the first five Chebyshev polynomials are T0(x)=1, T1(x)=x, T2(x)=2×2 −1, T3(x)=4×3 −3x T4(x)=8×4 −8×2 +1, T5(x)=16×5 −20×3 +5x
These curves for these polynomials, as is shown in Figure 4.4, are quite different from one another. The Chebyshev polynomials are usually employed on the interval [−1, 1].
T0(x) = 1, T1(x) = x
Ti(x) =2xTi−1(x)−Ti−2(x)
With changes of variable, they can be used on any interval, but the results will be more complicated.
One of the important properties of the Chebyshev polynomials is the equal oscillation property. Notice in Figure 4.4 that successive extreme points of the Chebyshev polynomials are equal in magnitude and alternate in sign. This property tends to distribute the error uniformly when the Chebyshev polynomials are used as the basis functions. In polynomial interpolation for continuous functions, it is particularly advantageous to select as the inter- polation points the roots or the extreme points of a Chebyshev polynomial. This causes the maximum error over the interval of interpolation to be minimized. An example of this is given in Section 4.2. In Section 12.2, we discuss Chebyshev polynomials in more detail.
Inverse Interpolation
A process called inverse interpolation is often used to approximate an inverse function. Supposethatvaluesyi = f(xi)havebeencomputedatx0,x1,…,xn.Usingthetable
y y0 y1 ··· yn
x x0 x1 ··· xn we form the interpolation polynomial
n i−1
p(y) = ci (y − yj) i=0 j=0
The original relationship, y = f (x), has an inverse, under certain conditions. This inverse is being approximated by x = p(y). Procedures Coef and Eval can be used to carry out the inverse interpolation by reversing the arguments x and y in the calling sequence for Coef.
Inverse interpolation can be used to find where a given function f has a root or zero. This means inverting the equation f (x) = 0. We propose to do this by creating a table of values ( f (xi ), xi ) and interpolating with a polynomial, p. Thus, p(yi ) = xi . The points xi should be chosen near the unknown root, r . The approximate root is then given by r ≈ p(0). See Figure 4.5 for an example of function y = f (x) and its inverse function x = g(y) with the root r = g(0).
4.1 Polynomial Interpolation 141
yx y f(x)
x g(y)
FIGURE 4.5
Function y = f(x) and inverse function x = g(y)
EXAMPLE 9
0
r
f(r) 0
r g(0) xy
For a concrete case, let the table of known values be
y −0.57892 00 −0.36263 70 −0.18491 60
−0.03406 42 4.0
0.09698 58 5.0
0
x 1.0 2.0
Find the inverse interpolation polynomial.
3.0
142 Chapter 4 Solution
Interpolation and Numerical Differentiation
The nodes in this problem are the points in the row of the table headed y, and the function values being interpolated are in the x row. The resulting polynomial is
p(y) = 0.25y4 + 1.2y3 + 3.69y2 + 7.39y + 4.24747 0086
and p(0) = 4.24747 0086. Only the last coefficient is shown with all the digits carried in
the calculation, as it is the only one needed for the problem at hand. ■
Polynomial Interpolation by Neville’s Algorithm
Another method of obtaining a polynomial interpolant from a given table of values x x0 x1 ··· xn
y y0 y1 ··· yn
was given by Neville. It builds the polynomial in steps, just as the Newton algorithm does. The constituent polynomials have interpolating properties of their own.
Let Pa,b,…,s(x) be the polynomial interpolating the given data at a sequence of nodes xa,xb,…,xs. We start with constant polynomials Pi(x) = f(xi). Selecting two nodes xi and x j with i > j , we define recursively
x−x x−x Pu,…,v(x) = j Pu,…, j−1, j+1,…,v(x) + i
Pu,…,i−1,i+1,…,v(x) Using this formula repeatedly, we can create an array of polynomials:
x0 P0(x)
x1 P1(x)
x2 P2(x)
x3 P3(x)
x4 P4(x)
P0,1(x)
P1,2(x) P0,1,2(x)
P2,3(x) P1,2,3(x) P0,1,2,3(x)
P3,4(x) P2,3,4(x) P1,2,3,4(x) P0,1,2,3,4(x)
Si j (x) = i− j xi −xi−j
So the displayed array becomes
x0 S00(x)
x1 S10(x)
x2 S20(x)
x3 S30(x)
x4 S40(x)
Si, j−1(x) +
i
xi −xi−j
Si−1, j−1(x)
xi −xj xi −xj
Here, each successive polynomial can be determined from two adjacent polynomials in the previous column.
We can simplify the notation by letting
Si j (x) = Pi− j,i− j+1,…,i−1,i (x)
where Si j (x ) for i j denotes the interpolating polynomial of degree j on the j + 1 nodes xi− j , xi− j+1, . . . , xi−1, xi . Next we can rewrite the recurrence relation above as
x−x x−x
S11(x)
S21(x) S22(x)
S31(x) S32(x) S33(x)
S41(x) S42(x) S43(x) S44(x)
To prove some theoretical results, we change the notation by making the superscript
the degree of the polynomial. At the beginning, we define constant polynomials (i.e., poly-
nomials of degree 0) as P0(x) = y for 0 i n. Then we define ii
i x−xi x−xi−1 i i−j i i−j
4.1 Polynomial Interpolation 143
x−x x−x
P j (x) = i− j P j−1(x) + i P j−1(x) (14)
In this equation, the superscripts are simply indices, not exponents. The range of j is 1 j n, while that of i is j i n. Formula (14) will be seen again, in slightly different form, in the theory of B splines in Section 9.3.
The interpolation properties of these polynomials are given in the next result.
INTERPOLATION PROPERTIES
The polynomials P j defined above interpolate as follows: i
P j (x ) = y (0 i − j k i n) (15) ikk
■ THEOREM4
Proof
We use induction on j. When j = 0, the assertion in Equation (15) reads P0(x )= y (0ikin)
ikk
In other words, P0(x ) = y , which is true by the definition of P0. iii i
Now assume, as an induction hypothesis, that for some j 1,
P j−1(x ) = y (0 i − j + 1 k i n)
ikk
To prove the next case in Equation (15), we begin by verifying the two extreme cases for k,
namely, k = i − j and k = i. We have, by Equation (14), x−x
Pj(x )= i i−j Pj−1(x ) i i−j x −x i−1 i−j
i i−j =Pj−1(x )=y
i−1 i−j i−j
The last equality is justified by the induction hypothesis. It is necessary to observe that
0i−1− j+1i− ji−1n.Inthesameway,wecompute x−x
Pj(x ) = i i−j Pj−1(x ) iix−xii
i i−j
= P j−1(x ) = y
iii
Here, in using the induction hypothesis, observe that 0 i − j + 1 i i n.
Now let i − j < k < i. Then
x−x x−x
Pj(x )= k i−j Pj−1(x )+ i k Pj−1(x ) i k x−x i k x−x i−1 k
i i−j i i−j
144 Chapter 4
Interpolation and Numerical Differentiation
Inthisequation,Pj−1(x )= y bytheinductionhypothesis,because0i−j+1kin. ikk
Likewise,Pj−1(x)=y because0i−1−j+1ki−1n.Thus,wehave
i−1 k k
Pj(x)= k i−j y+ i k y=y
x−xx−x ikx−xkx−xkk ■
i i−j i i−j
An algorithm follows in pseudocode to evaluate P0n(t) when a table of values is given:
integer i, j, n; real array (xi )0:n , (yi )0:n , (Si j )0:n×0:n fori =0ton
Si0 ← yi end for
for j = 1 to n
for i = j to n
Sij ← (t −xi−j)Si,j−1 +(xi −t)Si−1,j−1 (xi −xi−j) end for
end for return S0n
We begin the algorithm by finding the node nearest the point t at which the evaluation is to be made. In general, interpolation is more accurate when this is done.
Interpolation of Bivariate Functions
The methods we have discussed for interpolating functions of one variable by polynomials extend to some cases of functions of two or more variables. An important case occurs when a function (x, y) → f (x, y) is to be approximated on a rectangle. This leads to what is known as tensor-product interpolation. Suppose the rectangle is the Cartesian product of two intervals: [a, b] × [α, β]. That is, the variables x and y run over the intervals [a, b], and [α,β],respectively.Selectnnodesxi in[a,b],anddefinetheLagrangianpolynomials
l i ( x ) = n x − x j j ≠ i x i − x j
j=1 Similarly,weselectmnodesyi in[α,β]anddefine
( 1 i n )
( 1 i m )
Then the function
l i ( y ) = m y − y j j ≠ i y i − y j
j=1
n m i=1 j=1
P(x, y) =
f (xi , yj )li (x)lj (y)
is a polynomial in two variables that interpolates f at the grid points (xi , y j ). There are nm such points of interpolation. The proof of the interpolation property is quite simple because
li(xq) = δiq and lj(yp) = δjp. Consequently,
n m i=1 j=1
n m i=1 j=1
The same procedure can be used with spline interpolants (or indeed any other type of function).
Summary
(1) The Lagrange form of the interpolation polynomial is
4.1 Polynomial Interpolation 145
P(xq,yp) = =
f(xi,yj)li(xq)lj(yp) f(xi,yj)δiqδjp = f(xq,yp)
n i=0
j≠i xi−xj j=0
that obey the Kronecker delta equation
n i−1
pn(x) = ai (x − xj) i=0 j=0
with divided differences
ai = f[x0,x1,...,xi] = f[x1,x2,...,xi]− f[x0,x1,...,xi−1]
xi −x0
These are two different forms of the unique polynomial p of degree n that interpolates a
tableofn+1pairsofpoints(xi, f(xi))for0in. (3) We can illustrate this with a small table for n = 2:
x x0 x1 x2 f(x) f(x0) f(x1) f(x2)
The Lagrange interpolating polynomial is
p2(x)= (x−x1)(x−x2) f(x0)+ (x−x0)(x−x2) f(x1)
(x0 − x1)(x0 − x2) (x1 − x0)(x1 − x2) + (x−x0)(x−x1) f(x2)
with cardinal polynomials
li (x) =
n x − x
j (0 i n)
pn (x ) =
li (x ) f (xi )
li(xj)=δij =
(2) The Newton form of the interpolation polynomial is
0 i f i ≠ j 1 ifi=j
(x2 − x0)(x2 − x1)
146 Chapter 4
Interpolation and Numerical Differentiation
Clearly, p2(x0) = f (x0), p2(x1) = f (x1), and p2(x2) = f (x2). Next, we form the divided- difference table:
x0 f (x0)
f[x0,x1]
x1 f (x1)
x2 f (x2)
Using the divided-difference entries from the top diagonal, we have
pn(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1)
Again, it can be easily shown that p2(x0) = f (x0), p2(x1) = f (x1), and p2(x) = f (x2).
(4) We can use inverse polynomial interpolation to find an approximate value of a root r of theequation f(x)=0fromatableofvalues(xi,yi)for1in.Hereweareassuming that the table values are in the vicinity of this zero of the function f . Flipping the table values, we use the reversed table values (yi , xi ) to determine the interpolating polynomial called pn(y). Now evaluating it at 0, we find a value that approximates the desired zero, namely,r≈pn(0)and f(pn(0))≈ f(r)=0.
(5) Other advanced polynomial interpolation methods discussed are Neville’s algorithm and bivariate function interpolation.
a1. Use the Lagrange interpolation process to obtain a polynomial of least degree that assumes these values:
x0234
y 7 11 28 63
2. (Continuation)Rearrangethepointsinthetableoftheprecedingproblemandfindthe Newton form of the interpolating polynomial. Show that the polynomials obtained are identical, although their forms may differ.
a3. Forthefourinterpolationnodes−1,1,3,4,whataretheli Functions(2)requiredin the Lagrange interpolation procedure? Draw the graphs of these four functions to show their essential properties.
4. Verifythatthepolynomials
p(x)=5x3 −27x2 +45x−21, q(x)=x4 −5x3 +8x2 −5x+3
f[x1,x2]
f [x0, x1, x2]
interpolate the data
x1234
y 2 1 6 47
and explain why this does not violate the uniqueness part of the theorem on existence of polynomial interpolation.
Problems 4.1
5. Verifythatthepolynomials
p(x)=3+2(x−1)+4(x−1)(x+2), q(x)=4x2 +6x−7
are both interpolating polynomials for the following table, and explain why this does not violate the uniqueness part of the existence theorem for polynomial interpolation.
x 1 −2 0
y 3 −3 −7
6. Find the polynomial p of least degree that takes these values: p(0) = 2, p(2) = 4, p(3) = −4, p(5) = 82. Use divided differences to get the correct polynomial. It is not
necessary to write the polynomial in the standard form a0 + a1 x + a2 x 2 + · · ·.
7. Completethefollowingdivided-differencetables,andusethemtoobtainpolynomials
of degree 3 that interpolate the function values indicated: aa. x f[] f[,] f[,,] f[,,,]
−1 2
1−4 2 36
2 5 10
4.1 Polynomial Interpolation 147
b. x f[] −1 2
1 −4 3 46
f[,] f[,,] f[,,,]
53.5 4 99.5
Write the final polynomials in a form most efficient for computing. a8. Findaninterpolatingpolynomialforthistable:
x 1 2 2.5 3 4 y −1 −1 3 4 25
x01246 f(x) 1 9 23 93 259
9. Giventhedata
3 32 3
do the following.
aa. Constructthedivided-differencetable.
a b. Using Newton’s interpolation polynomial, find an approximation to f (4.2). Hint: Use polynomials starting with 9 and involving factors (x − 1).
148 Chapter 4
Interpolation and Numerical Differentiation
10. a. Construct Newton’s interpolation polynomial for the data shown. x0234
y 7 11 28 63
b. Without simplifying it, write the polynomial obtained in nested form for easy
evaluation.
11. From census data, the approximate population of the United States was 150.7 million in 1950, 179.3 million in 1960, 203.3 million in 1970, 226.5 million in 1980, and 249.6 million in 1990. Using Newton’s interpolation polynomial for these data, find an approximate value for the population in 2000. Then use the polynomial to estimate the population in 1920 based on these data. What conclusion should be drawn?
a12. The polynomial p(x) = x4 − x3 + x2 − x + 1 has the following values: x −2 −1 0 1 2 3
p(x) 31 5 1 1 11 61 Find a polynomial q that takes these values:
x −2 −1 0 1 2 3 q(x) 31 5 1 1 11 30
Hint: This can be done with little work.
13. Use the divided-difference method to obtain a polynomial of least degree that fits the
values shown.
aa.x 0 1 2 −1 3 b.x 1 3 −2 4 5
y −1 −1 −1 −7 5 y 2 6 −1 −4 2 a14. Findtheinterpolatingpolynomialforthesedata:
x 1.0 2.0 2.5 3.0 4.0
f (x) −1.5 −0.5 0.0 0.5 1.5 15. Itissuspectedthatthetable
x −2 −1 0 1 2 3 y 1 4 11 16 13 −4
comes from a cubic polynomial. How can this be tested? Explain.
a16. Thereexistsauniquepolynomialp(x)ofdegree2orlesssuchthatp(0)=0,p(1)=1, and p′(α) = 2 for any value of α between 0 and 1 (inclusive) except one value of α, say, α0. Determine α0, and give this polynomial for α ≠ α0.
17. Determine by two methods the polynomial of degree 2 or less whose graph passes through the points (0, 1.1), (1, 2), and (2, 4.2). Verify that they are the same.
a18. Developthedivided-differencetablefromthegivendata.Writedowntheinterpolating polynomial, and rearrange it for fast computation without simplifying.
x01325
f(x) 2 1 5 6 −183 Checkpoint: f[1,3,2,5]=−7.
a19.
20.
21.
a22.
23.
24.
a 25.
26.
a27. a 28.
Let f (x) = x3 + 2x2 + x + 1. Find the polynomial of degree 4 that interpolates the values of f at x = −2, −1, 0, 1, 2. Find the polynomial of degree 2 that interpolates the values of f at x = −1,0,1.
Without using a divided-difference table, derive and simplify the polynomial of least degree that assumes these values:
x −2 −1 0 1 2 y 2 14 4 2 2
(Continuation)Findapolynomialthattakesthevaluesshownintheprecedingproblem and has at x = 3 the value 10. Hint: Add a suitable polynomial to the p(x) of the previous problem.
Findapolynomialofleastdegreethattakesthesevalues:
x 1.73 1.82 2.61 5.22 8.26
y 0 0 7.8 0 0
Hint: Rearrange the table so that the nonzero value of y is the last entry, or think of
some better way.
Form a divided-difference table for the following and explain what happened.
x1231
y3557
Simplepolynomialinterpolationintwodimensionsisnotalwayspossible.Forexample, suppose that the following data are to be represented by a polynomial of first degree in x and y, p(t) = a + bx + cy, where t = (x, y):
t (1,1) (3,2) (5,3) f(t) 3 2 6
Show that it is not possible.
Consider a function f (x ) such that f (2) = 1.5713, f (3) = 1.5719, f (5) = 1.5738, and f (6) = 1.5751. Estimate f (4) using a second-degree interpolating polynomial and a third-degree polynomial. Round the final results off to four decimal places. Is there any advantage here in using a third-degree polynomial?
Use inverse interpolation to find an approximate value of x such that f (x) = 0 given the following table of values for f . Look into what happens and draw a conclusion.
x −2 −1 1 2 3 f(x) −31 5 1 11 61
Find a polynomial p(x) of degree at most 3 such that p(0) = 1, p(1) = 0, p′(0) = 0, and p′(−1) = −1.
From a table of logarithms, we obtain the following values of log x at the indicated tabular points:
x 1 1.5 2 3 3.5 4
logx 0 0.17609 0.30103 0.47712 0.54407 0.60206
4.1 Polynomial Interpolation 149
150 Chapter 4
Interpolation and Numerical Differentiation
Form a divided-difference table based on these values. Interpolate for log 2.4 and log 1.2 using third-degree interpolation polynomials in Newton form.
29. Show that the divided differences are linear maps; that is,
(αf +βg)[x0,x1,...,xn]=αf[x0,x1,...,xn]+βg[x0,x1,...,xn]
Hint: Use induction.
30. Show that another form for the polynomial pn of degree at most n that takes values y0,y1,...,yn atabscissasx0,x1,...,xn is
n i−1
f[xn,xn−1,...,xn−i] (x −xn−j) j=0
i=0
31. Usetheuniquenessoftheinterpolatingpolynomialtoverifythat
n n i−1
f (xi )li (x) =
32. (Continuation)Showthatthefollowingexplicitformulaisvalidfordivideddifferences:
n j ≠ i
j=0
i=0
f [x0, x1, . . . , xi ] (x − x j ) j=0
n f[x0,x1,...,xn]=
(xi −xj)−1
i=0
i = 0
n
li(x) = 1
f(xi)
Hint: If two polynomials are equal, the coefficients of xn in each are equal.
33. Verifydirectlythat
i=0
for the case n = 1. Then establish the result for arbitrary values of n.
34. Write the Lagrange form (1) of the interpolating polynomial of degree at most 2 that interpolates f (x) at x0, x1, and x2, where x0 < x1 < x2.
35. (Continuation)WritetheNewtonformoftheinterpolatingpolynomialp2(x),andshow that it is equivalent to the Lagrange form.
36. (Continuation)Showdirectlythat
p′′(x)=2f[x ,x ,x ]
2 012
37. (Continuation) Show directly for uniform spacing h = x1 − x0 = x2 − x1 that
f[x0,x1]= f0 and f[x0,x1,x2]= 2 f0 h 2h2
wherefi = fi+1 − fi,2 fi =fi+1 −fi,and fi = f(xi).
38. (Continuation)EstablishNewton’sforward-differenceformoftheinterpolatingpoly-
nomial with uniform spacing
p2(x)=f0+ 1s f0+ 2s 2f0
4.1 Polynomial Interpolation 151 where x = x0 + sh. Here, s is the binomial coefficient [s!]/[(s − m)! m!], and
m
s!/(s − m)! = s(s − 1)(s − 2) · · · (s − m + 1) because s can be any real number and
m! has the usual definition because m is an integer.
a39. (Continuation) From the following table of values of lnx, interpolate to obtain ln 2.352 and ln 2.387 using the Newton forward-difference form of the interpolating polynomial:
2 f −0.00001
−0.00002 −0.00002
Using the correctly rounded values ln 2.352 ≈ 0.85527 and ln 2.387 ≈ 0.87004, show that the forward-difference formula is more accurate near the top of the table than it is near the bottom.
a40. Countthenumberofmultiplications,divisions,andadditions/subtractionsinthegen- eration of the divided-difference table that has n + 1 points.
41. Verifydirectlythatforanythreedistinctpointsx0,x1,andx2, f[x0,x1,x2] = f[x2,x0,x1] = f[x1,x2,x0]
Compare this argument to the one in the text.
a42. Letpbeapolynomialofdegreen.Whatisp[x0,x1,...,xn+1]?
43. Show that if f is continuously differentiable on the interval [x0, x1], then f [x0, x1] = f ′(c) for some c in (x0, x1).
44. If f is a polynomial of degree n, show that in a divided-difference table for f , the nth columnhasasingleconstantvalue—acolumncontainingentries f[xi,xi+1,...,xi+n].
a45. Determinewhetherthefollowingassertionistrueorfalse.Ifx0,x1,...,xn aredistinct, thenforarbitraryrealvalues y0, y1,..., yn,thereisauniquepolynomial pn+1 ofdegree n+1suchthat pn+1(xi)= yi foralli =0,1,...,n.
46. Show that if a function g interpolates the function f at x0, x1, . . . , xn−1 and h interpo- lates f atx1,x2,...,xn,then
g(x)+ x0 −x [g(x)−h(x)] xn − x0
x f(x)
2.35 0.85442
2.36 0.85866
2.37 0.86289
2.38 0.86710
2.39 0.87129
f 0.00424
0.00423 0.00421 0.00419
interpolates f at x0,x1,...,xn.
152 Chapter 4
Interpolation and Numerical Differentiation
47. (Vandermonde determinant) Using fi = f (xi ), show the following:
1 x f 0 0
1x1f1 1x2 f2
b. f[x0,x1,x2]=1x x2 0 0 1 x 1 x 12 1x2 x2
a1. TesttheproceduregiveninthetextfordeterminingtheNewtonformoftheinterpolating polynomial. For example, consider this table:
x 1 2 3 −4 5
y 2 48 272 1182 2262 Find the interpolating polynomial and verify that p(−1) = 12.
2. Find the polynomial of degree 10 that interpolates the function arctan x at 11 equally spaced points in the interval [1, 6]. Print the coefficients in the Newton form of the polynomial. Compute and print the difference between the polynomial and the function at 33 equally spaced points in the interval [0, 8]. What conclusion can be drawn?
3. Write a simple program using procedure Coef that interpolates ex by a polynomial of degree 10 on [0, 2] and then compares the polynomial to exp at 100 points.
4. Use as input data to procedure Coef the annual rainfall in your town for each of the last 5 years. Using function Eval, predict the rainfall for this year. Is the answer reasonable?
5. A table of values of a function f is given at the points xi = i/10 for 0i100. In order to obtain a graph of f with the aid of an automatic plotter, the values of f are required at the points zi = i /20 for 0 i 200. Write a procedure to do this, using a cubic interpolating polynomial with nodes xi, xi+1, xi+2, and xi+3 to compute f at
1 (xi+1 + xi+2). For z1 and z199, use the cubic polynomial associated with z3 and z197, 2
respectively. Compare this routine to Coef for a given function.
6. WriteroutinesanalogoustoCoefandEvalusingtheLagrangeformoftheinterpolation polynomial. Test on the example given in this section at 20 points with h/2. Does the Lagrange form have any advantage over the Newton form?
7. (Continuation) Design and carry out a numerical experiment to compare the accuracy of the Newton and Lagrange forms of the interpolation polynomials at values throughout the interval [x0, xn].
8. Rewrite and test routines Coef and Eval so that the array (ai ) is not used. Hint: When the elements in the array (yi) are no longer needed, store the divided differences in their places.
9. Writeaprocedureforcarryingoutinverseinterpolationtosolveequationsoftheform f (x) = 0. Test it on the introductory example at the beginning of this chapter.
1f 0
a.f[x,x]=1f1 0 1 1 x0
1 x1
Computer Problems 4.1
4.2
10. For Example 8, compare the results from your code with that in the text. Redo using linear interpolation based on the ten equidistant points. How do the errors compare at intermediate points? Plot curves to visualize the difference between linear interpolation and a higher-degree polynomial interpolation.
11. UsemathematicalsoftwaresuchasMatlab,Maple,orMathematicatofindaninterpo- lation polynomial for the points (0, 0), (1, 1), (2, 2.001), (3, 3), (4, 4), (5, 5). Evaluate the polynomial at the point x = 14 or x = 20 to show that slight roundoff errors in the data can lead to suspicious results in extrapolation.
12. UsesymbolicmathematicalsoftwaresuchasMatlab,Maple,orMathematicatogener- ate the interpolation polynomial for the data points in Example 3. Plot the polynomial and the data points.
13. (Continuation.) Repeat these instructions using Example 7.
14. Carry out the details in Example 8 by writing a computer program that plots the data
points and the curve for the interpolation polynomial.
15. (Continuation.) Repeat the instructions for Problem 14 on Example 9.
16. Using mathematical software, carry out the details and verify the results in the intro- ductory example to this chapter.
17. (Pade ́ interpolation) Find a rational function of the form g(x) = a + bx
1+cx
that interpolates the function f (x ) = arctan (x ) at the points x0 = 1, x1 = 2, and x2 = 3. On the same axes, plot the graphs of f and g, using dashed and dotted lines, respectively.
Errors in Polynomial Interpolation
EXAMPLE 1
When a function f is approximated on an interval [a,b] by means of an interpolating polynomial p, the discrepancy between f and p will (theoretically) be zero at each node of interpolation. A natural expectation is that the function f will be well approximated at all intermediate points and that as the number of nodes increases, this agreement will become better and better.
In the history of numerical mathematics, a severe shock occurred when it was realized that this expectation was ill-founded. Of course, if the function being approximated is not required to be continuous, then there may be no agreement at all between p(x) and f (x) except at the nodes.
Consider these five data points: (0, 8), (1, 12), (3, 2), (4, 6), (8, 0). Construct and plot the interpolation polynomial using the two outermost points. Repeat this process by adding one additional point at a time until all the points are included. What conclusions can you draw?
4.2 Errors in Polynomial Interpolation 153
154
Chapter 4
Interpolation and Numerical Differentiation
35
30
25
20
15
10
5
0 5
p4
p1
p3
y
FIGURE 4.6
Interpolant polynomials overdatapoints
Solution
p2 012345678
x
The first interpolation polynomial is the line between the outermost points (0, 8) and (8, 0). Then we added the points (3, 2), (4, 5), and (1, 12) in that order and plotted a curve for each additional point. All of these polynomials are shown in Figure 4.6. We were hoping for a smooth curve going through these points without wide fluctuations, but this did not happen. (Why?) It may seem counterintuitive, but as we added more points, the situation became worse instead of better! The reason for this comes from the nature of high-degree polynomials. A polynomial of degree n has n zeros. If all of these zero points are real, then the curve crosses the x-axis n times. The resulting curve must make many turns for this to happen, resulting in wild oscillations. In Chapter 9, we discuss fitting the data points with spline curves. ■
Dirichlet Function
As a pathological example, consider the so-called Dirichlet function f , defined to be 1 at each irrational point and 0 at each rational point. If we choose nodes that are rational numbers, then p(x) ≡ 0 and f (x) − p(x) = 0 for all rational values of x, but
f (x) − p(x) = 1 for all irrational values of x.
However, if the function f is well-behaved, can we not assume that the differences
| f (x) − p(x)| will be small when the number of interpolating nodes is large? The answer is still no, even for functions that possess continuous derivatives of all orders on the interval!
Runge Function
A specific example of this remarkable phenomenon is provided by the Runge function:
f (x) = 1 + x2−1 (1)
on the interval [−5, 5]. Let pn be the polynomial that interpolates this function at n + 1 equally spaced points on the interval [−5, 5], including the endpoints. Then
lim max |f(x)−pn(x)|=∞ n→∞ −5 x 5
4.2 Errors in Polynomial Interpolation 155
Thus, the effect of requiring the agreement of f and pn at more and more points is to increase the error at nonnodal points, and the error actually increases beyond all bounds!
The moral of this example, then, is that polynomial interpolation of high degree with many nodes is a risky operation; the resulting polynomials may be very unsatisfactory as representations of functions unless the set of nodes is chosen with great care.
The reader can easily observe the phenomenon just described by using the pseudocodes already developed in this chapter. See Computer Problem 4.2.1 for a suggested numerical experiment. In a more advanced study of this topic, it would be shown that the divergence of the polynomials can often be ascribed to the fact that the nodes are equally spaced. Again, contrary to intuition, equally distributed nodes are usually a very poor choice in interpolation. A much better choice for n + 1 nodes in [−1, 1] is the set of Chebyshev nodes:
2i+1
xi =cos 2n+2 π (0in)
The corresponding set of nodes on an arbitrary interval [a, b] would be derived from a linear mapping to obtain
1 1 2i+1
xi =2(a+b)+2(b−a)cos 2n+2 π (0in)
Notice that these nodes are numbered from right to left. Since the theory does not depend on any particular ordering of the nodes, this is not troublesome.
A simple graph illustrates this phenomenon best. Again, consider Equation (1) on the interval [−5, 5]. First, we select nine equally spaced nodes and use routines Coef and Eval with an automatic plotter to graph p8. As shown in Figure 4.7, the resulting curve assumes negativevalues,which,ofcourse, f(x)doesnothave!Addingmoreequallyspacednodes— and thereby obtaining a higher-degree polynomial—only makes matters worse with wilder oscillations. In Figure 4.8, nine Chebyshev nodes are used, and the resulting polynomial curve is smoother. However, cubic splines (discussed in Chapter 9) produce an even better curve fit.
y
1
FIGURE 4.7
5 4
Polynomial x
4 5
interpolant with nine equally spaced nodes
FIGURE 4.8
3 2 1 0 1 2 3 1
y
1
Polynomial x
interpolant with nine Chebyshev nodes
5 4 3 2 1 0 1
1 2 3
4
5
156 Chapter 4
Interpolation and Numerical Differentiation
FIGURE 4.9
Interpolation with Chebyshev points
5 0 5
The Chebyshev nodes are obtained by taking equally-spaced points on a semicircle and
projecting them down onto the horizontal axis, as in Figure 4.9.
Theorems on Interpolation Errors
It is possible to assess the errors of interpolation by means of a formula that involves the (n + 1)st derivative of the function being interpolated. Here is the formal statement:
INTERPOLATION ERRORS I
If p is the polynomial of degree at most n that interpolates f at the n + 1 distinct nodes x0, x1, . . . , xn belonging to an interval [a, b] and if f (n+1) is continuous, then for each x in [a,b], there is a ξ in (a,b) for which
1 (n+1) f (x) − p(x) = (n + 1)! f
(ξ)
n
(x − xi ) (2)
i=0
■ THEOREM1
Proof
Observe first that Equation (2) is obviously valid if x is one of the nodes xi because then both sides of the equation reduce to zero. If x is not a node, let it be fixed in the remainder of the discussion, and define
(3)
Observe that c is well defined because w(x) ≠ 0 (x is not a node). Note also that φ takes thevalue0atthen+2pointsx0,x1,...,xn,andx.NowinvokeRolle’sTheorem,∗ which states that between any two roots of φ, there must occur a root of φ′. Thus, φ′ has at least n + 1 roots. By similar reasoning, φ′′ has at least n roots, φ′′′ has at least n − 1 roots, and so on. Finally, it can be inferred that φ(n+1) must have at least one root. Let ξ be a root of
∗ Rolle’s Theorem: Let f be a function that is continuous on [a, b] and differentiable on (a, b). If f(a)= f(b)=0,then f′(c)=0forsomepointcin(a,b).
n
(t − xi ) (polynomial in the variable t )
w(t ) =
c= f(x)−p(x) (constant)
i=0
w(x )
φ(t) = f (t) − p(t) − cw(t) (function in the variable t)
■ LEMMA1
FIGURE 4.10
Typical location of x in equally spaced nodes
x
Proof
To establish this inequality, fix x and select j so that xj x xj+1. It is an exercise in calculus (Problem 4.2.2) to show that
h2
|x−xj||x−xj+1| 4 (5)
Using Equation (5), we have
n j−1 n |x−xi|h2 (x−xi) (xi −x) i=0 4 i=0 i=j+2
The sketch in Figure 4.10, showing a typical case of equally spaced nodes, may be helpful. Since xj x xj+1, we have further
4.2 Errors in Polynomial Interpolation 157
φ(n+1). All the roots being counted in this argument are in (a, b). Thus, 0 = φ(n+1)(ξ) = f (n+1)(ξ) − p(n+1)(ξ) − cw(n+1)(ξ)
In this equation, p(n+1)(ξ) = 0 because p is a polynomial of degree n. Also, w(n+1)(ξ) = (n + 1)! because w(t) = tn+1+ (lower-order terms in t). Thus, we have
0= f(n+1)(ξ)−c(n+1)!= f(n+1)(ξ)−(n+1)![f(x)−p(x)] w(x )
This equation is a rearrangement of Equation (2).
■
A special case that often arises is the one in which the interpolation nodes are equally spaced.
UPPER BOUND LEMMA
Supposethatxi =a+ihfori=0,1,...,nandthath=(b−a)/n.Thenforany x ∈ [a, b]
n 1 n + 1
|x−xi| 4h n! (4) i=0
n j−1 n |x−xi|h2 (xj+1−xi) (xi −xj)
i=0
4 i=0
i=j+2
a x0
x1 x2
x3
. . . xj1
xj xj1 xj2
. . .
xn1 xn b
158 Chapter 4
Interpolation and Numerical Differentiation
Nowusethefactthatxi =a+ih.Thenwehavexj+1 −xi =(j−i+1)handxi −xj = (i − j)h. Therefore,
n j−1n h2
i=0
|x − xi | 4 h j hn−( j+2)+1 ( j − i + 1) (i − j) i=0 i=j+2
1hn+1(j+1)!(n−j)!1hn+1n! 44
Inthelaststep,weusethefactthatif0 jn−1,then(j+1)!(n−j)!n!.This,too, is left as an exercise (Problem 4.2.3). Hence, Inequality (4) is established. ■
We can now find a bound on the interpolation error.
INTERPOLATION ERRORS II
Let f beafunctionsuchthat f(n+1) iscontinuouson[a,b]andsatisfies|f(n+1)(x)| M. Let p be the polynomial of degree n that interpolates f at n + 1 equally spaced nodes in [a, b], including the endpoints. Then on [a, b],
|f(x)− p(x)| 1 Mhn+1 (6) 4(n + 1)
where h = (b − a)/n is the spacing between nodes.
■ THEOREM2
Proof
EXAMPLE 2
Solution
Use Theorem 1 on interpolation errors and Inequality (4) in Lemma 1. ■
This theorem gives loose upper bounds on the interpolation error for different values of n. By other means, one can find tighter upper bounds for small values of n. (Cf. Problem 4.2.5.) If the nodes are not uniformly spaced then a better bound can be found by use of the Chebyshev nodes.
Assess the error if sin x is replaced by an interpolation polynomial that has ten equally spaced nodes in [0, 1.6875]. (See the related Example 8 in Section 4.1.)
We use Theorem 2 on interpolation errors, taking f (x) = sin x, n = 9, a = 0, and b = 1.6875. Since f (10)(x) = − sin x, | f (10)(x)| 1. Hence, in Equation (6), we can let M = 1. The result is
|sinx − p(x)|1.34×10−9
Thus, p(x) represents sin x on this interval with an error of at most two units in the ninth decimal place. Therefore, the interpolation polynomial that has ten equally spaced nodes on the interval [0, 1.6875] approximates sin x to at least eight decimal digits of accuracy. In fact, a careful check on a computer would reveal that the polynomial is accurate to even more decimal places. (Why?) ■
■ THEOREM3
Proof
Let t be any point, other than a node, where f (t) is defined. Let q be the polynomial of degreen+1thatinterpolates f atx0,x1,...,xn,t.BytheNewtonformoftheinterpola- tion formula [Equation (8) in Section 4.1], we have
4.2 Errors in Polynomial Interpolation 159 The error expression in polynomial interpolation can also be given in terms of divided
differences:
INTERPOLATION ERRORS III
If pisthepolynomialofdegreenthatinterpolatesthefunction f atnodesx0,x1,...,xn, then for any x that is not a node,
f(x)−p(x)= f[x0,x1,...,xn,x]
n (x−xi)
i=0
q(x)= p(x)+ f[x0,x1,...,xn,t] Since q(t) = f (t), this yields at once
n (x−xi)
i=0
i=0
f(t)= p(t)+ f[x0,x1,...,xn,t]
The following theorem shows that there is a relationship between divided differences
and derivatives.
n
(t−xi) ■
DIVIDED DIFFERENCES AND DERIVATIVES
If f(n) iscontinuouson[a,b]andifx0,x1,...,xn areanyn+1distinctpointsin [a, b], then for some ξ in (a, b),
f[x0,x1,...,xn]= 1 f(n)(ξ) n!
■ THEOREM4
Proof
Let p be the polynomial of degree n − 1 that interpolates f at x0, x1, . . . , xn−1. By Theorem 1 on interpolation errors, there is a point ξ such that
n−1 f(xn)−p(xn)= 1 f(n)(ξ)(xn −xi)
n! i=0 By Theorem 3 on interpolation errors, we obtain
n−1
(xn −xi) ■
As an immediate consequence of this theorem, we observe that all high-order divided differences are zero for a polynomial.
f(xn)− p(xn)= f[x0,x1,...,xn−1,xn]
i=0
160 Chapter 4
Interpolation and Numerical Differentiation
DIVIDED DIFFERENCES COROLLARY
If f is a polynomial of degree n, then all of the divided differences f [x0, x1, . . . , xi ] are zero for i n + 1.
■ COROLLARY1
EXAMPLE 3
Solution
Is there a cubic polynomial that takes these values?
x 1 −2 0 3 −1 7
y −2 −56 −2 4 −16 376
If such a polynomial exists, its fourth-order divided differences f [ , , , , ] would all be
zero. We form a divided-difference table to check this possibility:
x f[] f[,] f[,,] f[,,,] f[,,,,]
1 −2
18
−2 −56 −9
27 2
0−2−5 0 22 34−3 0 52
−1 −16 7 376
11
49
The data can be represented by a cubic polynomial because the fourth-order divided dif- ferences f [ , , , , ] are zero. From the Newton form of the interpolation formula, this polynomial is
p3(x) = −2 + 18(x − 1) − 9(x − 1)(x + 2) + 2(x − 1)(x + 2)x ■ Summary
(1) The Runge function f (x) = 1/(1 + x2) on the interval [−5, 5] shows that high-degree polynomial interpolation and uniform spacing of nodes may not be satisfactory. The Cheby- shev nodes for the interval [a, b] are given by
1 1 2i+1 xi =2(a+b)+2(b−a)cos 2n+2 π
(2) There is a relationship between differences and derivatives:
f[x0,x1,...,xn]= 1 f(n)(ξ) n!
(3) Expressions for errors in polynomial interpolation are
1 (n+1) f (x) − p(x) = (n + 1)! f
(ξ) f(x)−p(x)= f[x0,x1,...,xn,x]
4.2 Errors in Polynomial Interpolation 161
n
(x − xi )
i=0 n
(x−xi)
(4) For n + 1 equally spaced nodes, an upper bound on the error is given by
|f(x)− p(x)| M b−an+1 4(n + 1) n
Here M isanupperboundonf(n+1)(x)whenaxb.
(5) If f is a polynomial of degree n, then all of the divided differences f [x0, x1, . . . , xi ] are
zero for i n + 1.
a1. Useadivided-differencetabletoshowthatthefollowingdatacanberepresentedbya polynomial of degree 3:
x −2 −1 0 1 2 3 y 1 4 11 16 13 −4
2. Fill in a detail in the proof of Inequality (4) by proving Inequality (5).
3. (Continuation) Fill in another detail in the proof of Inequality (4) by showing that
(j+1)!(n−j)!n!if0 jn−1.Inductionandasymmetryargumentcanbeused.
4. For nonuniformly distributed nodes a = x0 < x1 < ··· < xn = b, where h =
max1in{(xi −xi−1)},showthatInequality(4)istrue.
5. Using Theorem 1, show directly that the maximum interpolation error is bounded by
the following expressions and compare them to the bounds given by Theorem 2:
a. 1h2Mforlinearinterpolation,whereh=x1−x0andM=maxx0xx1|f′′(x)|. 8
b. 1 h3Mforquadraticinterpolation,whereh=x −x =x −x andM= 9√3 1021
maxx0xx2 |f′′(x)|.
c. 3 h4M forcubicinterpolation,whereh = x1 −x0 = x2 −x1 = x3 = x2 and
i=0
128
M=maxx0xx3 |f′′(x)|.
a 6.
a7. (Continuation)Giventhedata
How accurately can we determine sin x by linear interpolation, given a table of sin x to ten decimal places, for x in [0, 2] with h = 0.01?
x sinx cosx
0.70 0.64421 76872 0.76484 21873
0.71 0.65183 37710 0.75836 18760
find approximate values of sin 0.705 and cos 0.702 by linear interpolation. What is the error?
Problems 4.2
162 Chapter 4
Interpolation and Numerical Differentiation
a8.
a9.
a 10.
11. 12.
a13.
a 14.
15.
a16.
17.
Linear interpolation in a table of function values means the following: If y0 = f (x0) and y1 = f (x1) are tabulated values, and if x0 < x < x1, then an interpolated value of f(x)is y0 +[(y1 −y0)/(x1 −x0)](x−x0),asexplainedatthebeginningofSection4.1. A table of values of cos x is required so that the linear interpolation will yield five- decimal-place accuracy for any value of x in [0, π ]. Assume that the tabular values are
equally spaced, and determine the minimum number of entries needed in this table.
Aninterpolatingpolynomialofdegree20istobeusedtoapproximatee−x ontheinterval [0, 2]. How accurate will it be? (Use 21 uniform nodes, including the endpoints of the interval. Compare results, using Theorems 1 and 2.)
Let the function f (x ) = ln x be approximated by an interpolation polynomial of degree 9 with ten nodes uniformly distributed in the interval [1, 2]. What bound can be placed on the error?
In the first theorem on interpolation errors, show that if x0 < x1 < ··· < xn and x0
1 − sin2 θ if x is in the interval
7. Let f(x)=max{0,1−x}.Sketchthefunction f.Thenfindinterpolatingpolynomials p of degrees 2, 4, 8, 16, and 32 to f on the interval [−4, 4], using equally spaced nodes. Print out the discrepancy f (x) − p(x) at 128 equally spaced points. Then redo
the problem using Chebyshev nodes.
8. Using Coef and Eval and an automatic plotter, fit a polynomial through the following data:
x 0.0 0.60 1.50 1.70 1.90 2.1 2.30 2.60 2.8 3.00 y −0.8 −0.34 0.59 0.59 0.23 0.1 0.28 1.03 1.5 1.44
Does the resulting curve look like a good fit? Explain.
9. Find the polynomial p of degree 10 that interpolates |x | on [−1, 1] at 11 equally spaced points. Print the difference |x | − p(x ) at 41 equally spaced points. Then do the same with Chebyshev nodes. Compare.
10. WhyaretheChebyshevnodesgenerallybetterthanequallyspacednodesinpolynomial interpolation? The answer lies in the term n (x − xi ) that occurs in the error formula.
Ifxi =cos[(2i+1)π/(2n+2)],then
i=0
n −n
(x−xi)2 i=0
for all x in [−1, 1]. Carry out a numerical experiment to test the given inequality for n = 3,7,15.
Computer Problems 4.2
164
Chapter 4
Interpolation and Numerical Differentiation
4.3
11. (Studentresearchproject)Explorethetopicofinterpolationofmultivariatescattered data, such as arise in geophysics and other areas.
12. UsemathematicalsoftwaresuchasfoundinMatlab,Maple,orMathematicatorepro- duce Figures 4.7 and 4.8.
13. Use symbolic mathematical software such as Maple or Mathematica to generate the interpolation polynomial for the data points in Example 2. Plot the polynomial and the data points.
14. Usegraphicalsoftwaretoplotfourorfivepointsthathappentogenerateaninterpolating polynomial that exhibits a great deal of oscillations. This piece of software should let you use your computer mouse to click on three or four points that visually appear to be part of a smooth curve. Next it uses Newton’s interpolating polynomial to sketch the curve through these points. Then add another point that is somewhat remote from the curve and refit all the points. Repeat, adding other points. After a few points have been added in this way, you should have evidence that polynomials can oscillate wildly.
Estimating Derivatives and Richardson Extrapolation
A numerical experiment outlined in Chapter 1 (at the end of Section 1.1, p. 10) showed that determining the derivative of a function f at a point x is not a trivial numerical problem. Specifically,if f(x)canbecomputedwithonlyndigitsofprecision,itisdifficulttocalculate
f ′(x) numerically with n digits of precision. This difficulty can be traced to the subtraction between quantities that are nearly equal. In this section, several alternatives are offered for the numerical computation of f ′(x) and f ′′(x).
First-Derivative Formulas via Taylor Series
First, consider again the obvious method based on the definition of f ′(x). It consists of selecting one or more small values of h and writing
f′(x)≈ 1[f(x+h)− f(x)] (1) h
What error is involved in this formula? To find out, use Taylor’s Theorem from Section 1.2:
f(x+h)= f(x)+hf′(x)+1h2 f′′(ξ) 2
Rearranging this equation gives
f′(x)= 1[f(x+h)− f(x)]−1hf′′(ξ) (2)
h2 Hence,weseethatapproximation(1)haserrorterm−1hf′′(ξ)=O(h),whereξ isinthe
2
interval having endpoints x and x + h.
Equation (2) shows that in general, as h → 0, the difference between f ′(x) and the
estimate h−1[ f (x +h)− f (x)] approaches zero at the same rate that h does—that is, O(h). Of course, if f ′′(x) = 0, then the error term will be 1 h2 f ′′′(γ ), which converges to zero
6 somewhat faster at O(h2). But usually, f ′′(x) is not zero.
EXAMPLE 1
Solution
4.3 Estimating Derivatives and Richardson Extrapolation 165 Equation (2) gives the truncation error for this numerical procedure, namely,
− 1 h f ′′ (ξ ). This error is present even if the calculations are performed with infinite preci- 2
sion; it is due to our imitating the mathematical limit process by means of an approximation formula. Additional (and worse) errors must be expected when calculations are performed on a computer with finite word length.
In Section 1.1, the program named First used the one-sided rule (1) to approximate the first derivative of the function f (x) = sin x at x = 0.5. Explain what happens when a large number of iterations are performed, say n = 50.
There is a total loss of all significant digits! When we examine the computer output closely, we find that, in fact, a good approximation f ′(0.5) ≈ 0.87758 was found, but it deteriorated as the process continued. This was caused by the subtraction of two nearly equal quantities
f (x + h) and f (x), resulting in a loss of significant digits as well as a magnification of this effect from dividing by a small value of h. We need to stop the iterations sooner! When to stop an iterative process is a common question in numerical algorithms. In this case, one can monitor the iterations to determine when they settle down, namely, when two successive ones are within a prescribed tolerance. Alternatively, we can use the truncation error term. If we want six significant digits of accuracy in the results, we set
1 1 1 −2hf′′(ξ)24−n <210−6
since |f′′(x)| < 1 and h = 1/4n. We find n > 6/log4 ≈ 9.97. So we should stop after about ten steps in the process. (The least error of 3.1 × 10−9 was found at iteration 14.) ■
As we saw in Newton’s method (Chapter 3) and will see in the Romberg method (Chapter 5), it is advantageous to have the convergence of numerical processes occur with higher powers of some quantity approaching zero. In the present situation, we want an approximation to f ′(x) in which the error behaves like O(h2). One such method is easily obtained with the aid of the following two Taylor series:
⎧⎪⎨f(x+h)= f(x)+hf′(x)+ 1h2f′′(x)+ 1h3f′′′(x)+ 1h4f(4)(x)+··· 2! 3! 4!
(3)
⎪⎩f(x−h)= f(x)−hf′(x)+ 1h2f′′(x)− 1h3f′′′(x)+ 1h4f(4)(x)−··· 2! 3! 4!
By subtraction, we obtain
f(x+h)− f(x−h)=2hf′(x)+ 2h3 f′′′(x)+ 2h5 f(5)(x)+··· 3! 5!
This leads to a very important formula for approximating f ′(x):
′ 1 h2 ′′′ h4 (5)
f (x)= 2h[f(x+h)− f(x−h)]− 3! f (x)− 5! f (x)−··· (4) Expressed otherwise,
f′(x)≈ 1 [f(x+h)− f(x−h)] (5) 2h
with an error whose leading term is −1 h2 f ′′′(x), which makes it O(h2). 6
166
Chapter 4
Interpolation and Numerical Differentiation
By using Taylor’s Theorem with its error term, we could have obtained the following two expressions:
f(x+h)= f(x)+hf′(x)+1h2 f′′(x)+1h3 f′′′(ξ1) 26
f(x−h)= f(x)−hf′(x)+1h2 f′′(x)−1h3 f′′′(ξ2) 26
Then the subtraction would lead to ′′′ ′′′ f′(x)= 1[f(x+h)−f(x−h)]−1h2 f (ξ1)+f (ξ2)
2h 62
The error term here can be simplified by the following reasoning: The expression 1 [ f ′′′(ξ1)+
EXAMPLE 2
Solution
2
f ′′′(ξ2)] is the average of two values of f ′′′ on the interval [x − h, x + h]. It therefore lies
between the least and greatest values of f ′′′ on this interval. If f ′′′ is continuous on this interval, then this average value is assumed at some point ξ. Hence, the formula with its error term can be written as
f′(x)= 1[f(x+h)−f(x−h)]−1h2f′′′(ξ) 2h 6
This is based on the sole assumption that f ′′′ is continuous on [x − h, x + h]. This formula for numerical differentiation turns out to be very useful in the numerical solution of certain differential equations, as we shall see in Chapter 14 (on boundary value problems) and Chapter 15 (on partial differential equations).
Modify program First in Section 1.1 so that it uses the central difference formula (5) to approximate the first derivative of the function f (x) = sin x at x = 0.5.
Using the truncation error term for the central difference formula (5), we set 1 1 1
−6h2 f ′′′(ξ) 64−2n < 210−6
or n > (6 − log 3)/ log 16 ≈ 4.59. We obtain a good approximation after about five iterations
with this higher-order formula. (The least error of 3.6 × 10−12 was at step 9.) ■ Richardson Extrapolation
Returning now to Equation (4), we write it in a simpler form:
f′(x)= 1 [f(x+h)− f(x−h)]+a2h2 +a4h4 +a6h6 +··· (6)
2h
in which the constants a2,a4,… depend on f and x. When such information is available about a numerical process, it is possible to use a powerful technique known as Richardson extrapolation to wring more accuracy out of the method. This procedure is now explained, using Equation (6) as our model.
Holding f and x fixed, we define a function of h by the formula
φ(h)= 1 [f(x+h)− f(x−h)] (7)
2h
From Equation (6), we see that φ(h) is an approximation to f ′(x) with error of order O(h2). Our objective is to compute limh→0 φ(h) because this is the quantity f ′(x) that we wanted
4.3 Estimating Derivatives and Richardson Extrapolation 167 in the first place. If we select a function f and plot φ(h) for h = 1, 1, 1, 1,…, then we
248
get a graph (Computer Problem 4.3.5). Near zero, where we cannot actually calculate the
value of φ from Equation (7), φ is approximately a quadratic function of h, since the higher- order terms from Equation (6) are negligible. Richardson extrapolation seeks to estimate the limiting value at 0 from some computed values of φ(h) near 0. Obviously, we can take any convenient sequence hn that converges to zero, calculate φ(hn) from Equation (7), and use these as approximations to f ′(x).
But something much more clever can be done. Suppose we compute φ(h) for some h and then compute φ(h/2). By Equation (6), we have
φ(h) = f′(x)−a2h2 −a4h4 −a6h6 −···
h h2 h4 h6
φ 2 =f′(x)−a2 2 −a4 2 −a6 2 −···
We can eliminate the dominant term in the error series by simple algebra. To do so, multiply
the second equation by 4 and subtract it from the first equation. The result is h 3 15
φ(h)−4φ 2 =−3f′(x)−4a4h4−16a6h6−··· We divide by −3 and rearrange this to get
h1h 1 5
φ 2 +3 φ 2 −φ(h) =f′(x)+4a4h4+16a6h6+···
This is a marvelous discovery. Simply by adding 1 [φ(h/2) − φ(h)] to φ(h/2), we have 3
apparently improved the precision to O(h4) because the error series that accompanies this new combination begins with 1 a4h4. Since h will be small, this is a dramatic improvement.
4
We can repeat this process by letting
4h 1 (h)=3φ 2 −3φ(h)
Then we have from the previous derivation that
(h)= f′(x)+b4h4 +b6h6 +···
h h4 h6
2 =f′(x)+b4 2 +b6 2 +···
We can combine these equations to eliminate the first term in the error series h 3
Hence, we have
(h)−16 2 =−15f′(x)+4b6h6+··· h1h 1
2 +152 −(h)=f′(x)−20b6h5+···
This is yet another apparent improvement in the precision to O(h6). And now, to top it off, note that the same procedure can be repeated over and over again to kill higher and higher terms in the error. This is Richardson extrapolation.
168 Chapter 4
Interpolation and Numerical Differentiation
Essentially the same situation arises in the derivation of Romberg’s algorithm in Chap- ter 5. Therefore, it is desirable to have a general discussion of the procedure here. We start with an equation that includes both situations. Let φ be a function such that
∞ k=1
where the coefficients a2k are not known. Equation (8) is not interpreted as the definition of φ but rather as a property that φ possesses. It is assumed that φ(h) can be computed for any h > 0 and that our objective is to approximate L accurately using φ.
Select a convenient h, and compute the numbers h
D(n,0)=φ 2n Because of Equation (8), we have
(n0) h 2 k
(9)
φ(h) = L −
a2kh2k (8)
D(n,0) = L +
∞ k=1
A(k,0) 2n
where A(k, 0) = −a2k . These quantities D(n, 0) give a crude estimate of the unknown num- ber L = limx→0 φ(x). More accurate estimates are obtained via Richardson extrapolation. The extrapolation formula is
4m 1
D(n,m)= 4m −1D(n,m−1)− 4m −1D(n−1,m−1) (1mn) (10)
RICHARDSON EXTRAPOLATION THEOREM
The quantities D(n, m) defined in the Richardson extrapolation process (10) obey the equation
∞ h 2 k
A(k,m) 2n (0mn) (11)
D(n,m)=L+
k=m+1
■ THEOREM1
Proof
Equation (11) is true by hypothesis if m = 0. For the purpose of an inductive proof, we assume that Equation (11) is valid for an arbitrary value of m −1, and we prove that Equation (11) is then valid for m. Now from Equations (10) and (11) for a fixed value m, we have
4m ∞ h2k D(n,m)=4m−1 L+ A(k,m−1) 2n
k=m 1∞ h2k
−4m−1 L+ A(k,m−1) 2n−1 k=m
After simplification, this becomes
D(n,m)=L+
∞ 4m−4kh2k
A(k,m−1) 4m −1 2n (12)
k=m
4.3 Estimating Derivatives and Richardson Extrapolation 169 Thus, we are led to define
4m −4k A(k,m)=A(k,m−1) 4m−1
At the same time, we notice that A(m, m) = 0. Hence, Equation (12) can be written as
∞ h 2 k
D(n,m) = L +
Equation (11) is true for m, and the induction is complete.
k=m+1
A(k,m) 2n
ThesignificanceofEquation(11)isthatthesummationbeginswiththeterm(h/2n)2m+2. Since h/2n is small, this indicates that the numbers D(n, m) are approaching L very rapidly, namely,
h2(m+1) D(n,m)=L+O 22n(m+1)
In practice, one can arrange the quantities in a two-dimensional triangular array as follows:
■
■ ALGORITHM2
D(0, 0) D(1, 0) D(2, 0)
Richardson Extrapolation
D(1, 1) D(2, 1)
D(2, 2)
. …
D(N,2) ··· The main tasks to generate such an array are as follows:
(13)
. D(N,0)
. D(N,1)
D(N, N)
1. Writeafunctionforφ.
2. Decide on suitable values for N and h.
3. For i = 0,1,…, N, compute D(i,0) = φ(h/2i). 4. For0i jN,compute
D(i,j)=D(i,j−1)+(4j −1)−1[D(i,j−1)−D(i−1,j−1)]
Notice that in this algorithm, the computation of D(i, j) follows Equation (10) but has been rearranged slightly to improve its numerical properties.
EXAMPLE 3
Solution
Write a procedure to compute the derivative of a function at a point by using Equation (5) and Richardson extrapolation.
The input to the procedure will be a function f , a specific point x, a value of h, and a number n signifying how many rows in the array (13) are to be computed. The output will
170 Chapter 4
Interpolation and Numerical Differentiation
be the array (13). Here is a suitable pseudocode:
procedure Derivative( f, x, n, h, (di j ))
integer i, j, n; real h, x; real array (di j )0:n×0:n external function f
for i = 0 to n do
di0 ←[f(x+h)− f(x−h)]/(2h) for j = 1 to i do
di,j ←di,j−1+(di,j−1−di−1,j−1)/(4j −1) end for
h ← h/2 end for
end procedure Derivative
To test the procedure, choose f (x ) = sin x , where x0 = 1.23095 94154 and h = 1. Then f ′(x) = cos x and f ′(x0) = 1 . A pseudocode is written as follows:
3
program Test Derivative
real array (di j )0:n×0:n ; external function f
integer n ← 10; real h ← 1; x ← 1.23095 94154 call Derivative( f, x, n, h, (di j ))
output (di j )
end program Test Derivative
real function f (x) real x
f ← sin(x) end function f
We invite the reader to program the pseudocode and execute it on a computer. The computer output is the triangular array (di j ) with indices 0 j i 10. The most accurate value is (d4,1) = 0.3333333433. The values di0, which are obtained solely by Equations (7) and (9) without any extrapolation, are not as accurate, having no more than four correct digits. ■
Mathematical software is now available with algebraic manipulation capabilities. Using
them, we could write a computer program to find derivatives symbolically for a rather
large class of functions—probably all those you would encounter in a calculus course.
For example, we could verify the numerical results above by first finding the derivative
exactly and then evaluating the numerical answer cos(1.23095 94154) ≈ 0.33333 33355
since arccos 1 ≈ 1.23095 941543. Of course, the procedures discussed in this section are 3
for approximating derivatives that cannot be determined exactly.
First-Derivative Formulas via Interpolation Polynomials
An important general stratagem can be used to approximate derivatives (as well as integrals and other quantities). The function f is first approximated by a polynomial p so that
Consequently,
p1(x)= f(x0)+ f[x0,x1](x−x0)
f′(x)≈ p1′(x)= f[x0,x1]= f(x1)− f(x0) (14) x1 − x0
4.3 Estimating Derivatives and Richardson Extrapolation 171
f ≈ p. Then we simply proceed to the approximation f ′(x) ≈ p′(x) as a consequence. Of course, this strategy should be used very cautiously because the behavior of the interpolating polynomial can be oscillatory.
In practice, the approximating polynomial p is often determined by interpolation at a few points. For example, suppose that p is the polynomial of degree at most 1 that interpolates f at two nodes, x0 and x1. Then from Equation (8) in Section 4.1 with n = 1, we have
FIGURE 4.11
Forward difference: two nodes
FIGURE 4.12
Central difference: two nodes
If x0 = x and x1 = x + h (see Figure 4.11), this formula is one previously considered, namely, Equation (1):
f′(x)≈ 1[f(x+h)− f(x)] (15) h
x0 x1
x xh
If x0 = x − h and x1 = x + h (see Figure 4.12), the resulting formula is Equation (5): f′(x)≈ 1 [f(x+h)− f(x−h)] (16)
2h
x0
xh x xh
Now consider interpolation with three nodes, x0, x1, and x2. The interpolating polyno- mial is obtained from Equation (8) in Section 4.1:
p2(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1) and its derivative is
p2′(x)= f[x0,x1]+ f[x0,x1,x2](2x−x0 −x1) (17)
Here the right-hand side consists of two terms. The first is the previous estimate in Equa- tion (14), and the second is a refinement or correction term.
If Equation (17) is used to evaluate f ′(x) when x = 1 (x0 + x1), as in Equation (16), 2
then the correction term in Equation (17) is zero. Thus, the first term in this case must be more accurate than those in other cases because the correction term adds nothing. This is why Equation (16) is more accurate than (15).
An analysis of the errors in this general procedure goes as follows: Suppose that pn is thepolynomialofleastdegreethatinterpolates f atthenodesx0,x1,…,xn.Thenaccording
x1
172 Chapter 4
Interpolation and Numerical Differentiation
to the first theorem on interpolating errors in Section 4.2,
f (x) − pn(x) = 1 f (n+1)(ξ)w(x)
(n + 1)!
whereξ isdependentonx,andw(x)=(x−x0)(x−x1)···(x−xn).Differentiatinggives
f ′(x) − pn′ (x) = 1 w(x) d f (n+1)(ξ) + 1 f (n+1)(ξ)w′(x) (18) (n + 1)! dx (n + 1)!
Here, we had to assume that f (n+1)(ξ) is differentiable as a function of x, a fact that is known if f (n+2) exists and is continuous.
The first observation to make about the error formula in Equation (18) is that w(x) vanishes at each node, so if the evaluation is at a node xi , the resulting equation is simpler:
f′(xi)= pn′(xi)+ 1 f(n+1)(ξ)w′(xi) (n + 1)!
For example, taking just two points x0 and x1, we obtain with n = 1 and i = 0, ′ 1′′d
f(x0)=f[x0,x1]+2f (ξ)dx[(x−x0)(x−x1)]x=x0 = f[x0,x1]+1f′′(ξ)(x0−x1)
2
This is Equation (2) in disguise when x0 = x and x1 = x + h. Similar results follow with n = 1 and i = 1.
The second observation to make about Equation (18) is that it becomes simpler if x is chosen as a point where w′(x) = 0. For instance, if n = 1, then w is a quadratic function that vanishes at the two nodes x0 and x1. Because a parabola is symmetric about its axis, w′[(x0 + x1)/2] = 0. The resulting formula is
x+x 1 d
f′ 0 1 = f[x0,x1]− (x1 −x0)2 f′′(ξ)
2 8 dx
As a final example, consider four interpolation points: x0, x1, x2, and x3. The interpo-
lating polynomial from Equation (8) in Section 4.1 with n = 3 is
p3(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1)
Its derivative is
+ f[x0,x1,x2,x3](x−x0)(x−x1)(x−x2)
p3′(x)= f[x0,x1]+ f[x0,x1,x2](2x−x0 −x1) + f [x0, x1, x2, x3]((x − x1)(x − x2)
+(x −x0)(x −x2)+(x −x0)(x −x1)) Ausefulspecialcaseoccursifx0 =x−h,x1 =x+h,x2 =x−2h,andx3 =x+2h(see
Figure 4.13). The resulting formula is
f′(x)≈− 2 [f(x+h)− f(x−h)]− 1 [f(x+2h)− f(x−2h)] 3h 12h
FIGURE 4.13
Central difference: four nodes
x2x0 x1x3
x 2h x h x x h x 2h
This can be arranged in a form in which it probably should be computed with a principal term plus a correction or refining term:
f′(x)≈ 1 [f(x+h)− f(x−h)] 2h
− 1 {f(x +2h)−2[f(x +h)− f(x −h)]− f(x −2h)} (19) 12h
The error term is − 1 h4 f (v)(ξ) = O(h4). 30
Second-Derivative Formulas via Taylor Series
In the numerical solution of differential equations, it is often necessary to approximate
second derivatives. We shall derive the most important formula for accomplishing this.
Simply add the two Taylor series (3) for f (x + h) and f (x − h). The result is 1
f(x+h)+ f(x−h)=2f(x)+h2 f′′(x)+2 4!h4 f(4)(x)+··· When this is rearranged, we get
f′′(x)= 1[f(x+h)−2f(x)+ f(x−h)]+E h2
4.3 Estimating Derivatives and Richardson Extrapolation 173
where the error series is 1 1
E=−2 4!h2f(4)(x)+6!h4f(6)(x)+···
EXAMPLE 4
Solution
with error O(h2).
derivative of the function f (x) = sin x at the given point x = 0.5.
By carrying out the same process using Taylor’s formula with a remainder, one can show that E is also given by
E = − 1 h2 f (4)(ξ) 12
for some ξ in the interval (x − h, x + h). Hence, we have the approximation
f′′(x)≈ 1[f(x+h)−2f(x)+ f(x−h)] (20)
h2
Repeat Example 2, using the central difference formula (20) to approximate the second
Using the truncation error term, we set
1 1 1
−12h2 f (4)(ξ) 124−2n < 210−6
and we obtain n > (6 − log 6)/ log 16 ≈ 4.34. Hence, the modified program First finds a good approximation of f ′′(0.5) ≈ −0.47942 after about four iterations. (The least error of 3.1 × 10−9 was obtained at iteration 6.) ■
174 Chapter 4
Interpolation and Numerical Differentiation
Approximate derivative formulas of high order can be obtained by using unequally spaced points such as at Chebyshev nodes. Recently, software packages have been developed for automatic differentiation of functions that are expressible by a computer program. They produce true derivatives with only rounding errors and no discretization errors.
Noise in Computation
An interesting question is how noise in the evaluation of f (x) affects the computation of derivatives when using the standard formulas.
The formulas for derivatives are derived with the expectation that evaluation of the function at any point is possible, with complete precision. Then the approximate derivative produced by the formula differs from the actual derivative by a quantity called the error term, which involves the spacing of the sample points and some higher derivative of the function.
If there are errors in the values of the function (noise), they can vitiate the whole process! Those errors could overwhelm the error inherent in the formulas. The inherent error arises from the fact that in deriving the formulas a Taylor series was truncated after only a few terms. It is called the truncation error. It is present even if the evaluation of the function at the required sample points is absolutely correct.
For example, consider the formula
′ f (x + h) − f (x − h) h2 ′′′
f(x)= 2h −6f (ξ)
The term with h2 is the error term. The point ξ is a nearby point (unknown). If f (x + h) and f (x − h) are in error by at most d, then one can see that the formula will produce a value for f ′(x) that is in error by d/h, which is large when h is small. Noise completely spoils the process if d is large.
For a specific numerical case, suppose that h = 10−2 and | f ′′′(s)| 6. Then the trunca- tion error, E, satisfies |E| 10−4. The derivative computed from the formula with complete precision is within 10−4 of the actual derivative. Suppose, however, that there is noise in the evaluation of f (x ±h) of magnitude d = h. The correct value of [ f (x +h)− f (x −h)]/(2h) may differ from the noisy value by (2d)/(2h) = 1.
Summary
(1) We have derived formulas for approximating first and second derivatives. For f ′(x), a one-sided formula is
f′(x)≈ 1[f(x+h)− f(x)] h
with error term − 1 h f ′′ (ξ ). A central difference formula is 2
f′(x)≈ 1 [f(x+h)− f(x−h)] 2h
4.3 Estimating Derivatives and Richardson Extrapolation 175 with error −1 h2 f ′′′(ξ) = O(h2). A central difference formula with a correction term is
6
f′(x)≈ 1 [f(x+h)− f(x−h)]
2h
− 1 [f(x +2h)−2f(x +h)+2f(x −h)− f(x −2h)]
12h
with error term − 1 h4 f (v)(ξ) = O(h4).
30
(2) For f ′′(x), a central difference formula is
f′′(x)≈ 1 [f(x+h)−2f(x)+ f(x−h)] h2
with error term − 1 h2 f (4)(ξ) 12
(3) If φ(h) is one of these formulas with error series a2h2 + a4h4 + a6h6 + · · ·, then we can apply Richardson extrapolation as follows
with error terms
D(n,0) =φ(h/2n)
D(n,m)= D(n,m−1)+[D(n,m−1)−D(n−1,m−1)]/(4m −1)
h2(m+1) D(n,m)=L+O 22n(m+1)
Additional References for Chapter 4
For additional study, see Gautschi [1990], Goldstine [1977], Griewark [2000], Groetsch [1998], Rivlin [1990], and Whittaker and Robinson [1944].
a1. Determinetheerrortermfortheformula
f′(x)≈ 1 [f(x+3h)− f(x−h)]
4h
a2. UsingTaylorseries,establishtheerrortermfortheformula
f′(0)≈ 1 [f(2h)− f(0)] 2h
3. Derivetheapproximationformula
f′(x)≈ 1 [4f(x+h)−3f(x)− f(x+2h)]
2h
and show that its error term is of the form 1 h2 f ′′′(ξ).
3
a4. Canyoufindanapproximationformulafor f′(x)thathaserrortermO(h3)andinvolves
only two evaluations of the function f ? Prove or disprove.
5. Averaging the forward-difference formula f ′(x) ≈ [ f (x + h) − f (x)]/h and the backward-difference formula f ′(x) ≈ [ f (x) − f (x − h)]/h, each with error term
Problems 4.3
176 Chapter 4
Interpolation and Numerical Differentiation
O(h), results in the central-difference formula f ′(x) ≈ [ f (x + h) − f (x − h)]/(2h) with error O(h2). Show why. Hint: Determine at least the first term in the error series for each formula.
a6. Criticizethefollowinganalysis.ByTaylor’sformula,wehave
′ h2 ′′ h3 ′′′
f(x+h)−f(x)=hf(x)+2f(x)+6f (ξ) ′ h2 ′′ h3 ′′′
f(x−h)− f(x)=−hf (x)+ 2 f (x)− 6 f (ξ)
So by adding, we obtain an exact expression for f ′′(x): f(x+h)+ f(x−h)−2f(x)=h2 f′′(x)
7. Criticize the following analysis. By Taylor’s formula, we have
′ h2 ′′ h3 ′′′
f(x+h)− f(x)=hf (x)+ 2 f (x)+ 6 f (ξ1) ′ h2 ′′ h3 ′′′
f(x−h)− f(x)=−hf (x)+ 2 f (x)− 6 f (ξ2)
1 [f(x+h)−2f(x)+ f(x−h)]= f′′(x)+h[f′′′(ξ1)− f′′′(ξ2)]
Therefore,
h2 6 The error in the approximation formula for f ′′ is thus O(h).
8. Derivethetwoformulas
aa. f′(x)≈ 1[f(x+2h)−f(x−2h)]
4h
b. f′′(x)≈ 1 [f(x+2h)−2f(x)+ f(x−2h)]
4h2
and establish formulas for the errors in using them.
9. Derivethefollowingrulesforestimatingderivatives:
aa. f′′′(x)≈ 1 [f(x+2h)−2f(x+h)+2f(x−h)− f(x−2h)]
2h3
ab. f(4)(x)≈ 1[f(x+2h)−4f(x+h)+6f(x)−4f(x−h)+ f(x−2h)]
h4
and their error terms. Which is more accurate? Hint: Consider the Taylor series for
D(h)≡ f(x+h)− f(x−h)andS(h)≡ f(x+h)+ f(x−h). 10. Establishtheformula
f′′(x)≈2f(x0)−f(x1)+ f(x2) h2 (1 + α) α α(α + 1)
a11.
a12.
a13.
14.
15. a16.
17. 18. 19.
a 20.
(Continuation)UsingTaylorseries,showthat
f′(x1)= f(x2)− f(x0) +(α−1)h f′′(x1)+O(h2)
4.3 Estimating Derivatives and Richardson Extrapolation 177
in the following two ways, using the unevenly spaced points x0 < x1 < x2, where x1 − x0 = h and x2 − x1 = αh. Notice that this formula reduces to the standard central-difference formula (20) when α = 1.
a. Approximate f (x) by the Newton form of the interpolating polynomial of degree 2.
b. Calculate the undetermined coefficients A, B, and C in the expression f ′′(x) ≈ A f (x0) + B f (x1) + C f (x2) by making it exact for the three polynomials 1, x − x1, and (x − x1)2 and thus exact for all polynomials of degree 2.
x2 − x0 2
Establish that the error for approximating f ′(x1) by [ f (x2)− f (x0)]/(x2 −x0) is O(h2)
when x1 is midway between x0 and x2 but only O(h) otherwise. Acertaincalculationrequiresanapproximationformulafor f′(x)+ f′′(x).Howwell
does the expression
2+h 2 2−h
2h2 f(x +h)− h2 f(x)+ 2h2 f(x −h) serve? Derive this approximation and its error term.
The values of a function f are given at three points x0,x1, and x2. If a quadratic interpolating polynomial is used to estimate f ′(x) at x = 1 (x0 + x1), what formula
will result?
Consider Equation (19).
a. Fillinthedetailsinitsderivation.
b. Using Taylor series, derive its error term.
Show how Richardson extrapolation would work on Formula (20).
If φ(h) = L − c1h − c2h2 − c3h3 − ···, then what combination of φ(h) and φ(h/2)
should give an accurate estimate of L?
(Continuation) State and prove a theorem analogous to the theorem on Richardson
extrapolation for the situation of the preceding problem.
If φ(h) = L − c1h1/2 − c2h2/2 − c3h3/2 − · · ·, then what combination of φ(h) and
φ(h/2) should give an accurate estimate of L?
Show that Richardson extrapolation can be carried out for any two values of h. Thus, if φ(h) = L − O(hp), then from φ(h1) and φ(h2), a more accurate estimate of L is given by
h 2p
φ(h2) + h1p − h2p [φ(h2) − φ(h1)]
Consider a function φ such that limh→0 φ(h) = L and L − φ(h) ≈ ce−1/ h for some constant c. By combining φ(h), φ(h/2), and φ(h/3), find an accurate estimate of L.
2
178 Chapter 4
Interpolation and Numerical Differentiation
21. Considertheapproximateformula
′ 3h
Determine its error term. Does the function f have to be differentiable for the for- mula to be meaningful? Hint: This is a novel method of doing numerical differentia- tion. The interested reader can read more about Lanczos’ generalized derivative in Groetsch [1998].
22. DerivetheerrortermsforD(3,0),D(3,1),D(3,2)andD(3,3).
23. Differentiation and integration are mutual inverse processes. Differentiation is an in- herently sensitive problem in which small changes in the data can cause large changes in the results. Integration is a smoothing process and is inherently stable. Display two functions that have very different derivatives but equal definite integrals and vice versa.
2h3
b. f′(x)+hf′′≈1[f(x+h)−f(x)]
f (x)≈ 2h3
tf(x+t)dt
24. Establishtheerrortermsfortheserules:
a. f′′′(x)≈ 1 [3f(x+h)−10f(x)+12f(x−h)−6f(x−2h)+ f(x−3h)]
−h
2h
c. f(iv)(x)≈ 1 4f(x+3h)−6f(x+2h)+12f(x+h) if f(x)= f′(x)=0.
h4 3
1. TestprocedureDerivativeonthefollowingfunctionsatthepointsindicatedinasingle computer run. Interpret the results.
2. a 3.
a. f(x)=cosx atx =0 b. f(x)=arctanx atx =1 c. f(x)=|x|atx=0
(Continuation) Write and test a procedure similar to Derivative that computes f ′′(x) with repeated Richardson extrapolation.
Find f ′ (0.25) as accurately as possible, using only the function corresponding to the pseudocode below and a method for numerical differentiation:
real function f (x) integer i ; real a, b, c, x a ← 1; b ← cos(x)
fori =1to5do
c←b √
b← ab
a ← (a + c)/2 end for
f ← 2 arctan(1)/a end function f
Computer Problems 4.3
4.3 Estimating Derivatives and Richardson Extrapolation 179
4. CarryoutanumericalexperimenttocomparetheaccuracyofFormulas(5)and(19)on a function f whose derivative can be computed precisely. Take a sequence of values forh,suchas4−n with0n12.
5. UsingthediscussionofthegeometricinterpretationofRichardsonextrapolation,pro- duce a graph to show that φ(h) looks like a quadratic curve in h.
6. Use symbolic mathematical software such as Maple or Mathematica to establish the first term in the error series for Equation (19).
7. Use mathematical software such as found in Matlab, Maple, or Mathematica to redo Example 1.
5
Numerical Integration
In electrical field theory, it is proved that the magnetic field induced by a current flowing in a circular loop of wire has intensity
4Ir π/2 x2 2 1/2 H(x)= 2 2 1− sinθ
where I is the current, r is the radius of the loop, and x is the distance from the center to the point where the magnetic intensity is being computed ( 0 x r ) . If I , r , and x are given, we have a formidable integral to evaluate. It is an elliptic integral and not expressible in terms of familiar functions. But H can be computed precisely by the methods of this chapter. For example, if I = 15.3, r = 120, and x = 84, we find H = 1.355661135 accurate to nine decimals.
5.1 Lower and Upper Sums
Elementary calculus focuses largely on two important processes of mathematics: differen- tiation and integration. In Section 1.1, numerical differentiation was considered briefly; it was taken up again in Section 4.3. In this chapter, the process of integration is examined from the standpoint of numerical mathematics.
Definite and Indefinite Integrals
It is customary to distinguish two types of integrals: the definite and the indefinite integral. The indefinite integral of a function is another function or a class of functions, whereas the definite integral of a function over a fixed interval is a number. For example,
1 Indefinite integral: x 2 d x = 3 x 3 + C
28
x2 dx = 3
Actually, a function has not just one but many indefinite integrals. These differ from each other by constants. Thus, in the preceding example, any constant value may be assigned
r−x0 r
dθ
Definite integral:
180
0
to C, and the result is still an indefinite integral. In elementary calculus, the concept of an indefinite integral is identical with the concept of an antiderivative. An antiderivative of a function f is any function F having the property that F ′ = f .
The definite and indefinite integrals are related by the Fundamental Theorem of Calculus,∗ which states that b f (x ) d x can be computed by first finding an antiderivative
a
F of f and then evaluating F(b) − F(a). Thus, using traditional notation, we have
3 x3 3 27 1 14 ( x 2 − 2 ) d x = 3 − 2 x = 3 − 6 − 3 − 2 = 3
11
As another example of the Fundamental Theorem of Calculus, we can write
b
F′(x)dx = F(b)− F(a)
a x
F′(t)dt = F(x)− F(a) a
If this second equation is differentiated with respect to x, the result is (and here we have
5.1 Lower and Upper Sums 181
put f = F′)
emphasized in elementary calculus. The definite integral of a function, however, has an
interpretation as the area under a curve, and so the existence of a numerical value for
dx
f(t)dt = f(x)
f (t ) d t must be an antiderivative (indefinite integral) of f .
dx
a a
This last equation shows that x
The foregoing technique for computing definite integrals is virtually the only one
b f (x ) d x should not depend logically on our limited ability to find antiderivatives. Thus, a
for instance,
1
ex2 dx
0
has a precise numerical value despite the fact that there is no elementary function F such
that F′(x) = ex2 . By the preceding remarks, ex2 does have antiderivatives, one of which is
x
et2 dt
However, this form of the function F is of no help in determining the numerical value
sought.
Lower and Upper Sums
The existence of the definite integral of a nonnegative function f on a closed interval [a, b] is based on an interpretation of that integral as the area under the graph of f . The definite integral is defined by means of two concepts, the lower sums of f and the upper sums of
f ; these are approximations to the area under the graph.
∗Fundamental Theorem of Calculus: If f is continuous on the interval [a, b] and F is an antiderivative of f ,
F(x)=
0
then
b
f(x)dx = F(b)− F(a)
a
182
Chapter 5
Numerical Integration
Let P be a partition of the interval [a, b] given by
P={a=x0
2
induces us to take 92 subintervals. ■
Recursive Trapezoid Formula for Equal Subintervals
In the next section, we require a formula for the composite trapezoid rule when the interval [a, b] is subdivided into 2n equal parts. By Formula (1), we have
T(f;P)=h =h
i=1 n−1
n−1 h
f(xi)+2[f(x0)+ f(xn)] h
f(a+ih)+2[f(a)+ f(b)]
If we now replace n by 2n and use h = (b − a)/2n , the preceding formula becomes
i=1
2n −1 h
f(a+ih)+ 2[f(a)+ f(b)] (7)
Here, we have introduced the notation that will be used in Section 5.3 on the Romberg algorithm, namely, R(n, 0). It denotes the result of applying the composite trapezoid rule with 2n equal subintervals.
In the Romberg algorithm, it will also be necessary to have a means of computing R(n, 0) from R(n − 1, 0) without involving unneeded evaluations of f . For example, the computation of R(2, 0) utilizes the values of f at the five points a, a + (b − a)/4, a+2(b−a)/4,a+3(b−a)/4,andb.Incomputing R(3,0),weneedvaluesof f atthese fivepoints,aswellasatfournewpoints:a+(b−a)/8,a+3(b−a)/8,a+5(b−a)/8,and a + 7(b − a)/8 (see Figure 5.4). The computation should take advantage of the previously computed result. The manner of doing so is now explained.
If R(n − 1, 0) has been computed and R(n, 0) is to be computed, we use the identity
R(n,0)=1R(n−1,0)+ R(n,0)−1R(n−1,0) 22
R(n,0)=h
i=1
20
21
FIGURE 5.4 22 2n equal
R(0, 0) R(1, 0) R(2, 0) R(3, 0)
Subintervals
5.2 Trapezoid Rule 197 Array
ab
subintervals
23
It is desirable to compute the bracketed expression with as little additional work as possible. Fixing h = (b − a)/2n for the analysis and putting
C = h[f(a)+ f(b)] 2
2n −1
f (a + ih) + C (8) 2n−1−1
f (a + 2 jh) + 2C (9) Notice that the subintervals for R(n − 1, 0) are twice the size of those for R(n, 0). Now
we have, from Equation (7),
from Equations (8) and (9), we have 1
R(n, 0) = h R(n − 1, 0) = 2h
R(n, 0) − 2 R(n − 1, 0) = h = h
2n−1 2n−1−1 f (a + ih) − h
f (a + 2 jh)
i=1
n−1
k=1
2
j=1
i=1
j=1 f [a + (2k − 1)h]
Here, we have taken account of the fact that each term in the first sum that corresponds to an even value of i is canceled by a term in the second sum. This leaves only terms that correspond to odd values of i.
To summarize:
RECURSIVE TRAPEZOID FORMULA
If R(n − 1, 0) is available, then R(n, 0) can be computed by the formula
(n1) (10)
12
R(n,0)= 2R(n−1,0)+h
usingh =(b−a)/2n.Here, R(0,0)= 1(b−a)[f(a)+ f(b)]. 2
n−1
k=1
f[a+(2k−1)h]
■ THEOREM2
This formula allows us to compute a sequence of approximations to a definite integral using the trapezoid rule without reevaluating the integrand at points where it has already been evaluated.
198 Chapter 5
Numerical Integration
Multidimensional Integration
Here, we give a brief account of multidimensional numerical integration. For simplicity, we illustrate with the trapezoid rule for the interval [0, 1], using n + 1 equally spaced points. The step size is therefore h = 1/n. The composite trapezoid rule is then
1 n−1 1i
f(x)dx≈2h f(0)+2 f n +f(1) i=1
0 We write this in the form
where
1 ni
0
f(x)dx≈ Cifn i=0
⎧
⎪⎨1/(2h), i=0
C i = ⎪⎩ 1 / h , 0 < i < n 1/(2h), i = n
The error is O(h2) = O(n−2) for functions having a continuous second derivative.
If one is faced with a two-dimensional integration over the unit square, then the
trapezoid rule can be applied twice: 11
1n α1
f(x,y)dxdy≈
00 0α1=0
Cα1 f n,y dy
n 1α1
= Cα1 fn,ydy
α1=0 0
n n α1α2
≈ Cα1 α1 =0
Cα2fn,n α2 =0
α1 α2 Cα1Cα2f n,n
=
n n α1=0 α2=0
The error here is again O(h2), because each of the two applications of the trapezoid rule entails an error of O(h2).
In the same way, we can integrate a function of k variables. Suitable notation is the vectorx=(x1,x2,...,xk)T fortheindependentvariable.Theregionnowistakentobethe k-dimensionalcube[0,1]k ≡[0,1]×[0,1]×···×[0,1].Thenweobtainamultidimen- sional numerical integration rule
nnn α1α2αk
[0,1]k
f(x)dx≈ ··· Cα1Cα2 ···Cαk f , ,...,
α1 =0 α2 =0 αk =0 n n n
The error is still O(h2) = O(n−2), provided that f has continuous partial derivatives ∂2 f/∂xi2.
Besides the error involved, one must consider the effort, or work, required to attain a desired level of accuracy. The work in the one-variable case is O(n). In the two-variable case, it is O(n2), and it is O(nk) for k variables. The error, now expressed as a function of
the number of nodes N = nk, is
O(h2) = O(n−2) = O (nk)−2/k = O(N−2/k)
Thus, the quality of the numerical approximation of the integral declines very quickly as the number of variables, k, increases. Expressed in other terms, if a constant order of accuracy is to be retained while the number of variables, k, goes up, the number of nodes must go up like nk . These remarks indicate why the Monte Carlo method for numerical integration becomes more attractive for high-dimensional integration. (This subject is discussed in Chapter 13.)
Summary
(1) To estimate b f (x) dx, divide the interval [a, b] into subintervals according to the a
partitionP={a=x0
Extrapolation of the same type can be used in still more general situations, as is illus- trated next (and in the problems).
If φ is a function with the property
φ(x)=L+a1x−1 +a2x−2 +a3x−3 +···
how can L be estimated using Richardson extrapolation?
Obviously, L = limx→∞ φ(x); thus, L can be estimated by evaluating φ(x) for a succession
of ever-larger values of x. To use extrapolation, we write
φ(x) = L +a1x−1 +a2x−2 +a3x−3 +···
φ(2x) = L +2−1a1x−1 +2−2a2x−2 +2−3a3x−3 +···
2φ(2x) = 2L + a1x−1 + 2−1a2x−2 + 2−2a3x−3 + ··· 2φ(2x)−φ(x) = L −2−1a2x−2 −3·2−2a3x−3 −···
Thus, having computed φ(x) and φ(2x), we can compute a new function ψ(x) = 2φ(2x)− φ(x). It should be a better approximation to L because its error series begins with x−2 and is O(x−2) as x → ∞. This process can be repeated, as in the Romberg algorithm. ■
Here is a concrete illustration of the preceding example. We want to estimate limx→∞ φ(x) from the following table of numerical values:
x 1 2 4 8 16 32 64 128
φ (x ) 21.1100 16.4425 14.3394 13.3455 12.8629 12.6253 12.5073 12.4486
A tentative hypothesis is that φ has the form in the preceding example. When we compute the values of the function ψ(x) = 2φ(2x) − φ(x), we get a new table of values:
x 1 2 4 8 16 32 64
ψ (x ) 11.7750 12.2363 12.3516 12.3803 12.3877 12.3893 12.3899
It therefore seems reasonable to believe that the value of limx→∞ φ(x) is approximately 12.3899. If we do another extrapolation, we should compute θ(x) = [4ψ(2x) − ψ(x)]/3;
values for this table are
x 1 2 4 8 16 32
θ (x ) 12.3901 12.3900 12.3899 12.3902 12.3898 12.3901
For the precision of the given data, we conclude that limx→∞ φ(x) = 12.3900 to within
roundoff error.
Summary
(1) By using the Recursive Trapezoid Rule, we find that the first column of the Romberg algorithm is
5.3 Romberg Algorithm 211
n−1 12
R(n,0)= 2R(n−1,0)+h
where h = (b − a)/2n and n 1. The second and successive columns in the Romberg array
are generated by the Richardson extrapolation formula and are
R(n, m) = R(n, m − 1) + 1 [R(n, m − 1) − R(n − 1, m − 1)]
4m −1
with n 1 and m 1. The error is O(h2) for the first column, O(h4) for the second column,
O(h6) for the third column, and so on. Check the ratios
R(n, m) − R(n − 1, m) ≈ 4m+1
R(n + 1, m) − R(n, m) to test whether the algorithm is working.
(2) If the expression L is approximated by φ(h) and if these entities are related by the error series
L = φ(h) + ahα + bhβ + chγ + · · · then a more accurate approximation is
h 1h L≈φ 2 +2α−1φ 2 −φ(h)
with error O(hβ ). Additional References
For additional study, see Abramowitz and Stegun [1964], Clenshaw and Curtis [1960], Davis and Rabinowitz [1984], de Boor [1971], Dixon [1974], Fraser and Wilson [1966], Gentleman [1972], Ghizetti and Ossiccini [1970], Havie [1969], Kahaner [1971], Krylov [1962], O’Hara and Smith [1968], Stroud [1974], and Stroud and Secrest [1966].
f[a+(2k−1)h]
k=1
212 Chapter 5
Numerical Integration
a1. 2. a3. 4.
a5. 6.
a7. a 8. 9.
a10. 11.
WhatisR(5,3)ifR(5,2)=12andR(4,2)=−51,intheRombergalgorithm?
If R(3,2) = −54 and R(4,2) = 72, what is R(4,3)?
Compute R(5,2) from R(3,0) = R(4,0) = 8 and R(5,0) = −4.
Let f (x ) = 2x . Approximate 4 f (x ) d x by the trapezoid rule using partition points 0
0, 2, and 4. Repeat by using partition points 0,1,2,3, and 4. Now apply Romberg extrapolation to obtain a better approximation.
BytheRombergalgorithm,approximate2 4dx/(1+x2)byevaluating R(1,1). 0
UsingtheRombergscheme,establishanumericalvaluefortheapproximation
1
e−(10x)2 dx ≈ R(1,1)
0
Compute the approximation to only three decimal places of accuracy.
WearegoingtousetheRombergmethodtoestimate1√xcosxdx.Willthemethod 0
work? Will it work well? Explain.
By combining R(0, 0) and R(1, 0) for the partition P = {−h < 0 < h}, determine
R(1, 1).
In calculus, a technique of integration by substitution is developed. For example, if
the substitution x = z2 is made in the integral 1(ex /√x) dx, the result is 2 1 ez2 dz. 00
Verify this and discuss the numerical aspects of this example. Which form is likely to produce a more accurate answer by the Romberg method?
How many evaluations of the function (integrand) are needed if the Romberg array with n rows and n columns is to be constructed?
Using Equation (2), fill in the circles in the following diagram with coefficients used in the Romberg algorithm:
R(0, 0) R(1, 0) R(2, 0) R(3, 0) R(4, 0)
R(1, 1) R(2, 1) R(3, 1) R(4, 1)
R(2, 2) R(3, 2) R(4, 2)
R(3, 3) R(4, 3)
R(4, 4)
12.
Derive the quadrature rule for R(1, 1) in terms of the function f evaluated at partition points a, a + h, and a + 2h, where h = (b − a)/2. Do the same for R(n,1) with h = (b − a)/2n.
Problems 5.3
a 13. a 14. a15.
16. a17.
a18.
19.
20.
21.
a22.
a 23.
a 24.
5.3 Romberg Algorithm 213 (Continuation) Derive the quadrature rule R(2, 2) in terms of the function f evaluated
ata,a+h,a+2h,a+3h,andb,whereh =(b−a)/4.
We want to compute X = limn→∞ Sn , and we have already computed the two numbers
u = S10 and v = S30. It is known that X = Sn + Cn−3. What is X in terms of u and v?
SupposethatwewanttoestimateZ =limh→0 f(h)andthatwecalculate f(1), f(2−1), f(2−2), f(2−3),..., f(2−10). Then suppose also that it is known that Z = f(h)+ ah2 + bh4 + ch6. Show how to obtain an improved estimate of Z from the 11 numbers already computed. Show how Z can be determined exactly from any 4 of the 11
computed numbers. ShowhowRichardsonextrapolationworksonasequencex1,x2,x3,...thatconverges
toLasn→∞insuchawaythatL−xn =a2n−2+a3n−3+a4n−4+···.
Letxn beasequencethatconvergestoLasn→∞.IfL−xn isknowntobeof the form a3n−3 + a4n−4 + · · · (in which the coefficients are unknown), how can the convergence of the sequence be accelerated by taking combinations of xn and xn+1?
IftheRombergalgorithmisoperatingonafunctionthatpossessescontinuousderiva-
tives of all orders on the interval of integration, then what is a bound on the quantity
|b f(x)dx−R(n,m)|intermsofh? a
Show that the precise form of Equation (5) is
b a
∞ 4 j − 1 f(x)dx = R(n,1)− 3×4j a2j+2h
2 j + 2
j=1 Derive Equation (6), and show that its precise form is
b a
∞ 4 j − 1 4 j − 1 − 1 f(x)dx = R(n,2)+ 3×4j 15×4j−1 a2j+2h
2 j + 2
j=2
Use the fact that the coefficients in Equation (3) have the form
ak = ck [ f (k−1)(b) − f (k−1)(a)]
toprovethatb f(x)dx = R(n,m)if f isapolynomialofdegree 2m−2.
a
IntheRombergalgorithm, R(n,0)denotesanestimateofb f(x)dx withsubintervals
of size h = (b − a)/2n. If it were known that b
a
f(x)dx = R(n,0)+a3h3 +a6h6 +··· a
how would we have to modify the Romberg algorithm?
Show that if f ′′ is continuous, then the first column in the Romberg array converges to the integral in such a way that the error at the nth step is bounded in magnitude by a constant times 4−n .
Assuming that the first column of the Romberg array converges to b f (x ) d x , show a
that the second column does also.
214 Chapter 5
Numerical Integration
25. (Continuation) In the preceding problem, we established the elementary property that if limn→∞ R(n,0) = b f(x)dx, then limn→∞ R(n,1) = b f(x)dx. Show that
a
b. Determine Ak for1k6. EvaluateEinthetheoremontheEuler-Maclaurinformulaforthisspecialcase:a=0,
b=2π, f(x)=1+cos4x,n=4,andmarbitrary.
ComputeeightrowsandcolumnsintheRombergarrayfor2.19 x−1sinxdx. 1.3
Design and carry out an experiment using the Romberg algorithm. Suggestions: For a
function that possesses many continuous derivatives on the interval, the method should
work well. Try such a function first. If you choose one whose integral you can compute
by other means, you will acquire a better understanding of the accuracy in the Romberg
a27.
a1. 2.
a ab lim R(n,2)= lim R(n,3)=···= lim R(n,n)=
f(x)dx
26. a. UsingFormula(7),proveEuler-Maclaurincoefficientscanbegeneratedrecursively.
n→∞ n→∞ n→∞
algorithm. For example, try definite integrals for
A 0 = 1 , A k = − k A k − j j=1 (j+1)!
(1 + x)−1 dx = ln(1 + x)
ex dx = ex
3. 4.
a5. a 6. a7.
(1 + x2)−1 dx = arctan x TesttheRombergalgorithmonabadfunction,suchas√xon[0,1].Whyisitbad?
The transcendental number π is the area of a circle whose radius is 1. Show that 1/√2
and
( 1−x2−x)dx=π
with the help of a diagram, and use this integral to approximate π by the Romberg
method.
ApplytheRombergmethodtoestimateπ(2+sin2x)−1dx.Observethehighpreci- 0
sion obtained in the first column of the array, that is, by the simple trapezoidal estimates.
Compute π x cos 3x d x by the Romberg algorithm using n = 6. What is the correct 0
answer?
An integral of the form ∞ f(x)dx can be transformed into an integral on a finite 0
8
0
interval by making a change of variable. Verify, for instance, that the substitution
x = −lny changes the integral ∞ f(x)dx into 1 y−1 f(−lny)dy. Use this idea ∞−x20 0
to compute 0 [e /(1 + x )] d x by means of the Romberg algorithm, using 128 evaluations of the transformed function.
Computer Problems 5.3
8. BytheRombergalgorithm,calculate ∞
√
1 − sin x dx
9. Calculate
1 sinx
0
e−x
√x dx 0
5.3 Romberg Algorithm 215
by the Romberg algorithm. Hint: Consider making a change of variable. 10. Computelog2byusingtheRombergalgorithmonasuitableintegral.
a11. TheBesselfunctionoforder0isdefinedbytheequation
1π
J0(x)= π
Calculate J0(1) by applying the Romberg algorithm to the integral.
cos(xsinθ)dθ
12. RecodetheRombergproceduresothatallthetrapezoidruleresultsarecomputedfirst
0
and stored in the first column. Then in a separate procedure,
procedures Extrapolate(n, (ri ))
carry out Richardson extrapolation, and store the results in the lower triangular part of
the (r ) array. What are the advantages and disadvantages of this procedure over the i41x
routine given in the text? Test on the two integrals 0 dx/(1 + x) and −1 e dx using only one computer run.
13. (Studentresearchproject)StudytheClenshaw-Curtismethodfornumericalquadra- ture. If possible, read the original paper by Clenshaw and Curtis [1960] and then pro- gram the method. If programmed well, it should be superior to the Romberg method in many cases. For further information on it, consult papers by Dixon [1974], Fraser and Wilson [1966], Gentleman [1972], Havie [1969] Kahaner [1971], and O’Hara and Smith [1968].
14. (Student research project) Numerical integration is an ideal problem for use on a parallel computer, since the interval of integration can be subdivided into subintervals on each of which the integral can be approximated simultaneously and independently of each other. Investigate how numerical integration can be done in parallel. If you have access to a parallel computer or can simulate a parallel computer on a collection of PCs, write a parallel program to approximate π by using the standard example
1
(1 + x2)−1 dx
0
with a basic rule such as the midpoint rule. Vary the number of processors used and
the number of subintervals. You can read about parallel computing in books such as Pacheco [1997], Quinn [1994], and others or at any of the numerous sites on the Internet.
15. Use a mathematical software system with symbolic capabilities such as Mathematica to verify the relationship between Ak and the Bernoulli numbers for k = 6.
6
Additional Topics on Numerical Integration
Some interesting test integrals (for which numerical values are known) are
1∞1
√dx e−x3 dx xsin(1/x)dx
0 sinx 0 0
An important feature that is desirable in a numerical integration scheme is the capability of dealing with functions that have peculiarities, such as becoming infinite at some point or being highly oscillatory on certain subin- tervals. Another special case arises when the interval of integration is infi- nite. In this chapter, additional methods for numerical integration are intro- duced: the Gaussian quadrature formulas and an adaptive scheme based on Simpson’s Rule. Gaussian formulas can often be used when the integrand has a singularity at an endpoint of the interval. The adaptive Simpson code is robust in the sense that it can concentrate the calculations on trouble- some parts of the interval, where the integrand may have some unexpected behavior. Robust quadrature procedures automatically detect singularities or rapid fluctuations in the integrand and deal with them appropriately.
6.1 Simpson’s Rule and Adaptive Simpson’s Rule Basic Simpson’s Rule
The basic trapezoid rule for approximating b f (x ) d x is based on an estimation of the area a
beneath the curve over the interval [a, b] using a trapezoid. The function of integration f (x) is taken to be a straight line between f (a) and f (b). The numerical integration formula is of the form
b
f(x)dx ≈ Af(a)+ Bf(b)
a
where the values of A and B are selected so that the resulting approximate formula will correctly integrate any linear function. It suffices to integrate exactly the two functions 1 and x because a polynomial of degree at most one is a linear combination of these two monomials. To simplify the calculations, let a = 0 and b = 1 and find a formula of the
216
following type:
0
Thus, these equations should be fulfilled:
f (x) = x :
The solution is A = B = 1 , and the integration formula is
6.1 Simpson’s Rule and Adaptive Simpson’s Rule 217
1
f(x)dx ≈ Af(0)+ Bf(1)
f (x) = 1 :
dx = A + B x dx = 2 = B
2
11
1
01 1
0
f(x)dx ≈ 2[f(0)+ f(1)] 0
By a linear mapping y = (b − a)x + a from [0, 1] to [a, b], the basic Trapezoid Rule for the interval [a, b] is obtained:
b1
f(x)dx ≈ 2(b−a)[f(a)+ f(b)] a
See Figure 6.1 for a graphical illustration.
p1(x) f (a)
f (x) f (b)
FIGURE 6.1
Basic Trapezoid Rule
x
ab
The next obvious generalization is to take two subintervals a, a+b and a+b , b and b22
to approximate f(x)dx by taking the function of integration f(x) to be a quadratic a a+b
polynomialpassingthroughthethreepoints f(a), f 2 ,and f(b).Letusseekanumerical integration formula of the following type:
b a+b
f(x)dx ≈ Af(a)+ Bf 2 +Cf(b) a
The function f is assumed to be continuous on the interval [a, b]. The coefficients A, B, and C will be chosen such that the formula above will give correct values for the integral whenever f is a quadratic polynomial. It suffices to integrate correctly the three functions 1, x, and x2 because a polynomial of degree at most 2 is a linear combination of those
218
Chapter 6
Additional Topics on Numerical Integration
3 monomials. To simplify the calculations, let a = −1 and b = 1 and consider the equation 1
f(x)dx ≈ Af(−1)+ Bf(0)+Cf(1) −1
Thus, these equations should be fulfilled: 1
f (x) = 1 :
f (x) = x :
f(x)=x2 :
The solution is A = 1, C = 1, and B = 4. The resulting formula is
−1 333
dx = 2 = A + B + C x dx = 0 = −A + C
−1 12
−1 1
x2dx=3=A+C f(x)dx ≈ 3[f(−1)+4f(0)+ f(1)]
11 −1
Using a linear mapping y = 1(b − a) + 1(a + b) from [−1,1] to [a,b], we obtain the 22
basic Simpson’s Rule over the interval [a, b]:
b 1 a+b
FIGURE 6.2
Basic Simpson’s Rule
aabb 2
f(x)dx ≈ 6(b−a) See Figure 6.2 for an illustration.
f(a b ) 2
p2(x) f (a)
f(a)+4f
2 + f(b) f (x)
p2(x)
x
a
f (b)
Figure 6.3 shows graphically the difference between the Trapezoid Rule and the Simp- son’s Rule.
FIGURE 6.3
Example of Trapezoid Rule vs. Simpson’s Rule
aabb 2
Simpson
p2(x)
p1(x)
f
Trapezoid
EXAMPLE 1
Solution
Find approximate values for the integral 1
e−x2 ds −1
using the basic Trapezoid Rule and the basic Simpson’s Rule. Carry five significant digits.
Let a = 0 and b = 1. For the basic Trapezoid Rule (1), we obtain
1
21 e−x ds ≈ 6 e0 +4e−0.25 +e−1
1
21
e−x ds ≈ 2 e0 + e−1 ≈ 0.5[1 + 0.36788] = 0.68394
6.1 Simpson’s Rule and Adaptive Simpson’s Rule 219
0
which is correct to only one significant decimal place (rounded). For the basic Simpson’s Rule (2), we find
0
≈ 0.16667[1 + 4(0.77880) + 0.36788] = 0.7472
which is correct to three significant decimal places (rounded). Recall that 1 e−x2 dx =
1√0
2 πerf(1) ≈ 0.74682. ■
Simpson’s Rule
A numerical integration rule over two equal subintervals with partition points a, a + h, and a + 2h = b is the widely used basic Simpson’s Rule:
a+2h h
a
f(x)dx ≈ 3[f(a)+4f(a+h)+ f(a+2h)] (1)
Simpson’s Rule computes exactly the integral of an interpolating quadratic polynomial over an interval of length 2h using three points; namely, the two endpoints and the middle point. It can be derived by integrating over the interval [0, 2h] the Lagrange quadratic polynomial
p through the points (0, f (0)), (h, f (h)), and (2h, f (2h)):
2h h
p(x)dx = 3[f(0)+4f(h)+ f(2h)]
2h 00
f(x)dx ≈
where
p(x)= 1 (x−h)(x−2h)f(0)− 1 x(x−2h)f(h)+ 1 x(x−h)f(2h)
Section 1.2:
f(a+h)= f +hf′ + 1h2 f′′ + 1h3 f′′′ + 1h4 f(4) +··· 2! 3! 4!
where the functions f , f ′, f ′′, . . . on the right-hand side are evaluated at a. Now replacing h by 2h, we have
4 24 f(a+2h)= f +2hf′ +2h2 f′′ + 3h3 f′′′ + 4!h4 f(4) +···
2h2 h2 2h2
The error term in Simpson’s rule can be established by using the Taylor series from
220 Chapter 6
Additional Topics on Numerical Integration
Using these two series, we obtain
f(a)+4f(a+h)+ f(a+2h)=6f +6hf′ +4h2 f′′ +2h3 f′′′ + 20h4 f(4) +··· 4!
and, thereby, we have
h [ f (a) + 4 f (a + h) + f (a + 2h)] = 2h f + 2h2 f ′ + 4 h3 f ′′ 33
+ 2h4 f′′′ + 20 h5 f(4) +··· (2) 3 3·4!
Hence, we have a series for the right-hand side of Equation (1). Now let’s find one for the left-hand side. The Taylor series for F(a + 2h) is
Let
F(a+2h)= F(a)+2hF′(a)+2h2F′′(a)+ 4h3F′′′(a) 3
2 25
+ 3h4F(4)(a)+ 5!h5F(5)(a)+···
x a
F(x) =
f (t) dt
By the Fundamental Theorem of Calculus, F′ = f. We observe that F(a) = 0 and F(a + 2h) is the integral on the left-hand side of Equation (1). Since F′′ = f ′, F′′′ = f ′′, and so on, we have
a+2h 4 2 25
f(x)dx =2hf +2h2 f′ + 3h3 f′′ + 3h4 f′′′ + 5·4!h5 f(4) +··· (3) Subtracting Equation (2) from Equation (3), we obtain
a+2h h h5 (4)
a
a
f(x)dx=3[f(a)+4f(a+h)+f(a+2h)]−90f −···
A more detailed analysis will show that the error term for the basic Simpson’s Rule (1) is −(h5/90) f (4)(ξ) = O(h5) as h → 0, for some ξ between a and a + 2h. We can rewrite the basic Simpson’s Rule over the interval [a, b] as
f(x)dx ≈ 6 f(a)+4f 2 1 b−a5
b (b−a) a+b
+ f(b)
with error term
a
−90 2 f(4)(ξ) Composite Simpson’s Rule
for some ξ in (a, b).
Suppose that the interval [a, b] is subdivided into an even number of subintervals, say n, eachofwidthh=(b−a)/n.Thenthepartitionpointsarexi =a+ihfor0in,where
6.1 Simpson’s Rule and Adaptive Simpson’s Rule 221 n is divisible by 2. Now from basic calculus, we have
a
b
n/2
a+2ih
i=1 a+2(i−1)h
f(x)dx =
f(x)dx
Using the basic Simpson’s Rule, we have, for the right-hand side, n/2
≈h{f(a+2(i−1)h)+4f(a+(2i−1)h)+ f(a+2ih)} i=1 3
h
= 3 f(a)+
Thus, we obtain
a
where h = (b − a)/n. The error term is
f(a+(2i−1)h) f(a+2ih)+ f(b)
(n/2)−1 n/2 f(a+2ih)+4
i=1
i=1 (n/2)−1
bh
n/2
f (x) dx ≈ 3 [ f (a) + f (b)] + 4 f [a + (2i − 1)h] + 2
+
i=1
(n−2)/2 i=1 i=1
f (a + 2ih)
− 1 (b−a)h4 f(4)(ξ) 180
Many formulas for numerical integration have error estimates that involve derivatives of the function being integrated. An important point that is frequently overlooked is that such error estimates depend on the function having derivatives. So if a piecewise function is being integrated, the numerical integration should be broken up over the region to coincide with the regions of smoothness of the function. Another important point is that no polynomial ever becomes infinite in the finite plane, so any integration technique that uses polynomials to approximate the integrand will fail to give good results without extra work at integrable singularities.
An Adaptive Simpson’s Scheme
Now we develop an adaptive scheme based on Simpson’s Rule for obtaining a numerical approximation to the integral
b
f(x)dx
a
In this adaptive algorithm, the partitioning of the interval [a, b] is not selected beforehand
but is automatically determined. The partition is generated adaptively so that more and smaller subintervals are used in some parts of the interval and fewer and larger subintervals are used in other parts.
In the adaptive process, we divide the interval [a,b] into two subintervals and then decide whether each of them is to be divided into more subintervals. This procedure is continued until some specified accuracy is obtained throughout the entire interval [a, b]. Since the integrand f may vary in its behavior on the interval [a, b], we do not expect the final partitioning to be uniform but to vary in the density of the partition points.
222
Chapter 6
Additional Topics on Numerical Integration
It is necessary to develop the test for deciding whether subintervals should continue to be divided. One application of Simpson’s Rule over the interval [a, b] can be written as
where
and
(b − a) a + b
+ f(b)
I ≡
S(a,b)=
b a
f(x)dx = S(a,b)+ E(a,b)
E(a,b) = −90 2 Letting h = b − a, we have
f(4)(a)+···
6 f(a)+4f 1 b−a5
2
where and
1 h5
E(1) =−90 2 f(4)(a)+···
1 h5 =−90 2 C
I = S(1) + E(1) S(1) = S(a,b)
(4)
Here we assume that f (4) remains a constant value C throughout the interval [a, b]. Now two applications of Simpson’s Rule over the interval [a, b] give
where
where c = (a + b)/2, as in Figure 6.4, and
I = S(2) + E(2)
(5)
S(2) =S(a,c)+S(c,b)
1 h/25 1 h/25
E(2) = −90 2 f(4)(a)+··· −90 2
f(4)(c)+··· 9025 5
=−1h/25f(4)(a)+ f(4)(c)+··· =−11 h(2C)=1−1hC
90252 16902
a
h
c (a b)/2
b
One Simpson’s Rule
Two Simpson’s Rules
FIGURE 6.4
h/2
h/2
Simpson’s rule a
c
b
6.1 Simpson’s Rule and Adaptive Simpson’s Rule 223 Again, we use the assumption that f (4) remains a constant value C throughout the interval
[a, b]. We find that
Subtracting Equation (5) from (4), we have
16E(2) = E(1)
S(2) − S(1) = E(1) − E(2) = 15E(2)
From this equation and Equation (4), we have
I = S(2) + E(2) = S(2) + 1 S(2) − S(1) 15
This value of I is the best we have at this step, and we use the inequality
1 S(2) − S(1) < ε (6)
15
to guide the adaptive process.
If Test (6) is not satisfied, the interval [a, b] is split into two subintervals, [a, c] and
[c, b], where c is the midpoint c = (a + b)/2. On each of these subintervals, we again use Test (6) with ε replaced by ε/2 so that the resulting tolerance will be ε over the entire interval [a, b]. A recursive procedure handles this quite nicely.
To see why we take ε/2 on each subinterval, recall that bcb
I = f(x)dx = f(x)dx + f(x)dx = Ileft + Iright aac
If S is the sum of approximations S(2) over [a, c] and S(2) over [c, b], we have left right
|I−S|=I +I −S(2)−S(2)
15 left using Equation (6). Hence, if we require
1 S(2) − S(1) ε 15 left left 2
left right left right (2) (2)
Ileft − Sleft + Iright − Sright
= 1 S ( 2 ) − S ( 1 ) + 1 S ( 2 ) − S ( 1 )
left
and
15 right
1 S(2) 15 right
right
− S(1) ε right 2
then |I − S| ε over the entire interval [a, b].
We now describe an adaptive Simpson recursive procedure. The interval [a, b] is parti-
tioned into four subintervals of width (b − a)/4. Two Simpson approximations are computed by using two double-width subintervals and four single-width subintervals; that is,
h a + b one simpson ← 6 f (a) + 4 f 2 + f (b)
h a+c c+b two simpson ← 12 f(a)+4f 2 +2f(c)+4f 2 + f(b)
where h = b − a and c = (a + b)/2.
According to Inequality (6), if one simpson and two simpson agree to within 15ε, then
the interval [a, b] does not need to be subdivided further to obtain an accurate approximation to the integral b f (x ) d x . In this case, the value of [16 (two simpson) − (one simpson)]/15
a
is used as the approximate value of the integral over the interval [a, b]. If the desired accu-
racy for the integral has not been obtained, then the interval [a, b] is divided in half. The
224
Chapter 6
Additional Topics on Numerical Integration
subintervals [a, c] and [c, b], where c = (a + b)/2, are used in a recursive call to the adaptive Simpson procedure with tolerance ε/2 on each. This procedure terminates when- ever all subintervals satisfy Inequality (6). Alternatively, a maximum number of allowable levels of subdividing intervals is used as well to terminate the procedure prematurely. The recursive procedure provides an elegant and simple way to keep track of which subintervals satisfy the tolerance test and which need to be divided further.
Example Using Adaptive Simpson Procedure
The main program for calling the adaptive Simpson procedure can best be presented in terms of a concrete example. An approximate value for the integral
5π
4 cos(2x) dx (7)
0 ex is desired with accuracy 1 × 10−3.
2
FIGURE 6.5
Adaptive 5 Integration of
4 πcos(2x)/ex dx 0
1 0.8 0.6 0.4 0.2 0 0.2
0 0.5 1 1.5 2 2.5 3 3.5 4
The graph of the integrand function is shown in Figure 6.5. We see that this function has many turns and twists, so accurately determining the area under the curve may be difficult. A func- tion procedure f is written for the integrand. Its name is the first argument in the procedure, and necessary interface statements are needed here and in the main program. Other argu- ments are the values of the upper and lower limits a and b of the integral, the desired accuracy ε, the level of the current subinterval, and the maximum level depth. Here is the pseudocode:
recursive real function Simpson( f, a, b, ε, level, level max) result(simpson result)
integer level, level max; external function f level ← level + 1 h←b−a
real a, b, c, d, e, h
c ← (a + b)/2
one simpson←h[f(a)+4f(c)+ f(b)]/6 d ← (a + c)/2
e ← (c + b)/2
6.1 Simpson’s Rule and Adaptive Simpson’s Rule 225
two simpson←h[f(a)+4f(d)+2f(c)+4f(e)+ f(b)]/12 if level level max then
simpson result ← two simpson
output “maximum level reached” else
if |two simpson − one simpson| < 15ε then
simpson result ← two simpson + (two simpson − one simpson)/15
else
end if end if
end function Simpson
left simpson ← Simpson( f, a, c, ε/2, level, level max) right simpson ← Simpson( f, c, b, ε/2, level, level max) simpson result ← left simpson + right simpson
By writing a driver computer program for this pseudocode and executing it on a computer, we obtain an approximate value of 0.208 for the integral (7). The adaptive Simpson procedure uses a different number of panels for different parts of the curve as shown in Figure 6.5.
Newton-Cotes Rules
Newton-Cotes quadrature formulas for approximating b f (x ) d x are obtained by approx- a
imating the function of integration f (x) by interpolating polynomials. The rules are closed when they involve function values at the ends of the interval of integration. Otherwise, they are said to be open.
SomeclosedNewton-Cotesruleswitherrortermsareasfollows.Here,a = x0,b = xn, h=(b−a)/n,xi =x0+ih,fori=0,1,...,n,whereh=(b−a)/n, fi = f(xi),and a=x0 <ξ
j←i end if
end for
lj ↔lk
for i = k + 1 to n do
xmult ← ali ,k /alk ,k ali,k ←xmult
for j = k + 1 to n do
ali,j ← ali,j −(xmult)alk,j end for
end for end for
deallocate array (si ) end procedure Gauss
A detailed explanation of the above procedure is now presented. In the first loop, the initial form of the index array is being established, namely, li = i. Then the scale array (si) is computed.
The statement for k = 1 to n − 1 do initiates the principal outer loop. The index k is the subscript of the variable whose coefficients will be made 0 in the array (ai j ); that is, k is the index of the column in which new 0’s are to be created. Remember that the 0’s in the array (ai j ) do not actually appear because those storage locations are used for the multipliers. This fact can be seen in the line of the procedure where xmult is stored in the array (ai j ). (See Section 8.1 on the LU factorization of A for why this is done.)
Once k has been set, the first task is to select the correct pivot row, which is done by computing |alik|/sli for i = k,k + 1,…,n. The next set of lines in the pseudocode is calculating this greatest ratio, called rmax in the routine, and the index j where it occurs. Next,lk andlj areinterchangedinthearray(li).
268 Chapter 7
Systems of Linear Equations
The arithmetic modifications in the array (ai j ) due to subtracting multiples of row lk from rows lk+1, lk+2, . . . , ln all occur in the final lines. First the multiplier is computed and stored; then the subtraction occurs in a loop.
Caution: Values in array (ai j ) that result as output from procedure Gauss are not the same as those in array (ai j ) at input. If the original array must be retained, one should store a duplicate of it in another array.
In the procedure Naive Gauss for naive Gaussian elimination from Section 7.1, the right-hand side b was modified during the forward elimination phase; however, this was not done in the procedure Gauss. Therefore, we need to update b before considering the back substitution phase. For simplicity, we discuss updating b for the naive forward elimination first. Stripping out the pseudocode from Naive Gauss that involves the (bi ) array in the forward elimination phase, we obtain
This updates the (bi ) array based on the stored multipliers from the (ai j ) array. When scaled partial pivoting is done in the forward elimination phase, such as in procedure Gauss, the multipliers for each step are not one below another in the (ai j ) array but are jumbled around. To unravel this situation, all we have to do is introduce the index array (li ) into the above pseudocode:
After the array b has been processed in the forward elimination, the back substitution process is carried out. It begins by solving the equation
for k = 1 to n − 1 do
for i = k + 1 to n do
bi =bi −aikbk end for
end for
for k = 1 to n − 1 do
for i = k + 1 to n do
bli =bli −alikblk end for
end for
whence
Then the equation
is solved for xn−1:
aln,nxn =bln (6) xn = bln
aln ,n aln−1,n−1xn−1 +aln−1,nxn =bln−1
x = 1 b −a x
n−1 a
ln−1,n−1
ln−1 ln−1,n n
7.2 Gaussian Elimination with Scaled Partial Pivoting 269 After xn,xn−1,…,xi+1 have been determined, xi is found from the equation
ali,ixi +ali,i+1xi+1 +···+ali,nxn = bli
n
xi = 1 bli − ali,jxj (7)
Except for the presence of the index array li , this is similar to the back substitution formula (7) in Section 7.1 obtained for naive Gaussian elimination.
The procedure for processing the array b and performing the back substitution phase is given next:
whose solution is
ali,i j=i+1
procedure Solve(n, (ai j ), (li ), (bi ), (xi )) integer i, k, n; real sum
real array (ai j )1:n×1:n , (li )1:n , (bi )1:n , (xi )1:n for k = 1 to n − 1 do
for i = k + 1 to n do
bli ←bli −ali,kblk end for
end for
xn ←bln/aln,n
fori =n−1to1 step−1do
sum ← bli
for j = i + 1 to n do
sum ← sum − ali , j x j end for
xi ←sum/ali,i end for
end procedure Solve
Here, the first loop carries out the forward elimination process on array (bi ), using arrays (ai j ) and (li ) that result from procedure Gauss. The next line carries out the solution of Equation (6). The final part carries out Equation (7). The variable sum is a temporary variable for accumulating the terms in parentheses.
As with most pseudocode in this book, those in this chapter contain only the basic ingredients for good mathematical software. They are not suitable as production code for various reasons. For example, procedures for optimizing code are ignored. Furthermore, the procedures do not give warnings for difficulties that may be encountered, such as division by zero! General-purpose software should be robust; that is, it should anticipate every possible situation and deal with each in a prescribed way. (See Computer Problem 7.2.11.)
Long Operation Count
Solving large systems of linear equations can be expensive in computer time. To understand why, let us perform an operation count on the two algorithms whose codes have been given. We count only multiplications and divisions (long operations) because they are more time
270 Chapter 7
Systems of Linear Equations
consuming than addition. Furthermore, we lump multiplications and divisions together even though division is slower than multiplication. In modern computers, all floating-point operations are done in hardware, so long operations may not be as significant, but this still gives an indication of the operational cost of Gaussian elimination.
Consider first procedure Gauss. In step 1, the choice of a pivot element requires the calculation of n ratios—that is, n divisions. Then for rows l2 , l3 , . . . , ln , we first compute a multiplierandthensubtractfromrowli thatmultipliertimesrowl1.Thezerothatisbeing created in this process is not computed. So the elimination requires n − 1 multiplications per row. If we include the calculation of the multiplier, there are n long operations (divisions or multiplications) per row. There are n − 1 rows to be processed for a total of n(n − 1) operations. If we add the cost of computing the ratios, a total of n2 operations is needed for step 1.
The next step is like step 1 except that row l1 is not affected, nor is the column of multipliers created and stored in step 1. So step 2 will require (n − 1)2 multiplications or divisions because it operates on a system without row l1 and without column 1. Continuing this reasoning, we conclude that the total number of long operations for procedure Gauss is
n2 +(n−1)2 +(n−2)2 +···+42 +32 +22 = n(n+1)(2n+1)−1≈ n3 63
(The derivation of this formula is outlined in Problem 7.2.16.) Note that the number of long operations in this procedure grows like n3/3, the dominant term.
Now consider procedure Solve. The forward processing of the array (bi ) involves n − 1 steps. The first step contains n − 1 multiplications, the second contains n − 2 multiplications, and so on. The total of the forward processing of array (bi ) is thus
(n−1)+(n−2)+···+3+2+1= n(n−1) 2
(See Problem 7.2.15.) In the back substitution procedure, one long operation is involved in the first step, two in the second step, and so on. The total is
1 + 2 + 3 + · · · + n = n (n + 1) 2
Thus, procedure Solve involves altogether n2 long operations. To summarize:
THEOREM ON LONG OPERATIONS
The forward elimination phase of the Gaussian elimination algorithm with scaled par- tial pivoting, if applied only to the n×n coefficient array, involves approximately n3/3 long operations (multiplications or divisions). Solving for x requires an additional n2 long operations.
■ THEOREM1
An intuitive way to think of this result is that the Gaussian elimination algorithm involves a triply nested for-loop. So an O(n3) algorithmic structure is driving the elimination process, and the work is heavily influenced by the cube of the number of equations and unknowns.
7.2 Gaussian Elimination with Scaled Partial Pivoting 271
Numerical Stability
The numerical stability of a numerical algorithm is related to the accuracy of the procedure. An algorithm can have different levels of numerical stability because many computations can be achieved in various ways that are algebraically equivalent but may produce different results. A robust numerical algorithm with a high level of numerical stability is desirable. Gaussian elimination is numerically stable for strictly diagonally dominant matrices or symmetric positive definite matrices. (These are properties we will present in Sections 7.3 and 8.1, respectively.) For matrices with a general dense structure, Gaussian elimination with partial pivoting is usually numerically stable in practice. Nevertheless, there exist unstable pathological examples in which it may fail. For additional details, see Golub and Van Loan [1996] and Highman [1996].
An early version of Gaussian elimination can be found in a Chinese mathematics text dating from 150 B.C.
Scaling
Readers should not confuse scaling in Gaussian elimination (which is not recommended) with our discussion of scaled partial pivoting in Gaussian elimination.
The word scaling has more than one meaning. It could mean actually dividing each row by its maximum element in absolute value. We certainly do not advocate that. In other words, we do not recommend scaling of the matrix at all. However, we do compute a scale array and use it in selecting the pivot element in Gaussian elimination with scaled partial pivoting. We do not actually scale the rows; we just keep a vector of the “row infinity norms,” that is, the maximum element in absolute value for each row. This and the need for a vector of indices to keep track of the pivot rows make the algorithm somewhat complicated, but that is the price to be paid for some degree of robustness in the procedure.
The simple 2 × 2 example in Equation (4) shows that scaling does not help in choosing a good pivot row. In this example, scaling is of no use. Scaling of the rows is contemplated in Problem 7.2.23 and Computer Problem 7.2.17. Notice that this procedure requires at least n2 arithmetic operations. Again, we are not recommending it for a general-purpose code.
Some codes actually move the rows around in storage. Because that should not be done in practice, we do not do it in the code, since it might be misleading. Also, to avoid misleading the casual reader, we called our initial algorithm (in the preceding section) naive, hoping that nobody would mistake it for a reliable code.
Summary
(1) In performing Gaussian elimination, partial pivoting is highly recommended to avoid zero pivots and small pivots. In Gaussian elimination with scaled partial pivoting, we use a scale vector s = [s1,s2,…,sn]T in which
si = max |aij| (1in) 1jn
and an index vector l = [l1,l2,…,ln]T , initially set as l = [1,2,…,n]T . The scale vector or array is set once at the beginning of the algorithm. The elements in the index vector or array are interchanged rather than the rows of the matrix A, which reduces the
272 Chapter 7
Systems of Linear Equations
amount of data movement considerably. The key step in the pivoting procedure is to select j to be the first index associated with the largest ratio in the set
|ali,k| s :kin
li
and interchange l j with lk in the index array l. Then use multipliers
ali ,k
alk ,k
times row lk and subtract from equations li for k + 1 i n. The forward elimination
from equation li for lk+1 li ln is
The steps involving the vector b are usually done separately just before the back substitution phase, which we call updating the right-hand side. The back substitution is
ali,j ←ali,j −(ali,k/akk)akj (lk lj ln) bli ← bli −(ali,k/alkk)blk
n xi = 1 bli − ali,jxj
(2) For an n × n system of linear equations Ax = b, the forward elimination phase of the Gaussian elimination with scaled partial pivoting involves approximately n3/3 long operations (multiplications or divisions), whereas the back substitution requires only n2 long operations.
a1. Show how Gaussian elimination with scaled partial pivoting works on the following
matrix A:
⎡⎤
2 3 −4 1 ⎢1 −1 0 −2⎥ ⎣3343⎦ 4104
ali,i j=i+1
(i=n,n−1,n−2,…,1)
a2. SolvethefollowingsystemusingGaussianeliminationwithscaledpartialpivoting: ⎡ ⎤⎡⎤⎡⎤
1−12×1 −2 ⎣−2 1 −1⎦⎣x2⎦=⎣ 2⎦
4−12×3 −1 Show intermediate matrices at each step.
a3. CarryoutGaussianeliminationwithscaledpartialpivotingonthematrix ⎡⎤
1030 ⎢0 1 3−1⎥ ⎣3 −3 0 6⎦ 0 2 4 −6
Show intermediate matrices.
Problems 7.2
them.
⎡⎤
1321 ⎢4 2 1 2⎥ ⎣2123⎦ 1241
−0.0013 ⎢ 0.0000 ⎣ 0.0000
0.0000
56.4972 123.4567 −0.0145 8.8990 102.7513 −7.6543 −1.3131 −9876.5432
987.6543 833.3333 ⎥ 69.6869 ⎦
100.0001
7.2 Gaussian Elimination with Scaled Partial Pivoting 273
4. Considerthematrix ⎡⎤
Identify the entry that will be used as the next pivot element of naive Gaussian elimina- tion, of Gaussian elimination with partial pivoting (the scale vector is [1, 1, 1, 1]), and of Gaussian elimination with scaled partial pivoting (the scale vector is [987.6543, 46.79, 256.29, 1.096]).
a5. Withoutusingthecomputer,determinethefinalcontentsofthearray(aij)afterproce- dure Gauss has processed the following array. Indicate the multipliers by underlining
a6. IftheGaussianeliminationalgorithmwithscaledpartialpivotingisusedonthematrix shown, what is the scale vector? What is the second pivot row?
⎡⎤
473 ⎣132⎦ 2 −4 −1
7. IftheGaussianeliminationalgorithmwithscaledpartialpivotingisusedontheexample shown, which row will be selected as the third pivot row?
a8. Solvethesystem
⎡⎤
8 −1 4 9 2 ⎢1 0 3 9 7⎥ ⎢⎣−5 0 1 3 5⎥⎦
43227 30009
⎧
⎨2×1 +4×2 −2×3 = 6
⎩ x1 +3×2 +4×3 =−1 5×1+2×2 =2
using Gaussian elimination with scaled partial pivoting. Show intermediate results at each step; in particular, display the scale and index vectors.
9. Considerthelinearsystem
⎧
⎨ 2×1 + 3×2 = 8 ⎩−x1 +2×2 −x3 =0 3×1+ 2×3 =9
Solve for x1, x2, and x3 using Gaussian elimination with scaled partial pivoting. Show intermediate matrices and vectors.
274 Chapter 7
Systems of Linear Equations
a10. Considerthelinearsystemofequations ⎧
⎪ ⎪⎨ − x 1 + x 2 − 3 x 4 = 4
x1 ⎪ ⎪⎩
+3×3+ x4=0 x 2 − x 3 − x 4 = 3 + x3+2×4=1
3×1
Solve this system using Gaussian elimination with scaled partial pivoting. Show all
intermediate steps, and write down the index vector at each step.
11. Consider Gaussian elimination with scaled partial pivoting applied to the coefficient
matrix
⎡⎤
####0 ⎢# # # 0 #⎥ ⎢0###0⎥ ⎣0#0#0⎦ #00##
where each # denotes a different nonzero element. Circle the locations of elements in which multipliers will be stored and mark with an f those where fill-in will occur. The final index vector is l = [2, 3, 1, 5, 4].
12. RepeatProblem7.1.6ausingGaussianeliminationwithscaledpartialpivoting.
13. Solve each of the following systems using Gaussian elimination with scaled partial
pivoting. Carry four significant figures. What are the contents of the index array at each
step?
⎧⎧
⎨3×1 +4×2 +3×3 =10 a.⎩ x1+5×2− x3= 7 6×1 +3×3 +7×3 =15
c.
e.
⎨3×1 +2×2 −5×3 =0 ab.⎩2×1−3×2+ x3=0 x1 +4×2 − x3 =4
⎡1−1 2 1⎤⎡x⎤ ⎡ 1⎤ ⎢ ⎥⎢1⎥ ⎢ ⎥
⎧
⎨3×1+2×2−x3= 7
5x +3x +2x = 4 ⎩ 1 2 3
−x1+x2−3×3=−1
⎢3 2 1 4⎥⎢x2⎥=⎢ 1⎥ ⎣5 8 6 3⎦⎣x3⎦ ⎣ 1⎦
d.
4253 x4 −1 ⎧
⎪⎨ x1 +3×2 +2×3 + x4 =−2 4×1+2×2+ x3+2×4= 2 ⎪⎩2×1 + x2 +2×3 +3×4 = 1 x1 +2×2 +4×3 + x4 =−1
14. Usingscaledpartialpivoting,showhowthecomputerwouldsolvethefollowingsystem of equations. Show the scale array, tell how the pivot rows are selected, and carry out the computations. Include the index array for each step. There are no fractions in the correct solution, except for certain ratios that must be looked at to select pivots. You should follow exactly the scaled-partial-pivoting code, except that you can include the
right-hand side of the system in your calculations as you go along.
⎧
⎪⎨2×1 − x2 +3×3 + 7×4 =15
4×1+4×2 + 7×4=11 ⎪⎩2×1+ x2+ x3+ 3×4= 7 6×1 +5×2 +4×3 +17×4 =31
15. Derive the formula
Hint: Set S =
or use induction. 16. Derive the formula
7.2
Gaussian Elimination with Scaled Partial Pivoting 275
n n
k = 2 (n + 1)
k=1 nk=1 k; also observe that
2S = (1 + 2 + · · · + n) + [n + (n − 1) + · · · + 2 + 1] = (n + 1) + (n + 1) + · · ·
n n
k2 = 6(n+1)(2n+1) k=1
Hint: Induction is probably easiest.
a17. Countthenumberofoperationsinthefollowingpseudocode:
real array (ai j )1:n×1:n , (xi j )1:n×1:n real z; integer i, j, n
for i = 1 to n do
for j = 1 to i do z = z + ai j xi j
end for end for
a18. CountthenumberofdivisionsinprocedureGauss.Countthenumberofmultiplications. Count the number of additions or subtractions. Using execution times in microseconds (multiplication 1, division 2.9, addition 0.4, subtraction 0.4), write a function of n that represents the time used in these arithmetic operations.
a19. Considering long operations only and assuming 1-microsecond execution time for all long operations, give the approximate execution times and costs for procedure Gauss when n = 10, 102 , 103 , 104 . Use only the dominant term in the operation count. Estimate costs at $500 per hour.
20. (Continuation)Howmuchtimewouldbeusedonthecomputertosolve2000equations using Gaussian elimination with scaled partial pivoting? How much would it cost? Give a rough estimate based on operation times.
a21. AfterprocessingamatrixAbyprocedureGauss,howcantheresultsbeusedtosolve a system of equations of form AT x = b?
22. WhatmodificationswouldmakeprocedureGaussmoreefficientifdivisionweremuch slower than multiplication?
23. The matrix A = (ai j )n×n is row-equilibrated if it is scaled so that max |aij|=1 (1in)
1jn
In solving a system of equations Ax = b, we can produce an equivalent system in which the matrix is row-equilibrated by dividing the i th equation by max1 j n |ai j |.
276 Chapter 7
Systems of Linear Equations
aa. Solvethesystemofequations
⎡1 12×109⎤⎡x⎤⎡1⎤
1
⎣2 −1 109⎦⎣x2⎦=⎣1⎦
1 2 0 x3 1 by Gaussian elimination with scaled partial pivoting.
b. Solve by using row-equilibrated naive Gaussian elimination. Are the answers the same? Why or why not?
24. Solve each system using partial pivoting and scaled partial pivoting carrying four
significant digits. Also find the true solutions.
a. c. e.
0.004000x + 69.13y = 69.17 b.
4.281x − 5.230y = 41.91
0.003000x + 59.14y = 59.17 d.
5.291x − 6.130y = 46.78
40.00x + 691300y = 691700 4.281x − 5.230y = 41.91
30.00x + 591400y = 591700 5.291x − 6.130y = 46.78
0.7000x + 1725y = 1739 f. 0.8000x + 1825y = 2040 0.4352x − 5.433y = 5.278 0.4321x − 5.432y = 7.531
1. Test the numerical example in the text using the naive Gaussian algorithm and the Gaussian algorithm with scaled partial pivoting.
a2. Considerthesystem
⎡ ⎤⎡⎤⎡⎤
0.4096 0.1234 0.3678
0.2943 x1 0.4043 0.1129 ⎥ ⎢ x2 ⎥ = ⎢ 0.1550 ⎥ 0.0643 ⎦ ⎣ x3 ⎦ ⎣ 0.4240 ⎦ 0.3927 x4 0.2557
Solve it by Gaussian elimination with scaled partial pivoting using procedures Gauss and Solve.
a3. (Continuation)AssumethatanerrorwasmadewhenthecoefficientmatrixinComputer Problem 7.2.2 was typed and that a single digit was mistyped—namely, 0.3645 became 0.3345. Solve this system, and notice the effect of this small change. Explain.
⎢ 0.2246 ⎣ 0.3645
0.3872 0.4015
0.1920 0.3781 0.1784 0.4002 0.2786
a4. TheHilbertmatrixofordernisdefinedbyaij =(i+j−1)−1 for1i,jn. It is often used for test purposes because of its ill-conditioned nature. Define b =
n ni
j=1 aij.Thenthesolutionofthesystemofequations j=1 aijxj =bi for1inis
x = [1, 1, . . . , 1]T . Verify this. Select some values of n in the range 2 n 15, solve the system of equations for x using procedures Gauss and Solve, and see whether the result is as predicted. Do the case n = 2 by hand to see what difficulties occur in the computer.
a5. Definethen×narray(aij)byaij =−1+2max{i,j}.Setuparray(bi)insuchaway that the solution of the system Ax = b is xi = 1 for 1 i n. Test procedures Gauss and Solve on this system for a moderate value of n, say, n = 30.
Computer Problems 7.2
7.2 Gaussian Elimination with Scaled Partial Pivoting 277
a6. Selectamodestvalueofn,say,5n20,andletaij =(i−1)j−1 andbi =i−1. Solve the system Ax = b on the computer. By looking at the output, guess what the correct solution is. Establish algebraically that your guess is correct. Account for the errors in the computed solution.
7. Forafixedvalueofnfrom2to4,let
aij =(i+j)2 bi =ni(i+n+1)+1n(1+n(2n+3))
6
Show that the vector x = [1, 1, . . . , 1]T solves the system Ax = b. Test whether procedures Gauss and Solve can compute x correctly for n = 2, 3, 4. Explain what happens.
8. Usingeachvalueofnfrom2to9,solvethen×nsystemAx=b,whereAandbare defined by
where
aij =(i+j−1)7 bi =p(n+i−1)−p(i−1) x2
p(x) = 24(2 + x2(−7 + n2(14 + n(12 + 3n)))) Explain what happens.
9. SolvethefollowingsystemusingproceduresGaussandSolveandthenusingprocedure Naive Gauss. Compare the results and explain.
⎡ ⎤⎡⎤⎡⎤
0.0001 −5.0300 ⎢ 2.2660 1.9950 ⎣ 8.8500 5.6810
6.7750 −2.2530
5.8090 7.8320 x1 9.5740 1.2120 8.0080 ⎥ ⎢ x2 ⎥ = ⎢ 7.2190 ⎥ 4.5520 1.3020 ⎦ ⎣ x3 ⎦ ⎣ 5.7300 ⎦ 2.9080 3.9700 x4 6.2910
10. Without changing the parameter list, rewrite and test procedure Gauss so that it does both forward elimination and back substitution. Increase the size of array (ai j ), and store the right-hand side array (bi ) in the n + 1st column of (ai j ). Also, return the solution in this column.
11. Modify procedures Gauss and Solve so that they are more robust. Two suggested changesareasfollows:(i)Skipeliminationifali,k =0and(ii)addanerrorparameter ierr to the parameter list and perform error checking (e.g., on division by zero or a row of zeros). Test the modified code on linear systems of varying sizes.
12. RewriteproceduresGaussandSolvesothattheyarecolumnoriented—thatis,sothatall inner loops vary the first index of (ai j ). On some computer systems, this implementation may avoid paging or swapping between high-speed and secondary memory and be more efficient for large matrices.
13. Computer memory can be minimized by using a different storage mode when the coefficient matrix is symmetric. An n × n symmetric matrix A = (ai j ) has the prop- erty that aij = aji, so only the elements on and below the main diagonal need be stored in a vector of length n(n + 1)/2. The elements of the matrix A are placed in a
278 Chapter 7
Systems of Linear Equations
vector v = (vk) in this order: a11,a21,a22,a31,a32,a33, …,an,n. Storing a matrix in
this way is known as symmetric storage mode and effects a savings of n(n − 1)/2
memory locations. Here, aij = vk, where k = 1i(i − 1) + j for i j. Verify these 2
statements.
Write and test procedures Gauss Sym(n, (vi ), (li )) and Solve Sym(n, (vi ), (li ),
(bi )), which are analogous to procedures Gauss and Solve except that the coefficient matrix is stored in symmetric storage mode in a one-dimensional array (vi ) and the solution is returned in array (bi ).
14. The determinant of a square matrix can be easily computed with the help of pro- cedure Gauss. We require three facts about determinants. First, the determinant of a triangular matrix is the product of the elements on its diagonal. Second, if a mul- tiple of one row is added to another row, the determinant of the matrix does not change. Third, if two rows in a matrix are interchanged, the determinant changes sign. Procedure Gauss can be interpreted as a procedure for reducing a matrix to upper triangular form by interchanging rows and adding multiples of one row to another. Write a function det(n, (ai j )) that computes the determinant of an n × n matrix. It will call procedure Gauss and utilize the arrays (ai j ) and (li ) that result from that call. Numerically verify function det by using the following test matrices with several values of n:
a. aij =|i−j|
det(A)=(−1)n−1(n−1)2n−2
1 ji −j j
|ai j | (1 i n)
In the case of the tridiagonal system of Equation (1), strict diagonal dominance means simply that (with a0 = an = 0)
|di | > |ai−1| + |ci | (1 i n)
Let us verify that the forward elimination phase in procedure Tri preserves strictly diagonal dominance. The new coefficient matrix produced by Gaussian elimination has 0 elements where the ai ’s originally stood, and new diagonal elements are determined recursively by
⎧⎨d1=d1 a
⎩ d i = d i − i − 1 c i − 1 ( 2 i n )
d i − 1
where di denotes a new diagonal element. The ci elements are unaltered. Now we assume that |di | > |ai−1| + |ci |, and we want to be sure that |di | > |ci |. Obviously, this is true for i = 1 because d1 = d1. If it is true for index i − 1 (that is, |di−1| > |ci−1|), then it is true for index i because
a d i = d i − i − 1 c i − 1
|ci−1| |di|−|ai−1|d
i−1
> |ai−1|+|ci|−|ai−1| = |ci|
While the number of long operations in Gaussian elimination on full matrices is O(n3), it is only O(n) for tridiagonal matrices. Also, the scaled pivoting strategy is not needed on strictly diagonally dominant tridiagonal systems.
Pentadiagonal Systems
The principles illustrated by procedure Tri can be applied to matrices that have wider bands of nonzero elements. A procedure called Penta is given here to solve the
d i − 1
284 Chapter 7
Systems of Linear Equations
five-diagonal system:
⎡d1 c1 f1
⎢a1 d2 c2 f2
⎢e1 a2 d3 c3 f3
⎢ e2 a3 d4 c4 f4
⎤⎡x1⎤⎡b1⎤
⎢
⎢ ⎢ ⎢
⎢ ⎣
routine if n 4. (Why?)
…
⎥⎢x2 ⎥ ⎢b2 ⎥ ⎥⎢x3 ⎥ ⎢b3 ⎥ ⎥⎢x4⎥⎢b4⎥ ⎥⎢. ⎥ ⎢. ⎥
…
… … …
⎥⎢x ⎥=⎢b ⎥ ⎥⎢i⎥⎢i⎥
eadcf i−2 i−1 i i i
⎥⎢. ⎥ ⎢. ⎥ f ⎥⎢x ⎥ ⎢b ⎥
… …
… …
…
In the pseudocode, the solution vector is placed in array (xi ). Also, one should not use this
e
a n−4 n−3
en−3
d c n−2 n−2
n−2⎦⎣ n−2⎦ ⎣ n−2⎦ cn−1 xn−1 bn−1 dn xn bn
an−2 dn−1 en−2 an−1
procedure Penta(n, (ei ), (ai ), (di ), (ci ), ( fi ), (bi ), (xi )) integer i, n; real r, s, xmult
real array (ei )1:n , (ai )1:n , (di )1:n , (ci )1:n , ( fi )1:n , (bi )1:n , (xi )1:n r ← a1
s ← a2
t ← e1
fori =2ton−1do
xmult ← r/di−1
di ←di −(xmult)ci−1
ci ←ci −(xmult)fi−1
bi ←bi −(xmult)bi−1 xmult ← t/di−1
r ← s − (xmult)ci−1
di+1 ← di+1 − (xmult) fi−1 bi+1 ←bi+1 −(xmult)bi−1 s ← ai+1
t ← ei
end for
xmult ← r/dn−1
dn ← dn − (xmult)cn−1
xn ← (bn − (xmult)bn−1)/dn xn−1 ←(bn−1 −cn−1xn)/dn−1 fori =n−2to1 step−1do
xi ←(bi − fixi+2 −cixi+1)/di end for
end procedure Penta
To be able to solve symmetric pentadiagonal systems with the same code and with a mini- mum of storage, we have used variables r, s, and t to store temporarily some information rather than overwriting into arrays. This allows us to solve a symmetric pentadiagonal
where
Here,
(m−1i1) 1d −c
⎡
⎢
⎢
⎢
⎢ ⎢ ⎢
⎢⎣
Di =
⎤⎡⎤⎡⎤
D1 C1
A1 D2 C2
X1 ⎥⎢X2 ⎥⎢X3
B1
⎥ ⎢B2 ⎥
⎥ ⎢B3 ⎥
A2 D3 C3
nnn
X ←D−1(B−CX ) i i i ii+1
7.3 Tridiagonal and Banded Systems 285
system with a procedure call of the form
call Penta(n, ( fi ), (ci ), (di ), (ci ), ( fi ), (bi ), (bi ))
which reduces the number of linear arrays from seven to four. Of course, the original data in some of these arrays will be corrupted. The computed solution will be stored in the (bi ) array. Here, we assume that all linear arrays are padded with zeros to length n in order not to exceed the array dimensions in the pseudocode.
Block Pentadiagonal Systems
Many mathematical problems involve matrices with block structures. In many cases, there are advantages in exploiting the block structure in the numerical solution. This is particularly true in solving partial differential equations numerically.
We can consider a pentadiagonal system as a block tridiagonal system
⎥ ⎢ .
⎥⎢X
⎥⎢. ⎥ ⎢. ⎥
d2i−1 c2i−1 , Ai = e2i−1 c2i−1 , Ci = f2i−1 0 a2i−1 d2i 0 e2i c2i−1 f2i
…
…
…
⎥ ⎢. ⎥
ADC
⎥ = ⎢ B ⎥ ⎥⎢i⎥⎢i⎥
i−1 i i … …
…
⎥⎦ ⎢⎣ X
An−1Dn Xn Bn
ADC
n−1
⎥⎦ ⎢⎣ B ⎥⎦ n−1
n−2 n−1 n−1
Here, we assume that n is even, say n = 2m. If n is not even, then the system can be padded with an extra equation xn+1 = 1 so that the number of rows is even.
The algorithm for this block tridiagonal system is similar to the one for tridiagonal systems. Hence, we have the forward elimination phase
D ← D − A D−1 C
i i i−1i−1i−1
B ←B −A D−1 B
i i i−1 i−1 i−1
and the back substitution phase X ←D−1B
(2im)
D−1 = 2i 2i−1 i −a2i−1 d2i−1
where = d2i d2i−1 − a2i−1c2i−1.
Code for solving a pentadiagonal system using this block procedure is left as an exercise
(Computer Problem 7.3.21). The results from the block pentadiagonal code are the same as those from the procedure Penta, except for roundoff error. Also, this procedure can be
286
Chapter 7
Systems of Linear Equations
FIGURE 7.2
Mesh points in natural order
used for symmetric pentadiagonal systems (in which the subdiagonals are the same as the superdiagonals).
In Chapter 16, we discuss two-dimensional elliptic partial differential equations. For example, the Laplace equation is defined on the unit square. A 3 × 3 mesh of points are placed over the unit square region, and they are ordered in the natural ordering (left-to-right and up) as shown in Figure 7.2.
789
456
123
In the Laplace equation, second partial derivatives are approximated by second-order cen- tered finite difference formulas. This results in an 9 × 9 system of linear equations having a sparse coefficient matrix with this nonzero pattern:
⎡×××⎤ ⎢ × × × × ⎥ ⎢×× × ⎥
⎢× × ⎥ A=⎢ × × ⎥ ⎢× ×⎥ ⎢⎣ × ××⎥⎦
× ××× ×××
Here, nonzero entries in the matrix are indicated by the × symbol, and zero entries are a blank. This matrix is block tridiagonal, and each nonzero block is either tridiagonal or diagonal. Other orderings of the mesh points result in sparse matrices with different patterns.
Summary
(1) For banded systems, such as tridiagonal, pentadiagonal, and others, it is usual to develop special algorithms for implementing Gaussian elimination, since partial pivoting is not needed in many applications. The forward elimination procedure for a tridiagonal linear system A = tridiagonal[(ai ), (di ), (ci )] is
×× ×××
××
⎧ a ⎪⎨di←di− i−1 ci−1
di−1
⎪⎩bi ←bi − ai−1 bi−1 di −1
(2in)
7.3 Tridiagonal and Banded Systems 287
The back substitution procedure is
xi ← 1(bi −cixi+1) (i=n−1,n−2,…,1)
di
(2) A strictly diagonally dominant matrix A = (ai j )n×n is one in which the magnitude of
the diagonal entry is larger than the sum of the magnitudes of the off-diagonal entries in the same row and this is true for all rows, namely,
n j=1
|ai j | (1 i n)
For strictly diagonally dominant tridiagonal coefficient matrices, partial pivoting is not
|aii | >
j ≠ i
necessary because zero divisors will not be encountered.
(3) The forward elimination and back substitution procedures for a pentadiagonal linear system A = pentadiagonal [(ei ), (ai ), (di ), (ci ), ( fi )] is similar to that for a tridiagonal system.
Additional References
For additional study of linear systems, see Colerman and Van Loan [1988], Dekker and Hoffmann [1989], Dekker, Hoffmann, and Potma [1997], Dongarra, Duff, Sorenson, and van der Vorst [1990], Forsythe and Moler [1967], Gallivan et al. [1990], Golub and Van Loan [1996], Hoffmann [1989], Jennings [1977], Meyer [2000], Noble and Daniel [1988], Stewart [1973, 1996, 1998a, 1998b, 2001], and Watkins [1991].
1. WhathappenstothetridiagonalSystem(1)ifGaussianeliminationwithpartialpivoting is used to solve it? In general, what happens to a banded system?
2. Countthelongarithmeticoperationsinvolvedinprocedures: aa. Tri b. Penta
a3. How many storage locations are needed for a system of n linear equations if the coefficientmatrixhasbandedstructureinwhichaij =0for|i−j|k+1?
4. Give an example of a system of linear equations in tridiagonal form that cannot be solved without pivoting.
5. What is the appearance of a matrix A if its elements satisfy ai j = 0 when: a. ji+1
a6. ConsiderastrictlydiagonallydominantmatrixAwhoseelementssatisfyaij =0when i > j + 1. Does Gaussian elimination without pivoting preserve the strictly diagonal dominance? Why or why not?
a7. Let Abeamatrixofform(1)suchthataici >0for1in−1.Findthegeneralform of the diagonal matrix D = diag(αi ) with αi ≠ 0 such that D−1 A D is symmetric. What is the general form of D−1 A D?
Problems 7.3
288 Chapter 7 Systems of Linear Equations
Computer Problems 7.3
1. RewriteprocedureTriusingonlyfourarrays,(ai),(di),(ci),and(bi),andstoringthe solution in the (bi ) array. Test the code with both a nonsymmetric and a symmetric tridiagonal system.
2. Repeat the previous computer problem for procedure Penta with six arrays (ei ), (ai ), (di ), (ci ), ( fi ), and (bi ). Use the example that begins this chapter as one of the test cases.
a3. Writeandtestaspecialproceduretosolvethetridiagonalsysteminwhichai =ci =1 for all i.
a4. Use procedure Tri to solve the following system of 100 equations. Compare the nu- merical solution to the obvious exact solution.
⎩0.5xi−1 + 5. Solvethesystem
⎧
⎨4×1 − x2
xi + 0.5xi+1 = 2.0 0.5×99 + x100 = 1.5
(2 i 99)
(2jn−1)
⎧
⎨ x1 + 0.5×2 = 1.5
=−20 +xj+1= 40 − xn−1+4xn =−20
⎩xj−1−4xj using procedure Tri with n = 100.
6. Let A be the 50 × 50 tridiagonal matrix
⎢⎣ −1 5−1⎥⎦ −1 5
Consider the problem Ax = b for 50 different vectors b of the form [1,2,…,49,50]T [2,3,…,50,1]T [3,4,…,50,1,2]T …
Write and test an efficient code for solving this problem. Hint: Rewrite procedure Tri.
7. Rewrite and test procedure Tri so that it performs Gaussian elimination with scaled
partial pivoting. Hint: Additional temporary storage arrays may be needed.
8. RewriteandtestPentasothatitdoesGaussianeliminationwithscaledpartialpivoting.
Is this worthwhile?
9. Using the ideas illustrated in Penta, write a procedure for solving seven-diagonal sys- tems. Test it on several such systems.
⎡ 5 −1
⎢−1 5−1 ⎢−15−1 ⎥ ⎢ ……… ⎥
⎤ ⎥
10. Consider the system of equations (n = 7)
a1
procedureX Gauss(n,(ai),(di),(bi))
d2
a6
d6
d3 a3
d4
a5 d5
7.3
Tridiagonal and Banded Systems 289
⎤⎡⎤⎡⎤
a7 x1 b1
⎥⎢x2⎥ ⎢ b2 ⎥
⎥⎢x3⎥ ⎢ b3 ⎥
⎥⎢x4⎥ = ⎢b4 ⎥
⎥⎢x5⎥ ⎢ b5 ⎥ ⎦⎣x6⎦ ⎣ b6 ⎦
d7 x7 b7
that does the forward elimination phase of Gaussian elimination (without scaled partial pivoting) and
procedureX Solve(n,(ai),(di),(bi),(xi))
that does the back substitution for cross-systems of this form.
11. Considerthen×nlower-triangularsystemAx=b,whereA=(aij)andaij =0for i < j.
aa. Write an algorithm (in mathematical terms) for solving for x by forward substitution.
b. Write
procedureForward Sub(n,(ai),(bi),(xi))
which uses this algorithm.
c. Determinethenumberofdivisions,multiplications,andadditions(orsubtractions)
in using this algorithm to solve for x.
d. Should Gaussian elimination with partial pivoting be used to solve such a system?
a12. (Normalizedtridiagonalalgorithm)Constructanalgorithmforhandlingtridiagonal systems in which the normalized Gaussian elimination procedure without pivoting is used. In this process, each pivot row is divided by the diagonal element before a multiple of the row is subtracted from the successive rows. Write the equations involved in the forward elimination phase and store the upper diagonal entries back in array (ci ) and the right-hand side entries back in array (bi ). Write the equations for the back substitution phase, storing the solution in array (bi ). Code and test this procedure. What are its advantages and disadvantages?
13. Fora(2n)×(2n)tridiagonalsystem,writeandtestaprocedurethatproceedsasfollows: In the forward elimination phase, the routine simultaneously eliminates the elements in the subdiagonal from the top to the middle and in the superdiagonal from the bottom to the middle. In the back substitution phase, the unknowns are determined two at a time from the middle outward.
14. (Continuation) Rewrite and test the procedure in the preceding computer problem for a general n × n tridiagonal matrix.
d1
a2 For n odd, write and test
⎡
⎢
⎢
⎢
⎢ ⎣
290 Chapter 7
Systems of Linear Equations
15. Suppose
procedure Tri Normal(n, (ai ), (di ), (ci ), (bi ), (xi ))
performs the normalized Gaussian elimination algorithm of Computer Problem 7.3.12 and
procedure Tri 2n(n, (ai ), (di ), (ci ), (bi ), (xi ))
performs the algorithm outlined in Computer Problem 7.3.13. Using a timing routine on your computer, compare Tri, Tri Normal, and Tri 2n to determine which of them is fastest for the tridiagonal system
ai =i(n−i+1), ci =(i+1)(n−i−1), di =(2i+1)n−i−2i, bi =i
with a large even value of n. Note: Mathematical algorithms may behave differently on parallel and vector computers. Generally speaking, parallel computations completely alter our conventional notions about what’s best or most efficient.
16. Consider a special bidiagonal linear system of the following form (illustrated with n = 7) with nonzero diagonal elements:
⎡
d1
⎢a1 d2
⎢a2d3
⎢ a3 d4 a4
⎢ d5a5
⎣ d6a6
⎤⎡⎤⎡⎤
d7 x7 b7 procedure Bi Diagional(n, (ai ), (di ), (bi ))
to solve the general system of order n (odd). Store the solution in array b, and assume that all arrays are of length n. Do not use forward elimination because the system can be solved quite easily without it.
17. Writeandtest
procedureBackward Tri(n,(ai),(di),(ci),(bi),(xi))
for solving a backward tridiagonal system of linear equations of the form
Write and test
x1 ⎥⎢x2⎥
⎥⎢x3⎥
⎥⎢x4⎥ =
⎥⎢x5⎥ ⎦⎣x6⎦
b1
⎢ b2 ⎥ ⎢ b3 ⎥ ⎢ b4 ⎥ ⎢ b5 ⎥ ⎣b6 ⎦
⎡
⎢ ⎢ ⎢
⎤⎡x⎤⎡b⎤ a1 d1 ⎢ 1 ⎥ ⎢ 1 ⎥ a2 d2 c1⎥⎢x2⎥⎢b2⎥ a3 d3c2 ⎥⎢x3⎥⎢b3⎥
⎣ a n − 1 dncn−1
n−1 n−1 xn bn
using Gaussian elimination without pivoting.
... ... ... d n − 1 c n − 1
⎥⎢ . ⎥=⎢ . ⎥ ⎦ ⎢⎣ x ⎥⎦ ⎢⎣ b ⎥⎦
7.3 Tridiagonal and Banded Systems 291 18. AnupperHessenbergmatrixisoftheform
⎡aaa···a⎤⎡x⎤⎡b⎤ 11 12 13 1n 1 1
⎢a21 ⎢
⎢⎣
a22 a32
a23 ··· a33 · · ·
... ... an,n−1
a2n ⎥⎢x2 ⎥ ⎢b2 ⎥ a3n ⎥ ⎢ x3 ⎥ = ⎢ b3 ⎥
. ⎥⎦⎢⎣ . ⎥⎦ ⎢⎣ . ⎥⎦ ann xn bn
Write a procedure for solving such a system, and test it on a system having 10 or more equations.
19. Ann×nbandedcoefficientmatrixwithlsubdiagonalsandmsuperdiagonalscanbe stored in banded storage mode in an n × (l + m + 1) array. The matrix is stored with the row and diagonal structure preserved with almost all 0 elements unstored. If the original n × n banded matrix had the form shown in the figure, then the n × (l + m + 1) array in banded storage mode would be as shown. The main diagonal would be the l + 1st column of the new array. Write and test a procedure for solving a linear system with the coefficient matrix stored in banded storage mode.
20. Ann×nsymmetricbandedcoefficientmatrixwithmsubdiagonalsandmsuperdiag- onals can be stored in symmetric banded storage mode in an n × (m + 1) array. Only the main diagonal and subdiagonals are stored so that the main diagonal is the last column in the new array, shown in the figure. Write and test a procedure for solving a linear system with the coefficient matrix stored in symmetric banded storage mode.
21. Write code for solving block pentadiagonal systems and test it on the systems with block submatrices. Compare the code to Penta using symmetric and nonsymmetric systems.
22. (Nonperiodicsplinefilter)Thefilterequationforthenonperiodicsplinefilterisgiven by the n × n system
where the matrix is
I + α4 Qw = z
⎡ 1 −2 1 ⎢−2 5−4 1
⎤
⎥
⎥ ⎥
− 1 ⎥ ⎥ ⎥⎦
⎢ 1 −4 Q= ⎢ ...
6 −4 1
... ... ...
...
⎢ ⎢ ⎢⎣
− 4
1 −2 1
1 − 4 6
1 −4 5 −2
Here the parameter α = 1/[2sin(πx/λc)] involves measurement values of the pro- file, dimensions, and wavelength over a sampling interval. The solution w gives the profile values for the long wave components and z − w are those for the short wave components. Use this system to test the Penta code using various values of α. Hint: For test systems, select a simple solution vector such as w = [1, −1, 1, −1, . . . , 1]T with a modest value for n, and then compute the right-hand side by matrix-vector multiplication z = (I + α4 Q)w.
292 Chapter 7
Systems of Linear Equations
23. (Continuation, periodic spline filter) The filter equation for the periodic spline filter
is given by the n × n system
I + α 4 Q w = z
where the matrix is
⎡6−4 1
⎢−4 6−4 1 ⎢1−4 6−4 1
1−4⎤
Q = ⎢ ⎢ ⎢ ⎢ ⎢⎣
... ... ... ...
1 ⎥ ⎥ ... ⎥ ⎥
1
−4 1
1 −4 6 1 −4 1
−4 1 ⎥ ⎥ ⎥⎦ 6 −4
Periodic spline filters are used in cases of filtering closed profiles. Making use of the symmetry, modify the Penta pseudocode to handle this system and then code and test it.
24. UsemathematicalsoftwaresuchasfoundinMatlab,Maple,orMathematicatogenerate a tridiagonal system and solve it. For example, use the 5 × 5 tridiagonal system A = Band Matrix(−1, 2, 1) with right-hand side b = [1, 4, 9, 16, 25]T .
−4 6
8
Additional Topics Concerning Systems of Linear Equations
In applications that involve partial differential equations, large linear systems arise with sparse coefficient matrices such as
⎡ 4 ⎢−1
⎢ 0
⎢−1 A=⎢ 0
⎢ 0
⎢ 0
⎣ 0 0
−1 4 −1 0 −1 0 0 0 0
0 −1 4 0 0 −1 0 0 0
−1 0 0 0 −1 0 0 0 −1 4 −1 0
−1 4 −1 0 −1 4 −1 0 0 0 −1 0 0 0 −1
0 0 0 0 0 0
−1 0 0 −1 0 0 4 −1 −1 4 0 −1
0⎤ 0⎥
Gaussian elimination may
On the other hand, iterative methods preserve its sparse structure.
8.1 Matrix Factorizations
An n × n system of linear equations can be written in matrix form
Ax = b (1)
where the coefficient matrix A has the form ⎡aaa···a⎤
cause fill-in of the zero entries by nonzero values.
0⎥ 0⎥ 0⎥
−1 ⎥ 0⎥ −1⎦
4
11 ⎢a21 A = ⎢a31
12 13
a22 a23 ··· a32 a33 ···
1n a2n ⎥
⎢⎣ .
an1 an2 an3 ··· ann
. . ...
a3n ⎥ . ⎥⎦
293
294 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Our main objective is to show that the naive Gaussian algorithm applied to A yields a factorization of A into a product of two simple matrices, one unit lower triangular:
⎡⎤
and the other upper triangular:
U=⎢ ⎢⎣
11
12 u22
13 1n u23 ··· u2n ⎥ u33 ··· u3n⎥
... .⎥⎦ unn
is, the system
we have
⎡⎤
which is an upper triangular matrix.
1 ⎢l21 1
L = ⎢ l31 l32 1 ⎢⎣ . . .
⎥
⎥ ... ⎥⎦
ln1 ln2 ln3 ··· 1 ⎡uuu···u⎤
⎢
In short, we refer to this as an LU factorization of A; that is, A = LU. Numerical Example
The system of Equations (2) of Section 7.1 can be written succinctly in matrix form:
⎡ ⎤⎡⎤⎡⎤
6 −2 2 4 x1 16
⎢ 12 −8 6 10⎥⎢x2⎥=⎢ 26⎥ (2) ⎣ 3 −13 9 3⎦⎣x3 ⎦ ⎣−19⎦
−641−18x4 −34
Furthermore, the operations that led from this system to Equation (5) of Section 7.1, that
⎡ ⎤⎡⎤⎡⎤
6 −2 2 4 x1 16
⎢0 −4 2 2⎥⎢x2 ⎥ = ⎢−6⎥ (3) ⎣0 0 2 −5⎦⎣x3 ⎦ ⎣−9⎦
0 0 0 −3 x4 −3
could be effected by an appropriate matrix multiplication. The forward elimination phase
can be interpreted as starting from (1) and proceeding to
MAx = Mb (4) where M is a matrix chosen so that M A is the coefficient matrix for System (3). Hence,
6 −2 2 4
M A = ⎢ 0 −4 2 2 ⎥ ≡ U
⎣0 0 2 −5⎦ 0 0 0 −3
where
⎡⎤
1000
M =⎢−2 1 0 0⎥
1 ⎣−1 0 1 0⎦ 2
1001
8.1 Matrix Factorizations 295 The first step of naive Gaussian elimination results in Equation (3) of Section 7.1 or
the system
⎡ ⎤⎡⎤⎡⎤
6 −2 2 4 x1 16 ⎢0 −4 2 2⎥⎢x2⎥=⎢ −6⎥ ⎣0 −12 8 1⎦⎣x3 ⎦ ⎣−27⎦ 023−14x4 −18
This step can be accomplished by multiplying (1) by a lower triangular matrix M1: M1 Ax = M1b
Notice the special form of M1. The diagonal elements are all 1’s, and the only other nonzero elements are in the first column. These numbers are the negatives of the multipliers located in the positions where they created 0’s as coefficients in step 1 of the forward elimination
phase. To continue, step 2 resulted in Equation (4) of Section 7.1 or the system
which is equivalent to where
⎡ ⎤⎡⎤⎡⎤
6 −2 ⎢0 −4 ⎣0 0
0 0
2 4 x1 16
2 2⎥⎢x2 ⎥ = ⎢ −6⎥ 2 −5⎦⎣x3⎦ ⎣−9⎦ 4 −13 x4 −21
M2M1 Ax = M2M1b ⎡⎤
1000
M =⎢0 1 0 0⎥ 2 ⎣0−3 1 0⎦
0101 2
Again, M2 differs from an identity matrix by the presence of the negatives of the multipliers in the second column from the diagonal down. Finally, step 3 gives System (3), which is equivalent to
where
M3M2M1 Ax = M3M2M1b ⎡⎤
1000
M =⎢0 1 0 0⎥ 3⎣0010⎦
0 0 −2 1 Now the forward elimination phase is complete, and with
M = M3M2M1 (5) we have the upper triangular coefficient System (3).
296 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Using Equations (4) and (5), we can give a different interpretation of the forward elimination phase of naive Gaussian elimination. Now we see that
A = M−1U
= M−1M−1M−1U
123
= LU
Since each Mk has such a special form, its inverse is obtained by simply changing the signs
of the negative multiplier entries! Hence, we have
⎡⎤⎡⎤⎡⎤
100010001000
L=⎢2 1 0 0⎥⎢0 1 0 0⎥⎢0 1 0 0⎥
⎣1 010⎦⎣0310⎦⎣0010⎦
2
−1 0 0 1 0 −1
=⎢2 1 0 0⎥ ⎣1310⎦
2
−1−1 21 2
0 1 0 0 2 1
⎡⎤2 1000
It is somewhat amazing that L is a unit lower triangular matrix composed of the multipliers. Notice that in forming L, we did not determine M first and then compute M−1 = L. (Why?)
It is easy to verify that
⎡⎤⎡⎤
1 0 0 0 6 −2 2 4
LU=⎢2 1 0 0⎥⎢0−4 2 2⎥
⎣1 3 1 0⎦⎣0 0 2−5⎦ 2
−1−1 21000−3 ⎡2⎤
6 −2 2 4
= ⎢ 12 −8 6 10 ⎥ = A
⎣3−13 9 3⎦ −6 4 1 −18
We see that A is factored or decomposed into a unit lower triangular matrix L and an
upper triangular matrix U. The matrix L consists of the multipliers located in the positions
of the elements they annihilated from A, of unit diagonal elements, and of 0 upper triangular
elements. In fact, we now know the general form of L and can just write it down directly
using the multipliers without forming the M ’s and the M−1’s. The matrix U is upper kk
triangular (not generally having unit diagonal) and is the final coefficient matrix after the forward elimination phase is completed.
It should be noted that the pseudocode Naive Gauss of Section 7.1 replaces the original coefficient matrix with its LU factorization. The elements of U are in the upper triangular part of the (ai j ) array including the diagonal. The entries below the main diagonal in L (that is, the multipliers) are found below the main diagonal in the (ai j ) array. Since it is known that L has a unit diagonal, nothing is lost by not storing the 1’s. [In fact, we have run out of room in the (ai j ) array anyway!]
Formal Derivation
To see formally how the Gaussian elimination (in naive form) leads to an LU factorization, it is necessary to show that each row operation used in the algorithm can be effected by
multiplying A on the left by an elementary matrix. Specifically, if we wish to subtract λ times row p from row q, we first apply this operation to the n × n identity matrix to create an elementary matrix Mqp. Then we form the matrix product Mqp A.
Before proceeding, let us verify that Mqp A is obtained by subtracting λ times row p from row q in matrix A. Assume that p < q (for in the naive algorithm, this is always true). Then the elements of Mqp = (mi j ) are
⎧
⎪⎨ 1 i f i = j
mij=⎪⎩−λ ifi=qandj=p 0 in all other cases
TheqthrowofMqp AisthesumoftheqthrowofAand−λtimesthepthrowofA,as was to be proved.
The k th step of Gaussian elimination corresponds to the matrix M k , which is the product of n − k elementary matrices:
Mk =Mnk Mn−1,k···Mk+1,k
Notice that each elementary matrix M i k here is lower triangular because i > k , and therefore, Mk is also lower triangular. If we carry out the Gaussian forward elimination process on A, the result will be an upper triangular matrix U. On the other hand, the result is obtained by applying a succession of factors such as Mk to the left of A. Hence, the entire process is summarized by writing
Mn−1 · · · M2 M1 A = U Since each Mk is invertible, we have
A=M−1M−1···M−1 U 1 2 n−1
Each Mk is lower triangular having 1’s on its main diagonal (unit lower triangular). Each inverse M−1 has the same property, and the same is true of their product. Hence, the matrix
Therefore, the elements of Mqp A are given by n
ifi≠ q ifi=q
k
is unit lower triangular, and we have
(Mqp A)ij = misasj = aij
s=1 aqj −λapj
8.1 Matrix Factorizations 297
L = M−1 M−1 · · · M−1 (6) 1 2 n−1
A = LU
This is the so-called LU factorization of A. Our construction of it depends upon not encountering any 0 divisors in the algorithm. It is easy to give examples of matrices that have no LU factorization; one of the simplest is
A=01 11
(See Problem 8.1.4.)
298 Chapter 8
Additional Topics Concerning Systems of Linear Equations
LU FACTORIZATIONTHEOREM
Let A = (ai j ) be an n × n matrix. Assume that the forward elimination phase of the naive Gaussian algorithm is applied to A without encountering any 0 divisors. Let the resulting matrix be denoted by A = (ai j ). If
and
then A = LU.
⎤ ⎢ 0 a22 a23 ··· a2n ⎥
U=⎢0 0 a33 ··· a3n ⎥ ⎢⎣ . . . . . . . . . . . . . . . ⎥⎦
0 0 ··· 0 ann
⎡100···0⎤ ⎢a21 1 0 ··· 0⎥
L = ⎢ a31 ⎢⎣ . . .
an1
a32 . . .
an2
1
. . .
· · ·
· · ·
. . . an,n−1
0 ⎥ . . . ⎥⎦
1
⎡a 11
a 12
a 13
··· a 1n
■ THEOREM1
Proof
We define the Gaussian algorithm formally as follows. Let A(1) = A. Then we compute A(2), A(3), . . . , A(n) recursively by the naive Gaussian algorithm, following these equations:
a(k+1) = a(k) ij ij
a(k) a(k+1)= ik
ij a(k) kk
(if i k or j < k) (7) (ifi>kandj=k) (8)
(k)
aik a(k) (if i > k and j > k) (9)
a(k+1) = a(k) −
ij ij a(k) kj
kk
These equations describe in a precise form the forward elimination phase of the naive Gaussian elimination algorithm. For example, Equation (7) states that in proceeding from A(k) to A(k+1), we do not alter rows 1,2,…,k or columns 1,2,…,k − 1. Equation (8) shows how the multipliers are computed and stored in passing from A(k) to A(k+1). Finally, Equation (9) shows how multiples of row k are subtracted from rows k +1,k +2,…,n to
produce A(k+1) from A(k).
Notice that A(n) is the final result of the process. (It was referred to as A in the statement
of the theorem.) The formal definitions of L = (li k ) and U = (u k j ) are therefore
lik l
=1 = a(n)
ik
=0 = a(n)
(i = k) (10) (k < i) (11) (k > i) (12) ( j k) (13) (j < k) (14)
ik lik
u ukj
kj
kj
=0
8.1 Matrix Factorizations 299 Now we draw some consequences of these equations. First, it follows immediately from
Equation (7) that
Likewise, we have, from Equation (7),
a(j+1) = a(j+2) = ··· = a(n)
ijijij
From Equations (16) and (8), we now have
a(j) a(n)=a(j+1)= ij
ij ij a(j) jj
From Equations (17) and (11), it follows that a(k)
l =a(n)= ik ik ik a(k)
kk
From Equations (13) and (15), we have
k=1 i
k=1 i−1
likukj +uij
(k)
aik a(k) +a(i) a(k) kj ij
=
=
k=1 i−1
a(i) = a(i+1) = ··· = a(n) (15) ijijij
(j < n) (16)
(j
Pseudocode
(LU)ij =
=
=
likukj
likukj k=1
j a(k) ik
a(k) k=1 kk
[definition of multiplication] [by Equation (14)]
[by Equations (18) and (19)]
[byEquation(9)]
k=1 j
j−1 (k) = aik
a(k) k=1 kk
j−1 (k)
a(k) kj
a(k)+a(j) kj ij
=
= a(1) = a
k=1
ij ij
■
(k+1) aij −aij
(j) +aij
The following is the pseudocode for carrying out the LU factorization, which is sometimes called the Doolittle factorization:
integer i, k, n; real array (ai j )1:n×1:n , (li j )1:n×1:n , (ui j )1:n×1:n for k = 1 to n do
lkk ←1
for j = k to n do
k−1
ukj ←akj − end do
lksusj for i = k + 1 to n do
lik ← aik −
end do end do
lisusk ukk
s=1
k−1
s=1
Solving Linear Systems Using LU Factorization Once the LU factorization of A is available, we can solve the system
Ax = b LUx = b
by writing
Then we solve two triangular systems:
Lz = b (20)
for z and
Likewise, x is obtained by the pseudocode
Ux = z
(21)
for x. This is particularly useful for problems that involve the same coefficient matrix A and many different right-hand vectors b.
Since L is unit lower triangular, z is obtained by the pseudocode
8.1 Matrix Factorizations 301
integer i, n; real array (bi )1:n , (li j )1:n×1:n , (zi )1:n
z1 ← b1
for i = 2 to n do i−1
zi ←bi − end for
lijzj
j=1
integer i, n; real array (ui j )1:n×1:n , (xi )1:n , (zi )1:n
xn ←zn/unn
fori =n−1to1 step−1do n
xi ← zi − end for
uijxj uii
j=i+1
The first of these two algorithms applies the forward phase of Gaussian elimination to the right-hand-side vector b. [Recall that the li j ’s are the multipliers that have been stored in the array (ai j ).] The easiest way to verify this assertion is to use Equation (6) and to rewrite the equation
in the form
From this, we get immediately
Lz=b
M−1M−1···M−1 z=b 1 2 n−1
z = Mn−1 ···M2M1b
Thus, the same operations used to reduce A to U are to be used on b to produce z.
Another way to solve Equation (20) is to note that what must be done is to form Mn−1Mn−2 ···M2M1b
This can be accomplished by using only the array (bi ) by putting the results back into b; that is,
b ← Mk b
302 Chapter 8
Additional Topics Concerning Systems of Linear Equations
We know what Mk looks like because it is made up of negative multipliers that have been saved in the array (ai j ). Consequently, we have
Mk b = ⎢ ⎢
1
−a 1
⎥⎢bk+1 ⎥
⎡1 ⎢
⎢
⎢ ⎣
⎤⎡b⎤ 1
⎢ ⎢
1 −ak +1,k .
⎥⎢ k ⎥
…
⎥ ⎢ . ⎥ ⎥⎢b ⎥
…
. …⎦⎣. ⎦
−ank 1 bn
ik
⎥⎢b ⎥ ⎥⎢ i ⎥
The entries b1 to bk are not changed by this multiplication, while bi (for i k + 1) is replaced by −ai k bk + bi . Hence, the following pseudocode updates the array (bi ) based on the stored multipliers in the array a:
⎥ ⎢ . ⎥
integer i, k, n; real array (ai j )1:n×1:n , (bi )1:n for k = 1 to n − 1 do
for i = k + 1 to n do bi ←bi −aikbk
end for end for
This pseudocode should be familiar. It is the process for updating b from Section 7.2.
The algorithm for solving Equation (21) is the back substitution phase of the naive
Gaussian elimination process.
LDLT Factorization
In the L D L T factorization, L is unit lower triangular, and D is a diagonal matrix. This factorization can be carried out if A is symmetric and has an ordinary LU factorization, with L unit lower triangular. To see this, we start with
LU=A=AT =(LU)T =UTLT
Since L is unit lower triangular, it is invertible, and we can write U = L−1UT LT . Then U(LT )−1 = L−1UT . Since the right side of this equation is lower triangular and the left side is upper triangular, both sides are diagonal, say, D. From the equation U(LT )−1 = D, wehaveU=DLT andA=LU=LDLT.
We now derive the pseudocode for obtaining the L D L T factorization of a symmetric matrix A in which L is unit lower triangular and D is diagonal. In our analysis, we write ai j as generic elements of A and li j as generic elements of L . The diagonal of D has elements
dii, or di. From the equation A = LDLT , we have
n n
a = l d lT
ν=1 μ=1 n n
= liνdνδνμljμ ν=1 μ=1
n
ν=1
min(i,j)
ν=1
j ν=1
j−1
= In particular, let j = i. We get
(1i,jn)
aij = Assume now that j i. Then
liνdνljν
Equivalently, we have
Particular cases of this are
dl2
aij =
=
liνdνljν
liνdνljν +lijdjljj liνdνljν +lijdj
ij
iν νμ μj
=
Use the fact that lij = 0 when j > i and lii = 1 to continue the argument
aii =
liνdνliν +di i−1
ν=1 j−1
(1 jin)
(1in)
(1in)
ν=1
i−1
ν=1
liνdνljν (1i, jn)
d=a −
i ii νiν
ν=1
d1 = a11
d2 = a22 − d1l21
d3 = a33 − d1l231 − d2l232
etc.
Now we can limit our attention to the cases 1 j < i n, where we have
j−1
ν=1
aij =
liνdνljν+lijdj (1j 0 for every nonzero vector x. It follows at once that A is nonsingular because A obviously cannot map any nonzero vector into 0. Moreover, by considering special vectors of the form x = (x1,x2,…,xk,0,0,…,0)T , we see that the leading principal minors of A are also positive definite. Theorem 1 implies that A has an LU decomposition. By the symmetry of A, we then have, from the previous discussion, A = L D L T . It can be shown that D is positive definite, and thus its elements dii are positive. Denoting by D1/2 the diagonal matrixwhosediagonalelementsare√dii,wehaveA=LLT whereL≡LD1/2,whichis the Cholesky factorization. We leave the proof of uniqueness to the reader.
The algorithm for the Cholesky factorization is a special case of the general LU factorization algorithm. If A is real, symmetric, and positive definite, then by Theorem 2, it has a unique factorization of the form A = L L T , in which L is lower triangular and has positive diagonal. Thus, in the equation A = LU, U = LT . In the kth step of the general algorithm, the diagonal entry is computed by
k−1 1/2
lkk = akk − l2ks (22) s=1
The algorithm for the Cholesky factorization will then be as follows:
8.1 Matrix Factorizations 305
integer i, k, n, s; real array (ai j )1:n×1:n , (li j )1:n×1:n
for k = 1 to n do
k−1 1/2
lkk←akk− l2ks s=1
for i = k + 1 to n do
k−1
lik ← aik −
end do end do
lislks lkk
s=1
306
Chapter 8
Additional Topics Concerning Systems of Linear Equations
Theorem 2 guarantees that lkk > 0. Observe that Equation (22) gives us the following bound:
from which we conclude that
k s=1
l 2k s l 2k j |lkj|√akk
( j k )
(1 jk)
a k k =
EXAMPLE 2
Solution
Hence, any element of L is bounded by the square root of a corresponding diagonal element in A. This implies that the elements of L do not become large relative to A even without any pivoting. In the Cholesky algorithm (and the Doolittle algorithms), the dot products of vectors should be computed in double precision to avoid a buildup of roundoff errors.
Determine the Cholesky factorization of the matrix in Example 1. Using the results from Example 1, we write
A=LDLT =(LD1/2)(D1/2LT)=LLT
⎣1210⎦⎣0020⎦ 233
11110001 ⎡432⎤⎡2⎤
⎥ 0 ⎥
Multiple Right-Hand Sides
Many software packages for solving linear systems allow the input of multiple right-hand sides. Suppose an n × m matrix B is
B = [b(1), b(2), . . . , b(m)]
in which each column corresponds to a right-hand side of the m linear systems
where
L = L D 1 / 2
⎡1000⎤⎡2000⎤
⎢3 1 0 0⎥⎢0 1√3 0 0⎥ =⎢4 ⎥⎢ 2 ⎥
√
2 0 0 0 2.0000 0 0 0
⎢3 1√3 0 0⎥ ⎢
⎢ 2 2 √ ⎥ ⎢ 1.5000 0.8660 0
=⎢11320⎥=⎢ ⎥
3 3 1.0000 ⎣√⎦⎣ ⎦
0.5774 0.8165 0
2 6 2 3 √2 0.5000 0.2887 0.4082 0.7071
113121
Clearly, L is the lower triangular matrix in the Cholesky factorization A = L L T . ■
for1 jm.Thus,wecanwrite
Ax(j) = b(j)
A[x(1), x(2),…, x(m)] = [b(1), b(2),…, b(m)]
or
AX = B
For example, procedure Gauss can be used once to produce a factorization of A, and
procedure Solve can be used m times with right-hand side vectors b( j ) to find the m solution
vectors x( j) for 1 j m. Since the factorization phase can be done in 1 n3 long operations 3
while each of the back substitution phases requires n2 long operations, this entire process can be done in 1 n3 + mn2 long operations. This is much less than m 1 n3 + n2, which is
8.1 Matrix Factorizations 307
33 what it would take if each of the m linear systems were solved separately.
Computing A−1
In some applications, such as in statistics, it may be necessary to compute the inverse of a matrix A and explicitly display it as A−1. This can be done by using procedures Gauss and Solve. If an n × n matrix A has an inverse, it is an n × n matrix X with the property that
AX = I (23) where I is the identity matrix. If x(j) denotes the jth column of X and I(j) denotes the jth
column of I, then matrix Equation (23) can be written as
A[x(1), x(2),…, x(n)] = [I(1), I(2),…, I(n)]
This can be written as n linear systems of equations of the form Ax(j) = I(j) (1 jn)
Now use procedure Gauss once to produce a factorization of A, and use procedure Solve n times with the right-hand side vectors I ( j ) for 1 j n. This is equivalent to solving, one at a time, for the columns of A−1, which are x(j). Hence,
A−1 =[x(1),x(2),…,x(n)]
A word of caution on computing the inverse of a matrix: In solving a linear system Ax = b, it is not advisable to determine A−1 and then compute the matrix-vector product x = A−1b because this requires many unnecessary calculations, compared to directly
solving Ax = b for x.
Example Using Software Packages
A permutation matrix is an n×n matrix P that arises from the identity matrix by permuting its rows. It then turns out that permuting the rows of any n × n matrix A can be accomplished by multiplying A on the left by P . Every permutation matrix is nonsingular, since the rows still form a basis for Rn . When Gaussian elimination with row pivoting is performed on a matrix A, the result is expressible as
P A = LU
where L is lower triangular and U is upper triangular. The matrix P A is A with its rows rearranged. If we have the LU factorization of P A, how do we solve the system Ax = b?
308
Chapter 8
Additional Topics Concerning Systems of Linear Equations
EXAMPLE 3
Solution
First, write it as
P Ax = Pb
then LUx = Pb. Let y = Ux, so that our problem is now
Ly=Pb Ux=y
The first equation is easily solved for y, and then the second equation is easily solved for x. Mathematical software systems such as Matlab, Maple, and Mathematica produce factorizations of the form P A = LU upon command.
Use mathematical software systems such as Matlab, Maple, and Mathematica to find the
LU factorization of this matrix:
⎡⎤
6 −2 2 4
A = ⎢ 12 −8 6 10 ⎥ (24)
⎣ 3 −13 9 3⎦ −6 4 1 −18
First, we use Maple and find this factorization:
⎡⎤⎡⎤
1 0 0 0 6 −2 2 4
A=LU=⎢ 2 1 ⎣ 1 3
2
−1−1 2
0 0⎥⎢0 −4 2 2⎥ 1 0⎦⎣0 0 2 −5⎦ 21 0 00−3
Next, we use Matlab and find a different factorization:
P A = L U ⎡⎤
1.0000 0 0 0 L = ⎢ ⎢ 0.2500 1.0000 0 0 ⎥ ⎣ −0.5000 0 1.0000 0 ⎦
0.5000 −0.1818 0.0909 1.0000
⎡ 12.0000 −8.0000 U = ⎢ ⎢ 0 −11.0000 ⎣ 0 0
6.0000 7.5000 4.0000
10.0000 ⎤ 0.5000 ⎥ −13.0000 ⎦
0 0 0 0.2727 ⎡⎤
0100
P=⎢0 0 1 0⎥ ⎣0001⎦
1000
where P is a permutation matrix corresponding to the pivoting strategy used. Finally, we
use Mathematica to create this LU decomposition: ⎡3−139 3⎤
⎢ −2 −22 19 −12 ⎥ ⎣2−12 52 −166⎦
11 11 11
4 −2 22 −6 13 13
The output is in a compact store scheme that contains both the lower triangular matrix and the upper triangular matrix in a single matrix. However, the storage arrangement may be complicated because the rows are usually permuted during the factorization in an effort to make the solution process numerically stable. Verify that this factorization corresponds to the permutation of rows of matrix A in the order 3, 4, 1, 2. ■
Summary
(1) If A = (ai j ) is an n × n matrix such that the forward elimination phase of the naive Gaussian algorithm can be applied to A without encountering any zero divisors, then the resulting matrix can be denoted by A = (ai j ), where
11
⎢0
U=⎢0 0 a33 ···a3n⎥
a22 a23 ··· a2n ⎥ ⎢⎣. . … … . ⎥⎦
0 0 ··· 0 ann
This is the LU factorization of A, so A = LU, where L is a unit lower triangular and U is upper triangular. When we carry out the Gaussian forward elimination process on A, the result is the upper triangular matrix U. The matrix L is the unit lower triangular matrix whose entries are negatives of the multipliers in the locations of the elements they zero out.
(2) We can also give a formal description as follows. The matrix U can be obtained by applying a succession of matrices Mk to the left of A. The kth step of Gaussian elimination corresponds to a unit lower triangular matrix M k , which is the product of n − k elementary matrices
Mk =Mnk Mn−1,k···Mk+1,k
where each elementary matrix Mik is unit lower triangular. If Mqp A is obtained by sub- tractingλtimesrow pfromrowq inmatrix Awith p < q,thentheelementsof Mqp = (mij)
are
⎧
⎨ 1 ifi=j
mij=⎩−λ ifi=qandj=p 0 in all other cases
The entire Gaussian elimination process is summarized by writing Mn−1 · · · M2 M1 A = U
8.1 Matrix Factorizations 309
and
⎢a21 L = ⎢ a31
⎢⎣. an1
⎡ a
1 0 ··· 0⎥ a32 1 · · · 0 ⎥
. ...... .⎥⎦ an2···an,n−11
a a ··· a ⎤ 1213 1n
⎡100···0⎤
310 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Since each Mk is invertible, we have A=M−1M−1···M−1 U
1 2 n−1
Each M is a unit lower triangular matrix, and the same is true of each inverse M−1, as well
kk
as their products. Hence, the matrix
L = M−1 M−1 · · · M−1
is unit lower triangular.
(3) For symmetric matrices, we have the L D L T factorization, and for symmetric posi- tive definite matrices, we have the LLT factorization, which is also known as Cholesky factorization.
(4) If the LU factorization of A is available, we can solve the system Ax = b
Thus, we can write
or
Ax(1), x(2),..., x(m) = b(1), b(2),..., b(m) AX = B
1 2 n−1
by solving two triangular systems:
This is useful for problems that involve the same coefficient matrix A and many different
right-hand vectors b. For example, let B be an n × m matrix of the form B = [b(1), b(2), . . . , b(m)]
where each column corresponds to a right-hand side of the m linear systems Ax(j) =b(j) (1 jm)
Ly=b fory Ux=y forx
A special case of this is to compute the inverse of an n × n invertible matrix A. We write
AX = I
where I is the identity matrix. If x(j) denotes the jth column of X and I(j) denotes the jth column of I, this can be written as
Ax(1), x(2),..., x(n) = I(1), I(2),..., I(n)
or as n linear systems of equations of the form
Ax(j) = I(j) (1 jn)
We can use LU factorization to solve these n systems efficiently, obtaining A−1 =x(1),x(2),...,x(n)
(5) When Gaussian elimination with row pivoting is performed on a matrix A, the result is expressible as
P A = LU
where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. Here, the matrix P A is A with its rows interchanged. We can solve the system Ax = b by
solving
3. Considerthematrix
⎡⎤
25 0 0 0 1 ⎢ 0 27 4 3 2⎥ A=⎢ 0 54 58 0 0⎥ ⎣ 0 108 116 0 0⎦
⎡
−20 −15 −10 −5
⎤ 0 2 4 −6
c.A=⎢ 1 0 0 0⎥ ⎣0100⎦
0010 2. Considerthematrix
⎡⎤
Ly=Pb fory Ux = y for x
1. Using naive Gaussian elimination, factor the following matrices in the form A = LU, where L is a unit lower triangular matrix and U is an upper triangular matrix.
⎡⎤⎡1010⎤ 303⎢3⎥
1002 A = ⎢ 0 3 0 0 ⎥ ⎣0940⎦
5 0 8 10
aa. DetermineaunitlowertriangularmatrixMandanuppertriangularmatrixUsuch
that MA = U.
b. Determine a unit lower triangular matrix L and an upper triangular matrix U such
that A = LU. Show that ML = I so that L = M−1.
100 0 0 024
aa. Determine the unit lower triangular matrix M and the upper triangular matrix U
such that MA = U.
b. Determine M−1 = L such that A = LU.
4. Considerthematrix
⎡⎤
221 A=⎣1 1 1⎦
321
8.1 Matrix Factorizations 311
aa. A=⎣0 −1 3⎦ 1 3 0
b. A=⎢0 1 3 −1⎥ ⎣3−3 0 6⎦
Problems 8.1
312 Chapter 8
Additional Topics Concerning Systems of Linear Equations
a. Show that A cannot be factored into the product of a unit lower triangular matrix and an upper triangular matrix.
a b. Interchange the rows of A so that this can be done.
5. Considerthematrix
⎡⎤
a00z A=⎢0 b 0 0⎥ ⎣0xc0⎦
w0yd
aa. DetermineaunitlowertriangularmatrixMandanuppertriangularmatrixUsuch
that MA = U.
ab. Determine a lower triangular matrix L′ and a unit upper triangular matrix U′ such
that A = L′U′. 6. Considerthematrix
⎡⎤
4 −1 −1 0 A=⎢⎣−1 4 0−1⎥⎦
−1 0 4 −1 0 −1 −1 4
Factor A in the following ways:
aa. ab.
ac. ad. ae.
A = LU, where L is unit lower triangular and U is upper triangular.
A = L DU′, where L is unit lower triangular, D is diagonal, and U′ is unit upper
triangular.
A = L′U′, where L′ is lower triangular and U′ is unit upper triangular. A = (L′′)(L′′)T , where L′′ is lower triangular.
Evaluate the determinant of A. Hint: det(A) = det(L)det(D)det(U′) = det(D).
7. Considerthe3×3Hilbertmatrix
⎡111⎤
⎢23⎥ A=⎣1 1 1⎦
234 111 345
Repeat the preceding problem using this matrix.
a8. Find the LU decomposition, where L is unit lower triangular, for
9. Consider
⎡⎤
1001 A=⎢1 1 0−1⎥ ⎣−1 1 1 1⎦
1 −1 1 −1 ⎡⎤
2 −1 2 A=⎣2 −3 3⎦
6 −1 8
8.1 Matrix Factorizations 313 aa. Find the matrix factorization A = LDU′, where L is unit lower triangular, D is
diagonal, and U′ is unit upper triangular.
a b. Use this decomposition of A to solve Ax = b, where b = [−2, −5, 0]T .
a10. Repeattheprecedingproblemfor ⎡⎤⎡⎤
−21−2 1 A = ⎣ −4 3 −3 ⎦ , b = ⎣ 4 ⎦
224 4 11. Consider the system of equations
⎡ 3 2 −1⎤ −1 1 −3
333 −8 5 1 0 0 15
⎧⎪ ⎨
⎪⎩
6x1= 12 6x2 + 3x1 = −12 7x3−2x2+4x1= 14 21x4 +9x3 −3x2 +5x1 = −2
a. Solve for x1, x2, x3, and x4 (in order) by forward substitution.
b. Write this system in matrix notation Ax = b, where x = [x1, x2, x3, x4]T . Deter- mine the LU factorization A = LU, where L is unit lower triangular and U is upper triangular.
a12. Given
A=⎣ 5 3 2⎦,
⎡ 1 0 0⎤ ⎡3 2 −1⎤ L−1=⎣−5 1 0⎦, U=⎣0−1 11⎦
obtain the inverse of A by solving U X ( j ) = L −1 I ( j ) for j = 1, 2, 3.
13. Using the system of Equations (2), form M = M3 M2 M1 and determine M−1. Verify
that M−1 = L. Why is this, in general, not a good idea?
14. Consider the matrix A = tridiagonal (ai,i−1, aii , ai,i+1), where aii ≠ 0.
aa. Establishthealgorithm
integer i
real array (ai j )1:n×1:n , (li j )1:n×1:n , (ui j )1:n×1:n
l11 ← a11
fori =2to4do
li,i−1 ← ai,i−1
ui−1,i ←ai−1,i/li−1,i−1 li,i ← ai,i − li,i−1ui−1,i
end for
for determining the elements of a lower tridiagonal matrix L = (li j ) and a unit upper tridiagonal matrix U = (ui j ) such that A = LU.
314 Chapter 8
Additional Topics Concerning Systems of Linear Equations
b. Establishthealgorithm
integer i ; real array (ai j )1:n×1:n , (li, j )1:n×1:n , (ui, j )1:n×1:n u11 ← a11
fori =2to4do
ui−1,i ← ai−1,i
li,i−1 ← ai,i−1/ui−1,i−1 ui,j ←ai,i −li,i−1ui−1,i
end for
for determining the elements of a unit lower triangular matrix L = (li j ) and an upper tridiagonal matrix U = (ui j ) such that A = LU.
By extending the loops, we can generalize these algorithms to n×n tridiagonal matrices.
15. Show that the equation Ax = B can be solved by Gaussian elimination with scaled
partial pivoting in (n3/3) + mn2 + O(n2) multiplications and divisions, where A, X,
and B are matrices of order n × n, n × m, and n × m, respectively. Thus, if B is n × n,
then the n × n solution matrix X can be found by Gaussian elimination with scaled
partial pivoting in 4 n3 + O(n2) multiplications and divisions. Hint: If X( j) and B( j)
3
are the jth columns of X and B, respectively, then AX
16. Let X be a square matrix that has the form
X=AB CD
(j) (j) = B .
where A and D are square matrices and A−1 exists. It is known that X −1 exists if and only if ( D − C A−1 B)−1 exists. Verify that X −1 is given by
I − A−1 B A−1 0 I 0 X= 0 I 0 (D−CA−1B)−1 −CA−1 I
As an application, compute the inverse of the following: ⎡10 01⎤ ⎡100 1⎤
aa.X=⎢0 1 1 0⎥ ab.X=⎢0 1 0 1⎥ ⎣10 12⎦ ⎣001 1⎦
0001 1112
a 17. Let A be an n × n complex matrix such that A−1 exists. Verify that
−1 −1 −1 AA1AAi
−Ai −Ai = 2 A−1 −A−1i
where A denotes the complex conjugate of A; if A = (ai j ), then A = (ai j ). Recall
that for a complex number z = a + bi, where a and b are real, and z = a − bi.
18. Find the LU factorization of this matrix: ⎡⎤
221 A=⎣4 7 2⎦
2 11 5
19. a. b. c. d.
Prove that the product of two lower triangular matrices is lower triangular.
Provethattheproductoftwounitlowertriangularmatricesisunitlowertriangular.
Prove that the inverse of a unit lower triangular matrix is unit lower triangular.
Byusingthetransposeoperation,provethatalloftheprecedingresultsaretruefor upper triangular matrices.
20. Let L be lower triangular, U be upper triangular, and D be diagonal.
a. If L and U are both unit triangular and L DU is diagonal, does it follow that L and
U are diagonal?
b. If L DU is nonsingular and diagonal, does it follow that L and U are diagonal?
c. If L and U are both unit triangular and if L DU is diagonal, does it follow that L = U = I?
21. DeterminetheLDLT factorizationforthefollowingmatrix: ⎡⎤
1 2 −1 1 A=⎢2 3−43⎥ ⎣−1 −4 −1 3⎦
1330
22. Find the Cholesky factorization of
A=⎣6 25 19⎦
23. Consider the system
⎡⎤
4 6 10 10 19 62
A0x=b BCyd
Show how to solve the system more cheaply using the submatrices rather than the overall system. Give an estimate of the computational cost of both the new and old approaches. This problem illustrates solving a block linear system with a special structure.
24. DeterminetheLDLT factorizationofthematrix ⎡⎤
5 35 −20 65
A = ⎢ 35 244 −143 ⎣ −20 −143 73 65 461 −232
Can you find the Cholesky factorization?
461⎥ −232 ⎦
856
25. (Sparse factorizations) Consider the following sparse symmetric matrices with the nonzero pattern shown where nonzero entries in the matrix are indicated by the × symbol and zero entries are a blank. Show the nonzero pattern in the matrix L for the Cholesky factorization by using the symbol + for the fill-in of a zero entry by a nonzero entry.
8.1 Matrix Factorizations 315
316 Chapter 8
Additional Topics Concerning Systems of Linear Equations
⎡×××⎤⎡×××⎤
⎢×××× ⎥ ⎢×××⎥ ⎢× ×⎥
a. A=⎢ × × ⎥ ⎢ × ×⎥
⎢×××⎥ ⎢×××⎥ ⎢ ×⎥
b. A=⎢×× ×⎥ ⎢ × ×⎥
×× ×××
××
×× × ××
⎢⎣××⎥⎦ ⎢⎣××××⎥⎦ ××××××××
× ×× ⎡×××⎤
×××××
⎢ ×× ⎢×××
⎢ c. A=⎢
×
×⎥ × ⎥
⎥
×⎥
× ⎥
⎢
×× ×× ×××
⎢⎣× × ××⎥⎦
× ×××× ××××
1. WriteandtestaprocedureforimplementingthealgorithmsofProblem8.1.14.
2. Then×nfactorizationA=LU,whereL=(lij)islowertriangularandU=(uij)is upper triangular, can be computed directly by the following algorithm (provided zero divisions are not encountered): Specify either l11 or u11 and compute the other such that l11u11 = a11. Compute the first column in L by
li1 = ai1 (1in) u11
and compute the first row in U by
u1j = a1j (1 jn)
l11
Now suppose that columns 1, 2, . . . , k − 1 have been computed in L and that rows 1,2,...,k −1 have been computed in U. At the kth step, specify either lkk or ukk, and compute the other such that
k−1
m=1
k−1
lik = 1 aik −limumk (kin)
lkkukk =akk − Compute the kth column in L by
lkmumk
ukk m=1
Computer Problems 8.1
and compute the kth row in U by
k−1
Define the test matrix
⎡⎤
5765 A = ⎢ 7 10 8 7 ⎥ ⎣6 8 10 9⎦
ukj = 1 akj −lkmumj lkk m=1
(k jn)
This algorithm is continued until all elements of U and L are completely determined. Whenlii =1(1in),thisprocedureiscalledtheDoolittlefactorization,andwhen ujj = 1 (1 j n), it is known as the Crout factorization.
5 7 9 10
Using the algorithm above, compute and print factorizations so that the diagonal entries
of L and U are of the following forms:
diag( L )
[1, 1, 1, 1] [?,?,?,?] [1,?,1,?] [?,1,?,1] [?,?,7,9]
diag(U )
[?, ?, ?, ?] [1,1,1,1] [?,1,?,1] [1,?,1,?] [3,5,?,?]
Doolittle Crout
Here the question mark means that the entry is to be computed. Write code to check the results by multiplying L and U together.
3. Write
procedure Poly(n, (ai j ), (ci ), k, (yi j ))
for computing the n × n matrix pk ( A) stored in array (yi j ):
yk =pk(A)=c0I+c1A+c2A2+···+ckAk
where A is an n × n matrix and pk is a kth-degree polynomial. Here (ci ) are real con- stants for 0 i k. Use nested multiplication and write efficient code. Test procedure Poly on the following data:
Case 1.
Case 2.
Case 3.
A=I5, p3(x)=1−5x+10x3
A= 1 2 , p2(x)=1−2x+x2 34
⎡⎤
024
A=⎣0 0 8⎦, p3(x)=1+3x−3x2+x3
000
8.1
Matrix Factorizations 317
318 Chapter 8
Additional Topics Concerning Systems of Linear Equations
a Case 4.
⎡⎤
Case 5.
Case 6.
2 −1 0 0
A=⎢−1 2 −1 0⎥, p (x)=10+x−2x2 +3x3 −4x4 +5x5
⎣0−1 2−1⎦ 5 0 0 −1 2
⎡⎤
−20 −15 −10 −5
A=⎢ 1 0 0 0⎥, p (x)=5+10x+15x2 +20x3 +x4
⎣0100⎦4 0010
⎡⎤
5765
A=⎢7 10 8 7⎥, p (x)=1−100x+146x2 −35x3 +x4
⎣6 8 10 9⎦ 4 5 7 9 10
4. Write and test a procedure for determining A−1 for a given square matrix A of order n. Your procedure should use procedures Gauss and Solve.
5. Write and test a procedure to solve the system AX = B in which A, X, and B are matrices of order n×n, n×m, and n×m, respectively. Verify that the procedure works on several test cases, one of which has B = I so that the solution X is the inverse of
A. Hint: See Problem 8.1.15.
6. Write and test a procedure for directly computing the inverse of a tridiagonal matrix.
Assume that pivoting is not necessary.
7. (Continuation)Testtheprocedureoftheprecedingcomputerproblemonthesymmetric
tridiagonal matrix A of order 10:
⎡−2 1 ⎤ ⎢ 1 −2 1 ⎥ ⎢1−21 ⎥
A = ⎢ ... ... ... ⎥ ⎢⎣ 1−2 1⎥⎦
1 −2
The inverse of this matrix is known to be
(A−1)ij =(A−1)ji =−i(n+1−j) (ij)
(n + 1)
8. Investigatethenumericaldifficultiesininvertingthefollowingmatrix:
⎡⎤
−0.0001 5.096 5.101 1.853
A = ⎢ 0. ⎣ 0.
0.
3.737 3.740 3.392 ⎥ 0. 0.006 5.254⎦ 0. 0. 4.567
matrix:
⎡⎤
1000 ⎢3 1 0 0⎥ ⎣5210⎦ 7 4 −3 1
which was studied by Wilkinson [1965, p. 640].
8.2 Iterative Solutions of Linear Systems
A = ⎢⎣ −0.04291 −0.01652 −0.06140
0.56850 0.38953 0.32179
0.07041
0.01203 −0.22094
0.68747 ⎥⎦ −0.52927 0.42448
8.2 Iterative Solutions of Linear Systems 319
9. Considerthefollowingtwotestmatrices: ⎡⎤⎡⎤
4 6 10 4 6 10 A=⎣6 25 19⎦, B=⎣6 13 19⎦
10 19 62 10 19 62
Show that the first Cholesky factorization has all integers in the solution, while the second one is all integers until the last step, where there is a square root.
a. ProgramtheCholeskyalgorithm.
b. Use Matlab, Maple, or Mathematica to find the Cholesky factorizations.
10. LetAbereal,symmetric,andpositivedefinite.Isthesametrueforthematrixobtained by removing the first row and column of A?
11. Devise a code for inverting a unit lower triangular matrix. Test it on the following
12. VerifyExample1usingMatlab,Maple,orMathematica.
13. In Example 3, verify the factorizations of matrix A using Matlab, Maple, and
Mathematica.
14. Find the PA = LU factorization of this matrix: ⎡⎤
−0.05811 −0.11696 0.51004 −0.31330
In this section, a completely different strategy for solving a nonsingular linear system
Ax = b (1)
is explored. This alternative approach is often used on enormous problems that arise in solving partial differential equations numerically. In that subject, systems having hundreds of thousands of equations arise routinely.
Vector and Matrix Norms
We first present a brief overview of vector and matrix norms because they are useful in the discussion of errors and in the stopping criteria for iterative methods. Norms can be defined on any vector space, but we usually use Rn or Cn. A vector norm ||x|| can be thought of as
320 Chapter 8
Additional Topics Concerning Systems of Linear Equations
the length or magnitude of a vector x ∈ Rn . A vector norm is any mapping from Rn to R that obeys these three properties:
||x|| > 0 if x ̸= 0 ||αx|| = |α|||x||
||x + y|| ||x|| + || y|| (triangle inequality)
for vectors x, y ∈ Rn and scalars α ∈ R. Examples of vector norms for the vector x =
(x1,x2,…,xn)T ∈Rn are
xi2 ||x||∞ = max |xi |
||A||1 = max 1jn
|aij| |σmax|
n
j=1
l1-matrix norm
spectral /l2-matrix norm l∞-matrix norm
n i=1
||x||1 = ||x||2 =
l∞-vector norm
For n × n matrices, we can also have matrix norms, subject to the same requirements:
||A||>0if A̸=0
||α A|| = |α| || A||
||A + B|| ||A|| + ||B|| (triangular inequality)
for matrices A, B and scalars α.
We usually prefer matrix norms that are related to a vector norm. For a vector norm
|| · ||, the subordinate matrix norm is defined by
||A|| ≡ sup{||Ax|| : x ∈ Rn and ||x|| = 1}
Here, A is an n × n matrix. For a subordinate matrix norm, some additional properties are
||I|| = 1
||Ax|| ||A||||x||
||AB|| ||A||||B||
There are two meanings associated with the notation || · ||p, one for vectors and another for matrices. The context will determine which one is intended. Examples of subordinate matrix norms for an n × n matrix A are
|xi| n 1/2
l1-vectornorm
Euclidean/l2-vectornorm
i=1 1in
n
||A||2 = max 1in
i=1
||A||∞ = max 1in
|aij|
Here, σi are the eigenvalues of AT A, which are called the singular values of A. The largest σmax in absolute value is termed the spectral radius of A. (See Section 8.3 for a discussion of singular values.)
which gives us
||b|| = || Ax|| || A|| ||x|| 1 ||A||
||x|| ||b||
From the perturbed linear system Aδx = δb, we obtain δx = A−1δb and
||δx|| ||A−1||||δb||
Combining the two inequalities above, we obtain
||δx|| κ(A)||δb|| ||x|| ||b||
which contains the condition number of the original matrix A.
As an example of an ill-conditioned matrix consider the Hilbert matrix
⎡111⎤
8.2 Iterative Solutions of Linear Systems 321
Condition Number and Ill-Conditioning
An important quantity that has some influence in the numerical solution of a linear system Ax = b is the condition number, which is defined as
κ(A) = ∥A∥2 ∥A−1∥2
It turns out that it is not necessary to compute the inverse of A to obtain an estimate of the condition number. Also, it can be shown that the condition number κ(A) gauges the transfer of error from the matrix A and the vector b to the solution x. The rule of thumb is that if κ(A) = 10k, then one can expect to lose at least k digits of precision in solving the system Ax = b. If the linear system is sensitive to perturbations in the elements of A, or to perturbations of the components of b, then this fact is reflected in A having a large condition number. In such a case, the matrix A is said to be ill-conditioned. Briefly, the larger the condition number, the more ill-conditioned the system.
Suppose we want to solve an invertible linear system of equations Ax = b for a given coefficient matrix A and right-hand side b but there may have been perturbations of the data owing to uncertainty in the measurements and roundoff errors in the calculations. Suppose that the right-hand side is perturbed by an amount assigned the symbol δb and the corresponding solution is perturbed an amount denoted by the symbol δx. Then we have
A(x + δx) = Ax + Aδx = b + δb
where
Aδx = δb
From the original linear system Ax = x and norms, we have
⎢23⎥ H3 = ⎣ 1 1 1 ⎦
234 111 345
We can use the Matlab commands to generate the matrix and then to compute both the condition number using the 2-norm and the determinant of the matrix. We find the condition number to be 524.0568 and the determinant to be 4.6296 × 10−4 . In solving linear systems,
322 Chapter 8
Additional Topics Concerning Systems of Linear Equations
the condition number of the coefficient matrix measures the sensitivity of the system to errors in the data. When the condition number is large, the computed solution of the system may be dangerously in error! Further checks should be made before accepting the solution as being accurate. Values of the condition number near 1 indicate a well-conditioned matrix whereas large values indicate an ill-conditioned matrix. Using the determinant to check for singularity is appropriate only for matrices of modest size. Using mathematical software, one can compute the condition number to check for singular or near-singular matrices.
A goal in the study of numerical methods is to acquire an awareness of whether a numerical result can be trusted or whether it may be suspect (and therefore in need of further analysis). The condition number provides some evidence regarding this question. With the advent of sophisticated mathematical software systems such as Matlab and others, an estimate of the condition number is often available, along with an approximate solution so that one can judge the trustworthiness of the results. In fact, some solution procedures involve advanced features that depend on an estimated condition number and may switch solution techniques based on it. For example, this criterion may result in a switch of the solution technique from a variant of Gaussian elimination to a least-squares solution for an ill- conditioned system. Unsuspecting users may not realize that this has happened unless they look at all of the results, including the estimate of the condition number. (Condition numbers can also be associated with other numerical problems, such as locating roots of equations.)
Basic Iterative Methods
The iterative-method strategy produces a sequence of approximate solution vectors x(0), x(1), x(2), . . . for system Ax = b. The numerical procedure is designed so that, in principle, the sequence of vectors converges to the actual solution. The process can be stopped when sufficient precision has been attained. This stands in contrast to the Gaussian elimination algorithm, which has no provision for stopping midway and offering up an approximate solution. A general iterative algorithm for solving System (1) goes as follows: Select a nonsingular matrix Q, and having chosen an arbitrary starting vector x(0), generate vectors x(1), x(2), . . . recursively from the equation
Qx(k) = (Q − A)x(k−1) + b (k = 1,2,…) (2) To see that this is sensible, suppose that the sequence x(k) does converge, to a vector x∗,
say. Then by taking the limit as k → ∞ in System (2), we get Qx∗ = (Q − A)x∗ + b
This leads to Ax∗ = b. Thus, if the sequence converges, its limit is a solution to the original System (1). For example, the Richardson iteration uses Q = I.
An outline of the pseudocode for carrying out the general iterative procedure (2) follows:
integer k, kmax
real array (x(0))1:n , (b)1:n , (c)1:n , (x)1:n , ( y)1:n , ( A)1:n×1:n , ( Q)1:n×1:n x ← x(0)
fork =1tokmax do
⎢ n x(k) =⎢⎣−
j=1 j ̸=i
⎥ (a /a )x(k−1) +(b/a )⎥⎦
(1in) (4)
i
ij ii j i ii
8.2 Iterative Solutions of Linear Systems 323
y←x
c ← ( Q − A)x + b solve Qx = c output k, x
if∥x− y∥<εthen
output “convergence”
stop end if
end for
output “maximum iteration reached”
In choosing the nonsingular matrix Q, we are influenced by the following considerations: • System (2) should be easy to solve for x(k), when the right-hand side is known.
• Matrix Q should be chosen to ensure that the sequence x(k) converges, no matter what initial vector is used. Ideally, this convergence will be rapid.
One should not believe that it is necessary to compute the inverse of Q to carry out an iterative procedure. For small systems, we can easily compute the inverse of Q, but in general, this is definitely not to be done! We want to solve a linear system in which Q is the coefficient matrix. As was mentioned previously, we want to select Q so that a linear system with Q as the coefficient matrix is easy to solve. Examples of such matrices are diagonal, tridiagonal, banded, lower triangular, and upper triangular.
Now, let us view System (1) in its detailed form
n
aijxj =bi (1in) (3)
j=1
Solving the i th equation for the i th unknown term, we obtain an equation that describes the
Jacobi method:
⎡⎤
Here, we assume that all diagonal elements are nonzero. (If this is not the case, we can usually rearrange the equations so that it is.)
In the Jacobi method above, the equations are solved in order. The components x(k−1) j
and the corresponding new values x(k) can be used immediately in their place. If this is j
done, we have the Gauss-Seidel method: ⎡⎤
⎢n n ⎥
x(k) =⎢⎣− i
j=1 ji
324
Chapter 8
Additional Topics Concerning Systems of Linear Equations
EXAMPLE 1
Solution
(Jacobi iteration) Let
⎡⎤⎡⎤
2−10 1 A = ⎣ −1 3 −1 ⎦ , b = ⎣ 8 ⎦
If x(k−1) is not saved, then we can dispense with the superscripts in the pseudocode as follows:
integer i, j,k,kmax,n; real array (aij)1:n×1:n,(bi)1:n,(xi)1:n fork =1tokmax do
for i = 1 to n do ⎡⎤
⎣n ⎦ xi ← bi − j=1 aijxj aii
end for end for
j ̸=i
An acceleration of the Gauss-Seidel method is possible by the introduction of a relax- ation factor ω, resulting in the successive overrelaxation (SOR) method:
⎧⎡ ⎤⎫ ⎪⎨⎢ n n ⎥⎪⎬
x(k) =ω ⎢⎣− (a /a )x(k) − (a /a )x(k−1) +(b/a )⎥⎦ +(1−ω)x(k−1) (6) i ⎪ ij ii j ij ii j i ii ⎪ i
⎩ j=1 ji
⎭
The SOR method with ω = 1 reduces to the Gauss-Seidel method.
We now consider numerical examples using iterative methods associated with the names
Jacobi, Gauss-Seidel, and successive overrelaxation.
0−12 −5
Carry out a number of iterations of the Jacobi iteration, starting with the zero initial vector.
Rewriting the equations, we have the Jacobi method:
x(k) = 1x(k−1) + 1 1222
x(k) = 1x(k−1) + 1x(k−1) + 8 231333
x(k) = 1x(k−1) − 5 3222
Taking the initial vector to be x(0) = [0, 0, 0]T , we find (with the aid of a computer program or a programmable calculator) that
x(0) =[0,0,0]T
x(1) =[0.5000,2.6667,−2.5000]T x(2) =[1.8333,2.0000,−1.1667]T
.
x(21) =[2.0000,3.0000,−1.0000]T
The actual solution (to four decimal places rounded) is obtained. ■
Now
200 Q=⎣0 3 0⎦
002
⎡100⎤
⎡1−1 0⎤
In the Jacobi iteration, Q is taken to be the diagonal of A: ⎡⎤
8.2 Iterative Solutions of Linear Systems 325
⎢2⎥⎢2⎥ Q−1 =⎣0 1 0⎦, Q−1A=⎣−1 1 −1 ⎦
333
001 0−1 1 22
The Jacobi iterative matrix and constant vector are
⎡010⎤ ⎡1⎤
⎢2⎥⎢2⎥ B = I − Q−1 A = ⎣ 1 0 1 ⎦, h = Q−1b = ⎣ 8 ⎦
010 −5 22
One can see that Q is close to A, Q−1 A is close to I, and I − Q−1 A is small. We write the Jacobi method as
x(k) = Bx(k−1) + h
(Gauss-Seidel iteration) Repeat the preceding example using the Gauss-Seidel iteration.
The idea of the Gauss-Seidel iteration is simply to accelerate the convergence by incorpo-
rating each vector as soon as it has been computed. Obviously, it would be more efficient
333
EXAMPLE 2
Solution
in the Jacobi method to use the updated value x(k) in the second equation instead of the old 1
value x(k−1). Similarly, x(k) could be used in the third equation in place of x(k−1). Using the 122
new iterates as soon as they become available, we have the Gauss-Seidel method:
x(k) = 1x(k−1) + 1 1222
x(k) = 1x(k) + 1x(k−1) + 8 231333
x(k) = 1x(k) − 5 3222
Starting with the initial vector zero, some of the iterates are
x(0) =[0,0,0]T
x(1) =[0.5000,2.8333,−1.0833]T x(2) =[1.9167,2.9444,−1.0278]T
.
x(9) =[2.0000,3.0000,−1.0000]T
In this example, the convergence of the Gauss-Seidel method is approximately twice as fast as that of the Jacobi method. ■
In the iterative algorithm that goes by the name Gauss-Seidel, Q is chosen as the lower triangular part of A, including the diagonal. Using the data from the previous example, we
326
Chapter 8
Additional Topics Concerning Systems of Linear Equations
now find that
⎡⎤
200 Q = ⎣ −1 3 0 ⎦
The usual row operations give us ⎡100⎤
⎡1−1 0⎤
0 −1 2
⎢2⎥⎢2⎥ Q−1 =⎣ 1 1 0⎦, Q−1A=⎣0 5 −1 ⎦
63 63 111 0−15 12 6 2 12 6
Again, we emphasize that in a practical problem we would not compute Q−1. The Gauss- Seidel iterative matrix and constant vector are
⎡010⎤ ⎡1⎤
⎢2⎥⎢2⎥ L = I − Q−1 A = ⎣0 1 1 ⎦, h = Q−1b = ⎣ 17 ⎦
636 011 −13 12 6 12
EXAMPLE 3
Solution
We write the Gauss-Seidel method as
x(k) = Lx(k−1) + h
(SOR iteration) Repeat the preceding example using the SOR iteration with ω = 1.1. Introducing a relaxation factor ω into the Gauss-Seidel method, we have the SOR method:
1 12221
x(k) = ω
x(k−1) +
+ (1 − ω)x(k−1)
1
1 1 8
x(k) = ω x(k) + x(k−1) + + (1 − ω)x(k−1) 2313332
1 32223
5
Starting with the initial vector of zeros and with ω = 1.1, some of the iterates are
x(0) =[0,0,0]T
x(1) =[0.5500,3.1350,−1.0257]T x(2) =[2.2193,3.0574,−0.9658]T
.
x(7) =[2.0000,3.0000,−1.0000]T
In this example, the convergence of the SOR method is faster than that of the Gauss-Seidel method. ■
In the iterative algorithm that goes by the name successive overrelaxation (SOR), Q is chosen as the lower triangular part of A including the diagonal, but each diagonal element ai j is replaced by ai j /ω, where ω is the so-called relaxation factor. (Initial work on the SOR method was done by Southwell [1946] and Young [1950].) From the previous example,
x(k) =ω
x(k) −
+(1−ω)x(k−1)
this means that
⎡20 00⎤
⎥ 0 ⎦
20 11
⎡11 00⎤
⎢20 ⎥ ⎢10 20
Now
⎡11−11 Q−1A=⎣ 11 539
0⎤
8.2 Iterative Solutions of Linear Systems 327
⎢ 11
Q = ⎣ −1 30
11 0 −1
Q−1=⎣121 11 0⎦, 600 30
1331 121 11 12000 600 20
300 600 121 671
6000 12000
⎥ −11 ⎦
30 539 600
The SOR iterative matrix and constant vector are
⎡−1110⎤ ⎡11⎤ ⎢1020⎥ ⎢20⎥
Lω = I − Q−1 A = ⎣ − 11 300
−121 6000
61 600
− 671 12000
11 ⎦, 30
61 600
h = Q−1b = ⎣
627 ⎦
200 −4103 4000
We write the SOR method as
Pseudocode
x(k) =Lωx(k−1) +h
We can write pseudocode for the Jacobi, Gauss-Seidel, and SOR methods assuming that the linear system (1) is stored in matrix-vector form:
procedure Jacobi( A, b, x)
realkmax ←100,δ←10−10,ε← 1 ×10−4
2 integer i, j, k, kmax, n; real diag, sum
real array ( A)1:n×1:n , (b)1:n , (x)1:n , ( y)1:n n ← size(A)
fork =1tokmax do
y←x
for i = 1 to n do
sum ← bi
diag ← aii
if |diag| < δ then
output “diagonal element too small”
return end if
for j = 1 to n do if j ̸=i then
sum ← sum − ai j y j end if
end for
xi ← sum/diag end for
output k, x
328 Chapter 8
Additional Topics Concerning Systems of Linear Equations
if∥x− y∥<εthen output k, x
return end if
end for
output “maximum iterations reached” return
end Jacobi
Here, the vector y contains the old iterate values, and the vector x contains the updated ones. The values of kmax, δ, and ε are set either in a parameter statement or as global variables.
The pseudocode for the procedure Gauss Seidel( A, b, x) would be the same as that for the Jacobi pseudocode above except that the innermost j -loop would be replaced by the following:
for j = 1 to i − 1 do sum ← sum − ai j x j
end for
for j = i + 1 to n do
sum ← sum − ai j x j end for
The pseudocode for procedure SOR( A, b, x, ω) would be the same as that for the Gauss- Seidel pseudocode with the statement following the j-loop replaced by the following:
xi ← sum/diag
xi ←ωxi +(1−ω)yi
In the solution of partial differential equations, iterative methods are frequently used to solve large sparse linear systems, which often have special structures. The partial derivatives are approximated by stencils composed of relatively few points, such as 5, 7, or 9. This leads to only a few nonzero entries per row in the linear system. In such systems, the coefficient matrix A is usually not stored since the matrix-vector product can be written directly in the code. See Chapter 15 for additional details on this and how it is related to solving elliptic partial differential equations.
Convergence Theorems
For the analysis of the method described by System (2), we write x(k) = Q−1(Q − A)x(k−1) + b
or
x(k) = Gx(k−1) + h (7)
EXAMPLE 4
Solution
The conclusion of this theorem can also be written as ρ(I − Q−1 A) < 1
where ρ is the spectral radius function: For any n × n matrix G, having eigenvalues λi,ρ(G)=max1in |λi|.
Determine whether the Jacobi, Gauss-Seidel, and SOR methods (with ω = 1.1) of the previous examples converge for all initial iterates.
For the Jacobi method, we can easily compute the eigenvalues of the relevant matrix B. The steps are
8.2 Iterative Solutions of Linear Systems 329
where the iteration matrix and vector are
G = I − Q−1 A, h = Q−1b
Notice that in the pseudocode, we do not compute Q−1. The matrix Q−1 is used to facilitate the analysis. Now let x be the solution of System (1). Since A is nonsingular, x exists and is unique. We have, from Equation (7),
x(k) − x = (I − Q−1 A)x(k−1) − x + Q−1b
= (I − Q−1 A)x(k−1) −(I − Q−1 A)x = (I − Q−1 A)(x(k−1) − x)
One can interpret e(k) ≡ x(k) − x as the current error vector. Thus, we have
e(k) = (I − Q−1 A)e(k−1) (8)
We want e(k) to become smaller as k increases. Equation (8) shows that e(k) will be smaller than e(k−1) if I − Q−1 A is small, in some sense. In turn, that means that Q−1 A should be close to I. Thus, Q should be close to A. (Norms can be used to make small and close precise.)
SPECTRAL RADIUS THEOREM
In order that the sequence generated by Qx(k) = ( Q − A)x(k−1) + b to converge, no matter what starting point x(0) is selected, it is necessary and sufficient that all eigenvalues of I − Q−1 A lie in the open unit disc, |z| < 1, in the complex plane.
■ THEOREM1
⎡−λ 1 0⎤
⎢2⎥11 det(B−λI)=det⎣ 1 −λ 1 ⎦=−λ3 + λ+ λ=0
3366
0 1 −λ 2
The eigenvalues are λ = 0, ± 1/3 ≈ ±0.5774. Thus, by the preceding theorem, the Jacobi iteration succeeds for any starting vector in this example.
330 Chapter 8
Additional Topics Concerning Systems of Linear Equations
from
⎡−λ11 0⎤
For the Gauss-Seidel method, the eigenvalues of the iteration matrix L are determined
⎢20 ⎥121 det(L−λI)=det⎣ 0 1 −λ 1 ⎦=−λ −λ + λ=0
6 3 6 36
The eigenvalues are λ = 0, 0, 1 ≈ 0.333. Hence, the Gauss-Seidel iteration will also 3
succeed for any initial vector in this example.
For the SOR method with ω = 1.1, the eigenvalues of the iteration matrix Lω are
determined from ⎡ − 1 − λ 11 0 ⎤
⎢10 20 ⎥
011−λ 12 6
det(Lω−λI)=det⎣ −11 300
61 −λ 600
671
11 ⎦ 30
61 −λ 600
−121 6000
=
− 10 − λ 600 − λ
11 11 61 1 671 11
12000 1 61
2
121 1111 − 6000 30 20
+20300 600−λ − −10−λ 1200030 =−1 +31λ+31λ2−λ3=0
1000 3000 3000
The eigenvalues are λ ≈ 0.1200,0.0833,−0.1000. Hence, the SOR iteration will also
succeed for any initial vector in this example. ■
A condition that is easier to verify than the inequality ρ(I − Q−1 A) < 1 is the dominance of the diagonal elements over the other elements in the same row. As defined in Section 7.3, we can use the property of diagonal dominance
n j=1
j ̸=i
|ai i | >
to determine whether the Jacobi and Gauss-Seidel methods converge via the following
theorem.
Notice that this is a sufficient but not a necessary condition. Indeed, there are matrices that are not diagonally dominant for which these methods converge.
Another important property follows:
|ai j |
JACOBI AND GAUSS-SEIDEL CONVERGENCE THEOREM
If A is diagonally dominant, then the Jacobi and Gauss-Seidel methods converge for any starting vector x(0).
■ THEOREM2
SYMMETRIC POSITIVE DEFINITE
Matrix A is symmetric positive definite (SPD) if A = AT and xT Ax > 0 for all nonzero real vectors x.
■ DEFINITION1
■ THEOREM3
8.2 Iterative Solutions of Linear Systems 331 For a matrix A to be SPD, it is necessary and sufficient that A = AT and that all eigenvalues
of A are positive.
SOR CONVERGENCE THEOREM
Suppose that the matrix A has positive diagonal elements and that 0 < ω < 2. The SOR method converges for any starting vector x(0) if and only if A is symmetric and positive definite.
Matrix Formulation
For the formal theory of iterative methods, we split the matrix A into the sum of a nonzero diagonal matrix D, a strictly lower triangular matrix CL, and a strictly upper triangular matrix CU such that
A = D − CL − CU
Here, D = diag(A), CL = (−aij)i>j, and CU = (−aij)i
We can derive the following:
8.2 Iterative Solutions of Linear Systems 333
for all nonzero vectors x ∈ Rn. In general, expressions such as ⟨u,v⟩ and ⟨u,v⟩A reduce to 1 × 1 matrices and are treated as scalar values. A quadratic form is a scalar quadratic function of a vector of the form
f(x)= 1⟨x,x⟩A −⟨b,x⟩+c 2
Here, A is a matrix, x and b are vectors, and c is a scalar constant. The gradient of a quadratic form
f′(x)=∂f(x)/∂x , ∂f(x)/∂x , ···, ∂f(x)/∂x T 12n
f ′(x) = 1 AT x + 1 Ax − b 22
If A is symmetric, this reduces to
f′(x)= Ax−b
Setting the gradient to zero, we obtain the linear system to be solved, Ax = b. Therefore, the solution of Ax = b is a critical point of f (x). If A is symmetric and positive definite, then f (x) is minimized by the solution of Ax = b. So an alternative way of solving the linear system Ax = b is by finding an x that minimizes f (x).
We want to solve the linear system
Ax = b
where the n × n matrix A is symmetric and positive definite.
Suppose that { p(1), p(2), . . . , p(k), . . . , p(n)} is a set containing a sequence of n mutually
conjugate direction vectors. Then they form a basis for the space Rn . Hence, we can expand the true solution vector x∗ of Ax = b into a linear combination of these basis vectors:
x∗ =α1p(1)+α2p(2)+···+α(k)p(k)+···+αnp(n) where the coefficients are given by
αk =⟨p(k),b⟩/⟨p(k),p(k)⟩A
This can be viewed as a direct method for solving the linear system Ax = b: First find the sequence of n conjugate direction vectors p(k), and then compute the coefficients αk. However, in practice, this approach is impractical because it would take too much computer time and storage.
On the other hand, if we view the conjugate gradient method as an iterative method, then we could solve large sparse linear systems in a reasonable amount of time and storage. The key is carefully choosing a small set of the conjugate direction vectors p(k) so that we do not need them all to obtain a good approximation to the true solution vector.
Start with an initial guess x(0) to the true solution x∗. We can assume without loss of generality that x(0) is the zero vector. The true solution x∗ is also the unique minimizer of
f(x)= 1⟨x,x⟩A −⟨x,x⟩= 1xT Ax−xTx 22
for x ∈ Rn. This suggests taking the first basis vector p(1) to be the gradient of f at x = x(0), which equals −b. The other vectors in the basis are now conjugate to the gradient—hence
334 Chapter 8
Additional Topics Concerning Systems of Linear Equations
the name conjugate gradient method. The kth residual vector is r(k) = b − Ax(k)
The gradient descent method moves in the direction r(k). Take the direction closest to the gradient vector r(k) by insisting that the direction vectors p(k) be conjugate to each other. Putting all this together, we obtain the expression
p(k+1) = r(k) − p(k), r(k) p(k), p(k) pk AA
After some simplifications, the algorithm is obtained for solving the linear system Ax = b, where the coefficient matrix A is real, symmetric, and positive definite. The input vector x(0) is an initial approximation to the solution or the zero vector.
In theory, the conjugate gradient iterative method solves a system of n linear equations
in at most n steps, if the matrix A is symmetric and positive definite. Moreover, the nth
iterative vector x(n) is the unique minimizer of the quadratic function q(x) = 1 xT Ax−xT b. 2
When the conjugate gradient method was introduced by Hestenes and Stiefel [1952], the initial interest in it waned once it was discovered that this finite-termination property was not obtained in practice. But two decades later, there was renewed interest in this method when it was viewed as an iterative process by Reid [1971] and others. In practice, the solution of a system of linear equations can often be found with satisfactory precision in a number of steps considerably less than the order of the system.
Here is a pseudocode for the conjugate gradient algorithm:
k ← 0; x ← 0; r ← b − Ax; δ ← ⟨r, r⟩
while √δ>ε√⟨b,b⟩andk
n ⎪⎬
(−a /a )x(k−1) −(b/a ) +(1−ω)x(k−1)
x(k) =ω
The SOR method reduces to the Gauss-Seidel method when ω = 1.
(3) For a matrix formulation, we split the matrix A:
A = D − CL − CU
where D is a nonzero diagonal matrix, CL is a strictly lower triangular matrix, and CU is a strictlyuppertriangularmatrix.Here,D=diag(A),CL =(−aij)i>j,andCU =(−aij)i
h = D−1b
For the Gauss-Seidel method, we have
Q = D − CL
L = (D − CL)−1CU
For the SOR method, we have ω
8.2 Iterative Solutions of Linear Systems 337
h = ( D − C L )−1 b Q= 1(D−ωCL)
Lω =(D−ωCL)−1[ωCU +(1−ω)D] h = ω ( D − ω C L )−1 b
(4) An iterative method converges for a specific matrix A if and only if ρ(I − Q−1 A) < 1
If A is diagonally dominant, then the Jacobi and Gauss-Seidel methods converge for any x(0). The SOR method converges, for 0 < ω < 2 and any x(0), if and only if A is symmetric and positive definite with positive diagonal elements.
1. GiveanalternativesolutiontoExample4.
2. WritethematrixformulafortheGauss-Seideloverrelaxationmethod.
a3. (Multiplechoice)InsolvingasystemofequationsAx=b,itisoftenconvenienttouse an iterative method, which generates a sequence of x(k) vectors that should converge to a solution. The process is stopped when sufficient accuracy has been attained. A general procedure is to obtain x(k) by solving Qx(k) = (Q − A)x(k−1) + b. Here, Q is a certain matrix that is usually connected somehow to A. The process is repeated, starting with any available guess, x(0). What hypothesis guarantees that the method works, no matter what starting point is selected?
a. ||Q||<1 b. ||QA||<1 c. ||I−QA||<1
d. ||I−Q−1A||<1
4. (Multiplechoice)Fromavectornorm,wecancreateasubordinatematrixnorm.Which
e. Noneofthese.
Hint: The spectral radius is less than or equal to the norm.
relation is satisfied by every subordinate matrix norm?
a. ||Ax|| ||A||||x|| b. ||I|| = 1 c. ||AB|| ||A||||B||
d. ||A+B|| ||A||+||B|| e. None of these.
a 5. (Multiple choice) The condition for diagonal dominance of a matrix A is:
a. |aii| < nj=1 |aij| b. |aii| nj=1 |aij| c. |aii| < nj=1 |aij| j ̸=i j ̸=i
d. |aii|>nj=1|aij|
e. None of these.
Problems 8.2
338 Chapter 8
Additional Topics Concerning Systems of Linear Equations
6.
(Multiplechoice)Anecessaryandsufficientconditionforthestandarditerationformula x(k) =Gx(k−1)+htoproduceasequencex(k)thatconvergestoasolutionoftheequation (I − G)x = h is that:
a. ThespectralradiusofGisgreaterthan1. b. The matrix G is diagonally dominant.
c. ThespectralradiusofGislessthan1.
d. G is nonsingular.
e. Noneofthese.
(Multiple choice) A sufficient condition for the Jacobi method to converge for the linear system Ax = b.
7.
8.
a9.
10.
11.
a. b. c. d. e.
A − I is diagonally dominant. A is diagonally dominant.
G is nonsingular. ThespectralradiusofGislessthan1. Noneofthese.
(Multiple choice) A sufficient condition for the Gauss-Seidel method to work on the linear system Ax = b.
a. b. c. d. e.
A is diagonally dominant.
A − I is diagonally dominant.
The spectral radius of A is less than 1. G is nonsingular.
Noneofthese.
(Multiple choice) Necessary and sufficient conditions for the SOR method, where 0<ω<2,toworkonthelinearsystem Ax=b.
b. ρ ( A) < 1. d. x(0) = 0.
123 ⎢3054⎥ ⎢23456⎥ a.⎣0 5 4⎦ b.⎢⎣1 1 1 2⎥⎦ c.⎢0 1 0 1 0⎥
213 1322 ⎣34343⎦ 55555
Determine the condition numbers κ(A) of these matrices: ⎡⎤⎡⎤
a. A is diagonally dominant.
c. A is symmetric positive definite.
e. None of these.
n n = i =1 j =1
2 ai j
The Frobenius norm is given by || A||F
used because it is so easy to compute. Find the value of this norm for these matrices:
which is frequently ⎡ ⎤ ⎡0 0 1 2⎤ ⎡⎢1 1 1 1 1⎤⎥
−2 1 0 a.⎣1 −2 1⎦
0 1 −2
0 0 1 b.⎣0 1 0⎦
1 1 1
⎡3 0 0⎤ c. ⎣0 2 0⎦
001
⎡⎢−2 −1 2 −1⎤⎥ d. ⎢ 1 2 1 −2⎥ ⎣2−12 1⎦
0201
8.2
Iterative Solutions of Linear Systems 339
1. Redo several or all of Examples 1–5 using the linear system involving one of the following coefficient matrix and right-hand side vector pairs:
a.A= 5−1, b=7 ⎡−1 3 ⎤ ⎡4⎤
5−10 7 b.A=⎣−1 3−1⎦,b=⎣4⎦ ⎡0−12⎤ ⎡5⎤
2−10 1 c.A=⎣−1 6−2⎦,b=⎣3⎦
4−38 9 ⎡⎤⎡⎤
7 3 −1 3 d. A=⎣ 3 8 1⎦, b=⎣−4⎦
−114 2
2. Using the Jacobi, Gauss-Seidel, and SOR (ω = 1.1) iterative methods, write and exe- cute a computer program to solve the following linear system to four decimal places (rounded) of accuracy:
⎡ ⎤⎡⎤⎡⎤
7 1 −1 2 x1 3 ⎢1 8 0 −2⎥⎢x2 ⎥ = ⎢−5⎥ ⎣−1 0 4 −1⎦⎣x3⎦ ⎣ 4⎦
2 −2 −1 6 x4 −3
Compare the number of iterations needed in each case. Hint: The exact solution is
x = (1,−1,1,−1)T .
3. Using the Jacobi, Gauss-Seidel, and the SOR (ω = 1.4) iterative methods, write and run code to solve the following linear system to four decimal places of accuracy:
⎡ ⎤⎡⎤⎡⎤
7 3 −1 2 x1 −1 ⎢3 8 1 −4⎥⎢x2⎥=⎢ 0⎥ ⎣−1 1 4 −1⎦⎣x3⎦ ⎣−3⎦
2 −4 −1 6 x4 1
Compare the number of iterations in each case. Hint: Here, the exact solution is x =
(−1, 1, −1, 1)T .
4. (Continuation) Solve the system using the SOR iterative method with values of ω = 1(0.1)2. Plot the number of iterations for convergence versus the values of ω. Which value of ω results in the fastest convergence?
Computer Problems 8.2
340 Chapter 8
Additional Topics Concerning Systems of Linear Equations
5. Program and run the Jacobi, Gauss-Seidel, and SOR methods for the system of Example 1
a. using equations involving the splitting matrix Q.
b. using the equation formulations in Example 4.
c. usingthepseudocodeinvolvingmatrix-vectormultiplication.
6. (Continuation) Select one or more of the systems in Computer Problem 1, and rerun these programs.
7. Considerthelinearsystem
⎡12−64 ⎢3 ⎢−46−4
⎢1−46
⎤⎡y⎤⎡b⎤ ⎥⎢1⎥⎢1⎥
⎢ ⎢ ⎢ ⎢ ⎢⎣
6 −4 1
1 −4 6 −4
1 −4 6 1 −4
4 3
⎥⎢ y4 ⎥ ⎢ .
⎥⎢y ⎥⎢ n−3
⎥⎦ ⎢⎣ y n − 2 yn−1
⎥ ⎢b4 ⎥ ⎥ = ⎢ . ⎥
9 −3 x1 = 6 −2 8 x2 −4
Using Maple or Matlab, compare solving it by using the Jacobi method and the Gauss- Seidel method starting with x(0) = (0, 0)T .
8. (Continuation)
a. Changethe(1,1)entryfrom9to1sothatthecoefficientmatrixisnolongerdiag- onally dominant and see whether the Gauss-Seidel method still works. Explain why or why not.
b. Thenchangethe(2,2)entryfrom8to1aswellandtest.Againexplaintheresults.
9. Usetheconjugategradientmethodtosolvethislinearsystem:
⎡ ⎤⎡⎤⎡⎤
2.0 −0.3 −0.2 x1 7 ⎣−0.3 2.0 −0.1⎦⎣x2 ⎦ = ⎣5⎦
−0.2 −0.1 2.0 x3 3
10. (Euler-Bernoulli beam) A simple model for a bending beam under stress involves the Euler-Bernoulli differential equation. A finite difference discretization converts it into a system of linear equations. As the size of the discretization decreases, the linear system becomes larger and more ill-conditioned.
a. For a beam pinned at both ends, we obtain the following banded system of linear equations with a bandwidth of five:
1
−4 1
⎥⎢y2⎥⎢b2⎥ ⎥⎢y3⎥⎢b3⎥
1 −4
... ... ... ... ... ...
⎥ ⎢b ⎥ ⎥ ⎢ n−3 ⎥
1
−4 1
⎥⎦ ⎢⎣bn−2 ⎥⎦ bn−1
6 −4
6−12yn bn
The right-hand side represents forces on the beam. Set the right-hand side so that there is a known solution, such as a sag in the middle of the beam. Using an iterative
...
... ... ... ... ...
1 −4 6 −4 1
⎥⎢ .
⎥⎢y ⎥ ⎢b ⎥
8.2 Iterative Solutions of Linear Systems 341
method, repeatedly solve the system by allowing n to increase. Does the error in the solution increase when n increases? Use mathematical software that computes the condition number of the coefficient matrix to explain what is happening.
b. Thelinearsystemofequationsforacantileverbeamwithafreeboundarycondition at only one end is
⎡12−6 4 ⎤⎡y⎤⎡b⎤ 311
⎢−4 6 −4 1 ⎥⎢ y2 ⎢ 1 −4 6 −4 1 ⎥⎢ y3
⎥⎢b2 ⎥ ⎥ ⎢ b3 ⎥ ⎥⎢b4 ⎥
⎢ ⎢ ⎢ ⎢ ⎢⎣
1−46−41 ⎥⎢y4
1 −4 1
6 −4 −93 111 25 25
⎥ = ⎢ . ⎥ 1⎥⎢ n−3⎥ ⎢ n−3⎥
−43 ⎥⎦⎢⎣yn−2⎥⎦ ⎢⎣bn−2⎥⎦
25 yn−1 bn−1
Repeat the numerical experiment for this system. See Sauer [2006] for additional details.
11. Consider this sparse linear system:
⎡ 3 −1
⎢ −1 3 −1
1⎤⎡x ⎤ ⎡2.5⎤ 1
1 2⎥⎢x2⎥⎢1.5⎥ 2 ⎥⎢x3⎥⎢1.5⎥ ⎥⎢ . ⎥ ⎢.⎥ ⎥⎢.⎥⎢.⎥ ⎥⎢ . ⎥ = ⎢1.0⎥
12 24 12 25 25 25
yn bn
⎢−1 3−1
⎢−13−1
1
⎢....2
⎢ .. .. .. ..
⎢ ... ... ... ...
⎥⎢.⎥⎢.⎥
⎢ 1 −13
⎥⎢ . ⎥ ⎢1.5⎥
⎢⎣
12 2
⎥⎦⎢xn−2⎥⎢⎣⎥⎦ −13xn 1.5
−1
3 −1 ⎣xn−1⎦ 1.5
−1
The true solution is x = [1,1,1,...,1,1,1]T . Use an iterative method to solve this
1 2
system for increasing values of n.
12. Consider the sample two-dimensional linear system Ax = b, where A = 3 2 ,
26
b = 2 , and c = 0. Plot graphs to show the following: −8
a. Thesolutionliesattheintersectionoftwolines.
b. Graph of the quadratic form F(x) = c + bT x + 1 xT Ax showing that the minimum
2 point of this surface is the solution of Ax = b.
c. Contours of the quadratic form so each ellipsoidal curve has a constant value.
d. Gradient F′(x) of the quadratic form. Show that for every x, the gradient points in the direction of the steepest increase of F(x) and is orthogonal to the contour lines. (See Section 16.2.)
342
Chapter 8 Additional Topics Concerning Systems of Linear Equations
8.3
Eigenvalues and Eigenvectors
EXAMPLE1
Solution
EXAMPLE 2
Solution
Let A be an n × n matrix. We ask the following natural question about A: Are there nonzero vectors v for which Av is a scalar multiple of v? Although we pose this question in the spirit of pure curiosity, there are many situations in scientific computation in which this question arises.
The answer to our question is a qualified Yes! We must be willing to consider complex scalars, as well as vectors with complex components. With that broadening of our viewpoint, such vectors always exist. Here are two examples. In the first, we need not bring in complex numbers to illustrate the situation, while in the second, the vectors and scalar factors must be complex.
Let A= 3 2 .Findanonzerovectorvforwhich Avisamultipleofv. 7 −2
One easily verifies that
A
A 1 = 5 =5 1
2 = −8 =−4 2 −7 28 −7
We have two different answers (but we have not revealed how to find them). ■
Repeat the preceding example with the matrix A = 1 1 . −2 3
As in Example 1, it can be verified that
A 1 =(2+i) 1 1+i 1+i
A 1 =(2−i) 1 1−i 1−i
In these equations, i = √−1. Surprisingly, we find answers involving complex numbers even though the matrix does not contain any complex entries! ■
When the equation Ax = λx is valid and x is not zero, we say that λ is an eigenvalue of A and x is an accompanying eigenvector. Thus, in Example 1, the matrix has 5 as an eigenvalue with accompanying eigenvector [1, 1]T , and −4 is another eigenvalue with accompanying eigenvector [2, −7]T . Example 2 emphasizes that a real matrix may have complex eigenvalues and complex eigenvectors. Notice that an equation A0 = λ0 and an equation A0 = 0x say nothing useful about eigenvalues and eigenvectors of A.
Many problems in science lead to eigenvalue problems in which the principal question usually is: What are the eigenvalues of a given matrix, and what are the accompanying eigenvectors? An outstanding application of this theory is to systems of linear differential equations, about which more will be said later.
Notice that if Ax = λx and x ≠ 0, then every nonzero multiple of x is an eigenvector (with the same eigenvalue). If λ is an eigenvalue of an n × n matrix A, then the set {x: Ax = λx} is a subspace of Rn called an eigenspace. It is necessarily of dimension at least 1.
Calculating Eigenvalues and Eigenvectors
Given a square matrix A, how does one discover its eigenvalues? Begin by observing that the equation Ax = λx is equivalent to (A − λI)x = 0. Since we are interested in nonzero solutions to this equation, the matrix A − λ I must be singular (noninvertible), and therefore, Det( A − λ I ) = 0. This is how (in principle) we can find all the eigenvalues of A. Specifically, form the function p by the definition p(λ) = Det( A − λ I ), and find the zeros of p. It turns out that p is a polynomial of degree n and must have n zeros, provided that we allow complex zeros and count each zero a number of times equal to its multiplicity. Even if the matrix A is real, we must be prepared for complex eigenvalues. The polynomial just described is called the characteristic polynomial of the matrix A. If this polynomial has a repeated factor, such as (λ − 3)k , then we say that 3 is a root of multiplicity k . Such roots are still eigenvalues, but they can be troublesome when k > 1.
To illustrate the calculation of eigenvalues, let us use the matrix in Example 1, namely,
A=32 7 −2
The characteristic polynomial is
p(λ)=Det(A−λI)=Det 3−λ 2 =(3−λ)(−2−λ)−14
The eigenvalues are 5 and −4.
We can carry out this calculation with one or two commands in Matlab, Maple, or
Mathematica. We can determine the characteristic polynomial and subsequently compute its zeros. This gives us the two roots of of the characteristic polynomial, which are the eigenvalues 5 and −4. These mathematical software systems also have single commands to produce a list of eigenvalues, computed in the best possible way, which is usually not to determine the characteristic polynomial and subsequently compute its zeros!
In general, an n × n matrix will have a characteristic polynomial of degree n, and its roots are the eigenvalues of A. Since the calculation of zeros of a polynomial is numeri- cally challenging if not unstable, this straightforward procedure is not recommended. (See Computer Problem 8.3.2 for an experiment pertaining to this situation.) For small values of n, it may be quite satisfactory, however. It is called the direct method for computing eigenvalues.
Once an eigenvalue λ has been determined for a matrix A, an eigenvector can be computed by solving the system (A − λI)x = 0. Thus, in Example 1, we must solve (A − 5I)x = 0, or
−2 2 x1 = 0 7−7×2 0
8.3 Eigenvalues and Eigenvectors 343
7 −2−λ
=λ2 −λ−20=(λ−5)(λ+4)
344 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Of course, this matrix is singular, and the homogeneous equation has nontrivial solutions, such as [1, 1]T . The other eigenvalue is treated in the same way, leading to an eigenvector [2, −7]T . Any scalar multiple of an eigenvector is also an eigenvector.
This work can be done by using mathematical software to find an eigenvector for each eigenvalue λ via the null space of the matrix A − λI. Also, we can use a single command to compute all the eigenvalues directly or request the calculation of all the eigenvalues and eigenvectors at once. The Matlab command [V,D] = eig(A) produces two arrays, V and D. The array V has eigenvectors of A as its columns, and the array D contains all the eigenvalues of A on its diagonal. The program returns a vector of unit length such as [0.7071, 0.7071]T . That vector by itself provides a basis for the null space of A − 5 I .
Notice that the eigenvalue-eigenvector problem is nonlinear. The equation Ax = λx has two unknowns, λ and x. They appear in the equation multiplied together. If either x or λ were known, finding the other would be a linear problem and very easy.
Mathematical Software
A typical, mundane use of mathematical software such as Matlab might be to compute the eigenvalues and eigenvalues of a matrix with a command such as [V,D] = eig(A) for the matrix
⎡⎤
1 3 −7 A = ⎣ −3 4 1 ⎦
2 −5 3
Matlab responds instantly with the eigenvectors in the array V and the eigenvalues in the diagonal array D. The real eigenvalue is 0.0214 and the complex pair of eigenvalues are 3.9893 ± 5.5601i . Behind the scenes, much complicated computing may be taking place. The general procedure has these components: First, by means of similarity transforma- tions, A is put into lower Hessenberg form. This means that all elements below the first subdiagonal are zero. Thus, the new A = (aij) satisfies aij = 0 when i > j + 1. Sim- ilarity transformations ensure that the eigenvalues are not disturbed. If A is real, further similarity transformations put A into a near-diagonal form in which each diagonal element is either a single real number or a 2 × 2 real matrix whose eigenvalues are a pair of con- jugate complex numbers. Creating the additional zeros just below the diagonal requires some iterative process, because after all, we are in effect computing the zeros of a poly- nomial. The iterative process is reminiscent of the power method that will be described in Section 8.4.
Maple can be used to compute the eigenvalues and eigenvectors. The quantities are computed in exact arithmetic and then converted to floating-point. In some versions of Maple and Matlab, one can use some of the commands from one of these packages in the other. In Mathematica, we can use commands to obtain similar results.
The best advice for anyone who is confronted with challenging eigenvalue problems is to use the software in the package LAPACK. Special eigenvalue algorithms for various types of matrices are available there. For example, if the matrix in question is real and symmetric, one should use an algorithm tailored for that case. There are about a dozen categories available to choose from in LAPACK. Matlab itself employs some of the programs in LAPACK.
Properties of Eigenvalues
A theorem that summarizes the special properties of a matrix that impinge on the computing of its eigenvalues follows.
8.3 Eigenvalues and Eigenvectors 345
MATRIX EIGENVALUE PROPERTIES
The following statements are true for any square matrix A:
1. IfλisaneigenvalueofA,thenp(λ)isaneigenvalueofp(A),foranypolynomial
p. In particular, λk is an eigenvalue of Ak .
2. If A is nonsingular and λ is an eigenvalue of A, then p(1/λ) is an eigenvalue of
p(A−1), for any polynomial p. In particular, λ−1 is an eigenvalue of A−1.
3. If A is real and symmetric, then its eigenvalues are real.
4. If A is complex and Hermitian, then its eigenvalues are real.
5. If A is Hermitian and positive definite, then its eigenvalues are positive.
6. If P is nonsingular, then A and P A P −1 have the same characteristic polynomial (and the same eigenvalues).
■ THEOREM1
EIGENVALUES OF SIMILAR MATRICES
Similar matrices have the same eigenvalues.
■ THEOREM2
Recall that a matrix A is symmetric if A = AT , where AT = (aji ) is the transpose of A = (aij). On the other hand, a complex matrix A is Hermitian if A = A∗, where A∗ = AT = (a j i ). Here A∗ is the conjugate transpose of the matrix A. Using the syntax of programming, we can write AT (i, j) = A( j, i) and A∗(i, j) = A( j, i). Recall also that A
is positive definite if xT Ax > 0 for all nonzero vectors x.
Two matrices A and B are similar to each other if there exists a nonsingular matrix P
such that B = P AP−1. Similar matrices have the same characteristic polynomial
Det(B − λI) = Det(P AP−1 − λI) = Det(P(A − λI)P−1)
= Det( P ) · Det( A − λ I ) · Det( P −1 ) = Det( A − λ I )
Thus, we have an important theorem.
This theorem suggests a strategy for finding eigenvalues of A. Transform the matrix A to a matrix B using a similarity transformation B = P A P −1 in which B has a special structure, and then find the eigenvalues of matrix B. Specifically, if B is triangular or diagonal, the eigenvalues of B (and those of A) are simply the diagonal elements of B.
Matrices A and B are said to be unitarily similar to each other if B = U∗ AU for some unitary matrix U . Recall that a matrix U is unitary if U U ∗ = I . This brings us naturally to another important theorem and two corollaries.
346 Chapter 8
Additional Topics Concerning Systems of Linear Equations
SCHUR’S THEOREM
Every square matrix is unitarily similar to a triangular matrix.
■ THEOREM3
MATRIX SIMILAR TO A TRIANGULAR MATRIX
Every square real matrix is similar to a triangular matrix.
■ COROLLARY1
In this theorem, an arbitrary complex n × n matrix A is given, and the assertion made is that a unitary matrix U exists such that:
UAU∗ =T
where UU∗ = I and T is a triangular matrix.
The proof of Schur’s Theorem can be found in Kincaid and Cheney [2002] and Golub
and Van Loan [1996].
Thus the factorization
PAP−1 =T
is possible, where T is triangular, P is invertible, and A is real.
We illustrate Schur’s Theorem by finding the decomposition of this 2 × 2 matrix:
A= 3 −2 83
From the characteristic equation det( A − λ I ) = λ2 − 6λ + 25 = 0, the eigenvalues are 3 ± 4i . By solving A − λ I = 0 with each of these eigenvalues, the corresponding eigen- vectors are v1 = [i, 2]T and v2 = [−i, 2]T . Using the Gram-Schmidt orthogonalization process, we obtain u1 = v1 and u2 = v2 −[v∗2u1/u∗1u1]u1 = [−2,−i]T . After normalizing these vectors, we obtain the unitary matrix
1 i −2 U=√5 2 −i
which satisfies the property UU∗ = I, Finally, we obtain the Schur form
UAU∗=3+4i −6 0 3−4i
which is an upper triangular matrix with the eigenvalues on the diagonal. ■
EXAMPLE 3
Solution
HERMITIAN MATRIX UNITARILY SIMILAR TO A DIAGONAL MATRIX
Every square Hermitian matrix is unitarily similar to a diagonal matrix.
■ COROLLARY2
■ THEOREM4
In the second corollary, a Hermitian matrix, A, is factored as A = U∗ DU
where D is diagonal and U is unitary.
Furthermore, U∗ AU = T and U∗ A∗U = T∗ and A = A∗, so T = T∗, which must
be a diagonal matrix.
Most numerical methods for finding eigenvalues of an n × n matrix A proceed by
determining such similarity transformations. Then one eigenvalue at a time, say, λ, is com- puted, and a deflation process is used to produce an (n − 1) × (n − 1) matrix A whose eigenvalues are the same as those of A, except for λ. Any such procedure can be repeated with the matrix A to find as many eigenvalues of the matrix A as desired. In practice, this strategy must be used cautiously because the successive eigenvalues may be infected with roundoff error.
Gershgorin’s Theorem
Sometimes it is necessary to determine in a coarse manner where the eigenvalues of a matrix are situated in the complex plane C. The most famous of these so-called localization theorems is the following.
The matrix A can have either real or complex entires. The region containing the eigenvalues of A can be written
n n Ci = z∈C:|z−aii|ri
i=1 i=1
wheretheradiiareri =nj=1 |aij|. j ≠ i
TheeigenvaluesofAandAT arethesamebecausethecharacteristicequationinvolves the determinant, which is the same for a matrix and its transpose. Therefore, we can apply theGershgorinTheoremtoAT andobtainthefollowingusefulresult.
8.3 Eigenvalues and Eigenvectors 347
GERSHGORIN’S THEOREM
All eigenvalues of an n × n matrix A = (ai i ) are contained in the union of the n discs Ci =Ci(aii,ri)inthecomplexplanewithcenteraii andradiiri givenbythesumof the magnitudes of the off-diagonal entries in the ith row.
MORE GERSHGORIN DISCS
All eigenvalues of an n × n matrix A = (ai i ) are contained in the union of the n discs Di = Di (aii , si ) in the complex plane having center at aii and radii si given by the sum of the magnitudes of the columns of A.
■ COROLLARY3
348 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Consequently, the region containing the eigenvalues of A can be written as n n
Di = z∈C:|z−aii|si i=1
i=1
where the radii are si = ni =1 |ai j |. Finally, the region containing the eigenvalues of A is
i̸=j
n i=1
n Di
Ci
i=1
This may contain tighter bounds on the eigenvalues in some case. Also, a useful localization
result is
For a matrix A, the union of any k Gerschgorin discs that do not intersect the remaining n − k circles contains exactly k (counting multiplicities) of the eigenvalues of A.
■ COROLLARY4
For a strictly diagonally dominant matrix, zero cannot lie in any of its Gershgorin discs, so it must be invertible. Consequently, we obtain the following results.
Every strictly diagonally dominant matrix is nonsingular.
■ COROLLARY5
EXAMPLE 4
Solution
Consider the matrix
Draw the Gershgorin discs.
⎡⎤
4−i2i
A = ⎣ −1 2i 2 ⎦
1 −1 −5
Using the rows of A, we find that the Gershgorin discs are C1(4 − i, 3), C2(2i, 3), and C3(−5, 2). By using the columns of A, we obtain more Gershgorin discs: D1(4 − i, 2), D2(2i, 3), and D3(−5, 3). Consequently, all the eigenvalues of A are in the three discs D1, C2, and C3, as shown in Figure 8.1. By other means, we compute the eigenvalues of A as λ1 = 3.7208 − 1.05461i, λ2 = 4.5602 + −0.2849i, and λ3 = −0.1605 + 2.3395i. In Figure 8.1, the center of the discs are designated by dots • and the eigenvalues by ∗. ■
Singular Value Decomposition
This subsection requires of the reader some further knowledge of linear algebra, in particular the diagonalization of symmetric matrices, eigenvalues, eigenvectors, rank, column space,
Im(z) 6
4 2
D3 C3
C2, D2 *
8.3 Eigenvalues and Eigenvectors 349
C1 0* D1
FIGURE 8.1
Gershorgin discs
−2 −4
*
−6 −4 −2 0 2 4 6
and norms. See Appendix D for a brief review of these topics. (In the discussion below, we assume that the Euclidean norm is being used.)
The singular value decomposition is a general-purpose tool that has many uses, par- ticularly in least-squares problems (Chapter 12). It can be applied to any matrix, whether square or not. We begin by stating that the singular values of a matrix A are the nonnegative square roots of the eigenvalues of AT A.
Furthermore, the diagonal matrix D contains the eigenvalues of AT A on its diagonal. This follows from the fact that AT A Q = Q D, so the columns of Q are eigenvectors of AT A. If λ is an eigenvalue of AT A and if x is a corresponding eigenvector, then AT Ax = λx whence
||Ax||2 = (Ax)T (Ax) = xT AT Ax = xT λx = λ||x||2
This equation shows that λ is real and nonnegative. We can order the eigenvalues as
λ λ · · · λ 0. (Reordering the eigenvalues requires reordering the columns of Q.)
12n
The numbers σj = + λj are the singular values of A.
Re(z)
MATRIX SPECTRAL THEOREM
Let A be m × n. Then AT Ais an n × n symmetric matrix and it can be diagonalized by an orthogonal matrix, say, Q:
AT A= QDQ−1 whereQQT =QTQ=IandDisadiagonaln×nmatrix.
■ THEOREM5
350 Chapter 8
Additional Topics Concerning Systems of Linear Equations
Since Q is an orthogonal matrix, its columns form an orthonormal base for Rn . They are unit eigenvectors of AT A, so if vj is the jth column of Q, then AT Avj = λjvj. Some of the eigenvalues of AT A can be zero. Define r by the condition
λ1λ2 ···λr >0=λr+1 =···=λn
For a review of concepts such as rank, orthogonal basis, orthonormal basis, column
space, null space, and so on, see Appendix D.
ORTHOGONAL BASIS THEOREM
If the rank of A is r, then an orthogonal basis for the column space of A is {Avj: 1 jr}.
■ THEOREM6
Proof
Observe that
This establishes the orthogonality of the set {Avj:1 jn}. By letting k = j, we get
||Av ||2 =λ .Hence, Av ≠ 0ifandonlyif1 jr.Ifwisanyvectorinthecolumn jjj n
(Avk)T (Avj) = vkT AT Avj = vkT λjvj = λjδkj
space of A, then w = Ax for some x in Rn. Putting x =
j=1 cjvj, we get cjAvj
n w=Ax=
r j=1
cjAvj = and therefore, w is in the span of {Av1, Av2,…, Avr}.
j=1
The preceding theorem gives a reasonable way of computing the rank of a numerical matrix. First, compute its singular values. Any that are very small can be assumed to be zero. The remaining ones are strongly positive, and if there are r of them, we take r to be the numerically computed rank of A.
A singular value decomposition of an m × n matrix A is any representation of A in the form
A = U DV T
where U and V are orthogonal matrices and D is an m × n diagonal matrix having nonneg- ative diagonal entries that are ordered d11 d22 · · · 0. Then from Problem 4, it follows that the diagonal elements dii are necessarily the singular values of A. Note that the matrix U is m × m and V is n × n. A nonsquare matrix D is nevertheless said to be diagonal if the only elements that are not zero are among those whose two indices are equal.
One singular value decomposition of A (there are many of them) can be obtained from the work described above. Start with the vectors v1, v2, . . . , vr . Normalize the vectors Av j to get vectors u j . Thus, we have
uj = Avj/||Avj|| (1 jr)
Extend this set to an orthonormal base for Rm . Let U be the m ×m matrix whose columns are u1,u2,…,um.Define Dtobethem×nmatrixconsistingofzerosexceptforσ1,σ2,…,σr on its diagonal. Let V = Q, where Q is as above.
■
EXAMPLE 5
Solution
There will be certain small values of ε for which A has rank 3 and AT A has rank 1 (in the computer).
In an example in Section 1.1 (p. 4), we encountered this matrix:
A = 0.1036 0.2122 0.2081 0.4247
Determine its eigenvalues, singular values, and condition number.
By using mathematical software, it is easy to find the eigenvalues λ1(A) ≈ −0.0003 and
λ2(A) ≈ 0.5286. We can form the matrix
AT A = 0.0540 01104 0.1104 0.2254
and find its eigenvalues λ ( AT A) ≈ 0.3025 × 10−4 and λ ( AT A) ≈ 0.2794. Therefore, the 12
singularvaluesareσ1(A)= |λ1(AT A)|≈0.0003andσ2(A)= |λ2(AT A)|≈0.5286. Also, we can obtain the singular values directly as σ1 ≈ 0.0003 and σ2 ≈ 0.5286 using mathematical software. Consequently, the condition number is κ(A) = σ2/σ1 ≈ 1747.6. Because of this large condition number, we now understand why there was difficultly in solving a linear system of equations with this coefficient matrix! ■
8.3 Eigenvalues and Eigenvectors 351 ToverifytheequationA=UDVT,firstnotethatσj =||Avj||2andthatσjuj =Avj.
Then compute U D. Since D is diagonal, this is easy. We get
UD = [u1,u2,…,um]D = [σ1u1,σ2u2,…,σrur,0,…,0]
= [Av1, Av2,…, Avr,…, Avn] = AQ = AV
This implies that
A = U DV T
The condition number of a matrix can be expressed in terms of its singular values
σmax κ(A) = σmin
since || A||2 = ρ( AT A) = σmax( A) and || A−1||2 = ρ( A−T A−1|| = σmin( A). Numerical Examples of Singular Value Decomposition
The numerical determination of a singular value decomposition is best left to the available high-quality software. Such programs can be found in Matlab, Maple, LAPACK, and other software packages. The high-quality programs do not form AT A and seek its eigenvalues. One wishes to avoid using AT A in numerical work because its condition number may be much worse than that of A. This phenomenon is easily illustrated by the matrices
⎡1 1 1⎤ A = ⎢ ⎢⎣ ε 0 0 ⎥ ⎥⎦ ,
0ε0 00ε
⎡1+ε2 1 1 ⎤
A T A = ⎣ 1 1 + ε 2 1
1 1 1+ε2
⎦
352
Chapter 8
Additional Topics Concerning Systems of Linear Equations
EXAMPLE 6
Solution
Calculate the singular value decomposition of the matrix
⎡⎤
11
A=⎣0 1⎦ (1)
10 Here,thematrixAism×nandm=3andn=2.First,wefindthattheeigenvaluesof
the matrix
ATA= 2 1 12
arrangedindescendingorderareλ1 =3andλ1 =1.Thenumberofnonzeroeigenvaluesof
the matrix AT A is 2. Next, we determine that the eigenvectors of the matrix AT A are [1, 1]T
for λ = 3 and [1, −1]T for λ = 1. Consequently, the orthonormal set of eigenvectors 1√√T2 √√T
of AT Aare 1 2,1 2 forλ1 =3and 1 2,−1 2 .Thenwearrangetheminthe 22 22
same order as the eigenvalues to form the column vectors of the n × n matrix V : 1√ 1√
V=vv=2√2 2√2 12 12−12
22
Now we form a diagonal matrix D, placing on the leading diagonal the singular values: σi =√λi.Sinceσ1 =√3andσ2 =1,them×nsingularvaluematrixis
⎡√⎤ ⎢ 3 √0⎥
D=⎣01⎦ 00
Here, on the leading diagonal are the square roots of the eigenvalues of AT A in descending
order, and the rest of the entries of the matrix D are zeros. Next, we compute vectors
u =σ−1Av fori=1andformthecolumnvectorsofthem×mmatrixU.Inthiscase, iii
we find
and
⎡1 1⎤√ ⎡1√6⎤ 1√⎢ ⎥12 ⎢3√⎥
u=σ−1Av= 3⎣01⎦2√ =⎣1 6⎦ 1113126√
102 16 6
u
⎡11⎤1√⎡0⎤ = σ − 1 A v = ⎢⎣ 0 1 ⎥⎦ 2 √ 2 = ⎢⎣ − 1 √ 2 ⎥⎦
222 −122√ 10212
Finally, we add to the matrix U the rest of the m − r vectors using the Gram-Schmidt orthogonalization process. So we make the vector u3 perpendicular to u1 and u2:
⎡1⎤ ⎢3⎥
u 3 = e 1 − u 1T e 1 u 1 − u 1T e 2 u 2 = ⎣ − 1 ⎦ 3
2
−1 3
Normalizing the vector u3, we get
So we have the matrix
⎡ 1√3⎤
⎢3√⎥ u3 =⎣−1 3⎦
8.3 Eigenvalues and Eigenvectors 353
3√ −1 3
⎡ 1 √6
0 1 √3 ⎤
3
⎢3√√3√⎥ U= u1 u2 u′3 =⎣1 6 1 2 −1 3⎦
6√2√3√ 1 6 −1 2 −1 3
A = UDVT
⎡1 1⎤ ⎡1√6 0 1√3⎤⎡√3 0⎤ √ √
623 The singular value decomposition of the matrix A is
⎢⎥⎢3√√3√⎥⎢√⎥1212 ⎣0 1⎦=⎣1 6 1 2 −1 3⎦⎣ 0 1⎦ 2√ 2√
6√ 2√ 3√ 12−12 1016−12−13002 2
623
So there we have it! Fortunately, there is mathematical software for doing all of this instantly! We can verify the results by computing the diagonal matrix and the matrix A from the factorization. ■
See Chapters 12 and 16, for some important applications of the singular value decom- position. Further examples are given there and in the problems of those chapters.
Application: Linear Differential Equations
The application of eigenvalue theory to systems of linear differential equations will be briefly explained here. Let us start with a single linear differential equation with one dependent variable x. The independent variable is t and often represents time. We write x′ = ax, or in more detail (d/dt)x(t) = ax(t). There is a family of solutions, namely, x(t) = ceat, where c is an arbitrary real parameter. If an initial value x(0) is prescribed, we shall need parameter c to get the initial value right.
A pair of linear differential equations with two dependent variables, x1 and x2 will
look like this:
x1′ =a11x1+a12x2 x2′ =a21x1+a22x2
The general form of a system of n linear first-order differential equations, with constant coefficients, is simply x′ = Ax. Here, A is an n × n numerical matrix, and the vector x has n components, x j , each being a function of t . Differentiation is with respect to t . To solve this, we are guided by the easy case of n = 1, discussed above. Here, we try x(t) = eλtv, where v is a constant vector. Taking the derivative of x, we have x′ = λeλtv. Now the system of equations has become λeλt v = Aeλt v, or λv = Av. This is how eigenvalues come into the process. We have proved the following result.
354 Chapter 8
Additional Topics Concerning Systems of Linear Equations
LINEAR DIFFERENTIAL EQUATIONS
If λ is an eigenvalue of the matrix A and if v is an accompanying eigenvector, then one solution of the differential equation x′ = Ax is x(t) = eλt v.
■ THEOREM7
FIGURE 8.2
Two-mass vibration problem
Application: A Vibration Problem
Eigenvalue-eigenvector analysis can be utilized for a variety of differential equations. Con- sider the system of two masses and three springs shown in Figure 8.2. Here, the masses are constrained to move only in the horizontal direction.
From this situation, we write the equations of motion in matrix-vector form:
′′
x1 =−β α x1 x′′=Ax
x′′ α −β x2 2
By assuming that the solution is purely oscillatory (no damping), we have
In matrix form, we get
By differentiation, we obtain and
x = veiωt
x1 = v1 eiωt x2 v2
x′′ = −ω2veiωt = −ω2x
−β α x = −ω2 x α −β
This is the eigenvalue problem
where λ = −ω2. Eigenvalues can be found from the characteristic equation:
Ax = λx
ω2 −β α
Thisis(ω2 −β)2 −α2 =ω4 −2βω2 +(β2 −α2)=0,and 21
det(A+ω2I)=det α ω2−β =0
ω =
2β± 4β2 −4(β2 −α2) =β±α
2
For simplicity, we now assume unit masses and unit springs so that β = 2 and α = 1. Then
we obtain
A=−2 1 1 −2
Thentherootsofthecharacteristicequationsareω12 =β+α=3andω2 =β−α=1. Next, we can find the eigenvectors. For the first eigenvalue, we obtain
(A+ω12I)v1 =0 1 1 v11 =0 1 1 v12
Since v11 = −v12, we obtain the first eigenvector
v1= 1 −1
For the second eigenvector, we have
(A+ω2I)v2 =0 −1 1 v21 =0
v2= 1
The general solution for the equations of motion for the two-mass system is x(t) = c1v1eiω1t + c2v1e−iω1t + c3v2eiω2t + c4v2e−iω2t
Because the solution was for the square of the frequency, each frequency is used twice (one positive and one negative). We can use initial conditions to solve for the unknown coefficients.
Summary
(1) An eigenvalue λ and eigenvector x satisfy the equation Ax = λx. The direct method to compute the eigenvalues is to find the roots of the characteristic equation p(λ) = det(A − λI) = 0. Then, for each eigenvalue λ, the eigenvectors can be found by solv- ing the homogeneous system ( A − λ I )x = 0. There are software packages for finding the eigenvalue-eigenvector pairs using more sophisticated methods.
(2) There are many useful properties for matrices that influence their eigenvalues. For ex- ample, the eigenvalues are real when A is symmetric or Hermitian. The eigenvalues are positive when A is symmetric or Hermitian positive definite.
(3) Many eigenvalue procedures involve similarity or unitary transformations to produce triangular or diagonal matrices.
(4) Gershgorin’s discs can be used to localize the eigenvalues by finding coarse estimates of them.
(5) The singular value decomposition of an m × n matrix A is A = U DV T
where D is an m × n diagonal matrix whose diagonal entries are the singular values, U is an m × m orthogonal matrix, and V is an n × n orthogonal matrix. The singular values of
1 −1 v22 Since v21 = −v22, we obtain the first eigenvector
8.3 Eigenvalues and Eigenvectors 355
A are the nonnegative square roots of the eigenvalues of AT A.
356 Chapter 8
Additional Topics Concerning Systems of Linear Equations
1. Are[i,−1+i]T and[−i,−1−i]T eigenvectorsofthematrixinExample2?
2. Prove that if λ is an eigenvalue of a real matrix with eigenvector x, then λ is also an eigenvalue with eigenvector x. (For a complex number z = x + iy, the conjugate is defined by z = x − iy.)
3. Let
a 9.
3 −1 1 a. ⎣2 4 −2⎦
3 1 2 b. ⎣−1 4 −1⎦
1−i 1 i c. ⎣ 0 2i 2⎦
A= cosθ −sinθ sin θ cos θ
Account for the fact that the matrix A has the effect of rotating vectors counterclockwise through an angle θ and thus cannot map any vector into a multiple of itself.
4. LetAbeanm×nmatrixsuchthatA=UDVT,whereUandVareorthogonaland D is diagonal and nonnegative. Prove that the diagonal elements of D are the singular
values of A.
5. Let A,U, D,andV beasinthesingularvaluedecomposition: A=UDVT.Letr be asdescribedinthetext.DefineUr toconsistofthefirstrcolumnsofU.LetVr consist of the first r columns of V , and let Dr be the r × r matrix having the same diagonal as
D. Prove that A = Ur Dr VrT . (This factorization is called the economical version of the singular value decomposition.)
6. A linear map P is a projection if P2 = P. We can use the same terminology for an n × n matrix: A2 = A is the projection property. Use the Pierce decomposition, I = A+(I − A),toshowthateverypointinRn isthesumofavectorintherangeof
A and a vector in the null space of A. What are the eigenvalues of a projection?
7. Find all of the Gershgorin discs for the following matrices. Indicate the smallest
region(s) containing all of the eigenvalues:
⎡⎤⎡⎤⎡⎤
1 −2 9
8. (Multiplechoice)LetAbeann×ninvertible(nonsingular)matrix.Letxbeanonzero
3 −1 9
vector. Suppose that Ax = λx. Which equation does not follow from these hypotheses?
a. Akx=λkx b. λ−kx=(A−1)kxfork0
c. p(A)x = p(λ)x foranypolynomial p d. Akx =(1−λ)kx e. Noneofthese.
(Multiple choice) For what values of s will the matrix I − svv∗ be unitary, where v is
a column vector of unit length?
√
1 0 2
a. 0,1 b. 0,2 c. 1,2 d. 0, 2 e. Noneofthese.
10. (Multiple choice) Let U and V be unitary n × n matrices, possibly complex. Which conclusion is not justified?
Problems 8.3
8.3 Eigenvalues and Eigenvectors 357 a. U + V is unitary. b. U∗ is unitary. c. UV is unitary.
d. U − vv∗ is unitary when ||v|| = √2 and v is a column vector. a11. (Multiplechoice)Whichassertionistrue?
a. Every n × n matrix has n distinct (different) eigenvalues.
b. Theeigenvaluesofarealmatrixarereal.
c. If U is a unitary matrix, then U∗ = UT
d. Asquarematrixanditstransposehavethesameeigenvalues.
12. (Multiple choice) Consider the symmetric matrix ⎡⎤
1 3 4 −1 A=⎢3 7−6 1⎥ ⎣ 4 −6 3 0⎦
e. None of these.
e. Noneofthese.
−1 1 0 5
What is the smallest interval derived from Gershgorin’s Theorem such that all eigen-
values of the matrix A lie in that interval?
a. [−7,9] b. [−7,13] c. [3,7] d. [−3,17] e. Noneofthese.
13. (Trueorfalse)Gershgorin’sTheoremassertsthateveryeigenvalueλofann×nmatrix A must satisfy one of these inequalities:
n j=1
j ̸=i
factored as A = P T P −1 , where P is a nonsingular matrix and T is upper triangular.
15. (Trueorfalse)AconsequenceofSchur’sTheoremisthatevery(real)symmetricmatrix
A can be factored in the form A = P DP−1, where P is unitary and D is diagonal.
16. Explain why ||U B||2 = ||B||2 for any matrix B when UT U = I.
⎢2⎥
17. Consider the matrix A = ⎣ 3 5 −3 ⎦. Plot the Gershgorin discs in the complex
013 2
planeforAandAT aswellasindicatethelocationsoftheeigenvalues.
18. (Continuation) Let B be the matrix obtained by changing the negative entries in A to
positive numbers. Repeat the process for B. ⎡⎤
4 0 −2
19. (Continuation) Repeat for C = ⎣ 1 2 0 ⎦.
119
20. Find the Schur decomposition of A = 5 7 . −2 −4
|aij| for 1in.
14. (Trueorfalse)AconsequenceofSchur’sTheoremisthateverysquarematrixAcanbe
|λ−aii|
⎡4−1 0⎤
55
358 Chapter 8
Additional Topics Concerning Systems of Linear Equations
1. Use Matlab, Maple, Mathematica, or other computer programs available to you to compute the eigenvalues and eigenvectors of these matrices:
a.A=1 7 ⎡2−5 ⎤
4 −7 3 ⎢1 6 11 b.⎢5 −5 −2 ⎣9−31 3 2 5
2 3 −1 2⎥ −4 1⎥
65⎦ −5 1
c. Letn=12,aij =i/jwhenij,andaij =j/iwheni>j.Findtheeigenvalues.
d. Createann×nmatrixwithatridiagonalstructureandnonzeroelements(−1,2,−1) in each row. For n = 5 and 20, find all of the eigenvalues, and verify that they are 2 − 2 cos( jπ/(n + 1)).
e. For any positive integer n, form the symmetric matrix A whose upper triangular part is given by
⎡nn−1n−2n−3···21⎤
n−1 n−2 n−3 ··· 2 1⎥ n−2 n−3 ··· 2 1⎥
1
The eigenvalues of A are 1/{2 − 2 cos[(2i − 1)π/(2n + 1)]}. (See Frank [1958] and
Gregory and Karney [1969].) Numerically verify this result for n = 30.
2. Use Matlab to compute the eigenvalues of a random 100 × 100 matrix by direct use of the command eig and by use of the commands poly and roots. Use the timing functions to determine the CPU time for each.
3. Let p be the polynomial of degree 20 whose roots are the integers 1, 2, . . . , 20. Find the usual power form of this polynomial so that p(t) = t20 +a19t19 +a18t18 +···+a0. Next, form the so-called companion matrix, which is 20 × 20 and has zeros in all positions except all 1’s on the superdiagonal and the coefficients −a0 , −a1 , . . . , −a19 as its bottom row. Find the eigenvalues of this matrix, and account for any difficulties encountered.
4. (Student research project) Investigate some modern methods for computing eigen- values and eigenvectors. For the symmetric case, see the book by Parlett [1997]. Also, read the LAPACK User’s Guide. (See Anderson, et al. [1999].)
5. (Student research project) Experiment with the Cayley-Hamilton Theorem, which asserts that every square matrix satisfies its own characteristic equation. Check this
⎢
⎢
⎢
⎢
⎢ ⎣
… ··· . .⎥ .⎥
.. 2 1⎥ 21⎦
Computer Problems 8.3
numerically by using Matlab or some other mathematical software system. Use matrices of size 3, 6, 9, 12, and 15, and account for any surprises. If you can use higher-precision arithmetic do so—Matlab works with 15 digits of precision.
6. (Studentresearchproject)ExperimentwiththeQRalgorithmandthesingularvalue decomposition of matrices—for example, using Matlab. Try examples with four types of equations Ax = b—namely, (a) the system has a unique solution; (b) the system has many solutions; (c) the system is inconsistent but has a unique least-squares solution; (d) the system is inconsistent and has many least-squares solutions.
7. Using mathematical software such as Matlab, Maple, or Mathematica on each of the following matrices, compute the eigenvalues via the characteristic polynomial, compute the eigenvectors via the null space of the matrix, and compute the eigenvalues and eigenvectors directly:
3 2 a.7−1
⎡ ⎤
1 3−7 b.⎣−3 4 1⎦
2 −5 3
8. Using mathematical software such as Matlab, Maple, or Mathematica, determine the execution time for computing all eigenvalues of a 1000 × 1000 matrix with random entries.
9. Using mathematical software such as Matlab, Maple, or Mathematica, compute the Schur factorization of these complex matrices, and verify the results according to Schur’s Theorem and its corollaries:
a. 3−i 2−i b. 2+i 3+i c. 2−i 2+i 2+i 3+i 3−i 2−i 3−i 3+i
10. Using mathematical software such as Matlab, Maple, or Mathematica, compute the singular value decomposition of these matrices, and verify that each result satisfies the
equation A=UDVT: ⎡1 1⎤
a. ⎣0 1⎦ 1 0
Create the diagonal matrix D = UT AV to
⎤
−2 5⎥
4⎦ −2
check the results (always recommended).
⎣1142⎦ 1124
⎡
1 3 b. ⎢ 2 7
⎣−2 −3 5 −3
One can see the effects of roundoff errors in these calculations, for the off-diagonal
elements in D are theoretically zero. ⎡⎤
5411
a11. Consider A = ⎢ 4 5 1 1 ⎥. Find the eigenvalues and accompanying eigenvectors
of this matrix, from Gregory and Karney [1969], without using software. Hint: The answers can be integers.
12. Find the singular value decomposition of these matrices:
35√5√ a. 2 1 −2 b. 4 c. −2+3 3 2 3+3
8.3 Eigenvalues and Eigenvectors 359
360
Chapter 8
Additional Topics Concerning Systems of Linear Equations
⎡2222⎤ ⎡7−13√67+13√6⎤ ⎢ ⎥⎢26√26√⎥
d.⎣17 1 −17 −1⎦ e.⎣−7−13 6−7+13 6⎦ 10 10 10 10 2 6√ 2 6√
8.4
3 9 −3 −9 −13 6 13 6 55⎡55 ⎤6 6
−149 −50 −154
13. Consider B = ⎣ 537 180 546⎦. Find the eigenvalues, singular values, and
−27 −9 −25 condition number of the matrix B.
Power Method
A procedure called the power method can be employed to compute eigenvalues. It is an example of an iterative process that, under the right circumstances, will produce a sequence converging to an eigenvalue of a given matrix.
Suppose that A is an n × n matrix, and that its eigenvalues (which we do not know) have the following property:
|λ1|>|λ2||λ3| ···|λn|
Notice the strict inequality in this hypothesis. Except for that, we are simply ordering the eigenvalues according to decreasing absolute value. (This is only a matter of notation.) Each eigenvalue has a nonzero eigenvector u(i) and
Au(i) =λiu(i) (i =1,2,…,n) (1)
We assume that there is a linearly independent set of n eigenvectors {u(1), u(2), . . . , u(n)}. It is necessarily a basis for Cn .
We want to compute the single eigenvalue of maximum modulus (the dominant eigen- value) and an associated eigenvector. We select an arbitrary starting vector, x(0) ∈ Cn and express it as a linear combination of u(1), u(2), . . . , u(n):
x(0) =c1u(1) +c2u(2) +···+cnu(n)
In this equation, we must assume that c1 ≠ 0. Since the coefficients can be absorbed into
the vectors u(i), there is no loss of generality in assuming that
x(0) = u(1) +u(2) +···+u(n) (2)
Then we repeatedly carry out matrix-vector multiplication, using the matrix A to produce a sequence of vectors. Specifically, we have
⎧⎪ x(1) = Ax(0)
⎪x(2) = Ax(1) = A2x(0) ⎪⎨x(3) = Ax(2) = A3x(0)
⎪ .
⎪ x(k) = Ax(k−1) = Ak x(0)
⎩ .
In general, we have
x(k) = Ak x(0) (k = 1, 2, 3, . . .) Substituting x(0) in Equation (2), we obtain
x(k) = Ak x(0)
= Aku(1) + Aku(2) + Aku(3) +···+ Aku(n) = λk1u(1) +λk2u(2) +λk3u(3) +···+λknu(n)
by using Equation (1). This can be written in the form
λ k λ k λ k u(1) + 2 u(2) + 3 u(3) +···+ n
u(n)
Since|λ |>|λ |for j >1,wehave|λ /λ |<1andλ /λ k →0ask→∞.Tosimplify
x(k) =λk1 1jj1j1
8.4 Power Method 361
λ1 λ1 λ1 the notation, we write the above equation in the form
x(k) =λku(1) +ε(k) (3) 1
where ε(k) → 0 as k → ∞. We let φ be any complex-valued linear functional on Cn such that φ(u(1)) ≠ 0. Recall that φ is a linear functional if φ(ax + by) = aφ(x) + bφ(y) for scalars a and b and vectors x and y. For example, φ(x) = xj for some fixed j (1 j n) is a linear functional. Now, looking back at Equation (3), we apply φ to it:
φx(k) = λk φu(1) + φε(k) 1
Next, we form ratios r1, r2, . . . as follows:
φx(k+1) φu(1) + φε(k+1)
rk ≡ φx(k) =λ1 φu(1)+φε(k) →λ1 as k →∞
Hence, we are able to compute the dominant eigenvalue λ1 as the limit of the sequence {rk}. With a little more care, we can get an accompanying eigenvector. In the definition of the vectors x(k) in Equation (2), we see nothing to prevent the vectors from growing or converging to zero. Normalization will cure this problem, as in one of the pseudocodes below.
Power Method Algorithms
Here we present pseudocode for calculating the dominant eigenvalue and an associated eigenvector for a prescribed matrix A. In each algorithm, φ is a linear functional chosen by the user. For example, one can use φ(x) = x1 (the first component of the vector).
362
Chapter 8
Additional Topics Concerning Systems of Linear Equations
Power Method Algorithm
integer k, kmax, n; real r
real array ( A)1:n×1:n , (x)1:n , ( y)1:n external function φ
output 0, x
fork =1tokmax do
y ← Ax
r ← φ(y)/φ(x) x←y
output k, x, r
end do
FIGURE 8.3
In 2D, power method illustration
We use a simple 2×2 matrix such as A = 3 1 to give a geometric illustration of the 13
power method as shown in Figure 8.3. Clearly, the eigenvalues are λ1 = 2 and λ2 = 4 with eigenvectors v(1) = [−1, 1]T and v(2) = [1, 1]T , respectively. Starting with x(0) = [0, 1]T , the power method repeatedly multiplies the matrix A by a vector. It produces a sequence of vectors x(1), x(2), and so on that move in the direction of the eigenvector v(2), which corresponds to the dominant eigenvalue λ2 = 4.
v(2) x(0) x(1) x(2) v(1)
–1 0 1
We can easily modify this algorithm to produce normalized eigenvectors by using the infinity vector norm ||x||∞ = max1 j n |x j |, as in the following code:
Modified Power Method Algorithm with Normalization
integer k, kmax, n; real r
real array ( A)1:n×1:n , (x)1:n , ( y)1:n external function φ
output 0, x
fork =1tokmax do
y ← Ax
r ← φ(y)/φ(x) x ← y/||y||∞ output k, x, r
end do
EXAMPLE 1
Solution
Aitken Acceleration
From a given sequence {rk }, we can construct another sequence {sk } by means of the Aitken acceleration formula
(rk −rk−1)2
sk = rk − r − 2r + r (k 3)
k k−1 k−2
If the original sequence {rk } converges to r and if certain other conditions are satisfied, then the new sequence {sk } will converge to r more rapidly than the original one. (For details, see Kincaid and Cheney [2002].) Because subtractive cancellation may eventually spoil the results, the Aitken acceleration process should be stopped soon after the values become apparently stationary.
Use the modified power method algorithm and Aitken acceleration to find the dominant eigenvalue and an eigenvector of the given matrix A, with vector x(0) and φ(x) given as follows:
⎡⎤⎡⎤
we obtain the following results:
x(0) x (1) x (2) x (3) x (4) x (5) x (6)
= [−1.0000, −0.4074, −0.4074]T r2 = = [−1.0000, −0.6049, −0.6049]T r3 = = [−1.0000, −0.7366, −0.7366]T r4 = = [−1.0000, −0.8244, −0.8244]T r5 =
. . = [−1.0000, −0.9931, −0.9931]T r13 =
22.0000 8.9091 7.3061 6.7151
s3 = 13.5294 s4 = 7.0825 s5 = 6.3699
6 5 −5 A=⎣2 6 −2⎦,
−1
x(0) =⎣ 1⎦, φ(x)=x2
2 5 −1
After coding and running the modified power method algorithm with Aitken acceleration,
1
=[−1.0000,1.0000,1.0000]T
= [−1.0000, 0.3333, 0.3333]T r0 = 2.0000 = [−1.0000, −0.1111, −0.1111]T r1 = −2.0000
x (14)
The Aitken-accelerated sequence, sk, converges noticeably faster than the sequence {rk}.
6.0208 The actual dominant eigenvalue and an associated eigenvector are
λ1 =6 u(1) =[1,1,1]T ■
The coding of the modified power method is very simple, and we leave the actual imple- mentation as an exercise. We also use the simple infinity-norm for normalizing the vectors. The final vectors and estimates of the eigenvalue are displayed with 15 decimals digits.
In such a problem, one should always seek an independent verification of the purported answer. Here, we simply compute Ax to see whether it coincides with s14x. The last few commands in the code are doing this rough checking, taking s14 as probably the best estimate of the eigenvalue and the last x-vector as the best estimate of an eigenvector. The results after 14 steps are not very accurate. For better accuracy, take 80 steps!
8.4 Power Method 363
.
s13 = 6.0005
364
Chapter 8
Additional Topics Concerning Systems of Linear Equations
EXAMPLE 2
Solution
and the vector x(k+1) can be more easily computed by solving this last linear system. To do this, we first find the LU factorization of A, namely, A = LU. Then we repeatedly update the right-hand side and back solve:
Ux(k+1) = L−1x(k)
to obtain x(1), x(2), . . . .
Compute the smallest eigenvalue and an associated eigenvector of the following matrix:
⎡⎤
1 −154 528 407 A=3⎣ 55 −144 −121⎦
−132 396 318 using the following initial vector and linear function:
x(0) = [1,2,3]T , φ(x) = x2
We decide to take the easy route and use the inverse of A for producing the successive x vectors. We leave the actual implementation as an exercise. The ratios rk are saved, and once it is complete, the Aitken accelerated values, sk , are computed. Notice that at the end, we will want the reciprocal of the limiting ratio. Hence, it is easier to use reciprocals at every step in the code. Thus, you see rk = x2/y2 rather than y2/x2, and these ratios should
Inverse Power Method
It is possible to compute other eigenvalues of a matrix by using modifications of the power method. For example, if A is invertible, we can compute its eigenvalue of smallest magnitude by noting this logical equivalence:
Ax=λx ⇐⇒ x=A−1(λx) ⇐⇒ A−1x=1x λ
Thus, the smallest eigenvalue of A in magnitude is the reciprocal of the largest eigenvalue of A−1. We compute it by applying the power method to A−1 and taking the reciprocal of the result.
Suppose that there is a single smallest eigenvalue of A. With our usual ordering, this will be λn :
|λ1||λ2||λ3| ···|λn−1|>|λn|>0
It follows that A is invertible. (Why?) The eigenvalues of A−1 are λ−1 for 1 j n. There-
fore, we have
|λ−1|>|λ−1 | ···|λ−1|>0 n n−1 1
We can use the power method on the matrix A−1 to compute its dominant eigenvalue λ−1. n
The reciprocal of this is the eigenvalue of A that we sought. Notice that we need not compute A−1 because the equation
is equivalent to the equation
x(k+1) = A−1x(k) Ax(k+1) = x(k)
j
8.4 Power Method 365 converge to the smallest eigenvalue of A. The final results after 80 steps are these:
x = [0.26726101285547, −0.53452256017715, 0.80178375118802]T s80 = 3.33333333343344
We can divide each entry in x by the first component and arrive at
x = [1.0, −2.00000199979120, 3.00000266638827]T
The eigenvalue is actually 10 , and the eigenvector should be [1, −2, 3]T . The discrepancy 3
between Ax and s80 x is about 2.6 × 10−6. ■ Software Examples: Inverse Power Method
Using mathematical software on a small example,
⎡⎤
6 5 −5
A=⎣26 2⎦ (4)
2 5 −1
we can first get A−1 and then use the power method. (We have changed one entry in the matrix A from Example 1 to solve a different problem.) We leave the implementation of the code as an exercise. In the code, r is the reciprocal of the quantity r in the original power method. Thus, at the end of the computation, r should be the eigenvalue of A that has the smallest absolute value. After the prescribed 30 steps, we find that r = 0.214 and x = [0.7916, 0.5137, 0.3308]T . As usual, we can verify the result independently by computing Ax and rx, which should be equal. The method just illustrated is called the inverse power method. On larger examples, the successive vectors should be computed not via A−1 but rather by solving the equation A y = x for y. In mathematical software systems such as Matlab, Maple, and Mathematica, this can be done with a single command. Alternatively, one can get the LU factorization of A and solve Lz = x and U y = z.
In this example, two eigenvalues are complex. Since the matrix is real, they must be conjugate pairs of the form α + βi and α − βi. They have the same magnitude; thus, the hypothesis |λ1| > |λ2| needed in the convergence proof of the power method is violated. What happens when the power method is applied to A? The values of r for k = 26 to 30 are 0.76, −53.27, 8.86, 2.69, and −9.42. We leave the implementation of the code as a computer problem.
Shifted (Inverse) Power Method
Other eigenvalues of a matrix (besides the largest and smallest) can be computed by exploiting the following logical equivalences:
Ax=λx⇐⇒(A−μI)x=(λ−μ)x⇐⇒(A−μI)−1x= 1 x λ−μ
If we want to compute an eigenvalue of A that is close to a given number μ, we can apply the inverse power method to A − μ I and take the reciprocal of the limiting value of r . This should be λ − μ.
366 Chapter 8
Additional Topics Concerning Systems of Linear Equations
We can also compute an eigenvalue of A that is farthest from a given number μ. Suppose that for some eigenvalue λj of matrix A, we have
|λj −μ|>ε and 0<|λi −μ|<ε foralli ≠ j
Consider the shifted matrix A − μI. Applying the power method to the shifted matrix A − μ I , we compute ratios rk that converge to λ j − μ. This procedure is called the shifted
power method.
If we want to compute the eigenvalue of A that is closest to a given number μ, a variant
of the above procedure is needed. Suppose that λ j is an eigenvalue of A such that 0<|λj −μ|<ε and |λi −μ|>ε foralli ≠ j
Consider the shifted matrix A − μI. The eigenvalues of this matrix are λi − μ. Applying the inverse power method to A − μ I gives an approximate value for (λ j − μ)−1 . We can use the explicit inverse of A − μI or the LU factorization A − μI = LU. Now we repeatedly solve the equations
(A−μI)x(k+1) =x(k)
bysolvinginsteadUx(k+1) = L−1x(k).Sincetheratiosrk convergeto(λj −μ)−1,wehave
−1 1 λj=μ+ limrk =μ+lim
k→∞ k→∞ rk This algorithm is called the shifted inverse power method.
Example: Shifted Inverse Power Method
To illustrate the shifted inverse power method, we consider the following matrix:
⎡⎤
137
A=⎣2−4 5⎦ (5)
3 4 −6
and use mathematical software to compute the eigenvalue closest to −6. The code we use takes ratios of y2/x2, and we are therefore expecting convergence of these ratios to λ + 6. After eight steps, we have r = 0.9590 and x = [−0.7081, 0.6145, 0.3478]T . Hence, the eigenvalue should be λ = 0.9590 − 6 = −5.0410. We can ask Matlab to confirm the eigenvalue and eigenvector by computing both Ax and λx to be approximately [3.57, −3.10, −1.75]T .
Summary
(1) We have considered the following methods for computing eigenvalues of a matrix. In the power method, we approximate the largest eigenvalue λ1 by generating a sequence of points using the formula
x(k+1) = Ax(k)
and then forming a sequence rk = φ(x(k+1))/φ(x(k)), where φ is a linear functional. Under the right circumstances, this sequence, rk , will converge to the largest eigenvalue of A.
(2) In the inverse power method, we find the smallest eigenvalue λn by using the preceding process on the inverse of the matrix. The reciprocal of the largest eigenvalue of A−1 is the smallest eigenvalue of A. We can also describe this process as one of computing the sequence so that
Ax(k+1) = x(k)
(3) In the shifted power method, we find the eigenvalue that is farthest from a given number μ by seeking the largest eigenvalue of A − μI. This involves an iteration to produce a sequence
x(k+1) =(A−μI)x(k)
(4) In the shifted inverse power method, we find the eigenvalue that is closest to μ by
applying the inverse power method to A − μ I . This requires solving the equation (A−μI)x(k+1) =x(k) (A−μI=LU)
Additional References
For supplemental reading and study, see Anderson Bai, Bischof, Blackford, Demmel, Dongarra, Du Croz, Greenbaum, Hammarling, and McKenney [1999]; Axelsson [1994]; Bai, Demmel, Dongarra, Ruhe, and van der Vorst [2000]; Barrett, Berry, Chan, Demmel, Donato, Dongarra, Eijkhout, Pozo, Romine, and van der Vorst [1994]; Davis [2006]; Dekker and Hoffmann [1989]; Dekker, Hoffmann, and Potma [1997]; Demmel [1997]; Dongarra et al. [1990]; Elman, Silvester, and Wathen [2004]; Fox [1967]; Gautschi [1997]; Greenbaum [1997]; Hageman and Young [1981]; Heroux, Raghavan, and Simon [2006]; Jennings [1977]; Kincaid and Young [1979, 2000]; Lynch [2004]; Meurant [2006]; Noble and Daniel [1988]; Ortega [1990b]; Parlett [2000]; Saad [2003]; Schewchuck [1994]; Southwell [1946]; Stewart [1973]; Trefethen and Bau [1997]; Van der Vorst [2003]; Watkins [1991]; Wilkinson [1988]; and Young [1971].
a1. Let A = 5 2 . The power method has been applied to the matrix A. The result is 47
a long list of vectors that seem to settle down to a vector of the form [h, 1]T , where |h| < 1. What is the largest eigenvalue, approximately, in terms of that number h?
a. 4h+7 b. 5h+2 c. 1/h d. 5h+4 e. Noneofthese. 2. Whatistheexpectedfinaloutputofthefollowingpseudocode?
integer n, kmax; real r
real array ( A−1)1:n×1:n , (x)1:n , ( y)1:n fork =1to30do
y ← A−1x
r ← y1/x1 (first components of y and x) x ← y/||y||
output r, x
end do
8.4 Power Method 367
Problems 8.4
368 Chapter 8
Additional Topics Concerning Systems of Linear Equations
a. ristheeigenvalueofAlargestinmagnitude,andxisanaccompanyingeigenvector.
b. r = 1/λ, where λ is the smallest eigenvalue of A, and x is such that Ax = λx.
c. A vector x such that Ax = r x, where r is the eigenvalue of A having the smallest magnitude.
d. r is the largest (in magnitude) eigenvalue of A and x is a corresponding eigen- vector of A.
e. Noneofthese.
3. Briefly describe how to compute the following:
a. Thedominanteigenvalueandassociateeigenvector.
b. Thenextdominanteigenvalueandassociatedeigenvector.
c. Theleastdominanteigenvalueandassociatedeigenvector.
d. Aneigenvalueotherthanthedominantorleastdominanteigenvalueandassociated
eigenvectors.
⎡⎤
2 −1 0
4. Let A = ⎣ −1 2 −1 ⎦ Carry out several iterations of the power method, starting
0 −1 2
with x(0) = (1, 1, 1). What is the purpose of this procedure? ⎡⎤
−2 −1 0
5. Let B = A−4I = ⎣−1 −2 −1⎦. Carry out some iterations of the power method
0 −1 −2
applied to B, starting with x(0) = (1, 1, 1). What is the purpose of this procedure? ⎡⎤
321
6. LetC = A−1 = 1 ⎣2 4 2⎦.Carryoutafewiterationsofthepowermethodapplied
to C, starting with x(0) = (1, 1, 1). What is the purpose of this procedure?
7. The Rayleigh quotient is the expression ⟨x, x⟩A/⟨x, x⟩ = xT Ax/xT x. How can the
Rayleigh quotient be used when Ax = λx?
1. Use the power method, the inverse power method, and their shifted forms as well as Aitken’s acceleration to find some or all of the eigenvalues of the following matrices:
4
123
⎡5411⎤ ⎡234⎤
a.⎢4 5 1 1⎥ ⎣1 1 4 2⎦
1124
⎡⎤
−2 1 0 0 0 ⎢ 1 −2 1 0 0⎥ c.⎢0 1−2 1 0⎥ ⎣0 0 1−2 1⎦
b.⎣7 −1 3⎦ 1 −1 5
0 0 0 1 −2
Computer Problems 8.4
8.4 Power Method 369
2. Redotheexamplesinthissection,usingeitherMatlab,Maple,orMathematica.
3. Modify and test the pseudocode for the power method to normalize the vector so that the largest component is always 1 in the infinity-norm. This procedure gives the eigenvector and eigenvalue without having to compute a linear functional.
4. Findtheeigenvaluesofthematrix
A=⎣ 20 −53 −44⎦
−48 144 115
that are close to −4, 2, and 8 by using the inverse power method.
5. UsingmathematicalsoftwaresuchasMatlab,Maple,orMathematica,writeandexecute code for implementing the methods in Section 8.4. Verify that the results are consistent with those described in the text.
a. Example 1 using the modified power method.
b. Example 2 using the inverse power method with Aitken acceleration. c. Matrix(4)usingtheinversepowermethod.
d. Matrix (5) using the shifted power method. ⎡⎤
⎢2⎥
6. Considerthematrix A=⎣1 1 1 ⎦
4 2
a. Use the normalized power method starting with x(0) = [1, 1, 1]T , and find the dominant eigenvalue and eigenvector of the matrix A.
b. Repeat, starting with the initial value x(0) = [−0.64966116, 0, 74822116, 0]T .
Explain the results. See Ralston [1965, p. 475–476].
⎡⎤
−4 14 0
7. Let A = ⎣ −5 13 0 ⎦. Code and apply each of the following:
−1 0 2
a. The modified power algorithm starting with x(0) = [1, 1, 1]T as well as the Aitken’s
acceleration process.
b. Theinversepoweralgorithm.
c. Theshiftedpoweralgorithm.
d. Theshiftedinversepoweralgorithm. ⎡⎤
4 −1 1
8. (Continuation) Let B = ⎣ −1 3 −2 ⎦. Repeat the previous problem starting with
111
⎡⎤
−57 192 148
11 24
x(0) =[1,0,0]T.
1 −2 3 ⎡⎤
−8 −5 8
9. (Continuation) Let C = ⎣ 6 3 −8 ⎦. Use x(0) = [1, 1, 1]T . Repeat the previous
−3 1 9 problem starting with x(0) = [1, 0, 0]T .
370 Chapter 8
Additional Topics Concerning Systems of Linear Equations
10. Bymeansofthepowermethod,findaneigenvalueandassociatedeigenvectorofthese
matrices from the historical books by Fox [1957] and Wilkinson [1965]. Verify your
results by using mathematical software such as Matlab, Maple, or Mathematica.
a. 0.9901 0.002 starting with x(0) = [1, 0.9]T −0.0001 0.9904
8 −1 −5
b. ⎣ −4 4 −2 ⎦ starting with x(0) = [1, 0.8, 1]T
18 −5 −7 ⎡⎤
113
c. ⎣1 −2 1⎦ starting with x(0) = [1,1,1]T
⎡313⎤
−2 −1 4
d. ⎣ 2 1 −2 ⎦ starting with x(0) = [3, 1, 2]T without normalization and with
−1 −1 3 normalization
11. FindalloftheeigenvaluesandassociatedeigenvectorsofthesematricesfromFox[1957]
and Wilkinson [1965] by means of the power method and variations of it. Verify your
⎡⎤
results by using mathematical software such as Matlab, Maple, or Mathematica.
a. c.
e.
2 1 b. 0.4812 0.0023
4 2 −0.0024 0.4810 ⎡⎤⎡⎤
1 10 ⎣−1+10−8 3 0⎦
0 11
⎡⎤
0.987 0.400 −0.487 ⎣ −0.079 0.500 −0.479 ⎦
0.082 0.400 0.418
5−1−2 d. ⎣−1 3 −2⎦
−2−25
9
Approximation by Spline Functions
FIGURE 9.1
By experimentation in a wind tunnel, an airfoil is constructed by trial and error so that it has certain desired characteristics. The cross section of the airfoil is then drawn as a curve on coordinate paper (see Figure 9.1). To study this airfoil by analytical methods or to manufacture it, it is essential to have a formula for this curve. To arrive at such a formula, one first obtains the coordinates of a finite set of points on the curve. Then a smooth curve called a cubic interpolating spline can be constructed to match these data points. This chapter discusses general polynomial spline functions and how they can be used in various numerical problems such as the data-fitting problem just described.
y
Airfoil cross
section x
9.1 First-Degree and Second-Degree Splines
The history of spline functions is rooted in the work of draftsmen, who often needed to draw a gently turning curve between points on a drawing. This process is called fairing and can be accomplished with a number of ad hoc devices, such as the French curve, made of plastic and presenting a number of curves of different curvature for the draftsman to select. Long strips of wood were also used, being made to pass through the control points by weights laid on the draftsman’s table and attached to the strips. The weights were called ducks and the strips of wood were called splines, even as early as 1891. The elastic nature of the wooden strips allowed them to bend only a little while still passing through the prescribed points. The wood was, in effect, solving a differential equation and minimizing the strain energy. The latter is known to be a simple function of the curvature. The mathematical theory of these curves owes much to the early investigators, particularly Isaac Schoenberg in the 1940s and 1950s. Other important names associated with the early development of the subject (i.e., prior to 1964) are Garrett Birkhoff, C. de Boor, J. H. Ahlberg, E. N. Nilson,
371
372
Chapter 9
Approximation by Spline Functions
FIGURE 9.2
First-degree spline function
x
H. Garabedian, R. S. Johnson, F. Landis, A. Whitney, J. L. Walsh, and J. C. Holladay. The first book giving a systematic exposition of spline theory was the book by Ahlberg, Nilson, and Walsh [1967].
First-Degree Spline
A spline function is a function that consists of polynomial pieces joined together with certain smoothness conditions. A simple example is the polygonal function (or spline of degree 1), whose pieces are linear polynomials joined together to achieve continuity, as in Figure9.2.Thepointst0,t1,...,tn atwhichthefunctionchangesitscharacteraretermed knots in the theory of splines. Thus, the spline function shown in Figure 9.2 has eight knots.
Knots:
forced to write
where
t1 t2
t3 t4
⎧⎪ ⎪ ⎪ S 0 ( x ) ⎨S1(x)
t5 t6
x ∈ [ t 0 , t 1 ] x ∈[t1,t2]
t7 b
S0 S1 S2
S4 S3
S6 S5
a t0
Such a function appears somewhat complicated when defined in explicit terms. We are
S(x)=⎪ .
⎪⎩. .
. (1) Sn−1(x) x ∈ [tn−1,tn]
Si (x) = ai x + bi (2)
because each piece of S(x) is a linear polynomial. Such a function S(x) is piecewise linear. Iftheknotst0,t1,...,tn weregivenandifthecoefficientsa0,b0,a1,b1,...,an−1,bn−1 were all known, then the evaluation of S(x) at a specific x would proceed by first determining the interval that contains x and then using the appropriate linear function for that interval.
If the function S defined by Equation (1) is continuous, we call it a first-degree spline. It is characterized by the following three properties.
SPLINE OF DEGREE 1
A function S is called a spline of degree 1 if:
1. ThedomainofSisaninterval[a,b].
2. Siscontinuouson[a,b].
3. Thereisapartitioningoftheintervala=t0
Continuity of a function f at a point s can be defined by the condition lim f(x)= lim f(x)= f(s)
x→s+ x→s−
Here, limx→s+ means that the limit is taken over x values that converge to s from above s; that is, (x − s) is positive for all x values. Similarly, limx→s− means that the x values converge to s from below.
Determine whether this function is a first-degree spline function:
9.1 First-Degree and Second-Degree Splines 373
⎧ ⎪⎨ x
S ( x ) = ⎪⎩ 1 − x 2x − 2
x ∈ [−1, 0] x ∈ ( 0 , 1 ) x ∈ [1, 2]
The function is obviously piecewise linear but is not a spline of degree 1 because it is discontinuous at x = 0. Notice that limx→0+ S(x) = limx→0(1 − x) = 1, whereas limx→0− S(x) = limx→0 x = 0. ■
The spline functions of degree 1 can be used for interpolation. Suppose the following table of function values is given:
x t0 t1 ··· tn
y y0 y1 ··· yn
There is no loss of generality in supposing that t0 < t1 < · · · < tn because this is only a matter of labeling the knots.
The table can be represented by a set of n + 1 points in the plane, (t0, y0), (t1, y1), . . . , (tn , yn ), and these points have distinct abscissas. Therefore, we can draw a polygonal line through the points without ever drawing a vertical segment. This polygonal line is the graph of a function, and this function is obviously a spline of degree 1. What are the equations of the individual line segments that make up this graph?
By referring to Figure 9.3 and using the point-slope form of a line, we obtain
Si(x)=yi +mi(x−ti) (3) on the interval [ti , ti +1 ], where m i is the slope of the line and is therefore given by the
formula
mi = yi+1 − yi ti+1 −ti
FIGURE 9.3
First-degree spline: linear Si (x)
Si (x) (ti, yi)
ti
(ti1, yi1)
x
ti1
374 Chapter 9
Approximation by Spline Functions
Notice that the function S that we are creating has 2n parameters in it: the n coefficients ai and the n constants bi in Equation (2). On the other hand, exactly 2n conditions are being imposed, since each constituent function Si must interpolate the data at the ends of its subinterval. Thus, the number of parameters equals the number of conditions. For the higher-degree splines, we shall encounter a mismatch in these two numbers; the spline of degree k will have k − 1 free parameters for us to use as we wish in the problem of interpolating at the knots.
The form of Equation (3) is better than that of Equation (2) for the practical evaluation of S(x) because some of the quantities x − ti must be computed in any case simply to determinewhichsubintervalcontainsx.Ift0xtn thentheinterval[ti,ti+1]containing x is characterized by the fact that x −ti is the first of the quantities x −tn−1,x −tn−2,..., x − t0 that is nonnegative.
The following is a function procedure that utilizes n + 1 table values (ti , yi ) in linear arrays(ti)and(yi),assumingthata=t0
real function Spline1(n, (ti ), (yi ), x)
integer i,n; real x; real array (ti)0:n,(yi)0:n fori =n−1to0 step−1do
ifx−ti 0thenexitloop end for
Spline1←yi +(x−ti)[(yi+1−yi)/(ti+1−ti)] end function Spline1
Modulus of Continuity
To assess the goodness of fit when we interpolate a function with a first-degree spline, it is useful to have something called the modulus of continuity of a function f . Suppose f is defined on an interval [a, b]. The modulus of continuity of f is
ω(f;h)=sup{|f(u)− f(v)|:auvb,|u−v|h}
Here, sup is the supremum, which is the least upper bound of the given set of real numbers. The quantity ω( f ; h) measures how much f can change over a small interval of width h. If f is continuous on [a, b], then it is uniformly continuous, and ω( f ; h) will tend to zero as h tends to zero. If f is not continuous, ω( f ; h) will not tend to zero. If f is differentiable on (a, b) (in addition to being continuous on [a, b]) and if f ′(x) is bounded on (a, b), then the Mean Value Theorem can be used to get an estimate of the modulus of continuity: If u
and v are as described in the definition of ω( f ; h), then
|f(u)− f(v)|=|f′(c)(u−v)|M1|u−v|M1h
Here, M1 denotes the maximum of | f ′(x)| as x runs over (a, b). For example, if f (x) = x3 and [a, b] = [1, 4], then we find that ω( f ; h) 48h.
Hence,
f(x)− p(x)= Then we have
x −a
b−a [f(x)− f(b)]+
b−x
b−a [f(x)− f(a)]
9.1 First-Degree and Second-Degree Splines 375
FIRST-DEGREE POLYNOMIAL ACCURACY THEOREM
If p is the first-degree polynomial that interpolates a function f at the endpoints of an interval [a, b], then with h = b − a, we have
| f (x) − p(x)| ω( f ; h) (a x b)
■ THEOREM1
Proof
The linear function p is given explicitly by the formula
x −a b−x p(x)= b−a f(b)+ b−a f(a)
x − a
b−a |f(x)− f(b)|+ b−a |f(x)− f(a)|
b − x x −a b−x
x−a b−x
= b − a + b − a ω( f ; h) = ω( f ; h)
|f(x)− p(x)|
b−a ω(f;h)+ b−a ω(f;h)
■
FIRST-DEGREE SPLINE ACCURACY THEOREM
Let p be a first-degree spline having knots a = x0 < x1 <···
ui =2(hi +hi−1)− hi2−1 >2(hi +hi−1)−hi−1 >hi ui−1
Then by induction, ui > 0 for i = 1,2,…,n − 1.
Equation (5) is not the best computational form for evaluating the cubic polynomial
Si (x ). We would prefer to have it in the form
Si(x)= Ai +Bi(x−ti)+Ci(x−ti)2 +Di(x−ti)3 (11)
because nested multiplication can then be utilized.
Notice that Equation (11) is the Taylor expansion of Si about the point ti . Hence,
A =S(t), B =S′(t), C = 1S′′(t), D = 1S′′′(t) iii iii i2ii i6ii
Therefore, Ai = yi and Ci = zi/2. The coefficient of x3 in Equation (11) is Di, whereas the coefficient of x3 in Equation (5) is (zi+1 − zi )/6hi . Therefore,
Di= 1(zi+1−zi) 6hi
392 Chapter 9
Approximation by Spline Functions
Finally, Equation (6) provides the value of Si′ (ti ), which is
Bi =−hizi+1−hizi + 1(yi+1−yi)
6 3 hi
zi 1
Thus, the nested form of Si (x ) is
Si(x)=yi+(x−ti) Bi+(x−ti) 2+6h(x−ti)(zi+1−zi) (12)
i
We now write routines for determining a natural cubic spline based on a table of values and for evaluating this function at a given value. First, we use Algorithm 1 for directly solving the tridiagonal System (10). This procedure, called Spline3 Coef , takes n + 1 table values (ti , yi ) in arrays (ti ) and (yi ) and computes the zi ’s, storing them in array (zi ). Intermediate (working) arrays (hi ), (bi ), (ui ), and (vi ) are needed.
Pseudocode for Natural Cubic Splines
procedure Spline3 Coef (n, (ti ), (yi ), (zi ))
integer i, n; real array (ti )0:n , (yi )0:n , (zi )0:n
allocate real array (hi )0:n−1, (bi )0:n−1, (ui )1:n−1, (vi )1:n−1 fori =0ton−1do
hi ←ti+1−ti
bi ←(yi+1−yi)/hi end for
u1 ← 2(h0 + h1)
v1 ← 6(b1 − b0) fori =2ton−1do
ui ← 2(hi + hi−1) − hi2−1/ui−1
vi ← 6(bi − bi−1) − hi−1vi−1/ui−1 end for
zn ←0
fori =n−1to1 step−1do
zi ←(vi −hizi+1)/ui end for
z0 ← 0
deallocate array (hi ), (bi ), (ui ), (vi ) end procedure Spline3 Coef
Now a procedure called Spline3 Eval is written for evaluating Equation (12), the natural cubic spline function S(x), for x a given value. The procedure Spline3 Eval first determines the interval [ti , ti+1] that contains x and then evaluates Si (x) using the nested form of this cubic polynomial:
real function Spline3 Eval(n, (ti ), (yi ), (zi ), x) integer i; real h,tmp
real array (ti )0:n , (yi )0:n , (zi )0:n
fori =n−1to0 step−1do
ifx−ti 0thenexitloop
9.2 Natural Cubic Splines 393
end for
h ← ti+1 − ti
tmp←(zi/2)+(x−ti)(zi+1 −zi)/(6h) tmp←−(h/6)(zi+1 +2zi)+(yi+1 −yi)/h+(x−ti)(tmp) Spline3 Eval ← yi + (x − ti )(tmp)
end function Spline3 Eval
EXAMPLE 3
Solution
The function Spline3 Eval can be used repeatedly with different values of x after one call to procedure Spline3 Coef . For example, this would be the procedure when plotting a natural cubic spline curve. Since procedure Spline3 Coef stores the solution of the tridiagonal sys- tem corresponding to a particular spline function in the array (zi ), the arguments n, (ti ), (yi ), and (zi ) must not be altered between repeated uses of Spline3 Eval.
Using Pseudocode for Interpolating and Curve Fitting
To illustrate the use of the natural cubic spline routines Spline3 Coef and Spline3 Eval, we rework an example from Section 4.1.
Write pseudocode for a program that determines the natural cubic spline interpolant for sin x at ten equidistant knots in the interval [0, 1.6875]. Over the same interval, subdivide each subintervalintofourequallyspacedparts,andfindthepointwherethevalueof|sinx−S(x)| is largest.
Here is a suitable pseudocode main program, which calls procedures Spline3 Coef and Spline3 Eval:
procedure Test Spline3 integeri; reale,h,x
real array (ti )0:n , (yi )0:n , (zi )0:n integer n ← 9
real a ← 0, b ← 1.6875
h ← (b − a)/n
for i = 0 to n do
ti ←a+ih
yi ←sin(ti) end for
call Spline3 Coef (n, (ti ), (yi ), (zi )) temp ← 0
for j = 0 to 4n do
x ← a + jh/4
e ← | sin(x) − Spline3 Eval(n, (ti ), (yi ), (zi ), x)| if e > temp then temp ← e
output j,x,e
end for
end Test Spline3
From the computer, the output is j = 19, x = 0.890625, and d = 0.930 × 10−5. ■
394
Chapter 9
Approximation by Spline Functions
We can use mathematical software such as in Matlab to plot the cubic spline curve for this data, but the Matlab routine spline uses the not-a-knot end condition, which is different from the natural end condition. It dictates that S′′′ be a single constant in the first two subintervals and another single constant in the last two subintervals. First, the original data are generated. Next, a finer subdivision of the interval [a, b] on the x-axis is made, and the corresponding y-values are obtained from the procedure spline. Finally, the original data points and the spline curve are plotted.
We now illustrate the use of spline functions in fitting a curve to a set of data. Consider the following table:
x 0.0 0.6 1.5 1.7 1.9 2.1 2.3 2.6 2.8 3.0
y −0.8 −0.34 0.59 0.59 0.23 0.1 0.28 1.03 1.5 1.44
3.6 4.7 5.2 5.7 5.8 6.0 6.4 6.9 7.6 8.0 0.74 −0.82 −1.27 −0.92 −0.92 −1.04 −0.79 −0.06 1.0 0.0
These 20 points were selected from a wiggly freehand curve drawn on graph paper. We intentionally selected more points where the curve bent sharply and sought to reproduce the curve using an automatic plotter. A visually pleasing curve is provided by using the cubic spline routines Spline3 Coef and Spline3 Eval. Figure 9.8 shows the resulting natural cubic spline curve.
y
2 1.5 1 0.5
y S(x)
FIGURE 9.8
Natural cubic spline curve
0x 12345678
– 0.5 –1 –1.5 –2
Alternatively, we can use mathematical software such as Matlab, Maple, or Mathemat- ica to plot the cubic spline function for this table.
Space Curves
In two dimensions, two cubic spline functions can be used together to form a parametric representation of a complicated curve that turns and twists. Select points on the curve and
x0
x1
···
EXAMPLE 4
9.2 Natural Cubic Splines 395 label them t = 0,1,…,n. For each value of t, read off the x- and y-coordinates of the
point, thus producing a table:
t01···n x xn y y0 y1 ··· yn
Then fit x = S(t) and y = S(t), where S and S are natural cubic spline interpolants. The two functions S and S give a parametric representation of the curve. (See Computer Problem 9.2.6.)
Select 13 points on the well-known serpentine curve given by y=x
1/4 + x2
So that the knots will not be equally spaced, write the curve in parametric form:
x = 1 tanθ 2
y = sin 2θ
and take θ = i (π/12), where i = −6, −5, . . . , 5, 6. Plot the natural cubic spline curve and
the interpolation polynomial in order to compare them.
This is example of curve fitting using both the polynomial interpolation routines Coef and Eval from Chapter 4 and the cubic spline routines Spline3 Coef and Spline3 Eval. Figure 9.9 shows the resulting cubic spline curve and the high-degree polynomial curve (dashed line) from an automatic plotter. The polynomial becomes extremely erratic after the fourth knot from the origin and oscillates wildly, whereas the spline is a near perfect fit.
Solution
–1
y
8 6 4 2
–2 –4 –6 8
Polynomial curve
Cubic spline curve
x
–2 –1.5
–0.5
0 0.5
1 1.5 2
FIGURE 9.9
Serpentine curve
■
396
Chapter 9
Approximation by Spline Functions
EXAMPLE 5
Solution
Use cubic spline functions to produce the curve for the following data: t01234567
y 1.0 1.5 1.6 1.5 0.9 2.2 2.8 3.1 It is known that the curve is continuous but its slope is not.
A single cubic spline is not suitable. Instead, we can use two cubic spline interpolants, the first having knots 0, 1, 2, 3, 4 and the second having knots 4, 5, 6, 7. By carrying out two separate spline interpolation procedures, we obtain two cubic spline curves that meet at the point (4, 0.9). At this point, the two curves have different slopes. The resulting curve is shown in Figure 9.10.
y
3 2.5 2 1.5 1 0.5
y Sˆ ( x )
̃
y S(x)
FIGURE 9.10
Two cubic splines
0x 1234567
Smoothness Property
■
Why do spline functions serve the needs of data fitting better than ordinary polynomials? To answer this, one should understand that interpolation by polynomials of high degree is often unsatisfactory because polynomials may exhibit wild oscillations. Polynomials are smooth in the technical sense of possessing continuous derivatives of all orders, whereas in this sense, spline functions are not smooth.
Wild oscillations in a function can be attributed to its derivatives being very large. Consider the function whose graph is shown in Figure 9.11. The slope of the chord that
p
q
r
FIGURE 9.11
Wildly oscillating function
■ THEOREM1
Proof
To verify the assertion about [S′′(x)]2, we let
g(x)= f(x)−S(x)
so that g(ti ) = 0 for 0 i n, and
joins the points p and q is very large in magnitude. By the Mean-Value Theorem, the slope of that chord is the value of the derivative at some point between p and q. Thus, the derivative must attain large values. Indeed, somewhere on the curve between p and q, there is a point where f ′(x) is large and negative. Similarly, between q and r, there is a point where f ′(x) is large and positive. Hence, there is a point on the curve between p and r where f ′′(x) is large. This reasoning can be continued to higher derivatives if there are more oscillations. This is the behavior that spline functions do not exhibit. In fact, the following result shows that from a certain point of view, natural cubic splines are the best functions to use for curve fitting.
Now
f ′′ = S′′ + g′′ bbbb
(f′′)2 dx = (S′′)2 dx + (g′′)2 dx +2 aaaa
If the last integral were 0, we would be finished because then
bbbb ( f ′′)2 dx = (S′′)2 dx + (g′′)2 dx
(S′′)2 dx
We apply the technique of integration by parts to the integral in question to show that it
is 0.∗ We have
∗The formula for integration by parts is
aaaa
b b b b
S′′g′′ dx = S′′g′ − S′′′g′ dx = − S′′′g′ dx
aaaa
9.2 Natural Cubic Splines 397
CUBIC SPLINE SMOOTHNESS THEOREM
If S is the natural cubic spline function that interpolates a twice-continuously differ- entiablefunction f atknotsa=t0
interpolant.
a32. Byhandcalculation,findthenaturalcubicsplineinterpolantforthistable:
x12345
y01010
a33. Find a cubic spline over knots −1,0, and 1 such that the following conditions are
satisfied: S′′(−1) = S′′(1) = 0, S(−1) = S(1) = 0, and S(0) = 1.
34. This problem and the next two lead to a more efficient algorithm for natural cubic spline interpolation in the case of equally spaced knots. Let hi = h in Equation (5), andreplacetheparameterszi byqi =h2zi/6.ShowthatthenewformofEquation(5) is then
x − t 3 t − x 3 x − t S(x)=q i+qi+1 +(y−q)i
i i+1 h i h i+1 i+1 h ti+1 −x
+(yi −qi) h
35. (Continuation) Establish the new continuity conditions:
q0 =qn =0 qi−1+4qi +qi+1 =yi+1−2yi +yi−1 (1in−1)
36. (Continuation) Show that the parameters qi can be determined by backward recursion as follows:
qn =0 qn−1 =βn−1 qi =αiqi+1 +βi (i =n−2,n−3,…,0)
wherethecoefficientsαi andβi aregeneratedbyascendingrecursionfromtheformulas α0 = 0 αi = −(αi−1 + 4)−1 (1 i n)
β0=0 βi =−αi(yi+1−2yi+yi−1−βi−1) (1in)
(This stable and efficient algorithm is due to MacLeod [1973].)
37. Prove that if S(x) is a spline of degree k on [a, b], then S′(x) is a spline of degree k −1.
a38. Howmanycoefficientsareneededtodefineapiecewisequartic(fourth-degree)function with n + 1 knots? How many conditions will be imposed if the piecewise quartic function is to be a quartic spline? Justify your answers.
a39. Determinewhetherthisfunctionisanaturalcubicspline:
x3 + 3×2 + 7x − 5 (−1 x 0) −x3 + 3×2 + 7x − 5 (0 x 1)
x3 + x − 1 (0 x 1) −(x −1)3 +3(x −1)2 +4(x −1)+1 (1 x 2)
41. Showthatthenaturalcubicsplinegoingthroughthepoints(0,1),(1,2),(2,3),(3,4), and (4, 5) must be y = x + 1. (The natural cubic spline interpolant to a given data set is unique, because the matrix in Equation (10) is diagonally dominant and nonsingular, as proven in Section 7.3.)
1. Rewrite and test procedure Spline3 Coef using procedure Tri from Chapter 7. Use the symmetry of the (n − 1) × (n − 1) tridiagonal system.
2. Theextrastoragerequiredinstep1ofthealgorithmforsolvingthenaturalcubicspline tridiagonal system directly can be eliminated at the expense of a slight amount of extra computation—namely, by computing the hi ’s and bi ’s directly from the ti ’s and yi ’s in the forward elimination phase (step 2) and in the back substitution phase (step 3). Rewrite and test procedure Spline3 Coef using this idea.
3. Using at most 20 knots and the cubic spline routines Spline3 Coef and Spline3 Eval, plot on a computer plotter an outline of your:
a. school’smascot. b. signature. c. profile.
4. Let S be the cubic spline function that interpolates f (x) = (x2 + 1)−1 at 41 equally spacedknotsintheinterval[−5,5].EvaluateS(x)− f(x)at101equallyspacedpoints on the interval [0, 5].
5. Draw a free-form curve on graph paper, making certain that the curve is the graph of a function. Then read values of your function at a reasonable number of points, say, 10–50, and compute the cubic spline function that takes those values. Compare the freely drawn curve to the graph of the cubic spline.
S(x) =
40. Determine whether this function is or is not a natural cubic spline having knots 0, 1,
and 2:
f (x) =
9.2 Natural Cubic Splines 403
Computer Problems 9.2
404
Chapter 9
Approximation by Spline Functions
6. Drawaspiral(orothercurvethatisnotafunction)andreproduceitbywayofparametric spline functions. (See the figure below.)
y
7
3
9.3
7. Writeandtestproceduresthatareassimpleaspossibletoperformnaturalcubicspline interpolation with equally spaced knots. Hint: See Problems 9.3.34–9.3.36.
8. Write a program to estimate b f (x) dx, assuming that we know the values of f at a
only certain prescribed knots a = t0 < t1 < ··· < tn = b. Approximate f first by an interpolating cubic spline, and then compute the integral of it using Equation (5).
9. Write a procedure to estimate f ′(x) for any x in [a, b], assuming that we know only thevaluesof f atknotsa=t0
By Equation (2), this assertion is true when k = 0. If it is true for index k − 1, then Bk−1(x) > 0 on (t , t ) and Bk−1(x) > 0 on (t , t ). In Equation (3), the factors that
i i i+k i+1 i+1 i+k+1
multiply Bk−1(x) and Bk−1(x) are positive when t < x < t . Thus, Bk(x) > 0 on this
i i+1 i i+k+1
Figure 9.14 shows the first four B splines plotted on the same axes.
i
y
1 B0i B1i
B 2i
B 3i
FIGURE 9.14
First four B-splines
x
ti3 ti4
kth-degree splines that have the same knot sequence. Thus, linear combinations
∞
ci Bik
i =−∞
are important objects of study. (We use ci for fixed k and Cik to emphasize the degree k of the corresponding B splines.) Our first task is to develop an efficient method to evaluate a function of the form
ti ti1 ti2
The principal use of the B splines Bik(i = 0,±1,±2,…) is as a basis for the set of all
f (x) in the form
f (x) = Ck i
i ti+k −ti
Bk−1(x) + i
i+k+1 Bk−1(x)
∞ i =−∞
Cik Bik(x) (4) Using Definition (3) and some simple series manipulations, we have
f (x) =
under the supposition that the coefficients Cik are given (as well as the knot sequence ti ).
∞ x−t t −x
i=−∞
∞ x−ti ti+k−x
= Ck
+Ck
i−1 ti+k −ti
Bk−1(x) i
i =−∞ ∞
i ti+k −ti Ck−1 Bk−1(x)
=
where Ck−1 is defined to be the appropriate coefficient from the line preceding Equation (5).
ii i =−∞
(5)
i
This algebraic manipulation shows how a linear combination of Bik (x ) can be expressed
as a linear combination of Bk−1(x). Repeating this process k−1 times, we eventually express i
f (x) =
∞ i =−∞
Ci0 Bi0(x) (6)
ti+k+1 −ti+1
i+1
408 Chapter 9
Approximation by Spline Functions
If t x < t , then f(x) = C0. The formula by which the coefficients Cj−1 are ob- mm+1m i
tained is
x−t t −x
Cj−1 =Cj i +Cj i+j (7)
i it−ti−1t−t i+j i i+j i
A nice feature of Equation (4) is that only the k + 1 coefficients Cmk , Cmk −1, . . . , Cmk −k are needed to compute f (x) if tm x < tm+1 (see Problem 9.3.6). Thus, if f is defined by Equation (4) and we want to compute f (x), we use Equation (7) to calculate the entries in the following triangular array:
Ck Ck−1···C0 mmm
...
Although our notation does not show it, the coefficients in Equation (4) are independent of x, whereas the C j−1’s calculated subsequently by Equation (7) do depend on x.
It is now a simple matter to establish that
∞
Bik(x) = 1 for all x and all k 0
i =−∞
Ifk = 0,wealreadyknowthis.Ifk > 0,weuseEquation(4)withCik = 1foralli.
By Equation (7), all subsequent coefficients Ck,Ck−1,Ck−2,…,C0 are also equal to 1 iiii
(induction is needed here!). Thus, at the end, Equation (6) is true with Ci0 = 1, and so f (x) = 1. Therefore, from Equation (4), the sum of all B splines of degree k is unity.
The smoothness of the B splines Bik increases with the index k. In fact, we can show by induction that Bik has a continuous k − 1st derivative.
The B splines can be used as substitutes for complicated functions in many mathematical situations. Differentiation and integration are important examples. A basic result about the derivatives of B splines is
i
Ck Ck−1 m−1 m−1
. … C mk − k
d k k
Bk(x) = Bk−1(x) − Bk−1(x)
(8)
dx i ti+k −ti i ti+k+1 −ti+1 i+1
This equation can be proved by induction using the recursive Formula (3). Once Equation (8)
is established, we get the useful formula
where
d∞ ∞
c Bk(x) = d Bk−1(x) (9)
dxii ii i =−∞ i =−∞
c−c di=ki i−1
ti+k −ti
The verification is as follows. By Equation (8),
d ∞
dx
ci Bik(x)
∞ d k
9.3 B Splines: Interpolation and Approximation 409
i =−∞
= ci dx Bi (x)
i=−∞ ∞ k k−1 k k−1
= ci t −t Bi (x)− t −t Bi+1(x) i=−∞ i+k i i+k+1 i+1
= =
ti+k −ti d Bk−1(x)
−
Bk−1(x) ti+k −ti i
∞ cik ci−1k
i=−∞ ∞
ii i =−∞
For numerical integration, the B splines are also recommended, especially for indefinite integration. Here is the basic result needed for integration:
x t −t∞
Bk(s)ds = i+k+1 i Bk+1(x) (10)
−∞i k+1j j=i
This equation can be verified by differentiating both sides with respect to x and simplifying by the use of Equation (9). To be sure that the two sides of Equation (10) do not differ by a constant, we note that for any x < ti , both sides reduce to zero.
The basic result (10) produces this useful formula:
where
x ∞ ∞ c Bk(s)ds =
e Bk+1(x) (11) ii ii
−∞ i=−∞ i=−∞
1 i
ei =k+1
It should be emphasized that this formula gives an indefinite integral (antiderivative) of any function expressed as a linear combination of B splines. Any definite integral can be obtained by selecting a specific value of x . For example, if x is a knot, say, x = tm , then
t m ∞ ∞ cBk(s)ds=
cj(tj+k+1−tj)
j =−∞
m
ii iim iim
eBk+1(t )=
−∞ i=−∞ i=−∞ i=m−k−1
eBk+1(t )
Matlab has a Spline Toolbox, developed by Carl de Boor, that can be used for many tasks involving splines. For example, there are routines for interpolating data by splines with diverse end conditions and routines for least-squares fits to data. There are many demon- stration routines in this Toolbox that exhibit plots and provide models for programming Matlab M-files. These demonstrations are quite instructive for visualizing and learning the concepts in spline theory, especially B splines.
Maple has a BSpline package for constructing B spline basis functions of degree k from a given knot list, which may include multiple knots. It is based on a divided-difference
410 Chapter 9
Approximation by Spline Functions
implementation found in Bartels, Beatty, and Barskey [1987]. It can be downloaded from the Maple Application Center at www.maplesoft.com.
Interpolation and Approximation by B Splines
We developed a number of properties of B splines and showed how B splines are used in various numerical tasks. The problem of obtaining a B spline representation of a given function was not discussed. Here, we consider the problem of interpolating a table of data; later, a noninterpolatory method of approximation is described.
A basic question is how to determine the coefficients in the expression
S(x) =
i =−∞
so that the resulting spline function interpolates a prescribed table:
We mean by interpolate that
x t0 t1 ··· tn y y0 y1 ··· yn
S(ti ) = yi (0 i n) (13)
∞
i i−k
The natural starting point is with the simplest splines, corresponding to k = 0. Since
has the interpolation property (13).
y B1 (x) has the interpolation property (13). So Ai = yi again.
Bi0(tj)=δij =
1 (i = j) 0 (i≠j)
the solution to the problem is immediate: Just set Ai = yi for 0 i n. All other coefficients in Equation (12) are arbitrary. In particular, they can be zero. We arrive then at this result: The zero-degree B spline
B1 (t)=δ i−1j ij
Hence, the following is true: The first-degree B spline
i=0
inturn,requirefortheirdefinitionknotst−1,t0,t1,...,t4.Knotst−1 andt4 canbearbitrary. Figure 9.15 shows the graphs of the four B1-splines. In such a problem, if t−1 and t4 are not prescribed, it is natural to define them in such a way that t0 is the midpoint of the interval [t−1, t1] and t3 is the midpoint of [t2, t4].
In both elementary cases considered, the unknown coefficients A0, A1, . . . , An in Equation (12) were uniquely determined by the interpolation conditions (13). If terms were
yi Bi0(x)
The next case, k = 1, also has a simple solution. We use the fact that
S(x) =
If the table has four entries (n = 3), for instance, we use B1
S(x) =
n i=0
A Bk (x) (12)
n
i i−1
, B1, B1, and B1. They, −101 2
FIGURE 9.15
Bi1 splines
B11 B10 B1 B12
t1 t0 t1 t2 t3 t4
x
9.3 B Splines: Interpolation and Approximation 411
present in Equation (12) corresponding to values of i outside the range {0, 1, . . . , n}, then they would have no influence on the values of S(x) at t0, t1, ...,tn.
For higher-degree splines, we shall see that some arbitrariness exists in choosing coefficients. In fact, none of the coefficients is uniquely determined by the interpolation conditions. This fact can be advantageous if other properties are desired of the solution. In the quadratic case, we begin with the equation
∞ 1
AB2 (t)= i i−2 j
t −t j+1
A(t −t)+A (t−t ) (14) j j+1 j j+1 j j−1
j−1
i=−∞
Its justification is left to Problem 9.3.26. If the interpolation conditions (13) are now imposed, we obtain the following system of equations, which gives the necessary and sufficient conditions on the coefficients:
Aj(tj+1 −tj)+ Aj+1(tj −tj−1)= yj(tj+1 −tj−1) (0 jn) (15)
This is a system of n + 1 linear equations in n + 2 unknowns A0, A1,..., An+1.
One way to solve Equation (15) is to assign any value to A0 and then use Equation (15) to compute for A1, A2, . . . , An+1, recursively. For this purpose, the equations could be
rewritten as
Aj+1 =αj +βjAj where these abbreviations have been used:
⎧ ⎪⎨α=y tj+1−tj−1
j j tj −tj−1 ⎪ ⎪ ⎪⎩ β j = t j − t j + 1
(0 jn) (16)
(0 jn)
tj −tj−1
To keep the coefficients small in magnitude, we recommend selecting A0 such that the
expression
n+1
= Ai2 i=0
will be a minimum. To determine this value of A0, we proceed as follows: By successive substitution using Equation (16), we can show that
Aj+1 =γj +δjA0 (0 jn) (17)
412 Chapter 9
Approximation by Spline Functions
where the coefficients γj and δj are obtained recursively by this algorithm:
Then is a quadratic function of A0 as follows: = A 20 + A 21 + · · · + A 2n + 1
γ0 = α0 δ0 = β0
γj =αj +βjγj−1 δj =βjδj−1 (1 jn)
(18)
= A20 +(γ0 +δ0A0)2 +(γ1 +δ1A0)2 +···+(γn +δnA0)2
To find the minimum of , we take its derivative with respect to A0 and set it equal to zero:
d =2A0 +2(γ0 +δ0A0)δ0 +2(γ1 +δ1A0)δ1 +···+2(γn +δnA0)δn =0 dA0
This is equivalent to q A0 + p = 0, where
Pseudocode and a Curve-Fitting Example
A procedure that computes coefficients A0, A1, . . . , An+1 in the manner outlined above is given now. In its calling sequence, (ti )0:n is the knot array, (yi )0:n is the array of abscissa points, (ai )0:n+1 is the array of Ai coefficients, and (hi )0:n+1 is an array that contains hi = ti − ti−1. Only n, (ti ), and (yi ) are input values. They are available unchanged when the routine is finished. Arrays (ai ) and (hi ) are computed and available as output.
q = 1 + δ 02 + δ 12 + · · · + δ n2 p=γ0δ0 +γ1δ1 +···+γnδn
procedure BSpline2 Coef (n, (ti ), (yi ), (ai ), (hi )) integer i, n; real δ, γ, p, q
real array (ai)0:n+1,(hi)0:n+1,(ti)0:n,(yi)0:n
for i = 1 to n do
hi ←ti −ti−1 end for
h0 ← h1
hn+1 ← hn δ ← −1
γ ← 2y0
p ← δγ
q←2
for i = 1 to n do
r ← hi+1/hi
δ ← −rδ
γ ← −rγ + (r + 1)yi
p←p+γδ q ← q + δ2
end for
between t0 and tn . The result of Problem 9.3.26 is used.
9.3 B Splines: Interpolation and Approximation 413
a0 ←−p/q
fori =1ton+1do
ai ←[(hi−1 +hi)yi−1 −hiai−1]/hi−1 end for
end procedure BSpline2 Coef
Next we give a procedure function BSpline2 Eval for computing values of the quadratic splinegivenbyS(x)=n+1 A B2 (x).Itscallingsequencehassomeofthesamevariables
i=0 i i−2
as in the preceding pseudocode. The input variable x is a single real number that should lie
real function BSpline2 Eval(n, (ti ), (ai ), (hi ), x)
integer i, n; real d, e, x; real array (ai )0:n+1, (hi )0:n+1, (ti )0:n fori =n−1to0 step−1do
ifx−ti 0thenexitloop end for
i←i+1
d ← [ai+1(x − ti−1) + ai (ti − x + hi+1)]/(hi + hi+1) e←[ai(x−ti−1 +hi−1)+ai−1(ti−1 −x+hi)]/(hi−1 +hi) BSpline2 Eval ← [d(x − ti−1) + e(ti − x)]/hi
end function BSpline2 Eval
Using the table of 20 points from Section 9.2, we can compare the resulting natural cubic spline curve with the quadratic spline produced by the procedures BSpline2 Coef and BSpline2 Eval. The first of these curves is shown in Figure 9.8, and the second is in Figure 9.16. The latter is reasonable but perhaps not as pleasing as the former. These curves show once again that cubic natural splines are simple and elegant functions for curve fitting.
y
2 1.5 1 0.5
0x 12345678
0.5
1
FIGURE 9.16 1.5 Quadratic
interpolating 2 spline
414 Chapter 9
Approximation by Spline Functions
Schoenberg’s Process
An efficient process due to Schoenberg [1967] can also be used to obtain B spline approx- imations to a given function. Its quadratic version is defined by
∞ 1
1. If f(x)=ax+b,thenS(x)= f(x).
2. If f (x) 0 everywhere, then S(x) 0 everywhere.
3. maxx |S(x)| maxx |f(x)|.
4. If f is continuous on [a,b], if δ = maxi |ti+1 − ti|, and if δ < b − a, then for x in [a,b],
Some of these properties are elementary; others are more abstruse. Property 1 is outlined in Problem 9.3.29. Property 2 is obvious because Bi2(x) 0 for all x. Property 3 follows easily from Equation (19) because if | f (x )| M , then
f(τi)Bi2(x) where τi = 2(ti+1 +ti+2) (19) Here, of course, the knots are {t }∞ , and the points where f must be evaluated are
S(x)=
i =−∞
i i=−∞
Equation (19) is useful in producing a quadratic spline function that approximates f .
midpoints between the knots.
The salient properties of this process are as follows:
|S(x)− f(x)| 3 max
2 a u v u+δ b
|f(u)− f(v)|
5. ThegraphofSdoesnotcrossanylineintheplaneagreaternumberoftimesthandoes
thegraphof f.
∞ |S(x)|
i=−∞
∞
f (τi )Bi2(x)
| f (τi )∥Bi2(x) M
∞ i=−∞
Bi2(x) = M
i=−∞
Properties 4 and 5 will be accepted without proof. Their significance, however, should not be overlooked. By Property 4, we can make the function S close to a continuous function f simply by making the mesh size δ small. This is because f (u)− f (v) can be made as small as we wish simply by imposing the inequality |u − v| δ (uniform continuity property). Property 5 can be interpreted as a shape-preserving attribute of the approximation process.
In a crude interpretation, S should not exhibit more undulations than f . Pseudocode
A pseudocode to obtain a spline approximation by means of Schoenberg’s process is devel- oped here. Suppose that f is defined on an interval [a, b] and that the spline approximation of Equation (19) is wanted on the same interval. We define nodes τi = a + ih, where h = (b − a)/n. Here, i can be any integer, but the nodes in [a,b] are only τ0,τ1,...,τn.
To have τi = 1(ti+1 + ti+2), we define the knots ti = a + (i − 3)h. In Equation (19), the 22
only B splines B2 that are active on [a, b] are B2 , B2, . . . , B2
. Hence, for our purposes,
(20)
i
Equation (19) becomes
−10 n+1
S(x) =
n+1
f (τi )Bi2(x)
i =−1
9.3 B Splines: Interpolation and Approximation 415 Thus, we require the values of f at τ−1, τ0, . . . , τn+1. Two of these nodes are outside the
interval [a, b]; therefore, we furnish linearly extrapolated values in the code by defining f(τ−1) = 2f(τ0)− f(τ1)
f(τn+1)=2f(τn)− f(τn−1) To use the formulas in Problem 9.3.26, we write
n+3
S(x)=
D B2 (x) [D = f(τ )] ii−2 i i−2
i=1
A pseudocode to compute D1, D2, . . . , Dn+3 is given now. In the calling sequence for
procedure Schoenberg Coef , f is an external function. After execution, the n + 3 desired coefficients are in the (di ) array.
procedure Schoenberg Coef ( f, a, b, n, (di )) integer i; real a, b, h; real array (di )1:n+3 external function f
h ← (b − a)/n
fori =2ton+2do
di ← f(a+(i−2)h)
end for
d1 ← 2d2 − d3
dn+3 ← 2dn+2 − dn+1
end procedure Schoenberg Coef
After the coefficients Di have been obtained by the procedure just given, we can recover
values of the spline S(x) in Equation (20). Here, we use the algorithm of Problem 9.3.26.
Given an x, we first need to know where it is relative to the knots. To determine k such that
tk−1 x tk, we notice that k should be the largest integer such that tk−1 x. This inequality
is equivalent to the inequality k 5 + (x − a)/ h, as is easily verified. This explains the 2
calculations of k in the pseudocode. The location of x is indicated in Figure 9.17. In the calling sequence for function Schoenberg Eval, a and b are the ends of the interval, and x is a point where the value of S(x) is desired. The procedure determines knots ti in such a
way that the equally spaced points τi in the preceding procedure satisfy τi = 1 (ti +1 + ti +2 ). 2
FIGURE 9.17
Location of x
tk1 x tk
tk 2
tk1 tk 1
real function Schoenberg Eval(a, b, n, (di ), x) integer k: real c, h, p, w; real array (di )1:n+3 h ← (b − a)/n
k ← integer[(x − a)/h + 5/2]
p ← x − a − (k − 5/2)h
c ← [dk+1 p + dk(2h − p)]/(2h)
e ← [dk(p + h) + dk−1(h − p)]/(2h) Schoenberg Eval ← [cp + e(h − p)]/h end function Schoenberg Eval
416
Chapter 9
Approximation by Spline Functions
Be ́zierCurves
In computer-aided design, it is useful to have a procedure for producing a curve that goes through (or near to) some control points, or a curve that can be easily manipulated to give a desired shape. High-degree polynomial interpolation is generally not suitable for this sort of task, as one might guess from the negative remarks previously made about them. Experience shows that if one specifies a number of control points through which the polynomial must pass, the overall shape of the resulting curve may be severely disappointing!
Polynomials can be used in a different way, however, leading to Be ́zier curves. Be ́zier
curves use as a basis for the space n (all polynomials of degree not exceeding n) a special
set of polynomials that lend themselves to the task at hand. We standardize to the interval
[0, 1] and fix a value of n. Next, we define basic polynomial functions n
φni(x)= i xi(1−x)n−i (0in)
The polynomials φni are the constituents of the Bernstein polynomials. For a continuous
function f defined on [0, 1], Bernstein, in 1912, proved that the sequence of polynomials
n i
f n φni (x) (n 1)
y
1
0.8
70
0.6
0.4
0.2
71
75
pn(x) =
i=0
converges uniformly to f , thus providing a very attractive proof of the Weierstrass Approx-
imation Theorem.
The graphs of a few polynomials φni are shown in Figure 9.18, where we used n = 7 and
i = 0, 1, 5. The Bernstein basic polynomials are found in mathematical software systems such as Maple or Mathematica, for example.
FIGURE 9.18
First few Bernstein basis polynomials
■ PROPERTIES
0x 0.2 0.4 0.6 0.8 1
Bernstein polynomials have two salient properties.
For all x satisfying 0 x 1,
1. φni(x)0
2. n φni(x)=1 i=0
9.3 B Splines: Interpolation and Approximation 417
Any set of functions having these two properties is called a partition of unity on the interval [0, 1]. Notice that the second equation above is actually valid for all real x . The set {φn0,φn1,...,φnn} is a basis for the space n. Consequently, every polynomial of degree at most n has a representation
n aiφni(x)
i=0
If we want to create a polynomial that comes close to interpolating values (i/n, y ) for
n i
0 i n, we can use i=0 yi φni to start and then, after examining the resulting curve,
adjust the coefficients to change the shape of the curve. This is one procedure that can be used in computer-aided design. Changing the value of yi will change the curve principally in the vicinity of i/n because of the local nature of the basic polynomials φni .
Another way in which these polynomials can be used is in creating curves that are not simply graphs of a function f . Here, we turn to a vector form of the procedure suggested above. If n + 1 vectors v0, v1, . . . , vn are prescribed, say, in R2 or R3, the expression
n i=0
makes sense, since the right-hand side is (for each t ) a linear combination of the vectors vi . As t runs over the interval [0,1], the vector u(t) describes a curve in the space where the vectorsvi aresituated.Thiscurveliesintheconvexhullofthevectorsvi,becauseu(t)isa convexlinearcombinationofthevi.Thisrequiresthetwopropertiesofφni mentionedabove.
To illustrate this procedure, we have selected seven points in the plane and have drawn the closed curve generated by the above equation; that is, by the vector u(t). Figure 9.19 shows the resulting curve as well as the control points. In Figure 9.19, the control points
y
5
4
3
2
1
u(t)
u(t) =
φni (t)vi (0 t 1)
FIGURE 9.19
Curve using control points
x
012345
418 Chapter 9
Approximation by Spline Functions
are the vertices of the polygon, and the curve is the one that results in the manner de- scribed. Mathematical software systems such as Maple and Mathematica can be used to do this.
A glance at Figure 9.18 will suggest to the reader that perhaps B splines can be used
in the role of the Bernstein functions φni . Indeed, that is the case, and B splines have taken
over in most programs for computer-aided design. Thus, to obtain a curve that comes close
to a set of points (t , y ), we can set up a system of B splines (for example, cubic B splines) ii n3
having knots ti . Then the linear combination i =0 yi Bi can be examined to see whether it has the desired shape. Here, of course, Bi3 denotes a cubic B spline whose support is the interval (ti , ti +4 ).
Thevectorcaseisliketheonedescribedabove,exceptthatthefunctionsφni arereplaced by Bi3. Also, it is easier to take the knots as integers and let t run from 0 to n. The properties 1 and 2 of the φni displayed above are also shared by the B splines.
Summary
(1) The B spline of degree 0 is Bi0(x) =
1 (ti x < ti+1) 0 (otherwise)
Higher-degree B splines are defined recursively:
x−t t −x Bk(x) = i Bk−1(x)+ i+k+1
Bk−1(x) i+1
i ti+k −ti i where k = 1,2,... and i = 0,±1,±2,... .
ti+k+1 −ti+1
(2) Some properties are
An efficient method to evaluate a function of the form
is to use
Bk(x) = dx i
A useful formula is
Bk−1(x) − ti+k −ti i
d∞ ∞ c Bk(x) =
Bk−1(x) ti+k+1 −ti+1 i+1
B i k ( x ) = 0 B ik ( x ) > 0
x ∈/ [ t i , t i + k + 1 ) x ∈ ( t i , t i + k + 1 )
∞ i =−∞
x−t Cj−1 =Cj i
C ik B ik ( x )
t −x
f ( x ) =
+Cj i+j
i it−ti−1t−t
i+j i i+j i
d k k
(3) The derivative of B splines is
d Bk−1(x) dxii ii
i =−∞ i =−∞
9.3 B Splines: Interpolation and Approximation 419 where di = k(ci − ci−1)/(ti+k − ti ). A basic result needed for integration is
x t −t∞ Bk(s)ds = i+k+1 i Bk+1(x)
−∞i k+1j j=i
A resulting useful formula is
x ∞ ∞ c Bk(s)ds =
−∞ i=−∞ i=−∞ whereei =1/(k+1)i cj(tj+k+1 −tj).
j =−∞
(4) To determine the coefficients in the expression
e Bk+1(x) ii ii
∞
i i−k
S(x)=
i =−∞
so that the resulting spline function interpolates a prescribed table, we use the condition
Aj(tj+1 −tj)+ Aj+1(tj −tj−1)= yj(tj+1 −tj−1) (0 jn) Thisisasystemofn+1linearequationsinn+2unknowns A0,A1,…,An+1 thatcanbe
solved recursively.
(5) Schoenberg’s process is an efficient process to obtain B spline approximations to a given function. For example, its quadratic version is defined by
∞
S(x) = f (τi )Bi2(x)
i =−∞
whereτ = 1(t +t )andtheknotsare{t }∞ .Thepointsτ where f mustbeevaluated
i 2i+1 i+2
are midpoints between the knots.
ii=−∞ i
AB2 (x)
(6) Be ́zier curves are used in computer-aided design for producing a curve that goes through (or near to) control points, or a curve that can be manipulated easily to give a desired shape. Be ́zier curves use Bernstein polynomials. For a continuous function f defined on [0, 1], the sequence of Bernstein polynomials
n i
f n φni (x)
converges uniformly to f . The polynomials φni are n
φni(x)= i xi(1−x)n−i Additional References
pn(x) =
(n 1)
(0in)
i=0
See Ahlberg et al. [1967], de Boor [1978], Farin [1990], MacLeod [1973], Schoenberg [1946, 1967], Schultz [1973], Schumaker [1981], Subbotin [1967], and Yamaguchi [1988].
420 Chapter 9
Approximation by Spline Functions
1. Show that the functions fn(x) = cosnx are generated by this recursive definition:
a2. Whatfunctionsaregeneratedbythefollowingrecursivedefinition?
a3. FindanexpressionforBi2(x)andverifythatitispiecewisequadratic.ShowthatBi2(x) is zero at every knot except
Bi2(ti+1) = ti+1 − ti and Bi2(ti+2) = ti+3 − ti+2 ti+2 − ti ti+3 − ti+1
4. VerifyEquation(5).
a5. Establish that ∞ f (t )B1 (x) is a first-degree spline that interpolates f at every
f0(x) = 1, f1(x) = cos x
fn+1(x) = 2 f1(x) fn(x) − fn−1(x) (n 1)
f0(x) = 1, f1(x) = x
fn+1(x) = 2x fn(x) − fn−1(x) (n 1)
i=−∞ i i−1
knot. What is the zero-degree spline that does so?
6. Showthatiftm x
What is the maximum value of Bi2 and where does it Let the knots be the integers, and prove that
⎧⎪ ⎪ ⎪ ⎪
⎪ ⎪ ⎪⎨
occur?
0 (x<0)
1x3 (0x<1) 6
1(4−3x(x−2)2) (1x<2) 6
B 03 ( x ) =
⎪1(4+3(x−4)(x−2)2) ⎪ 6
0
curve passes through the first point, v0.
Show that a linear B spline with integer knots can be written in matrix form as
where
35. 36.
⎪ 1(4−x)3
(2x<3) (3x<4)
⎪⎩ 6
In the theory of Be ́zier curves, using the Bernstein basic polynomials, show that the
S(x)=[x 1] −1 1 c1 =b10c0 +b11c1 2 0 c0
⎧
⎪⎨ b10 = x
B01(x)=⎪⎩b11 =2−x 0
(0 x < 1) (1x<2) (otherwise)
(x4)
9.3 B Splines: Interpolation and Approximation 423 37. Show that the quadratic B spline with integer knots can be written in matrix form as
⎡⎤⎡⎤
1 1 −2 1 c2
S(x)=2[x2 x 1]⎣−6 6 0⎦⎣c1 ⎦ = b20c0 + b21c1 + b22c2
where
Hint: See Problem 9.3.23.
9 −3 0 c0
(0x<1) (1x<2) (2x<3) (otherwise)
B02(x)= ⎪⎩ b22 0
38. Show that the cubic B spline with integer knots can be written as
⎡ ⎤⎡⎤
⎧⎪ b20 ⎨ b21
−1 3 −3 S(x)=1[x3 x2 x 1]⎢ 12 −24 12
1 c3 0⎥⎢c2 ⎥ 0⎦⎣c1⎦ 0 c0
6 ⎣−48 60 −12 64 −44 4
where
Hint: See Problem 9.3.34.
⎧
⎪ b30
⎪⎨ b31 B03(x)= ⎪ b32
⎪⎩ b33 0
(0x<1) (1x<2) (2x<3) (3x<4) (otherwise)
= b30c0 + b31c1 + b32c2 + b33c3
1. Using an automatic plotter, graph B0k for k = 0, 1, 2, 3, 4. Use integer knots ti = i over the interval [0, 5].
2. Let ti = i (so the knots are the integer points on the real line). Print a table of 100 values of the function 3B1 + 6B1 − 4B1 + 2B1 on the interval [6, 14]. Using a plotter,
7 8 9 10
construct the graph of this function on the given interval.
3. (Continuation) Repeat for the function 3B2 + 6B2 − 4B2 + 2B2 . 7 8 9 10
4. AssumingthatS(x)=n ciBik(x),writeaproceduretoevaluateS′(x)ataspecified i=0
x. Input is n,k,x,t0,...,tn+k+1 and c0,c1,...,cn.
5. Write a procedure to evaluate b S(x)dx, using the assumption that S(x) =
nk a
i=0 ci Bi (x). Input will be n,k,a,b,c0,c1,...,cn,t0,...,tn+k+1.
Computer Problems 9.3
424 Chapter 9
Approximation by Spline Functions
6. (March of the B splines) Produce graphs of several B splines of the same degree marching across the x-axis. Use an automatic plotter or a computer package with on-screen graphics capabilities, such as Matlab.
a7. HistorianshaveestimatedthesizeoftheArmyofFlandersasfollows:
Date Number Feb. 1578 27, 603
Sept. 1572 67, 259
Dec. 1573 62, 280
Mar. 1574 62, 350 Apr. 1588 63, 455
Jan. 1575 59, 250 Nov. 1591
May 1576 51, 457
8. RewriteproceduresBSpline2CoefandBSplineEvalsothatthearray(hi)isnotused.
9. Rewrite procedures BSpline2 Coef and BSpline2 Eval for the special case of equally spaced knots, simplifying the code where possible.
10. Write a procedure to produce a spline approximation to F (x ) = x f (t ) d t . Assume a
that a x b. Begin by finding a quadratic spline interpolant to f at the n points ti =a+i(b−a)/n.Testyourprogramonthefollowing:
Sept. 1580 45, 435
Oct. 1582 61, 162
Mar. 1607 41, 471
62, 164
Fit the table with a quadratic B spline, and use it to find the average size of the army
during the period given. (The average is defined by an integral.)
a. f(x)=sinx
b. f(x)=ex
c. f(x)=(x2+1)−1
(0xπ) (0x4) (0x2)
11. Write a procedure to produce a spline function that approximates f ′(x) for a given f on a given interval [a,b]. Begin by finding a quadratic spline interpolant to f at n + 1 points evenly spaced in [a, b], including endpoints. Test your procedure on the
functions suggested in the preceding computer problem.
12. Define f on [0, 6] to be a polygonal line that joins points (0, 0), (1, 2), (3, 3), (5, 3), and (6, 0). Determine spline approximations to f , using Schoenberg’s process and taking 7, 13, 19, 25, and 31 knots.
13. Write suitable code to calculate ∞ f (si )Bi2(x) with si = 1 (ti+1 + ti+2). Assume i=−∞ 2
that f is defined on [a,b] and that x will lie in [a,b]. Assume also that t1 < a < t2 and tn+1 < b < tn+2. (Make no assumption about the spacing of knots.)
14. Writeaproceduretocarryoutthisapproximationscheme:
∞ 1
f(τi)Bi3(x) τi = 3(ti+1 +ti+2 +ti+3) Assumethatfisdefinedon[a,b]andthatτi =a+ihfor0in,whereh=
(b − a)/n.
15. UsingamathematicalsoftwaresystemsuchasMatlabwithBsplineroutines,compute and plot the spline curve in Figure 9.16 based on the 20 data points from Section 9.2. Vary the degree of the B splines from 0, 1, 2, 3, through 4 and observe the resulting curves.
S(x)=
i =−∞
9.3 B Splines: Interpolation and Approximation 425
16. Using B splines, write a program to perform a natural cubic spline interpolation at
knotst0
5. If hmin |h| hmax, then the step is repeated by returning to step 1 with x(t) and the new h value.
The procedure for this adaptive scheme is RK45 Adaptive. In the parameter list of the pseudocode, f is the function f (t , x ) for the differential equation, t and x contain the initial values, h is the initial step size, tb is the final value for t, itmax is the maximum number of steps to be taken in going from a = ta to b = tb, εmin and εmax are lower and upper bounds on the allowable error estimate ε, hmin and hmax are bounds on the step size h, and iflag is an error flag that returns one of the following values:
iflag
0 1
Meaning
Successful march from ta to tb Maximum number of iterations reached
On return, t and x are the exit values, and h is the final step size value considered or used:
procedure RK45 Adaptive( f, t, x, h, tb, itmax, εmax, εmin, hmin, hmax, iflag) integer iflag, itmax, n; external function f
real ε, εmax, εmin, d, h, hmin, hmax, t, tb, x, xsave, tsave
realδ← 1 ×10−5 2
454 Chapter 10
Ordinary Differential Equations
output 0, h, t, x iflag ← 1 k←0
while k itmax
k←k+1
if |h| < hmin then h ← sign(h)hmin if |h| > hmax then h ← sign(h)hmax d ← |tb − t|
if d |h| then
iflag ← 0
if d δ · max{|tb|, |t|} then exit loop h ← sign(h)d
end if
xsave ← x
tsave ← t
call RK45( f, t, x, h, ε) output n,h,t,x,ε
if iflag = 0 then exit loop ifε<εmin thenh←2h if ε > εmax then
h ← h/2 x ← xsave
t ← tsave
k←k−1 end if
end while
end procedure RK45 Adaptive
In the pseudocode, notice that several conditions must be checked to determine the size of the final step, since floating-point arithmetic is involved and the step size varies.
As an illustration, the reader should repeat the computer example in the previous section using RK45 Adaptive, which allows variable step size, instead of RK4. Compare the accuracy of these two computed solutions.
An Industrial Example
A first-order differential equation that arose in the modeling of an industrial chemical process is as follows:
x′ =a+bsint+cx x(0)=0 (1)
in which a = 3, b = 5, and c = 0.2 are constants. This equation is amenable to the solution techniques of calculus, in particular the use of an integrating factor. However, the analytic solution is complicated, and a numerical solution may be preferable.
To solve this problem numerically using the adaptive Runge-Kutta formulas, one need only identify (and program) the function f that appears in the general description. In this problem, it is f (t, x) = 3 + 5 sin t + 0.2x. Here is a brief pseudocode for solving the
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 455 equation on the interval [0, 10] with particular values assigned to the parameters in the
routine RK45 Adaptive:
program Test RK45 Adaptive
integer iflag; real t , x , h , tb ; external function f
integer itmax ← 1000
real εmax ← 10−5, εmin ← 10−8, hmin ← 10−6, hmax ← 1.0 t←0.0; x←0.0; h←0.01; tb ←10.0
callRK45 Adaptive(f,t,x,h,tb,itmax,εmax,εmin,hmin,hmax,iflag) output itmax, iflag
end program Test RK45 Adaptive
real function f (t, x) real t, x
f ←3+5sin(t)+0.2x end function f
We obtain the approximation x(10) ≈ 135.917. The output from the code is a table of values that can be sent to a plotting routine. The resulting graph helps the user to visualize the solution curve.
Adams-Bashforth-Moulton Formulas
We now introduce a strategy in which numerical quadrature formulas are used to solve a single first-order ordinary differential equation. The model equation is
x′(t)= f(t,x(t))
and we suppose that the values of the unknown function have been computed at several points to the left of t, namely, t,t − h,t − 2h,…,t − (n − 1)h. We want to compute x(t + h). By the theorems of calculus, we can write
t+h t
t+h t
n j=1
wheretheabbreviation fj = f(t−(j−1)h,x(t−(j−1)h))hasbeenused.Inthelast line of the above equation, we have brought in a suitable numerical integration formula. The simplest case of such a formula will be for the interval [0, 1] and will use values of the integrand at points 0, −1, −2, . . . , 1 − n in the case of an Adams-Bashforth formula. Once we have such a basic rule, a change of variable will produce the rule for any other interval with any other uniform spacing.
x(t + h) = x(t) +
= x(t)+
≈x(t)+
x′(s) ds f(s,x(s))ds
cj fj
456 Chapter 10
Ordinary Differential Equations
Let us find a rule of the form 1
F(r)dr ≈c1F(0)+c2F(−1)+···+cnF(1−n) 0
There are n coefficients cj at our disposal. We know from interpolation theory that the formula can be made exact for all polynomials of degree n − 1. It suffices that we insist on integrating each function 1,r,r2,…,rn−1 exactly. Hence, we write down the appropriate equation:
1 n ri−1 dt =
cj(1− j)i−1 (1in)
0
j=1
This is a system Au = b of n equations in n unknowns. The elements of the matrix A are
Aij =(1−j)i−1,andtheright-handsideisbi =1/i.
When this program is run, the output is the vector of coefficients 55 , − 59 , 37 , − 3 .
get the Adams-Moulton formulas, we start with a quadrature rule of the form
24 24 24 8 Of course, higher-order formulas are obtained by changing the value of n in the code. To
1 n G(r)dr ≈
CjG(2− j)
A program similar to the one above yields the coefficients 9 , 19 , − 5 , 1 . The distinction
0
j=1
24 24 24 24
between the two quadrature rules is that one involves the value of the integrand at 1 and the
other does not. t +h
How do we arrive at formulas for t g(s)ds from the work already done? Use the
change of variable from s to σ given by s = hσ − t. In these considerations, think of t as a constant. The new integral will be h1 g(hσ +t)dσ, which can be treated with either
0
of the two formulas already designed for the interval [0, 1]. For example,
t+h h
F(r)dr ≈ 24 [55F(t)−59F(t −h)+37F(t −2h)−9F(t −3h)] t+h h
G(r)dr ≈ 24 [9G(t +h)+19G(t)−5G(t −h)+G(t −2h)] t
The method of undetermined coefficients used here to obtain the quadrature formulas does not, by itself, provide the error terms that we would like to have. An assessment of the error can be made from interpolation theory, because the methods considered here come from integrating an interpolating polynomial. Details can be found in more advanced books. You can experiment with some of the Adams-Bashforth-Moulton formulas in Computer Problems 10.3.2–10.3.4. These methods are taken up again in Section 11.3.
Stability Analysis
Let us now resume the discussion of errors that inevitably occur in the numerical solution of an initial-value problem
t
x′ = f(t,x) x(a) = s
(2)
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 457
The exact solution is a function x(t). It depends on the initial value s, and to show this, we write x(t,s). The differential equation therefore gives rise to a family of solution curves, each corresponding to one value of the parameter s. For example, the differential equation
x′ = x x(a) = s
gives rise to the family of solution curves x = se(t−a) that differ in their initial values x(a) = s. A few such curves are shown in Figure 10.4. The fact that the curves there diverge from one another as t increases has important numerical significance. Suppose, for instance, that initial value s is read into the computer with some roundoff error. Then even if all subsequent calculations are precise and no truncation errors occur, the computed solution will be wrong. An error made at the beginning has the effect of selecting the wrong curve from the family of all solution curves. Since these curves diverge from one another, any minute error made at the beginning is responsible for an eventual complete loss of accuracy. This phenomenon is not restricted to errors made in the first step, because each point in the numerical solution can be interpreted as the initial value for succeeding points.
x
FIGURE 10.4
Solution curves tox′ =x with x(a) = s
at0 t1 t2 t3 t4 t5
x se(ta)
s5
s4
s3 Global error s2
s1
t
For an example in which this difficulty does not arise, consider
x′ =−x x(a) = s
Its solutions are x = se−(t−a). As t increases, these curves come closer together, as in Figure 10.5. Thus, errors made in the numerical solution still result in selecting the wrong curve, but the effect is not as serious because the curves coalesce.
At a given step, the global error of an approximate solution to an ordinary differential equation contains both the local error at that step and the accumulative effect of all the local errors at all previous steps. For divergent solution curves, the local errors at each step are magnified over time, and the global error may be greater than the sum of all the local errors. In Figure 10.4 and Figure 10.5, the steps in the numerical solution are indicated by dots connected by dark lines. Also, the local errors are indicated by small vertical bars and the global error by a vertical bar at the right end of the curves.
For convergent solution curves, the local errors at each step are reduced over time, and the global error may be less than the sum of all the local errors. For the general differential
458
Chapter 10
Ordinary Differential Equations
FIGURE 10.5
Solution curves
′
to x = −x with
x(a) = s
whence
at0 t1 t2 t3 t4 t5
Global error
t
s5 s4 s3 s2 s1
x se(ta)
x
Equation (2), how can the two modes of behavior just discussed be distinguished? It is simple. If fx > δ for some positive δ, the curves diverge. However, if fx < −δ, they con- verge. To see why, consider two nearby solution curves that correspond to initial values s and s + h. By Taylor series, we have
∂ 1 2 ∂2 x(t,s+h)=x(t,s)+h∂sx(t,s)+2h ∂s2x(t,s)+···
x(t,s+h)−x(t,s)≈h ∂ x(t,s) ∂s
Thus, the divergence of the curves means that
lim |x(t, s + h) − x(t, s)| = ∞
and can be written as
t→∞
∂
lim x(t,s)=∞ t→∞ ∂s
To calculate this partial derivative, start with the differential equation satisfied by x(t,s): ∂ x(t,s)= f(t,x(t,s))
∂t
and differentiate partially with respect to s:
Hence,
∂ ∂x(t,s)= ∂ f(t,x(t,s)) ∂s ∂t ∂s
∂ ∂ x(t,s)= fx(t,x(t,s)) ∂ x(t,s)+ ft(t,x(t,s))∂t (3) ∂t ∂s ∂s ∂s
But s and t are independent variables (a change in s produces no change in t), so ∂t/∂s = 0. If s is now fixed and if we put u(t) = (∂/∂s)x(t, s) and q(t) = fx (t, x(t, s)), then Equa- tion (3) becomes
u′ = qu (4)
This is a linear differential equation with solution u(t) = ceQ(t), where Q is the indefinite integral (antiderivative) of q. The condition limt→∞ |u(t)| = ∞ is met if limt→∞ Q(t) = ∞.
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 459 This situation, in turn, occurs if q(t) is positive and bounded away from zero because then
t t
Q(t) = q(θ) dθ > δ dθ = δ(t − a) → ∞
aa
as t → ∞ if fx = q > δ > 0.
To illustrate, consider the differential equation x ′ = t + tan x . The solution curves
diverge from one another as t → ∞ because fx (t, x) = sec2 x > 1. Summary
(1) The Runge-Kutta-Fehlberg method is
x(t) = x(t)+ 25 K1 + 1408K3 + 2197K4 − 1K5
216 2565 4104 5 x(t+h)=x(t)+ 16K1+ 6656K3+28561K4− 9K5+ 2K6
where
⎧⎪ ⎪ K 1 = h f ⎪
⎪K2=hf ⎪
t+4h,x+4K1
3 3 9
135 12825
( t , x )
1 1
56430 50
55
⎪⎨K3=hf ⎪K4=hf
t+8h,x+32K1+32K2
12 1932 7200 7296
⎪ ⎪K5=hf
⎪ ⎩K6=hf
t + 13h, x + 2197K1 − 2197K2 + 2197K3 439 3680 845
t+h,x+216K1−8K2+513K3−4104K4 1 8 3544 1859
11 t+2h,x−27K1+2K2−2565K3+4104K4−40K5
The quantity ε = |x(t + h) − x| can be used in an adaptive step-size procedure. (2) A fourth-order multistep method is the Adams-Bashforth-Moulton method:
x(t+h)=x(t)+ h [55f(t,x(t))−59f(t−h,x(t−h))
24
+37f(t −2h,x(t −2h))−9f(t −3h,x(t −3h))]
x(t+h)=x(t)+ h [9f(t+h,x(t+h))+19f(t,x)t))
24
−5f(t −h,x(t −h))+ f(t −2h,x(t −2h))]
The value x(t + h) is the predicted value, and x(t + h) is the corrected value. The trunca- tion errors for these two formulas are O(h5). Since the value of x(a) is given, the values for x(a + h), x(a + 2h), x(a + 3h), x(a + 4h) are computed by some single-step method such as the fourth-order Runge-Kutta method.
460 Chapter 10
Ordinary Differential Equations
Additional References
See Aiken [1985], Butcher [1987], Dekker and Verwer [1984], England [1969], Fehlberg [1969], Henrici [1962], Hundsdorfer [1985], Lambert [1973], Lapidus and Seinfeld [1971], Miranker [1981], Moulton [1930], Shampine and Gordon [1975], and Stetter [1973].
a1. Solvetheproblem
x′ =−x x(0) = 1
by using the Trapezoid Rule, as discussed at the beginning of this chapter. Compare the true solution at t = 1 to the approximate solution obtained with n steps. Show, for example, that for n = 5, the error is 0.00123.
a2. Derive an implicit multistep formula based on Simpson’s rule, involving uniformly spaced points x(t − h), x(t), and x(t + h), for numerically solving the ordinary differ- ential equation x′ = f .
3. An alert student noticed that the coefficients in the Adams-Bashforth formula add up to 1. Why is that so?
a4. Deriveaformulaoftheform
x(t + h) = ax(t) + bx(t − h) + h[cx′(t + h) + dx′′(t) + ex′′′(t − h)]
that is accurate for polynomials of as high a degree as possible. Hint: Use polynomials 1, t, t2, and so on.
a5. Determinethecoefficientsofanimplicit,one-step,ordinarydifferentialequationmethod of the form
x(t + h) = ax(t) + bx′(t) + cx′(t + h)
so that it is exact for polynomials of as high a degree as possible. What is the order of
the error term?
6. The differential equation that is used to illustrate the adaptive Runge-Kutta program can be solved with an integrating factor. Do so.
7. EstablishEquation(4).
a8. The initial-value problem x′ = (1 + t2)x with x(0) = 1 is to be solved on the interval
[0, 9]. How sensitive is x (9) to perturbations in the initial value x (0)?
9. For each differential equation, determine regions in which the solution curves tend to
diverge from one another as t increases:
aa. x′ =sint+ex b. x′ =x+te−t ac. x′ =xt
d. x′ =x3(t2 +1) ae. x′ =cost−ex f. x′ =(1−x3)(1+t2)
Problems 10.3
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 461
a10. For the differential equation x′ = t(x3 − 6×2 + 15x), determine whether the solution
curves diverge from one another as t → ∞.
a11. Determine whether the solution curves of x′ = (1 + t2)−1x diverge from one another
as t → ∞.
1. Usemathematicalsoftwaretosolvesystemsoflinearequationswhosesolutionsare a. Adams-Bahforthcoefficients b. Adams-Moultoncoefficients
2. The second-order Adams-Bashforth-Moulton method is given by x(t+h)=x(t)+h[3f(t,x(t))− f(t−h,x(t−h))]
2
x(t+h)=x(t)+h[f(t+h,x(t+h))+ f(t,x(t))] 2
Theapproximatesingle-steperrorisε≡K|x(t+h)−x(t+h)|,whereK = 1.Using 6
ε to monitor the convergence, write and test an adaptive procedure for solving an ODE of your choice using these formulas.
3. (Continuation) Carry out the instructions of the previous computer problem for the third-order Adams-Bashforth-Moulton method:
x(t+h)=x(t)+ h [23f(t,x(t))−16f(t−h,x(t−h)) 12
+ 5 f (t − 2h, x(t − 2h))] x(t+h)=x(t)+ h [5f(t+h,x(t+h))+8f(t,x(t))
12
− f(t−h,x(t−h))]
where K = 1 in the expression for the approximate single-step error. 10
4. (Predictor-corrector scheme) Using the fourth-order Adams-Bashforth-Moulton method, derive the predictor-corrector scheme given by the following equations:
x(t+h)=x(t)+ h [55f(t,x(t))−59f(t−h,x(t−h)) 24
+37f(t −2h,x(t −2h))−9f(t −3h,x(t −3h))] x(t+h)=x(t)+ h [9f(t+h,x(t+h))+19f(t,x(t))
24
−5f(t −h,x(t −h))+ f(t −2h,x(t −2h))]
Write and test a procedure for the Adams-Bashforth-Moulton method. Note: This is a multistep process because values of x at t, t − h, t − 2h, and t − 3h are used to deter- mine the predicted value x(t + h), which, in turn, is used with values of x at t, t − h, and t − 2h to obtain the corrected value x(t + h). The error terms for these form- ulas are (251/720)h5 f (4)(ξ) and −(19/720)h5 f (4)(η), respectively. (See Section 9.3 for additional discussion of these methods.)
Computer Problems 10.3
462 Chapter 10
Ordinary Differential Equations
a5. Solve
⎧⎨x′ = 3x + 9t −13 ⎩t2
x(3) = 6
at x 1 using procedure RK45 Adaptive to obtain the desired solution to nine decimal
2
places. Compare with the true solution:
x = t3 − 9t2 + 13t 22
a6. (Continuation)Repeatthepreviousproblemforx−1. 2
7. It is known that the fourth-order Runge-Kutta method described in Equation (12) of Section 10.2 has a local truncation error that is O(h5). Devise and carry out a numerical experiment to test this. Suggestions: Take just one step in the numerical solution of a nontrivial differential equation whose solution is known beforehand. However, use a variety of values for h , such as 2−n , where 1 n 24. Test whether the ratio of errors to h5 remains bounded as h → 0. A multiple-precision calculation may be needed. Print the indicated ratios.
8. Computethenumericalsolutionof
x′ =−x x(0) = 1
x n + 1 = x n − 1 + 2 h x n′ √
with x0 = 1 and x1 = −h +
for this problem? Carry out an analysis of the stability of this method. Hint: Consider fixedhandassumexn =λn.
a9. Tabulateandgraphthefunction[1−lnv(x)]v(x)on[0,e],wherev(x)isthesolution of the initial-value problem (dv/dx)[lnv(x)] = 2x,v(0) = 1. Check value: v(1) = e.
10. Determinethenumericalvalueof
5 es
2π sds
4
in three ways: solving the integral, an ordinary differential equation, and using the exact
formula.
11. Compute and print a table of the function
φ12
1−4sin θdθ
using the midpoint method
1 + h2. Are there any difficulties in using this method
f(φ)=
0
by solving an appropriate initial-value problem. Cover the interval [0, 90◦ ] with steps
of 1◦ and use the Runge-Kutta method of order 4. Check values: Use f(30◦) = 0.51788 193 and f (90◦) = 1.46746 221. Note: This is an example of an elliptic integral of the second kind. It arises in finding an arc length on an ellipse and in many engineering problems.
10.3 Stability and Adaptive Runge-Kutta and Multistep Methods 463 a12. Bysolvinganappropriateinitial-valueproblem,makeatableofthefunction
∞ dt
f (x) =
on the interval [0,1]. Determine how well f is approximated by xe−1/x. Hint: Let
t =−lns.
a13. Bysolvinganappropriateinitial-valueproblem,makeatableofthefunction 2x2
e−t dt
on the interval 0 x 2. Determine how accurately f (x ) is approximated on this
1/x
tet
f (x) = √π
2 3 2 −x2
interval by the function
g(x)=1− ay+by +cy √πe
0
where
a = 0.30842 84 b = −0.08497 13
c = 0.66276 98 y = (1 + 0.47047x )−1
14. UsetheRunge-Kuttamethodtocompute1√1+s3ds. 0
a15. Writeandrunaprogramtoprintanaccuratetableofthesineintegral
r dr
The table should cover the interval 0 x 1 in steps of size 0.01. [Use sin(0)/0 = 1.
See Computer Problem 5.1.2] 16. Compute a table of the function
Shi(x) =
0
x sinht
t dt
Si(x)=
x sinr
0
by finding an initial-value problem that it satisfies and then solving the initial-value problem. Your table should be accurate to nearly machine precision. [Use sinh(0)/
0 = 1.]
17. Design and carry out a numerical experiment to verify that a slight perturbation in an initial-value problem can cause catastrophic errors in the numerical solution. Note: An initial-value problem is an ordinary differential equation with conditions specified only at the initial point. (Compare this with a boundary value problem as given in Chapter 12.)
18. RunexampleprogramsforsolvingtheindustrialexampleinEquation(1),comparethe solutions, and produce the plots.
19. AnotheradaptiveRunge-KuttamethodwasdevelopedbyEngland[1969].TheRunge- Kutta-England method is similar to the Runge-Kutta-Fehlberg method in that it com- bines a fourth-order Runge-Kutta formula and a companion fifth-order one. To reduce the number of function evaluations, the formulas are derived so that some of the same function evaluations are used in each pair of formulas. (A fourth-order Runge-Kutta
464 Chapter 10
Ordinary Differential Equations
formula requires at least four function evaluations, and a fifth-order one requires at least six.) The Runge-Kutta-England method uses the fourth-order Runge-Kutta methods in Computer Problem 10.2.14a and takes two half steps as follows:
11
x t+2h =x(t)+6(K1+4K3+K4)
where
and
where
⎧⎪K1 = 1hf(t,x(t))
⎪2
⎨K2=1hf t+1h,x(t)+1K1 242
⎪K3=1hf t+1h,x(t)+1K1+1K2 ⎪⎩24 44
K4=1hf t+1h,x(t)−K2+2K3 22
11
x(t+h)=x t+2h +6(K5+4K7+K8)
⎧⎪ ⎪ K 5 = 1 h f t + 1 h , x t + 1 h
⎪ 2 2 2 ⎨K6=1hf t+3h,xt+1h +1K5
2422 ⎪K7=1hf t+3h,xt+1h +1K5+1K6 ⎪⎩ 2 4 2 4 4
K8=1hf t+h,xt+1h −K6+2K7 22
With these two half steps, there are enough function evaluations so that only one more 11
K9=2hf t+h,x(t)−12(K1+96K2−92K3+121K4
− 144K5 − 6K6 + 12K7) is needed to obtain a fifth-order Runge-Kutta method:
x(t+h)=x(t)+ 1 (14K1 +64K3 +32K3 −8K5 +64K7 +15K8 −K9) 90
An adaptive procedure can be developed by using an error estimation based on the two values x(t + h) and x(t + h). Program and test such a procedure. (See, for example, Shampine, Allen, and Pruess [1997].)
20. Investigatethenumericalsolutionoftheinitial-valueproblem ′√2
x=−1−x x(0) = 1
This problem is ill-conditioned, since x (t ) = cos t is a solution and x (t ) = 1 is also. For more information on this and other test problems, see Cash [2003] or www.ma.ic.ac .uk/∼jcash/.
21. (Student research project) Learn about algebraic differential equations.
22. Write software to implement the following pseudocodes and verify the numerical
results given in the text:
a. Test RK45 and RK45 b. Test RK45 Adaptive and RK45 Adaptive
11
Systems of Ordinary Differential Equations
A simple model to account for the way in which two different animal species sometimes interact is the predator-prey model. If u(t) is the number of individuals in the predator species and v(t) the number of individuals in the prey species, then under suitable simplifying assumptions and with appropriate constants a, b, c, and d,
⎧⎪ ⎪ ⎪ d u = a ( v + b ) u ⎨ dt
⎪ ⎪ ⎪⎩ d v = c ( u + d ) v dt
This is a pair of nonlinear ordinary differential equations (ODEs) that govern the populations of the two species (as functions of time t). In this chapter, numerical procedures are developed for solving such problems.
11.1 Methods for First-Order Systems
In Chapter 10, ordinary differential equations were considered in the simplest context; that is, we restricted our attention to a single differential equation of the first order with an accompanying auxiliary condition. Scientific and technological problems often lead to more complicated situations, however. The next degree of complication occurs with systems of several first-order equations.
Uncoupled and Coupled Systems
The sun and the nine planets form a system of particles moving under the jurisdiction of Newton’s law of gravitation. The position vectors of the planets constitute a system of 27 functions, and the Newtonian laws of motion can be written, then, as a system of 54 first-order ordinary differential equations. In principle, the past and future positions of the planets can be obtained by solving these equations numerically.
465
466 Chapter 11
Systems of Ordinary Differential Equations
Taking an example of more modest scope, we consider two equations with two auxiliary conditions. Let x and y be two functions of t subject to the system
(1)
This is an example of an initial-value problem that involves a system of two first-order differential equations. Note that in the example given, it is not possible to solve either of the two differential equations by itself because the first equation governing x′ involves the unknown function y, and the second equation governing y′ involves the unknown function x. In this situation, we say that the two differential equations are coupled.
The reader is invited to verify that the analytic solution is
y(t) = et sin(t) − t3 = sin(t)[cosh(t) + sinh(t)] − t3
Let us look at another example that is superficially similar to the first but is actually
with initial conditions
x′(t)=x(t)−y(t)+2t−t2 −t3 y′(t)=x(t)+y(t)−4t2 +t3
x(0) = 1
y(0) = 0
simpler:
with initial conditions
x′(t) = x(t) + 2t − t2 − t3 y′(t) = y(t) − 4t2 + t3
x(0) = 1
y(0) = 0
x(t) = et cos(t) + t2 = cos(t)[cosh(t) + sinh(t)] + t2
These two equations are not coupled and can be solved separately as two unrelated initial- value problems (using, for instance, the methods of Chapter 10). Naturally, our concern here is with systems that are coupled, although methods that solve coupled systems also solve those that are not. The procedures discussed in Chapter 10 extend to systems whether coupled or uncoupled.
Taylor Series Method
We illustrate the Taylor series method for System (1) and begin by differentiating the
equations constituting it:
x′ = x − y + 2t − t2 − t3
y′ = x + y − 4t2 + t3
x′′ = x′ − y′ + 2 − 2t − 3t2
y′′ = x′ + y′ − 8t + 3t2
x′′′ = x′′ − y′′ − 2 − 6t
y′′′ = x′′ + y′′ − 8 + 6t
x(4) = x′′′ − y′′′ − 6 y(4) = x′′′ + y′′′ + 6
etc.
(2)
11.1 Methods for First-Order Systems 467 A program to proceed from x(t) to x(t + h) and from y(t) to y(t + h) is easily written by
using a few terms of the Taylor series:
′ h2 ′′ h3 ′′′ h4 (4)
x(t+h)=x+hx+2x+6x+24x +··· ′ h2 ′′ h3 ′′′ h4 (4)
y(t+h)=y+hy+2y+6y+24y +···
together with equations for the various derivatives. Here, x and y and all their derivatives are functions of t; that is, x = x(t), y = y(t), x′ = x′(t), y′′ = y′′(t), and so on.
A pseudocode program that generates and prints a numerical solution from 0 to 1 in 100 steps is as follows. Terms up to h4 have been used in the Taylor series.
program Taylor System1
integer k; real h, t, x, y, x′, y′, x′′, y′′, x′′′, y′′′, x(4), y(4) integernsteps←100; reala←0,b←1
x←1; y←0; t←a
output 0, t , x , y
h ← (b − a)/nsteps
fork =1tonstepsdo
x′ ←x−y+t(2−t(1+t)) y′ ←x+y+t2(−4+t)
x′′ ← x′ − y′ + 2 − t(2 + 3t) y′′ ← x′ + y′ + t(−8 + 3t) x′′′ ← x′′ − y′′ − 2 − 6t
y′′′ ← x′′ + y′′ − 8 + 6t
x(4) ← x′′′ − y′′′ − 6
y(4) ← x′′′ + y′′′ + 6
x ← x +hx′ + 1hx′′ + 1hx′′′ + 1hx(4)
2 3 4 y←y+h y′+1h y′′+1h y′′′+1h y(4)
t←t+h
output k, t, x, y end for
end program Taylor System1
234
Vector Notation
Observe that System (1) can be written in vector notation as x′ x−y+2t−t2−t3
with initial conditions
y′ = x+y−4t2+t3 (3)
This is a special case of a more general problem that can be written as
′
X =F(t,X)
(4)
x(0) = 1 y(0) 0
X(a) = S, given
468 Chapter 11
Systems of Ordinary Differential Equations
where
x x′ X= y X′= y′
and F is the vector whose two components are given by the right-hand sides in Equation (1). Since F depends on t and X, we write F(t, X).
Systems of ODEs
We can continue this idea in order to handle a system of n first-order differential equations. First, we write them as
Then we let
⎡s ⎤ 1111
⎧⎪x1′ = f1(t,x1,x2,…,xn) ⎪⎨x2′ = f2(t,x1,x2,…,xn)
⎪ .
⎪⎩xn′ = fn(t,x1,x2,…,xn)
x1(a) = s1,x2(a) = s2,…, xn(a) = sn ⎡x ⎤ ⎡x′ ⎤ ⎡ f ⎤
all given
⎢ ⎢ x 2 ⎥ ⎥ ′ ⎢ ⎢ x 2′ ⎥ ⎥ ⎢ ⎢ f 2 ⎥ ⎥ ⎢ ⎢ s 2 ⎥ ⎥ X=⎢⎣.⎥⎦ X =⎢⎣.⎥⎦ F=⎢⎣.⎥⎦ S=⎢⎣.⎥⎦
xn xn′ fn sn
and we obtain Equation (4), which is an ordinary differential equation written in vector
notation.
Taylor Series Method: Vector Notation
The m-order Taylor series method would be written as
′ h2 ′′ hm (m)
X(t+h)=X+hX+2X +···+m!X (5)
where X = X(t), X′ = X′(t), X′′ = X′′(t), and so on.
A pseudocode for the Taylor series method of order 4 applied to the preceding problem
can be easily rewritten by a simple change of variables and the introduction of an array and an inner loop.
program Taylor System2
integer i,k; real h,t; real array (xi)1:n,(dij)1:n×1:4 integer n ← 2, nsteps ← 100
real a ← 0, b ← 1
t←0; (xi)←(1,0)
output 0, t, (xi )
h ← (b − a)/nsteps
iji
11.1 Methods for First-Order Systems 469
fork =1tonstepsdo
d11 ← x1 −x2 +t(2−t(1+t)) d21 ← x1 + x2 + t2(−4 + t)
d12 ← d11 −d21 +2−t(2+3t) d22 ←d11 +d21 +t(−8+3t) d13 ← d12 −d22 −2−6t
d23 ← d12 +d22 −8+6t
d14 ←d13 −d23 −6
d24 ←d13 +d23 +6
for i = 1 to n do 1
1 1 xi ←xi +h di1+2h di2+3h di3+4h[di4]
end for
t←t+h
output k, t, (xi ) end for
end program Taylor System2
Here, a two-dimensional array is used instead of all the different derivative variables; that is, d ↔ x(j). In fact, this and other methods in this chapter become particularly easy to
program if the computer language supports vector operations.
Runge-Kutta Method
The Runge-Kutta methods of Chapter 10 also extend to systems of differential equations. The classical fourth-order Runge-Kutta method for System (4) uses these formulas:
where
X(t+h)=X+h(K1 +2K2 +2K3 +K4) (6) 6
⎧
⎪K1 = F(t,X)
⎪⎨K =Ft+1h,x+1hk 2 2 2 1
⎪K3=F t+1h,x+1hk2 ⎪⎩ 2 2
K4 =F(t+h,X+hK3)
Here, X=X(t), and all quantities are vectors with n components except variables t and h.
A procedure for carrying out the Runge-Kutta procedure is given next. It is assumed that the system to be solved is in the form of Equation (4) and that there are n equations in the system. The user furnishes the initial value of t, the initial value of X, the step size h, and the number of steps to be taken, nsteps. Furthermore, procedure XP System(n, t, (xi ), ( fi )) is needed, which evaluates the right-hand side of Equation (4) for a given value of array (xi ) and stores the result in array ( fi ). (The name XP System2 is chosen as an abbreviation of X′ for a system.)
470 Chapter 11 Systems of Ordinary Differential Equations
procedure RK4 System1(n, h, t, (xi ), nsteps) integer i, j, n; real h, t; real array (xi )1:n allocate real array (yi )1:n , (Ki, j )1:n×1:4 output 0, t, (xi )
for j =1tonstepsdo
call XP System(n, t, (xi ), (Ki,1)) for i = 1 to n do
yi ←xi +1hKi,1 2
end for
call XP System(n, t + h/2, (yi ), (Ki,2)) for i = 1 to n do
yi ←xi +1hKi,2 2
end for
call XP System(n, t + h/2, (yi ), (Ki,3)) for i = 1 to n do
yi ←xi +hKi,3 end for
call XP System(n, t + h, (y)i , (Ki,4)) for i = 1 to n do
xi ← xi + 1h[Ki,1 +2Ki,2 +2Ki,3 + Ki,4] 6
end for
t←t+h
output j,t,(xi) end for
deallocate array (yi ), (Ki, j ) end procedure RK4 System1
To illustrate the use of this procedure, we again use System (1) for our example. Of course, it must be rewritten in the form of Equation (4). A suitable main program and a procedure for computing the right-hand side of Equation (4) follow:
program Test RK4 System1 integer n ← 2, nsteps ← 100 real a ← 0, b ← 1
real h, t; real array (xi )1:n t←0
(xi) ← (1,0)
h ← (b − a)/nsteps
call RK4 System1(n, h, t, (xi ), nsteps) end program Test RK4 System1
procedure XP System(n, t, (xi ), ( fi )) real array (xi )1:n , ( fi )1:n
integer n
11.1 Methods for First-Order Systems 471
real t
f1 ← x1 − x2 + t(2 − t(1 + t)) f2 ← x1 + x2 − t2(4 − t)
end procedure XP System
A numerical experiment to compare the results of the Taylor series method and the Runge-Kutta method with the analytic solution of System (1) is suggested in Computer Problem 11.1.1. At the point t = 1.0, the results are as follows:
Taylor Series
x (1.0) ≈ 2.46869 40 y(1.0) ≈ 1.28735 46
Runge-Kutta
2.46869 42 1.28735 61
Analytic Solution
2.46869 39399 1.28735 52872
We can use mathematical software routines found in Matlab, Maple, or Mathematica to obtain the numerical solution of the system of ordinary differential equations (1). For t over the interval [0,1], we invoke an ODE procedure to march from t = 0 at which x(0) = 1 and y(0) = 0 to t = 1 at which x(1) = 2.468693912 and y(1) = 1.287355325.
To obtain the numerical solution of the ordinary differential equation defined for t over the interval [1, 1.5], invoke an ordinary differential equation solving procedure to march from t = 0 at which x(1) = 2 and y(1) = −2 to t = 1.5 at which x(1.5) ≈ 15.5028 and y(1.5) ≈ 6.15486.
Autonomous ODE
When we wrote the system of differential equations in vector form
X′ = F(t, X)
we assumed that the variable t was explicitly separated from the other variables and treated differently. It is not necessary to do this. Indeed, we can introduce a new variable x0 that is t in disguise and add a new differential equation x0′ = 1. A new initial condi- tion must also be provided, x0(a) = a. In this way, we increase the number of differ- ential equations from n to n + 1 and obtain a system written in the more elegant vector form
′
X = F(X)
X(a) = S, given
Consider the system of two equations given by Equation (1). We write it as a system
with three variables by letting
x0 =t, x1 =x, x2 =y
Thus, we have
⎡x0′⎤ ⎡1
⎤ ⎢⎣x1′ ⎥⎦=⎢⎣x1 −x2 +2×0 −x02 −x03⎥⎦
x2′ x1 +x2 −4×02 +x03
The auxiliary condition for the vector X is X (0) = [0, 1, 0]T .
472 Chapter 11
Systems of Ordinary Differential Equations
As a result of the preceding remarks, we sacrifice no generality in considering a system of n + 1 first-order differential equations written as
⎧⎪x0′ = f0(x0,x1,x2,…,xn) ⎪x1′ = f1(x0,x1,x2,…,xn) ⎪⎨x2′ = f2(x0,x1,x2,…,xn)
⎪ .
⎪⎩xn′ = fn(x0,x1,x2,…,xn)
x0(a) = s0,x1(a) = s1,x2(a) = s2,…, xn(a) = sn We can write this system in general vector notation as
where
all given
′
X = F(X)
(7)
X(a)=S, given
⎡x′ ⎤ ⎡ f ⎤
⎡x ⎤ 0000
⎢ ⎢ x 1 ⎥ ⎥ X = ⎢ ⎢ x 2 ⎥ ⎥
⎢ ⎢ x 1′ ⎥ ⎥ X ′ = ⎢ ⎢ x 2′ ⎥ ⎥
⎢ ⎢ f 1 ⎥ ⎥ F = ⎢ ⎢ f 2 ⎥ ⎥
⎡s ⎤ ⎢ ⎢ s 1 ⎥ ⎥
⎢⎣ . . . ⎥⎦
xn xn′ fn sn
⎢⎣ . . . ⎥⎦
A system of differential equations without the t variable explicitly present is said to be autonomous. The numerical methods that we discuss do not require that x0 = t or f0 = 1 ors0 =a.
For an autonomous system, the classical fourth-order Runge-Kutta method for System (6) uses these formulas:
⎢⎣ . . . ⎥⎦
S = ⎢ ⎢ s 2 ⎥ ⎥ ⎢⎣ . . . ⎥⎦
X(t+h)=X+h(K1 +2K2 +2K3 +K4) (8) 6
where
Here, X = X(t), and all quantities are vectors with n+1 components except the variables h. In the previous example, the procedure RK4 System1 would need to be modified by beginningthearrayswith0ratherthan1andomittingthevariablet.(WecallitRK4 System2 and leave it as Computer Problem 11.1.4.) Then the calling programs would be as follows:
⎧⎪ ⎪ K 1 = F ( X )
⎪⎨ K = F X + 1 h K
2 2 1 ⎪K3=F X+1hK2
⎩2
K4 =F(X+hK3)
program Test RK4 System2 real h, t; real array (xi )0:n integer n ← 2, nsteps ← 100 real a ← 0, b ← 1
(xi) ← (0,1,0)
h ← (b − a)/nsteps
call RK4 System2(n, h, (xi ), nsteps) end program Test RK4 System2
Summary
(1) A system of ordinary differential equations ⎧⎪x1′ = f1(t,x1,x2,…,xn)
⎪⎩xn′ = fn(t,x1,x2,…,xn)
x1(a) = s1,x2(a) = s2,…,xn(a) = sn,
all given
can be written in vector notation as
′
X =F(t,X)
X(a)=S, given where we define the following n component vectors
⎧⎪ ⎪ ⎪ ⎨
⎪ ⎪ ⎪⎩
X =[x1,x2,…,xn]T X′ = [x1′,x2′,…,xn′ ]T F = [ f1, f2,…, fn]T
X(a) = [x1(a),x2(a),…,xn(a)]T (2) The Taylor series method of order m is
′ h2 ′′ hm (m) X(t+h)=X+hX+2X +···+m!X
11.1 Methods for First-Order Systems 473
procedure XP System(n, (xi ), ( fi )) real array (xi )0:n , ( fi )0:n
integer n
f0 ← 1
f1 ← x1 −x2 +x0(2−x0(1+x0)) f 2 ← x 1 + x 2 − x 02 ( 4 − x 0 )
end procedure XP System
It is typical in ordinary differential equation solvers, such as those found in mathe- matical software libraries, for the user to interface with them by writing a subprogram in a nonautonomous format. In other words, the ordinary differential equation solver takes as input both the independent variable and the dependent variable and returns values for the right-hand side to the ordinary differential equation. Consequently, the nonautonomous programming convention may seem more natural to those who are using these software packages.
It is a useful exercise to find a physical application in your field of study or profession involving the solution of an ordinary differential equation. It is instructive to analyze and solve the physical problem by determining the appropriate numerical method and translating the problem into the format that is compatible with the available software.
⎪⎨x2′ = f2(t,x1,x2,…,xn) ⎪ .
where X = X(t), X′ = X′(t), X′′ = X′′(t), and so on.
474 Chapter 11
Systems of Ordinary Differential Equations
(3) The Runge-Kutta method of order 4 is X(t+h)=X+h(K1 +2K2 +2K3 +K4)
where
6
⎧⎪ ⎪ K 1 = F ( t , X )
⎪⎨ K = F t + 1 h , X + 1 h K 22 21
⎪K3=F t+1h,X+1hK2 ⎩22
K4 =F(t+h,X+hK3)
Here, X = X(t), and all quantities are vectors with n components except variables t and h.
(4) We can absorb the t variable into the vector by letting x0 = t and then writing the autonomous form for the system of ordinary differential equations in vector notation as
′
X = F(X)
X(a)=S, given where vectors are defined to have n + 1 components. Then
a1. Consider
x′ = y x(0) = −1
y′ = x with y(0) = 0
⎧⎪ ⎪ ⎪ ⎨
⎪⎩
X =[x0,x1,x2,…,xn]T X′ = [x0′,x1′,x2′,…,xn′ ]T F = [1, f1, f2,…, fn]T
X(a) = [a,x1(a),x2(a),…,xn(a)]T
(5) The Runge-Kutta method of order 4 for the system of ordinary differential equations
in autonomous form is
X(t+h)=X+h(K1 +2K2 +2K3 +K4) 6
where
Here, X = X (t ), and all quantities F and K i are vectors with n + 1 components except the variables t and h.
⎧⎪ ⎪ K 1 = F ( X )
⎪⎨ K = F X + 1 h K
2 2 1 ⎪K3=F X+1hK2
⎩2
K4 =F(X+hK3)
Write down the equations, without derivatives, to be used in the Taylor series method of order 5.
Problems 11.1
the right-hand side:
11.1 Methods for First-Order Systems 475
a2. Howwouldyousolvethissystemofdifferentialequationsnumerically? ⎧
⎪⎨ x 1′ = x 12 + e t − t 2 ⎪⎩x2′ =x2−cost
x1(0) = 0 x2(1) = 0
a3. Howwouldyousolvetheinitial-valueproblem ⎧
⎪⎨x1′(t)=x1(t)et +sint−t2 ⎪⎩ x2′ (t) = [x2(t)]2 − et + x2(t)
x1(1) = 2 x2(1) = 4
if a computer program were available to solve an initial-value problem of the form
x′ = f (t, x) involving a single unknown function x = x(t)?
a4. Writeanequivalentsystemoffirst-orderdifferentialequationswithouttappearingon
⎧⎪⎨x′ =x2 +log(y)+t2
⎪⎩ y′ = ey − cos(x) + sin(tx) − (xy)7
x(0)=1 y(0)=3
a1. Solvethesystemofdifferentialequations(1)byusingtwodifferentmethodsgivenin this section and compare the results with the analytic solution.
a2. Solvetheinitial-valueproblem
⎧⎪⎨ x ′ = t + x 2 − y ⎪⎩ y ′ = t 2 − x + y 2
x(0)=3 y(0)=2
by means of the Taylor series method using h = 1/128 on the interval [0, 0.38]. Include terms involving three derivatives in x and y. How accurate are the computed function values?
3. WritetheRunge-Kuttaproceduretosolve ⎧⎪⎨ x 1 ′ = − 3 x 2
x′ =1x ⎪⎩ 2 3 1
x1(0) = 0 on the interval 0 t 4. Plot the solution.
a4. WriteprocedureRK4 System2andadriverprogramforsolvingtheordinarydifferential equation system given by Equation (2). Use h = −10−2, and print out the values of x0, x1, and x2, together with the true solution on the interval [−1,0]. Verify that the true solutionisx(t)=et +6+6t+4t2 +t3 andy(t)=et −t3 +t2 +2t+2.
x2(0) = 1
Computer Problems 11.1
476 Chapter 11
Systems of Ordinary Differential Equations
a5. UsingtheRunge-Kuttaprocedure,solvethefollowinginitial-valueproblemonthein- terval 0 t 2π. Plot the resulting curves (x1(t), x2(t)) and (x3(t), x4(t)). They should
be circles.
6. Solvetheproblem
⎧⎡x⎤
⎪⎢3 ⎥ ⎪⎨′⎢x4 ⎥ X = ⎢⎣ − x 1 x 12 + x 2 2 − 3 / 2 ⎥⎦
⎪ −x x2 + x2−3/2 ⎪⎩ 212
X(0) = [1,0,0,1]T
⎧⎪ ⎪ ⎪ x 0 ′ = 1
⎨x1′ =−x2+cosx0 ⎪⎩x2′= x1+sinx0
x0(1)=1 x1(1)=0 x2(1)=−1 Use the Runge-Kutta method and the interval −1 t 2.
a7. Writeandtestaprogram,usingtheTaylorseriesmethodoforder5,tosolvethesystem
⎧⎪⎨ x ′ = t x − y 2 + 3 t ⎪⎩ y ′ = x 2 − t y − t 2
x(5)=2 y(5)=3
on the interval [5,6] using h = 10−3. Print values of x and y at steps of 0.1.
8. Printatableofsintandcostontheinterval[0,π/2]bynumericallysolvingthesystem
⎧⎪⎨ x ′ = y ⎪⎩ y ′ = − x
x(0)=0 y(0)=1
9. WriteaprogramforusingtheTaylorseriesmethodoforder3tosolvethesystem
⎧⎪ ⎪ ⎪ x ′ = t x + y ′ − t 2 ⎨y′ =ty+3t
⎪ ⎪ ⎪⎩ z ′ = t z − y ′ + 6 t 3
x(0)=1 y(0)=2 z(0)=3 on the interval [0, 0.75] using h = 0.01.
10. Writeandtestashortprogramforsolvingthesystemofdifferentialequations
⎧⎪⎨ y ′ = x 3 − t 2 y − t 2 ⎪⎩ x ′ = t x 2 − y 4 + 3 t
y(2)=5 x(2)=3
over the interval [2, 5] with h = 0.25. Use the Taylor series method of order 4.
11. Recode and test procedure RK4 System2 using a computer language that supports vector operations.
11.2 Higher-Order Equations and Systems 477
12. Verifythenumericalresultsgiveninthetextforthesystemofdifferentialequations(1)
from programs Test RK4 System1 and RK4 System2.
13. (Continuation) Using mathematical software such as Matlab, Maple, or Mathematica containing symbolic manipulation capabilities to verify the analytic solution for the system of differential equations (1).
14. (Continuation)UsemathematicalsoftwareroutinessuchasarefoundinMatlab,Maple, or Mathematica to verify the numerical solutions given in the text. Plot the result- ing solution curve. Compare with the results from programs Test RK4 System1 and Test RK4 System2.
11.2 Higher-Order Equations and Systems
Consider the initial-value problem for ordinary differential equations of order higher than 1. A differential equation of order n is normally accompanied by n auxiliary conditions. This many initial conditions are needed to specify the solution of the differential equation pre- cisely (assuming certain smoothness conditions are present). Take, for example, a particular second-order initial-value problem
Without the auxiliary conditions, the general analytic solution is
x(t) = 1t2 + 3 cos(2t) + c1t + c2 48
x′′(t) = −3 cos2(t) + 2 x(0)=0 x′(0)=0
(1)
where c1 and c2 are arbitrary constants. To select one specific solution, c1 and c2 must be fixed, and two initial conditions allow this to be done. In fact, x (0) = 0 yields c2 = − 3 ,
andx′(0)=0forcesc1 =0.
Higher-Order Differential Equations
8
In general, higher-order problems can be much more complicated than this simple example because System (1) has the special property that the function on the right-hand side of the differential equation does not involve x. The most general form of an ordinary differential equation with initial conditions that we shall consider is
(2)
This can be solved numerically by turning it into a system of first-order differential equa- tions.Todoso,wedefinenewvariablesx1,x2,…,xn asfollows:
x(n) = f(t,x,x′,x′′,…,x(n−1))
x(a), x′(a), x′′(a), . . . , x(n−1)(a) all given
x1 =x x2 =x′ x3 =x′′ … xn−1 =x(n−2) xn =x(n−1)
478 Chapter 11
Systems of Ordinary Differential Equations
Consequently, the original initial-value problem (2) is equivalent to ⎧⎪ x1′=x2
⎪ x2′=x3 ⎨ .
⎪ ⎪ ⎪ ⎪ x n′ − 1 = x n
⎪⎩ xn′ = f(t,x1,x2,…,xn)
or, in vector notation,
where
and
x1(a),x2(a),…,xn(a) allgiven ′
X =F(t,X) X(a)=S, given
(3)
X = [x1,x2,…,xn]T X′ =[x1′,x2′,…,xn′]T
F = [x2,x3,x4,…,xn, f]T X(a) = [x1(a),x2(a),…,xn(a)]
Whenever a problem must be transformed by introducing new variables, it is recom- mended that a dictionary be provided to show the relationship between the new and the old variables. At the same time, this information, together with the differential equations and the initial values, can be displayed in a chart. Such systematic bookkeeping can be helpful in a complicated situation.
(4)
into a form suitable for solution by the Runge-Kutta procedure. A chart summarizing the transformed problem is as follows:
Old Variable New Variable Initial Value Differential Equation
x x1 3 x1′=x2
x ′ x 2 7 x 2′ = x 3
x′′ x3 13 x3′ =cosx1+sinx2−ex3 +t2
So the corresponding first-order system is
⎡⎤
x2
X′ =⎣ x3 ⎦
cosx1 +sinx2 −ex3 +t2
To illustrate, let us transform the initial-value problem ′′′ ′ x′′ 2
and X(0) = [3, 7, 13]T .
x =cosx+sinx −e +t x(0)=3 x′(0)=7 x′′(0)=13
ing chart:
Old Variable New Variable
xx1 x′ x2 y x3
y ′ y′′
Hence, we have
x 4 x5
Initial Value Differential Equation
2×1′=x2
−4 x2′ = x1 −x3 −9×2 +x43 +6×5 +2t −2 x3′=x4
7 x 4′ = x 5
6 x5′ =x5−x2+ex1 −t
⎡⎤
x2
⎢x1 −x3 −9×2 +x43 +6×5 +2t⎥
and X(1) = [2,−4,−2,7,6]T . Autonomous ODE Systems
11.2 Higher-Order Equations and Systems 479
Systems of Higher-Order Differential Equations
By systematically introducing new variables, we can transform a system of differential equations of various orders into a larger system of first-order equations. For instance, the system
⎧⎪⎨ x′′ = x − y − (3x′)2 + (y′)3 + 6y′′ + 2t
⎪⎩ y′′′ = y′′ − x′ + ex − t (5)
x(1)=2 x′(1)=−4 y(1)=−2 y′(1)=7 y′′(1)=6
can be solved by the Runge-Kutta procedure if we first transform it according to the follow-
X′ =⎢ x4 ⎥ ⎣ x5 ⎦
x5−x2+ex1 −t
We notice that t is present on the right-hand side of Equation (3) and that therefore the equations x0 = t and x0′ = 1 can be introduced to form an autonomous system of ordinary differential equations in vector notation. It is easy to show that a higher-order system of differential equations having the form in Equation (2) can be written in vector notation as
where
and
′
X = F(X)
X(a)=S, given
X = [x0,x1,x2,…,xn]T X′ = [x0′,x1′,x2′,…,xn′ ]T
F = [1,×2,x3,x4,…,xn, f]T X(a) = [a,x1(a),x2(a),…,xn(a)]
480 Chapter 11
Systems of Ordinary Differential Equations
As an example, the ordinary differential equation system in Equation (4) can be written
in autonomous form as
and X(1) = [1,2,−4,−2,7,6]T . Summary
(1) A single nth-order ordinary differential equation with initial values has the form x(n) = f(t,x,x′,x′′,…,x(n−1))
x(a), x′(a), x′′(a), . . . , x(n−1)(a), all given It can be turned into a system of first-order equations of the form
⎡⎤
1
⎢ x2 ⎥
X′ =⎢x1 −x3 −9×2 +x43 +6×5 +2×0⎥ ⎢ x4 ⎥
⎣ x5 ⎦ x5 −x2 +ex1 −x0
where
⎧⎪ ⎪ ⎪ ⎨
⎪ ⎪ ⎪⎩
′
X =F(t,X)
X(a)=S, given X =[x1,x2,…,xn]T
X′ = [x1′,x2′,…,xn′ ]T
F =[x2,x3,x4,…,xn, f]T
X(a) = [x1(a),x2(a),…,xn(a)]T
(2) We can absorb the variable t into the vector notation by letting x0 = t and extending the vectors to length n + 1. Thus, a single nth-order ordinary differential equation can be
written as
where
⎧⎪ ⎪ ⎪ ⎨
⎪⎩
′
X = F(X)
X(a)=S, given X =[x0,x1,x2,…,xn]T
X′ = [x0′,x1′,x2′,…,xn′ ]T
F =[1,×2,x3,x4,…,xn, f]T
X(a) = [a,x1(a),x2(a),…,xn(a)]
a1. Turnthisdifferentialequationintoasystemoffirst-orderequationssuitableforapplying the Runge-Kutta method:
x′′′ = 2x′ + log(x′′) + cos(x) x(0)=1 x′(0)=−3 x′′(0)=5
Problems 11.2
11.2 Higher-Order Equations and Systems 481 2. a. Assuming that a program is available for solving initial-value problems of the form
in Equation (3), how can it be used to solve the following differential equation?
b. How would this problem be solved if the initial conditions were x(1) = 3, x′(1) = −7, and x′′′(1) = 0?
a3. Howwouldyousolvethisdifferentialequationproblemnumerically? ⎧⎪x′′ =x′ +x2 −sint
x′′′ = t + x + 2x′ + 3x′′
x(1)=3 x′(1)=−7 x′′(1)=4
⎨111
x′′=x −(x′)1/2+t
⎪⎩2 2 2
x1(0)=1 x2(1)=3 x1′(0)=0 x2′(1)=−2
a4. Converttoafirst-ordersystemtheorbitalequations
with initial conditions
x(0) = 0.5 x′(0) = 0.75 y(0) = 0.25 y′(0) = 1.0
a5. Rewritethefollowingequationasasystemoffirst-orderdifferentialequationswithout t appearing on the right-hand side:
x x(4) = (x′′′)2 + cos(x′x′′) − sin(tx) + log t
x(0)=1 x′(0)=3 x′′(0)=4 x′′′(0)=5
a6. Expressthesystemofordinarydifferentialequations
⎧⎪ ⎪ ⎪ d 2 z − 2 t d z = 2 t e x z ⎪ dt2 dt
⎪ ⎪⎨ d 2 x − 2 x z d x = 3 x 2 y t 2 ⎪ dt2 dt
⎪ d2 y − ey dy = 4xt2z ⎪ ⎪ ⎪⎩ d t 2 d t
z(1)=x′′(1)=y′(1)=2 z′(1)=x(1)=y(1)=3 as a system of first-order ordinary differential equations.
7. Determineasystemoffirst-orderequationsequivalenttoeachofthefollowing:
x′′ + x(x2 + y2)−3/2 = 0 y′′ + y(x2 + y2)−3/2 = 0
aa. x′′′+x′′sinx+tx′+x=0
c.
a8. Consider
b. x(4)+x′′cosx′+txx′ =0
x ′′ = x ′ − x x(0)=0 x′(0)=1
x′′ =3×2 −7y2 +sint+cos(x′y′) y′′′ =y+x2 −cost−sin(xy′′)
Determine the associated first-order system and its auxiliary initial conditions.
482 Chapter 11
Systems of Ordinary Differential Equations
a9. Theproblem
⎧⎪ x′′(t) = x + y − 2x′ + 3y′ + log t ⎨y′′(t)=2x−3y+5x′ +ty′ −sint ⎪⎩ x(0) = 1 x′(0) = 2
y(0)=3 y′(0)=4
is to be put into the form of an autonomous system of five first-order equations. Give
the resulting system and the appropriate initial values.
10. Write procedure XP System for use with the fourth-order Runge-Kutta routine RK4 System1 for the following differential equation:
x alone:
x′ =−x+axy y′ =3y−xy
′′′ x′′ ′′′ ′ 10 x =10e −x sin(xx)−(xt)
x(2) = 6.5 x′(2) = 4.1 x′′(2) = 3.2
11. Ifwearegoingtosolvetheinitial-valueproblem
x(1) = x′(1) = x′′(1) = 1
using Runge-Kutta formulas, how should the problem be transformed?
12. Convert this problem involving differential equations into an autonomous system of first-order equations (with initial values):
⎧⎪⎨3x′ +tanx′′ −x2 =√t2 +1+y2 +(y′)2 ⎪⎩−3y′+coty′′+y2 =t2+(x+1)1/2+4x′
x(1)=2 x′(1)=−2 y(1)=7 y′(1)=3
13. Follow the instructions in the preceding problem on this example:
⎧⎪txyz+x′y′/t =tx2 +x/y′′ +z ⎨t2x/z + y′z′t = y2 − (z′′)2x + x′y′ ⎪⎩tyz−x′z′y′ =z2 −zx′′ −(yz)′
x(3)=1 y(3)=2 z(3)=4 x′(3)=5 y′(3)=6 z′(3)=7
14. Turnthispairofdifferentialequationsintoasecondorderdifferentialequationinvolving
x ′′′ = x ′ − t x ′′ + x + ln t
1. UseRK4System1tosolveeachofthefollowingfor0t1.Useh=2−k withk=5,
6, and 7, and compare results.
′′ 2t 21/2 ⎪⎨′′ 2 t
⎧⎪x′′ =x2 −y+et x = 2(e − x ) y = x − y − e
a. x(0)=0 x′(0)=1 b. ⎪⎩x(0)=0 x′(0)=0 y(0) = 1 y′(0) = −2
Computer Problems 11.2
3. Solve
x′′ + x′ + x2 − 2t = 0
4. Solve
on the interval [−2, 0].
x′′ = 2x′ − 5x
x(0) = 0 x′(0) = 0.4
11.3 Adams-Bashforth-Moulton Methods 483
2. Solve the Airy differential equation ⎧⎪⎨x′′ =tx
⎪⎩ x (0) = 0.35502 80538 87817 x′(0) = −0.25881 94037 92807
on the interval [0, 4.5] using the Runge-Kutta method. Check value: The value x (4.5) = 0.00033 02503 is correct.
x(0) = 0 x′(0) = 0.1
on [0, 3] by any convenient method. If a plotter is available, graph the solution.
5. Write computer programs based on the pseudocode in the text to find the numerical solution of these ordinary differential equation systems:
a. (1) b. (4) c. (5)
6. (Continuation) Use mathematical software such as Matlab, Maple, or Mathematica with symbolic manipulation capabilities to find their analytical solutions.
7. (Continuation)UsemathematicalsoftwareroutinessuchasarefoundinMatlab,Maple, or Mathematica to verify the numerical solutions for these ordinary differential equation systems. Plot the resulting solution curves.
11.3 Adams-Bashforth-Moulton Methods A Predictor-Corrector Scheme
The procedures explained so far have solved the initial-value problem
X′ = F(X) X(a)=S, given
(1)
by means of single-step numerical methods. In other words, if the solution X(t) is known at a particular point t, then X(t + h) can be computed with no knowledge of the solution at points earlier than t. The Runge-Kutta and Taylor series methods compute X(t + h) in terms of X(t) and various values of F.
Moreefficientmethodscanbedevisedifseveralvalues X(t), X(t−h), X(t−2h),… are used in computing X(t + h). Such methods are called multistep methods. They have the obvious drawback that at the beginning of the numerical solution, no prior values of X are available. So it is usual to start a numerical solution with a single-step method, such as the Runge-Kutta procedure, and transfer to a multistep procedure for efficiency as soon as enough starting values have been computed.
484 Chapter 11
Systems of Ordinary Differential Equations
An example of a multistep formula is known as the Adams-Bashforth method (see Section 10.3 and the related problem). It is
X(t + h) = X(t) + h {55F[X(t)] − 59F[X(t − h)] + 37F[X(t − 2h)] 24
−9F[X(t − 3h)]} (2)
Here, X(t + h) is the predicted value of X(t + h) computed by using Formula (2). If the solution X has been computed at the four points t, t − h, t − 2h, and t − 3h, then Formula (2) can be used to compute X (t + h). If this is done systematically, then only one evaluation of F is required for each step. This represents a considerable savings over the fourth-order Runge-Kutta procedure; the latter requires four evaluations of F per step. (Of course, a consideration of truncation error and stability might permit a larger step size in the Runge-Kutta method and make it much more competitive.)
In practice, Formula (2) is never used by itself. Instead, it is used as a predictor, and then another formula is used as a corrector. The corrector that is usually used with Formula (2) is the Adams-Moulton formula:
X(t + h) = X(t) + h {9F[X(t + h)] + 19F[X(t)] − 5F[X(t − h)] 24
+ F[X(t − 2h)]} (3)
Thus, Equation (2) predicts a tentative value of X(t +h), and Equation (3) computes this X value more accurately. The combination of the two formulas results in a predictor-corrector scheme.
With initial values of X specified at a, three steps of a Runge-Kutta method can be performed to determine enough X values that the Adams-Bashforth-Moulton procedure can begin. The fourth-order Adams-Bashforth and Adams-Moulton formulas, started with the fourth-order Runge-Kutta method, are referred to as the Adams-Moulton method. Predictor and corrector formulas of the same order are used so that only one application of the corrector formula is needed. Some suggest iterating the corrector formula, but experience has demonstrated that the best overall approach is only one application per step.
Pseudocode
Storage of the approximate solution at previous steps in the Adams-Moulton method is usually handled either by storing in an array of dimension larger than the total number of steps to be taken or by physically shifting data after each step (discarding the oldest data and storing the newest in their place). If an adaptive process is used, the total number of steps to be taken cannot be determined beforehand. Physical shifting of data can be eliminated by cycling the indices of a storage array of fixed dimension. For the Adams-Moulton method, the xi data for X(t) are stored in a two-dimensional array with entries zim in locations m = 1,2,3,4,5,1,2,… for t = a,a + h,a + 2h,a + 3h,a + 4h,a + 5h,a + 6h,…, respectively. The sketch in Figure 11.1 shows the first several t values with corresponding m values and abbreviations for the formulas used.
An error analysis can be conducted after each step of the Adams-Moulton method. If
x(p) is the numerical approximation of the ith equation in System (1) at t + h obtained by i
FIGURE 11.1
Starting values for applications of RK and AB/AM methods
m:1 2 3 4 5 1 2
11.3 Adams-Bashforth-Moulton Methods 485
a a h RK
a 2h RK
a 3h RK
a 4h AB / AM
a 5h AB / AM
a 6h AB / AM
predictorFormula(2)andxi isthatfromcorrectorFormula(3)att+h,thenitcanbeshown that the single-step error for the i th component at t + h is given approximately by
19 xi − x(p) εi= i
est = max |εi | 1in
So we compute
270 |xi |
in the Adams-Moulton procedure AM System to obtain an estimate of the maximum single- step error at t + h.
A control procedure is needed that calls the Runge-Kutta procedure three times and then calls the Adams-Moulton predictor-corrector scheme to compute the remaining steps. Such a procedure for doing nsteps steps with a fixed step size h follows:
procedure AMRK(n, h, (xi ), nsteps)
integer i, k, m, n; real est, h; real array (xi )0:n allocate real array ( fi j )0:n×0:4, (zi j )0:n×0:4
m←0
output h
output 0, (xi )
for i = 0 to n do
zim ←xi end for
fork =1to3do
callRK System(m,n,h,(zij),(fij)) output k,(zim)
end for
fork =4tonstepsdo
callAM System(m,n,h,est,(zij),(fij),) output k,(zim)
output est
end for
for i = 0 to n do
xi ←zim end for
deallocate array ( f, z) end procedure AMRK
486 Chapter 11 Systems of Ordinary Differential Equations
The Adams-Moulton method for a system and the computation of the single-step error are accomplished in the following pseudocode:
procedureAM System(m,n,h,est,(zij),(fij))
integer i, j, k, m, mp1; real d, dmax, est, h
real array (zi j )0:n×0:4, ( fi j )0:n×0:4 allocate real array (si )0:n , ( yi )0:n realarray(ai)1:4 ←(55,−59,37,−9) realarray(bi)1:4 ←(9,19,−5,1) mp1 ← (1 + m) mod 5
callXP System(n,(zim),(fim)) for i = 0 to n do
si ←0 end for
fork =1to4do
j ← (m − k + 6) mod 5
for i = 0 to n do
si ←si +ak fij
end for end for
for i = 0 to n do
yi ←zim +hsi/24
end for
callXP System(n,(yi),(fi,mp1)) for i = 0 to n do
si ←0 end for
fork =1to4do
j ←(mp1−k+6)mod5
for i = 0 to n do
si ←si +bk fij
end for end for
for i = 0 to n do
zi,mp1 ←zim +hsi/24
end for
m ← mp1
dmax ← 0
for i = 0 to n do
d←|zim −yi|/|zim| if d > dmax then
dmax ← d j←i
end if end for
est ← 19dmax/270 deallocate array (s, y) end procedure AM System
11.3 Adams-Bashforth-Moulton Methods 487
Here, the function evaluations are stored cyclically in fim for use by Formulas (2) and (3).
Various optimization techniques are possible in this pseudocode. For example, the program-
mer may wish to move the computation of 1 h outside of the loops. 24
A companion Runge-Kutta procedure is needed, which is a modification of procedure RK4 System2 from Section 11.1:
procedureRK System(m,n,h,(zij),(fij))
integer i, m, mp1, n; real h; real array (zi j )0:n×0:4, ( fi j )0:n×0:4 allocate real array (gi j )0:n×0:3, (yi )0:n
mp1 ← (1 + m) mod 5
callXP System(n,(zim),(fim))
for i = 0 to n do
yi ←zim +1hfim 2
end for
callXP System(n,(yi),(gi,1)) for i = 0 to n do
yi ←zim +1hgi,1 2
end for
callXP System(n,(yi),(gi,2)) for i = 0 to n do
yi ←zim +hgi,2 end for
callXP System(n,(yi),(gi,3)) for i = 0 to n do
zi,mp1 ← zim + h[ fim + 2gi,1 + 2gi,2 + gi,3]/6 end for
m ← mp1
deallocate array (gi j ), (yi ) end procedure RK System
As before, the programmer may wish to move 1 h out of the loop. 6
To use the Adams-Moulton pseudocode, we supply the procedure XP System that defines the system of ordinary differential equations and write a driver program with a call to procedure AMRK. The complete program then consists of the following five parts: the main program and procedures XP System, AMRK, RK System, and AM System.
As an illustration, the pseudocode for the last example in Section 11.2 (p. 479) is as follows:
program Test AMRK
real h; real array (xi )0:n integer n ← 5, nsteps ← 100 real a ← 0, b ← 1
(xi) ← (1,2,−4,−2,7,6)
h ← (b − a)/nsteps
call AMRK(n, h, (xi ), nsteps) end program Test AMRK
488 Chapter 11
Systems of Ordinary Differential Equations
procedure XP System(n, (xi ), ( fi )) integer n; real array (xi )0:n , ( fi )0:n
f0 ← 1
f1 ← x2
f2 ← x1 −x3 −9×2 +x43 +6×5 +2×0 f3 ← x4
f4 ← x5
f5 ← x5 − x2 + ex1 − x0
end procedure XP System
Here, we have programmed this procedure for an autonomous system of ordinary differential equations.
An Adaptive Scheme
Since an estimate of the error is available from the Adams-Moulton method, it is natural to replace procedure AMRK with one that employs an adaptive scheme—that is, one that changes the step size. A procedure similar to the one used in Section 10.3 is outlined here. The Runge-Kutta method is used to compute the first three steps, and then the Adams- Moulton method is used. If the error test determines that halving or doubling of the step size is necessary in the first step using the Adams-Moulton method, then the step size is halved or doubled, and the whole process starts again with the initial values—so at least one step of the Adams-Moulton method must take place. If during this process the error test indicates that halving is required at some point within the interval [a, b], then the step size is halved. A retreat is made back to a previously computed value, and after three Runge-Kutta steps have been computed, the process continues, using the Adams-Moulton method again but with the new step size. In other words, the point at which the error was too large should be computed by the Adams-Moulton method, not the Runge-Kutta method. Doubling the step size is handled in an analogous manner. Doubling the step size requires only saving an appropriate number of previous values; however, one can simplify this process (whether halving or doubling the step size) by always backing up two steps with the old step size and then using this as the beginning point of a new initial-value problem with the new step size. Other, more complicated procedures can be designed and can be the subject of numerical experimentation. (See Computer Problem 11.3.3.)
An Engineering Example
In chemical engineering, a complicated production activity may involve several reactors connected with inflow and outflow pipes. The concentration of a certain chemical in the i th reactorisanunknownquantity,xi.Eachxi isafunctionoftime.Iftherearenreactors,the whole process is governed by a system of n differential equations of the form
X′ = AX + V X(0)=S, given
where X is the vector containing the unknown quantities xi , A is an n × n matrix, and V is a constant vector. The entries in A depend on the flow rates permitted between different reactors of the system.
11.3 Adams-Bashforth-Moulton Methods 489
There are several approaches to solving this problem. One is to diagonalize the matrix A by finding a nonsingular matrix P for which is P −1 A P is diagonal and then using the matrix exponential function to solve the system in an analytic form. This is a task that mathematical software can handle. On the other hand, we can simply turn the problem over to an ODE solver and get the numerical solution. One piece of information that is always wanted in such a problem is a description of the steady state of the system. That means the values of all variables at t = ∞. Each function xi should be a linear combination of exponential functions of the form t → eλt , in which λ < 0. Here is a simple example that
can illustrate all of this:
⎡x′⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎢1⎥ −8/3−4/3 1 x1 12
⎣x2′ ⎦=⎣−17/3 −4/3 1⎦⎣x2⎦+⎣29⎦ (4) x3′ −35/3 14/3 −2 x3 48
Using mathematical software such as Matlab, Maple, or Mathematica, we can obtain a closed-form solution:
x(t) = 1e−3t(6 − 50et + 10e2t + 34e3t) 6
y(t) = 1e−3t(12 − 125et + 40e2t + 73e3t) 6
z(t) = 1e−3t(14 − 200et + 70e2t + 116e3t) 6
For a system of ordinary differential equations with a large number of variables, it may be more convenient to represent them in a computer program with an array such as x(i,t) rather than by separate variables names. To see the numerical value of the analytic solution at a single point, say, t = 2.5, we obtain x(2.5) ≈ 5.74788, y(2.5) ≈ 12.5746, z(2.5) ≈ 20.0677. Also, we can produce a graphing of the analytic solution to the problem.
Finally, the programs presented in this section can be used to generate a numerical solution on a prescribed interval with a prescribed number of points.
Some Remarks about Stiff Equations
In many applications of differential equations there are several functions to be tracked together as functions of time. A system of ordinary differential equations may be used to model the physical phenomena. In such a situation, it can happen that different solution functions (or different components of a single solution) have quite disparate behavior that makes the selection of the step size in the numerical solution problematic. For example, one component of a function may require a small step in the numerical solution because it is varying rapidly, whereas another component may vary slowly and not require a small step size for its computation. Such a system is said to be stiff. Figure 11.2 illustrates a slowly varying solution surrounded by other solutions with rapidly decaying transients.
An example will illustrate this possibility. Consider a system of two differential equa- tions with initial conditions:
x′ =−20x−19y x(0)=2 y′ =−19x−20y y(0)=0
(5)
490 Chapter 11 Systems of Ordinary Differential Equations x
FIGURE 11.2
Solution curves for a stiff ode
t
The solution is easily seen to be
x(t)=e−39t +e−t
y(t)=e−39t −e−t
The component e−39t quickly becomes negligible as t increases, starting at 0. The solution is then approximately given by x(t) = −y(t) = e−t, and this function is smooth and decreasing to 0. It would seem that in almost any numerical solution, a large step size could be used. However, let us examine the simplest of numerical procedures: Euler’s method. It generates the solution by using the following equations:
xn+1 =xn +h(−20xn −19yn) x0 =2
yn+1 =yn +h(−19xn −20yn) y0 =0 These difference equations can be solved in closed form, and we have
xn =(1−39h)n +(1−h)n
yn =(1−39h)n −(1−h)n
For the numerical solution to converge to 0 (and thus imitate the actual solution), it is
necessary that h < 2 . If we were solving only the differential equation x′ = −x to get the 39
solution x(t) = e−t , the step size could be as large as h = 2 to get the correct behavior as t increased. (See Problem 11.3.2.)
To see that numerical success (in the sense of being able to use a reasonable step size) depends on the method used, let us consider the implicit Euler method. For a single differential equation, this employs the formula
xn+1 =xn +hf(tn+1,xn+1)
Since xn+1 appears on both sides of this equation, the equation must be solved for xn+1. In
the example being considered, the Euler equations are
xn+1 = xn + h(−20xn+1 − 19yn+1)
yn+1 = yn + h(−19xn+1 − 20yn+1)
This pair of equations has the form Xn+1 = Xn + AXn+1, where A is the 2×2 matrix in the previous pair of equations and Xn is the vector having components xn and yn. This matrix equation can be written (I − A)Xn+1 = Xn or Xn+1 = (I − A)−1 Xn. A consequence is that the explicit solution is Xn = (I − A)−n X0. At this point, it is necessary to appeal to a result concerning such iterative processes. For Xn to converge to 0 for any choice of initial vector X0, it is necessary and sufficient that all eigenvalues of (I − A)−1 be less than one in
11.3 Adams-Bashforth-Moulton Methods 491
modulus (see Kincaid and Cheney [2002]). Equivalently, the eigenvalues of I − A should be greater than 1 in modulus. An easy calculation shows that for positive h this condition is met, without further hypotheses. Thus, the implicit Euler method can be used with any reasonable step size on this problem. In the literature on stiff equations, much more infor- mation can be found, and there are books that address this topic thoroughly. Some essential references are Dekker and Verwer [1984], Gear [1971], Miranker [1981], and Shampine and Gordon [1975].
In general, stiff ordinary differential equations are rather difficult to solve. This is com- pounded by the fact that in most cases, one does not know beforehand whether an ordinary differential equation that one is trying to solve numerically is stiff. Software packages usu- ally have ordinary differential equation solvers specifically designed to handle stiff ordinary differential equations. Some of these procedures may vary both the step size and the order of the method. In such algorithms, the Jacobian matrix ∂ F /∂ X y may play a role. Solving an associated linear system involving the Jacobian matrix is critical to the reliability and efficiency of the code. The Jacobian matrix may be sparse, an indication that the function F does not depend on some of the variables in the problem.
For readers who are interested in the history of numerical analysis, we recommend the book by Goldstine [1977]. The textbook on differential equations by Moulton [1930] gives some insight into the numerical methods used prior to the advent of high-speed computing machines. He also (page 224) gives some of the history, going back to Newton! The calculation of orbits in celestial mechanics has always been a stimulus for the invention of numerical methods; so also have been the needs of ballistic science. Moulton mentions that the retardation of a projectile by air resistance is a very complicated function of velocity that necessitates numerical solution of the otherwise simple equations of ballistics.
Summary
(1) For the autonomous form for a system of ordinary differential equations in vector notation X′ = F(X)
X(a) = S, given
the Adams-Bashforth-Moulton method of fourth order is
X(t + h) = X(t) + h 55F[X(t)] − 59F[X(t − h)] + 37F[X(t − 2h)] 24
− 9F[X(t − 3h)]
X(t + h) = X(t) + h 9F[X(t + h)] + 19F[X(t)] − 5F[X(t − h)]
24 + F[X(t − 2h)]
Here, X (t + h) is the predictor, and X(t + h) is the corrector. The Adams-Bashforth- Moulton method needs five evaluations of F per step. With the initial vector X(a) given, the values for X(a + h), X(a + 2h), X(a + 3h) are computed by the Runge-Kutta method of fourth order. Then the Adams-Bashforth-Moulton method can be used repeatedly. The predicted value X is computed from the four X values at t, t − h, t − 2h, and t − 3h, and then the corrected value X(t + h) can be computed by using the predictor value X(t + h) and previously evaluated values of F at t, t − h, and t − 2h.
492 Chapter 11
Systems of Ordinary Differential Equations
Additional References
See Aiken [1985], Ascher and Petzold [1998], Boyce and DiPrima [2003], Butcher [1987], Carrier and Pearson [1991], Chicone [2006], Collatz [1966], Dekker and Verwer [1984], Edwards and Penny [2004], England [1969], Enright [2006], Fehlberg [1969], Gear [1971], Golub and Ortega [1992], Henrici [1962], Hull et al. [1972], Hundsdorfer [1985], Lambert [1973, 1991], Lapidus and Seinfeld [1971], Miranker [1981], Moulton [1930], and Shampine and Gordon [1975].
Problems 11.3
a 1.
Find the general solution of this system by turning it into a first-order system of four
equations:
x′′ =αy y′′ =βx
2. Verify the assertions made about the step size h in the discussion of stiff equations.
1. Test the procedure AMRK on the system given in Computer Problem 11.2.2.
2. The single-step error is closely controlled by using fourth-order formulas; however,
the roundoff error in performing the computations in Equations (3) and (4) can be
large. It is logical to carry these out in what is known as partial double-precision
Computer Problems 11.3
arithmetic. The function F would be evaluated in single precision at the desired points
X(t + ih), but the linear combination ci F(X(t + ih)) would be accumulated in i
double precision. Also, the addition of X(t) to this result is done in double precision. Recode the Adams-Moulton method so that partial double-precision arithmetic is used. Compare this code with that in the text for a system with a known solution. How do they compare with regard to roundoff error at each step?
3. WriteandtestanadaptiveprocesssimilartoRK45 AdaptiveinSection10.3withcalling sequence
procedure AMRK Adaptive(n,h,ta,tb,(xi), itmax,εmin, εmax, hmin, hmax, iflag) This routine should carry out the adaptive procedure outlined in this section and be
used in place of the AMRK procedure.
4. Solve the predator-prey problem in the example at the beginning of this chapter with
a = −10−2, b = −1 ×102, c = 10−2 and d = −102 and with initial values u(0) = 80, 4
v(0) = 30. Plot u (the prey) and v (the predator) as functions of time t.
5. Solve and plot the numerical solution of the system of ordinary differential equa- tions given by Equation (4) using mathematical software such as Matlab, Maple, or Mathematica.
11.3 Adams-Bashforth-Moulton Methods 493 6. (Continuation)RepeatforEquation(5)usingaroutinespecificallydesignedtohandle
stiff ordinary differential equations.
7. Solvethefollowingtestproblemsandplottheirsolutioncurves.
a. This problem corresponds to a recently discovered stable orbit that arises in the restricted three-body problem in which the orbits are co-planar. The two spatial coordinates of the j th body are x1 j and x2 j for j = 1, 2, 3. Each of the six coordinates satisfies a second-order differential equation:
′′ 3 3
mk xik−xij /djk
whered2 =2 (x −x )2fork,j=1,2,3.Assumethatthebodieshaveequal
xij= jk i=1 ij ik
k=1 k̸=j
mass, say, m1 = m2 = m3 = 1, and with the appropriate starting conditions, they will follow the same figure-eight orbit as a periodic steady-state solution. When the system is rewritten as a first order system, the dimension of the problem is 12, and
the initial conditions at t = 0 are given
by
⎧x ⎪ 11
⎪x ⎪ 21
= −0.97000436 = 0.24308753 = 0.0
= 0.0
x′ = 0.466203685 11
x′ = 0.43236573 21
x′ = −0.93240737 12
x′ = −0.86473146 22
x′ = 0.466203685 13
x′ = 0.43236573 23
⎨x ⎪x
12 ⎪ 22
⎪x ⎪⎩ 13
= 0.97000436 = −0.24308753
x
Solve the problem for t ∈ [0, 20].
23
b. TheLorenzproblemiswellknown,anditarisesinthestudyofdynamicalsystems:
⎧⎪x1′ =10(x2−x1) ⎨x2′ =x1(28−x3)−x2
⎪ ⎪ ⎪ x 3′ = x 1 x 2 − 8 x 3 ⎩3
x1(0) = 15, x2(0) = 15, x3(0) = 36
Solve the problem for t ∈ [0, 20]. It is known to have solutions that are potentially
poorly conditioned.
For additional details on these problems, see Enright [2006].
8. Write a computer program based on pseudocode Test AMRK to find the numerical solution to the ordinary differential equation systems, and compare the results with that by using a built-in routine such as can be found in Matlab, Maple, or Mathematica. Plot the resulting solution curves.
9. (Tacoma Narrows Bridge project) In 1940, the third longest suspension bridge in the world collapsed in a high wind. The following system of differential equations is a mathematical model that attempts to explain how twisting oscillations can be magnified
494 Chapter 11
Systems of Ordinary Differential Equations
and cause such a calamity:
′′ ′ a(y−lsinθ) a(y+lsinθ)
y =−yd−[K/(ma)] e −1+e −1 +0.2Wsinωt θ′′ = −θy′d +(3cosθ/l)[K/(ma)]ea(y−lsinθ) −ea(y+lsinθ)
The last term in the y equation is the forcing term for the wind W , which adds a strictly vertical oscillation to the bridge. Here, the roadway has width 2l hanging between two suspended cables, y is the current distance from the center of the roadway as it hangs below its equilibrium point, and θ is the angle the roadway makes with the horizontal. Also, Newton’s Law F = ma is used and Hooke’s constant K . Explore how ODE solvers are used to generate numerical trajectories for various parameter settings. Illustrate different types of phenomena that are available in this model. For additional details, see McKenna and Tuama [2001] and Sauer [2006].
12
Smoothing of Data and the Method of Least Squares
Surface tension S in a liquid is known to be a linear function of temperature T . For a particular liquid, measurements have been made of the surface tension at certain temperatures. The results were as follows:
T 0 10 20 30 40 80 90 95 S 68.0 67.1 66.4 65.6 64.6 61.8 61.0 60.0
How can the most probable values of the constants in the equation
S = aT + b
be determined? Methods for solving such problems are developed in this
chapter.
12.1 Method of Least Squares Linear Least Squares
In experimental, social, and behavioral sciences, an experiment or survey often produces a mass of data. To interpret the data, the investigator may resort to graphical methods. For instance, an experiment in physics might produce a numerical table of the form
x x0 x1 ··· xm y y0 y1 ··· ym
(1)
and from it, m + 1 points on a graph could be plotted. Suppose that the resulting graph looks like Figure 12.1. A reasonable tentative conclusion is that the underlying function is linear and that the failure of the points to fall precisely on a straight line is due to experimental error. If one proceeds on this assumption—or if theoretical reasons exist for believing that the function is indeed linear—the next step is to determine the correct function. Assuming that
y = ax + b
what are the coefficients a and b? Thinking geometrically, we ask: What line most nearly passes through the eight points plotted?
495
496
Chapter 12
Smoothing of Data and the Method of Least Squares
y7 y y5 y6
4
y2
y3
y0 y1
FIGURE 12.1
Experimental data
x0 x1x2x3 x4 x5 x6x7
x
To answer this question, suppose that a guess is made about the correct values of a and b. This is equivalent to deciding on a specific line to represent the data. In general, the data points will not fall on the line y = ax + b. If by chance the kth datum falls on the line, then
axk +b−yk =0
If it does not, then there is a discrepancy or error of magnitude
| axk + b − yk | The total absolute error for all m + 1 points is therefore
m
| axk + b − yk |
k=0
This is a function of a and b, and it would be reasonable to choose a and b so that the function assumes its minimum value. This problem is an example of l1 approximation and can be solved by the techniques of linear programming, a subject dealt with in Chapter 17. (The methods of calculus do not work on this function because it is not generally differentiable.)
In practice, it is common to minimize a different error function of a and b:
m k=0
This function is suitable because of statistical considerations. Explicitly, if the errors follow a normal probability distribution, then the minimization of φ produces a best estimate of a and b. This is called an l2 approximation. Another advantage is that the methods of calculus can be used on Equation (2).
The l1 and l2 approximations are related to specific cases of the lp norm defined by
1 / p
∥x∥p = forthevectorx =[x1,x2,...,xn]T.
(1p<∞)
φ(a,b)=
(axk +b−yk)2 (2)
n i=1
|xi|p
Let us try to make φ(a, b) a minimum. By calculus, the conditions
∂φ = 0 ∂φ = 0 ∂a ∂b
12.1 Method of Least Squares 497 (partial derivatives of φ with respect to a and b, respectively) are necessary at the minimum.
Taking derivatives in Equation (2), we obtain
notation, we set
⎧⎪ ⎪ m ⎪⎨
2(axk+b−yk)xk=0
k=0 ⎪ m
2(axk +b−yk)=0
This is a pair of simultaneous linear equations in the unknowns a and b. They are called
⎩
the normal equations and can be written as
⎧m m m
k=0
⎪⎨ xk2 a+ xk b= ykxk
k=0 k=0 k=0 (3)
⎪ ⎪ ⎪ m m ⎪⎩ x k a + ( m + 1 ) b = y k
k=0 k=0
Here, of course, mk=0 1 = m + 1, which is the number of data points. To simplify the
n n n n
p= xk q= yk r= xkyk
k=0
s= xk2 k=0
The system of Equations (3) is now
k=0 k=0
spa=r pm+1bq
We solve this pair of equations by Gaussian elimination and obtain the following algorithm.
Alternatively, since this is a 2 × 2 linear system, we can use Cramer’s Rule∗ to solve it. The
determinant of the coefficient matrix is
s p = (m + 1)s − p2 p m+1
= 1 [(m + 1)r − pq] b=dDet p q =d[sq−pr]
We can write this as an algorithm:
Linear Least Squares
d = Det Moreover, we obtain
a = 1 Det 1sr1
r p dqm+1d
■ ALGORITHM1
The coefficients in the least-squares line y = ax + b through the set of m + 1 data points (xk, yk) for k = 0,1,2,...,m are computed (in order) as follows:
1. p=mk=0xk 2. q = mk=0 yk
∗Cramer’s Rule is given in Appendix D.
498
Chapter 12
Smoothing of Data and the Method of Least Squares
3. r = mk=0 xk yk
4. s = mk=0 xk2
5. d=(m+1)s−p2
6. a=[(m+1)r−pq]/d 7. b=[sq−pr]/d
EXAMPLE 1
Solution
The preceding analysis illustrates the least-squares procedure in the simple linear case. As a concrete example, find the linear least-squares solution for the following table of values:
x 4 7 11 13 17 y20267
Plot the original data points and the line using a finer set of grid points.
The equations in Algorithm 1 leads to this system of two equations:
Another form of this result is
where
Linear Example
m m m
a=1 (m+1) xkyk − d k=0
xk yk k=0
k=0 mmmm (4)
b=1xk2 yk−xk xkyk d k=0 k=0 k=0 k=0
m m 2 d=(m+1) xk2 − xk
k=0 k=0
whose solution is a = 0.4864 and b = −1.6589. By Equation (3), we obtain the value φ(a,b) = 10.7810. Figure 12.2 is a plot of the given data and the linear least squares
straight line.
y
644a + 52b = 227 52a+ 5b=17
FIGURE 12.2
Linear least squares
10 8 6 4 2 0 –2
0
2 4 6 8
10 12 14 16 18 20
x
■
We can use mathematical software such as Matlab, Maple, or Mathematica to fit a lin- ear least-squares polynomial to the data and verify the value of φ. (See Computer Prob- lem 12.1.5.)
To understand what is going on here, we want to determine the equation of a line of the form y = ax + b that fits the data best in the least-squares sense. With four data points (xi,yi),wehavefourequationsyi =axi +bfori=1,2,3,4thatcanbewrittenas
where
Ax = y ⎡⎤⎡⎤
⎢x1 1⎥ ⎢y1⎥ ⎢x2 1⎥ a =⎢y2⎥ ⎣x3 1⎦b ⎣y3⎦ x41 y4
In general, we want to solve a linear system
Ax = b
where A is an m × n matrix and m > n. The solution coincides with the solution of the
normal equations
AT Ax = AT b This corresponds to minimizing ||Ax − b||2.
Nonpolynomial Example
The method of least squares is not restricted to linear (first-degree) polynomials or to any specific functional form. Suppose, for instance, that we want to fit a table of values (xk , yk ), where k = 0,1,…,m, by a function of the form
y = a ln x + b cos x + cex
in the least-squares sense. The unknowns in this problem are the three coefficients a, b, and c. We consider the function
equations:
⎧⎪ ⎪ m
m k=0
m k=0
(lnxk)exk (cosxk)exk (exk )2
m
m k=0
(alnxk +bcosxk +cexk −yk)2
φ(a,b,c)=
and set ∂φ/∂a = 0, ∂φ/∂b = 0, and ∂φ/∂c = 0. This results in the following three normal
(lnxk)2
⎨m m m m
⎪a
⎪ k=0
+ b (lnxk)(cosxk) + b
(lnxk)(cosxk) + c
= = =
yk lnxk yk cosxk yk exk
⎪a
⎪ k=0
k=0 m
(cosxk)2 (cos xk )exk
+ c + c
k=0 m
k=0 m
⎪ ⎪ ⎪ m ⎪⎩a
(ln xk )exk
+ b
k=0
k=0
k=0
k=0
12.1 Method of Least Squares 499
k=0
500
Chapter 12
Smoothing of Data and the Method of Least Squares
EXAMPLE 2
Solution
Fitafunctionoftheformy=alnx+bcosx+cex tothefollowingtablevalues:
x 0.24 0.65 0.95 1.24 1.73 2.01 2.23 2.52 2.77 2.99
y 0.23 −0.26 −1.10 −0.45 0.27 0.10 −0.29 0.24 0.56 1.00 Using the table and the equations above, we obtain the 3 × 3 system
⎧
⎪⎨ 6.79410a − 5.34749b + 63.25889c = 1.61627
⎪⎩ −5.34749a + 5.10842b − 49.00859c = −2.38271 63.25889a − 49.00859b + 1002.50650c = 26.77277
It has the solution a = −1.04103, b = −1.26132, and c = 0.03073. So the curve y = −1.04103 ln x − 1.26132 cos x + 0.03073ex
has the required form and fits the table in the least-squares sense. The value of φ(a, b, c) is 0.92557. Figure 12.3 is a plot of the given data and the nonpolynomial least squares curve.
FIGURE 12.3
Nonpolynomial least squares
y
1 0.5 0 0.5 1
1.5 x 0 0.5 1 1.5 2 2.5 3
■
We can use mathematical software such as Matlab, Maple, or Mathematica to verify these results and to plot the solution curve. (See Computer Problem 12.1.6.)
Basis Functions {g0, g1, . . . , gn}
The principle of least squares, illustrated in these two simple cases, can be extended to general linear families of functions without involving any new ideas. Suppose that the data in Equation (1) are thought to conform to a relationship such as
n
y= cjgj(x) (5)
j=0
inwhichthefunctionsg0,g1,…,gn (calledbasisfunctions)areknownandheldfixed.The coefficientsc0,c1,…,cn aretobedeterminedaccordingtotheprincipleofleastsquares.
In other words, we define the expression
φ(c0,c1,…,cn) = cjgj(xk)− yk
2
k=0
When set equal to zero, the resulting equations can be rearranged as
m n k=0 j=0
(6)
and select the coefficients to make it as small as possible. Of course, the expression φ(c0,c1,…,cn) is the sum of the squares of the errors associated with each entry (xk, yk) in the given table.
Proceeding as before, we write down as necessary conditions for the minimum the n equations
∂φ =0 (0in) ∂ci
These partial derivatives are obtained from Equation (7). Indeed,
nm m
gi(xk)gj(xk) cj = ykgi(xk) (0in) (7)
j=0 k=0 k=0
These are the normal equations in this situation and serve to determine the best values of the parameters c0 , c1 , . . . , cn . The normal equations are linear in ci ; thus, in principle, they can be solved by the method of Gaussian elimination (see Chapter 7).
In practice, the normal equations may be difficult to solve if care is not taken in choosing the basis functions g ,g ,…,g . First, the set {g ,g ,…,g } should be linearly inde-
01n 01n pendent.Thismeansthatnolinearcombination n cigi canbethezerofunction(except
i=0
in the trivial case when c0 = c1 = ··· = cn = 0). Second, the functions g0,g1,…,gn
should be appropriate to the problem at hand. Finally, one should choose a set of basis functions that is well conditioned for numerical work. We elaborate on this aspect of the problem in the next section.
Summary
(1) We wish to find a line y = ax + b that most nearly passes through the m + 1 pairs of points (xi, yi) for 0 i m. An example of l1 approximation is to choose a and b so that the total absolute error for all these points is minimized:
m
| axk + b − yk |
k=0
This can be solved by the techniques of linear programming.
(2) An l2 approximation will minimize a different error function of a and b:
m k=0
12.1 Method of Least Squares 501
∂φ m n
∂c = i
2 cj gj(xk) − yk j=0
gi(xk) (0 i n)
φ(a,b)=
(axk +b−yk)2
502 Chapter 12
Smoothing of Data and the Method of Least Squares
The minimization of φ produces a best estimate of a and b in the least-squares sense. One solves the normal equations
⎧m m m
⎪⎨ xk2 a+ xk b= ykxk
k=0 k=0 k=0
k=0 k=0
(3) In a more general case, the data points conform to a relationship such as
n
y= cjgj(x)
j=0
in which the basis functions g0 , g1 , . . . , gn are known and held fixed. The coefficients c0,c1,…,cn aretobedeterminedaccordingtotheprincipleofleastsquares.Thenormal equations in this situation are
nm m
gi(xk)gj(xk) cj = ykgi(xk) (0in)
j=0 k=0 k=0
and can be solved, in principle, by the method of Gaussian elimination to determine the
best values of the parameters c0,c1,…,cn.
a1. Usingthemethodofleastsquares,findtheconstantfunctionthatbestfitsthefollowing data:
x −1 2 3
545 4 3 12
a2. Determinetheconstantfunctioncthatisproducedbytheleast-squarestheoryapplied to the Table on p. 495. Does the resulting formula involve the points xk in any way? Apply your general formula to the preceding problem.
a3. Find an equation of the form y = aex2 + bx3 that best fits the points (−1,0), (0,1), and (1, 2) in the least-squares sense.
4. SupposethatthexpointsinTable(1)aresituatedsymmetricallyabout0onthex-axis. In this case, there is an especially simple formula for the line that best fits the points. Find it.
a5. Findtheequationofaparabolaofformy=ax2+bthatbestrepresentsthefollowing data. Use the method of least squares.
x −1 0 1 y 3.1 0.9 2.9
6. Suppose that Table (1) is known to conform to a function like y = x2 − x + c. What value of c is obtained by the least-squares theory?
⎪ ⎪ ⎪ m
m ⎪⎩ x k a + ( m + 1 ) b = y k
y
Problems 12.1
a 7. 8.
9.
10.
12.1 Method of Least Squares 503 Suppose that Table (1) is thought to be represented by a function y = c log x . If so,
what value for c emerges from the least-squares theory? Show that Equation (4) is the solution of Equation (3).
(Continuation) How do we know that divisor d is not zero? In fact, show that d is positive for m 1. Hint: Show that
m k−1
(xk − xl )2
by induction on m. The Cauchy-Schwarz inequality can also be used to prove that
d > 0.
(Continuation) Show that a and b can also be computed as follows:
d =
k=0 l=0
1 m
x = m + 1
m
c =
Hint: Show that d = (m + 1)c.
Howdoweknowthatthecoefficientsc0,c1,…,cn thatsatisfythenormalEquations(7) do not lead to a maximum in the function defined by Equation (6)?
If Table (1) is thought to conform to a relationship y = log(cx), what is the value of c obtained by the method of least squares?
Whatstraightlinebestfitsthefollowingdata x1234
y0112
in the least-squares sense?
Inanalyticgeometry,welearnthatthedistancefromapoint(x0,y0)toalinerepresented by the equation ax + by = c is (ax0 + by0 − c)(a2 + b2)−1/2. Determine a straight line that fits a table of data points (xi , yi ), for 0 i m, in such a way that the sum of the squares of the distances from the points to the line is minimized.
Show that if a straight line is fitted to a table (xi , yi ) by the method of least squares, then the line will pass through the point (x∗, y∗), where x∗ and y∗ are the averages of the xi ’s and yi ’s, respectively.
The viscosity V of a liquid is known to vary with temperature according to a quadratic law V = a + bT + cT2. Find the best values of a, b, and c for the following table:
T1234567 V 2.31 2.01 1.80 1.66 1.55 1.47 1.41
x k
1 m
y = m + 1 y k
k=0
a = c (xk − x)(yk − y) b = y − ax
k=0 (xk − x )2
1 m k=0
a11. a12. a13.
14.
15.
a 16.
k=0
504 Chapter 12 Smoothing of Data and the Method of Least Squares
17. Anexperimentinvolvestwoindependentvariablesxandyandonedependentvariable z. How can a function z = a + bx + cy be fitted to the table of points (xk , yk , zk )? Give the normal equations.
a18. Findthebestfunction(intheleast-squaressense)thatfitsthefollowingdatapointsand isoftheform f(x)=asinπx+bcosπx:
x −1 −1 0 1 1 22
y −1 0 1 2 1
a19. Find the quadratic polynomial that best fits the following data in the sense of least
squares:
x −2 −1 0 1 2
y21112
a20. Whatlinebestrepresentsthefollowingdataintheleast-squaressense?
x012
y 5 −6 7
a21. Whatconstantcmakestheexpression
as small as possible?
m
[ f (xk ) − cexk ]2
k=0
22. Show that the formula for the best line to fit data (k, yk) at the integers k for 1 k n is
where
k=1 k=1
23. Establish the normal equations and verify the results in Example 1.
24. Avectorvisassertedtobetheleast-squaressolutionofaninconsistentsystemAx=b. How can we test v without going through the entire least-squares procedure?
25. Find the normal equations for the following data points:
x 1.0 2.0 2.5 3.0
y 3.7 4.1 4.3 5.0
Determine the straight line that best fits the data in the least-squares sense. Plot the data point and the least-squares line.
26. For the case n = 4, show directly that by forming the normal equations from the data points (xi , yi ), we obtain the results in Theorem 1.
y = ax + b
n n
6 2 kyk − (n + 1) yk n(n2 − 1)
a =
b=n(n−1) (2n+1) yk−3 kyk
k=1 k=1 2 n n
12.2 Orthogonal Systems and Chebyshev Polynomials 505
1. Write a procedure that sets up the normal Equations (7). Using that procedure and other routines, such as Gauss and Solve from Chapter 7, verify the solution given for the problem involving ln x , cos x , and ex in the subsection entitled “Nonpolynomial Example.”
2. Write a procedure that fits a straight line to Table (1). Use this procedure to find the constants in the equation S = aT + b for the table in the example that begins this chapter. Also, verify the results obtained for the problem in the subsection entitled “Linear Example.”
3. Writeandtestaprogramthattakesm+1pointsintheplane(xi,yi),where0im, with x0 < x1 < ··· < xm, and computes the best linear fit by the method of least squares. Then the program should create a plot of the points and the best line determined by the least-squares method.
4. The Internal Revenue Service (IRS) publishes the following table of values having to do with minimal distributions of pension plans:
x12345678
y 29.9 29.0 28.1 27.1 26.2 25.3 24.4 23.6
9 10 11 12 13 14 15 16
22.7 21.8 21.0 20.1 19.3 18.5 17.7 16.9
What simple function represents the data? Use Equation (5), and plot the data and the results using either plotting software such as gnuplot or some mathematics software system such as Maple, Matlab, or Mathematica.
5. UsingmathematicalsoftwaresuchasMatlab,Maple,orMathematica,fitalinearleast- squares polynomial to the data in Example 1. Then plot the original data and the polynomial using a fine set of grid points.
6. (Continuation) Verify the results in Example 2 and plot the curve.
12.2 Orthogonal Systems and Chebyshev Polynomials Orthonormal Basis Functions {g0, g1, . . . , gn}
Once the functions g0 , g1 , . . . gn of Equation (5) in Section 12.1 have been chosen, the least-squares problem can be interpreted as follows: The set of all functions g that can be expressedaslinearcombinationsofg0,g1,...,gn isavectorspaceG.(Familiaritywith vector spaces is not essential to understanding the discussion here.) In symbols, we have
n
G= g:thereexistc0,c1,...,cn suchthatg(x)=
cjgj(x)
j=0
Computer Problems 12.1
506 Chapter 12
Smoothing of Data and the Method of Least Squares
The function that is being sought in the least-squares problem is thus an element of the vectorspaceG.Sincethefunctionsg0,g1,...,gn formabasisforG,thesetisnotlinearly dependent. However, a given vector space has many different bases, and they can differ drastically in their numerical properties.
Let us turn our attention away from the given basis {g0 , g1 , . . . , gn } to the vector space G generated by that basis. Without changing G, we ask: What basis for G should be chosen for numerical work? In the present problem, the principal numerical task is to solve the normal equations—that is, Equation (7) in Section 12.1:
nm m
gi(xk)gj(xk) cj = ykgi(xk) (0in) (1)
j=0 k=0 k=0
The nature of this system obviously depends on the basis {g0,g1,...,gn}. We want these equations to be easily solved or to be capable of being accurately solved. The ideal situation occurs when the coefficient matrix in Equation (1) is the identity matrix. This happens if the basis {g0,g1,...,gn} has the property of orthonormality:
m 1i=j
gi(xk)gj(xk)=δij= 0 i≠ j k=0
In the presence of this property, Equation (1) simplifies dramatically to
m
cj =
ykgj(xk) (0 jn)
k=0
which is no longer a system of equations to be solved but rather an explicit formula for the
coefficients c j .
Under rather general conditions, the space G has a basis that is orthonormal in the sense
just described. A procedure known as the Gram-Schmidt process can be used to obtain such a basis. There are some situations in which the effort of obtaining an orthonormal basis is justified, but simpler procedures often suffice. We describe one such procedure now.
Remember that our goal is to make Equation (1) well disposed for numerical solution. We want to avoid any matrix of coefficients that involves the difficulties encountered in connection with the Hilbert matrix (see Computer Problem 7.2.4). This objective can be met if the basis for the space G is well chosen.
We now consider the space G that consists of all polynomials of degree n, which is an important example of the least-squares theory. It may seem natural to use the following n + 1 functions as a basis for G:
g0(x) = 1 g1(x) = x g2(x) = x2 ... gn(x) = xn Using this basis, we write a typical element of the space G in the form
n g(x)=
j=0
cjgj(x)=
n j=0
cjxj =c0 +c1x+c2x2 +···+cnxn
This basis, however natural, is almost always a poor choice for numerical work. For many purposes, the Chebyshev polynomials (suitably defined for the interval involved) do form a good basis.
Figure12.4givesanindicationofwhythemonomialsxj donotformagoodbasisfor numerical work: These functions are too much alike! If a function g is given and we wish
y
1
0.5
T5
T4
x x2
x3 x4
12.2 Orthogonal Systems and Chebyshev Polynomials 507
T1 0x
0.2 0.4
x5
0.6 0.8 1
0.5
1 T2
T3
FIGURE 12.4
Polynomials xk and Chebyshev polynomials Tk
to express it as a linear combination of the monomials, g(x) = nj=0 cjxj, it is difficult to determine the coefficients cj precisely. Figure 12.4 also shows a few of the Chebyshev polynomials; they are quite different from one another.
For simplicity, assume that the points in our least-squares problem have the property −1=x0
σn2 =m−n infected by noise), then
i=0
■ PROPERTIES
σ2 >σ2 >···>σ2 =σ2 =σ2 =···=σ2
0 1 N N+1 N+2 m−1
This fact suggests the following strategy for dealing with the case in which N is not known: Compute σ02 , σ12 , . . . in succession. As long as these are decreasing significantly, continue the calculation. When an integer N is reached for which σN2 ≈ σN2 +1 ≈ σN2 +2 ≈ · · · , stop and declare pN to be the polynomial sought.
If σ02 , σ12 , . . . are to be computed directly from the definition in Equation (6), then each of the polynomials p0 , p1 , . . . will have to be determined. The procedure described below can avoid the determination of all but the one desired polynomial.
In the remainder of the discussion, the abscissas xi are to be held fixed. These points are assumed to be distinct, although the theory can be extended to include cases in which some points repeat. If f and g are two functions whose domains include the points {x0, x1, . . . , xm }, then the following notation is used:
m i=0
This quantity is called the inner product of f and g. Much of our discussion does not depend on the exact form of the inner product but only on certain of its properties. An inner product ⟨· , ·⟩ has the following properties:
Defining Properties of an Inner Product
The reader should verify that the inner product defined in Equation (7) has the properties listed.
A set of functions is now said to be orthogonal if ⟨ f, g⟩ = 0 for any two different
functions f and g in that set. An orthogonal set of polynomials can be generated recursively
by the following formulas:
⎧
⎪⎨ q 0 ( x ) = 1
⎪⎩ q1(x)=x−α0
qn+1(x) = xqn(x) − αnqn(x) − βnqn−1(x) (n 1)
⟨ f, g⟩ =
f (xi )g(xi ) (7)
1. ⟨f,g⟩=⟨g,f⟩
2. ⟨f, f⟩>0unless f(xi)=0foralli 3. ⟨af,g⟩=a⟨f,g⟩wherea∈R
4. ⟨f,g+h⟩=⟨f,g⟩+⟨f,h⟩
where
⎧⎪⎨αn = ⟨xqn,qn⟩ ⟨qn,qn⟩
⎪⎩βn = ⟨xqn,qn−1⟩ ⟨qn−1, qn−1⟩
12.2 Orthogonal Systems and Chebyshev Polynomials 513
In these formulas, a slight abuse of notation occurs where “xqn” is used to denote the function whose value at x is xqn(x).
To understand how this definition leads to an orthogonal system, let’s examine a few cases. First,
⟨q1,q0⟩=⟨x−α0,q0⟩=⟨xq0 −α0q0,q0⟩=⟨xq0,q0⟩−α0⟨q0,q0⟩=0
Notice that several properties of an inner product listed previously have been used here.
Also, the definition of α0 was used. Another of the first few cases is this: ⟨q2, q1⟩ = ⟨xq1 − α1q1 − β1q0, q1⟩
= ⟨xq1, q1⟩ − α1⟨q1, q1⟩ − β1⟨q0, q1⟩ = 0
Here, the definition of α1 has been used, as well as the fact (established above) that ⟨q1 , q0 ⟩ = 0. The next step in a formal proof is to verify that ⟨q2,q0⟩ = 0. Then an inductive proof completes the argument.
One part of this proof consists in showing that the coefficients αn and βn are well defined. This means that the denominators ⟨q , q ⟩ are not zero. To verify that this is the
m n n
case,supposethat⟨qn,qn⟩=0.Then i=0[qn(xi)]2 =0,andconsequently,qn(xi)=0for
eachvalueofi.Thismeansthatthepolynomialqn hasm+1roots,x0,x1,…,xm.Since the degree n is less than m, we conclude that qn is the zero polynomial. However, this is not possible because obviously
q0(x) = 1
q1(x) = x − α0
q2(x) = x2 + (lower-order terms)
and so on. Observe that this argument requires n < m.
The system of orthogonal polynomials {q0, q1, . . . , qm−1} generated by the above algo-
rithm is a basis for the vector space
clearfromthealgorithmthateachqn startswiththehighesttermxn.Ifitisdesiredtoexpress agivenpolynomial pofdegreen(nm−1)asalinearcombinationofq0,q1,...,qn,this can be done as follows: Set
m−1
of all polynomials of degree at most m − 1. It is
n i=0
On the right-hand side, only one summand contains x n . It is the term an qn . On the left-hand side, there is also a term in xn. One chooses an so that anxn on the right is equal to the corresponding term in p. Now write
n−1
p =
ai qi (8)
p−anqn =
aiqi
i=0
514 Chapter 12
Smoothing of Data and the Method of Least Squares
On both sides of this equation, there are polynomials of degree at most n − 1 (because of the
choice of an). Hence, we can now choose an−1 in the way we chose an; that is, choose an−1
so that the terms in xn−1 are the same on both sides. By continuing in this way, we discover
take the inner product of both sides of Equation (8) with q j . The result is
⟨p,qj⟩ = aj⟨qj,qj⟩
the unique values that the coefficients a must have. This establishes that {q , q , . . . , q } is
i01n n, for n = 0,1,...,m − 1.
a basis for
Another way of determining the coefficients ai (once we know that they exist!) is to
n m m qi(xk)qj(xk) cj =
n i=0
⟨p,qj⟩=
Since the set q0,q1,...,qn is orthogonal, ⟨qi,qj⟩ = 0 for each i different from j. Hence,
we obtain
Thisgivesaj asaquotientoftwoinnerproducts.
Now we return to the least-squares problem. Let F be a function that we wish to fit by
a polynomial pn of degree n. We shall find the polynomial that minimizes the expression
m
[F(xi)− pn(xi)]2
i=0 The solution is given by the formulas
pn =
It is especially noteworthy that ci does not depend on n. This implies that the various
polynomials p , p , . . . that we are seeking can all be obtained by simply truncating one 01
series—namely, m−1 ciqi.Toprovethat pn,asgiveninEquation(9),solvesourproblem, i=0
we return to the normal equations, Equation (1). The basic functions now being used are q0, q1, . . . , qn . Thus, the normal equations are
ai⟨qi,qj⟩ (0 jn)
n ⟨ F , q i ⟩
i=0
ciqi ci = ⟨qi,qi⟩ (9)
ykqi(xk)
⟨qi,qj⟩cj =⟨F,qi⟩ whereFissomefunctionsuchthatF(xk)=yk for0km.Next,applytheorthogonality
property ⟨qi,qj⟩ = 0 when i ≠ j. The result is
⟨qi,qi⟩ci = ⟨F,qi⟩ (0 i n) (10)
Now we return to the variance numbers σ02,σ12,... and show how they can be easily computed. First, an important observation: The set {q0 , q1 , . . . , qn , F − pn } is orthogonal!
j=0 k=0
Using the inner product notation, we get
k=0
n j=0
(0in)
(0 i n)
12.2 Orthogonal Systems and Chebyshev Polynomials 515 The only new fact here is that ⟨F − pn,qi⟩ = 0 for 0 i n. To check this, write
= ⟨F,qi⟩−
⟨F − pn,qi⟩ = ⟨F,qi⟩−⟨pn,qi⟩
j=0
In this computation, we used Equations (9) and (10). Since pn is a linear combination of q0,q1,...,qn, it follows easily that
⟨F − pn, pn⟩ = 0 Now recall that the variance σn2 was defined by
σ n2 = m − n ρ n = The quantities ρn can be written in another way:
ρn =⟨F−pn,F−pn⟩
= ⟨F − pn, F⟩
= ⟨F, F⟩ − ⟨F, pn⟩
n
ci ⟨F, qi ⟩ =⟨F,F⟩−n ⟨F,qi⟩2
i=0 ⟨qi,qi⟩
Thus, the numbers ρ0 , ρ1 , . . . can be generated recursively by the algorithm
= ⟨F, F⟩ −
i=0
n cjqj,qi
n j=0
cj⟨qj,qi⟩ = ⟨F, qi ⟩ − ci ⟨qi , qi ⟩ = 0
= ⟨F,qi⟩−
ρn m
[ y i − p n ( x i ) ] 2
i=0
⎧⎪⎨ρ0 =⟨F,F⟩−⟨F,q0⟩2 ⟨q0, q0⟩
⎪ ⟨F,qn⟩2
⎩ρn =ρn−1−⟨qn,qn⟩ (n1)
Summary
(1) We use Chebyshev polynomials {T j } as an orthogonal basis that can be generated recur- sively by
T j (x ) = 2x T j −1 (x ) − T j −2 (x ) ( j 2)
516 Chapter 12
Smoothing of Data and the Method of Least Squares
with T0(x) = 1 and T1(x) = x. The coefficient matrix A = (aij)0:n×0:n and the right-hand side b = (bi )0:n of the normal equations are
k=0
A linear combination of Chebyshev polynomials
can be evaluated recursively:
aij = bi =
m k=0
m
Ti(zk)Tj(zk) ykTi(zk)
(0i, jn) (0in)
⎧
⎪⎨ w n + 2 = w n + 1 = 0
⎪⎩ w j = c j + 2 x w j + 1 − w j + 2 g(x) = w0 − xw1
g(x) =
n j=0
cj Tj (x)
( j = n , n − 1 , . . . , 0 ) (2) We discuss smoothing of data by polynomial regression.
1. Let g0,g1,...,gn be a set of functions such that mk=0 gi(xk)gj(xk) = 0 if i ≠ j. What linear combination of these functions best fits the data of Table (1) in Section 12.1?
a2. Considerpolynomialsg0,g1,...,gn definedbyg0(x) = 1,g1(x) = x−1,andgj(x) = 3xgj−1(x) + 2gj−2(x). Develop an efficient algorithm for computing values of the function f(x)=nj=0cjgj(x).
a3. Showthatcosnθ=2cosθcos(n−1)θ−cos(n−2)θ.Hint:Usethefamiliaridentity cos(A ∓ B) = cos A cos B ± sin A sin B.
4. (Continuation) Show that if fn(x) = cos(n arccos x), then f0(x) = 1, f1(x) = x, and fn(x)=2xfn−1(x)− fn−2(x).
a5. (Continuation)ShowthatanalternatedefinitionofChebyshevpolynomialsisTn(x)= cos(n arccos x ) for −1 x 1.
a6. (Continuation) Give a one-line proof that Tn(Tm(x)) = Tnm(x).
a7. (Continuation)Showthat|Tn(x)|1forxintheinterval[−1,1].
a8. Defineg(x)=T 1x+1.Whatrecursiverelationdothesefunctionssatisfy? kk22
9. Show that T0, T2, T4, . . . are even and that T1, T3, . . . are odd functions. Recall that an even function satisfies the equation f (x) = f (−x); an odd function satisfies the equation f (x) = − f (−x).
Problems 12.2
12.2 Orthogonal Systems and Chebyshev Polynomials 517 a10. Count the number of operations involved in the algorithm used to compute g(x) =
nj = 0 c j T j ( x ) .
11. Showthatthealgorithmforcomputingg(x)=nj=0cjTj(x)canbemodifiedtoread ⎧
⎪⎨ w n − 1 = c n − 1 + 2 x c n
⎪⎩ wk=ck+2xwk+1−wk−2 (n−2k1)
g(x) = c0 + xw1 − w2 thus making wn+2, wn+1, and w0 unnecessary.
a12. (Continuation)Counttheoperationsforthealgorithmintheprecedingproblem.
a13. Determine T6(x) as a polynomial in x.
14. Verifythefourpropertiesofaninnerproductthatwerelistedinthetext,usingDefini- tion (7).
15. Verifytheseformulas:
1 m ⟨qn,qn⟩ ρn−1−ρn
p0(x)=m+1
16. Completetheproofthatthealgorithmforgeneratingtheorthogonalsystemofpolyno-
a 17.
mials works.
There is a function f of the form
f(x)=αx12 +βx13
for which f (0.1) = 6×10−13 and f (0.9) = 3×10−2. What is it? Are α and β sensitive
to perturbations in the two given values of f (x)?
i=0
yi βn =⟨qn−1,qn−1⟩ cn = ⟨F,qn⟩
18. (Multiple choice) Let x1 = [2,2,1]T , x2 = [1,1,5]T , and x3 = [−3,2,1]T . If the Gram-Schmidt process is applied to this ordered set of vectors to produce an orthonormal set {u1, u2, u3}, what is u1?
a. 2, 2, 1T b. [2,2,1]T c. 2, 2, 1T 333 555
e. Noneofthese. 19. (Multiple choice, continuation) What is u2?
d. [1,0,0]T 1T1TT
a. √27[1,1,5] b. √18[−1,−1,4] c. [2,2,1]
d. [1,1,−4]T
e. None of these.
1. Carry out an experiment in data smoothing as follows: Start with a polynomial of
modest degree, say, 7. Compute 100 values of this polynomial at random points in the
interval [−1, 1]. Perturb these values by adding random numbers chosen from a small
interval, say, − 1 , 1 . Try to recover the polynomial from these perturbed values by 88
using the method of least squares.
Computer Problems 12.2
518
Chapter 12
Smoothing of Data and the Method of Least Squares
2. WriterealfunctionCheb(n,x)forevaluatingTn(x).Usetherecursiveformulasatisfied by Chebyshev polynomials. Do not use a subscripted variable. Test the program on these 15 cases: n = 0,1,3,6,12 and x = 0,−1,0.5.
3. Write real function Cheb(n,x,(yi)) to calculate T0(x),T1(x),...,Tn(x), and store these numbers in the array (yi ). Use your routine, together with suitable plotting rou- tines,toobtaingraphsofT0,T1,T2,...,T8 on[−1,1].
4. Write real function F(n,(ci),x) for evaluating f(x) = n cjTj(x). Test your ∞k j=02
routine by means of the formula k=0 t Tk(x) = (1 − tx)/(1 − 2tx + t ), valid for |t| < 1. If |t| 1 , then only a few terms of the series are needed to give full machine
2
precision. Add terms in ascending order of magnitude.
12.3
k=0
Interpret the results in terms of the least-squares polynomial-fitting problem.
8. Programthealgorithmforfindingσ02,σ12,...inthepolynomialregressionproblem.
9. Programthecompletepolynomialregressionalgorithm.Theoutputshouldbeαn,βn,
σn2,andcn for0n N,where N isdeterminedbytheconditionσN2−1 >σN2 ≈σN2+1.
10. Using orthogonal polynomials, find the quadratic polynomial that fits the following
data in the sense of least squares:
a. x −1 −1 0 1 1 b. x −2 −1 0 1 2 22
y −1 0 1 2 1 y 2 1 1 1 2
Other Examples of the Least-Squares Principle
The principle of least squares is also used in other situations. In one of these, we attempt to solve an inconsistent system of linear equations of the form
n
akjxj =bk (0km) (1)
5. Obtain a graph of Tn for some reasonable value of n by means of the following idea: Generate 100 equally spaced angles θi in the interval [0, π ]. Define xi cos θi and yi = Tn (xi ) = cos(n arccos xi ) = cos nθi . Send the points (xi , yi ) to a suitable plotting routine.
6. Writesuitablecodetocarryouttheprocedureoutlinedinthetextforfittingatablewith a linear combination of Chebyshev polynomials. Test it in the manner of Computer Problem 12.2.1, first by using an unperturbed polynomial. Find out experimentally how large n can be in this process before roundoff errors become serious.
a7. Define xk = cos[(2k − 1)π/(2m)]. Select modest values of n and m > 2n. Compute and print the matrix A whose elements are
m
aij =
Ti(xk)Tj(xk) (0i, jn)
j=0
begins by factoring
Ax = b A=QR
j=0 k=0
k=0
12.3 Other Examples of the Least-Squares Principle 519
in which m > n. Here, there are m + 1 equations but only n + 1 unknowns. If a given n+1-tuple(x0,x1,…,xn)issubstitutedontheleft,thediscrepancybetweenthetwosides of the kth equation is termed the kth residual. Ideally, of course, all residuals should be zero.Ifitisnotpossibletoselect(x0,x1,…,xn)soastomakeallresidualszero,System(1) is said to be inconsistent or incompatible. In this case, an alternative is to minimize the sum of the squares of the residuals. So we are led to minimize the expression
m n 2 φ(x0,x1,…,xn)= akjxj −bk
k=0 j=0
(2)
by making an appropriate choice of (x0,x1,…,xn). Proceeding as before, we take partial derivatives with respect to xi and set them equal to zero, thereby arriving at the normal equations
nm m
akiakj xj = bkaki (0in) (3)
This is a linear system of just n + 1 equations involving unknowns x0, x1, . . ., xn . It can be shown that this system is consistent, provided that the column vectors in the original coef- ficient array are linearly independent. System (3) can be solved, for instance, by Gaussian elimination. The solution of System (3) is then a best approximate solution of Equation (1) in the least-squares sense.
Special methods have been devised for the problem just discussed. Generally, they gain in precision over the simple approach outlined above. One such algorithm for solving System (1),
wherematrixQis(m+1)×(n+1)satisfyingQT Q=IandmatrixRis(n+1)×(n+1) satisfying ri i > 0 and ri j = 0 for j < i . Then the least-squares solution is obtained by an algorithm called the modified Gram-Schmidt process.
A more elaborate (and more versatile) algorithm depends on the singular value decomposition of the matrix A. This is a factoring, A = UVT , in which UT U = Im+1, VTV = In+1,andisan(m+1)×(n+1)diagonalmatrixthathasnonnegativeentries. For these more reliable procedures, the reader is referred to material at the end of this section and to Stewart [1973] and Lawson and Hanson [1995].
Use of a Weight Function w (x)
Another important example of the principle of least squares occurs in fitting or approximat- ing functions on intervals rather than discrete sets. For example, a given function f defined on an interval [a, b] may have to be approximated by a function such as
n j=0
g(x) =
cj gj (x)
520 Chapter 12
Smoothing of Data and the Method of Least Squares
It is natural, then, to attempt to minimize the expression
b
a
by choosing coefficients appropriately. In some applications, it is desirable to force functions g and f into better agreement in certain parts of the interval. For this purpose, we can modify Equation (4) by including a positive weight function w(x), which can, of course, be w(x) ≡ 1 if all parts of the interval are to be treated the same. The result is
b
j=0
1 − 1
Ti(x)Tj(x)(1−x2)−1/2dx=
⎪⎨0 i≠j
π i= j>0
φ(c0,c1,…,cn)=
[g(x)− f(x)]2dx (4)
φ(c0,c1,…,cn)=
[g(x)− f(x)]2w(x)dx
a Theminimumofφisagainsoughtbydifferentiatingwithrespecttoeachci andsettingthe
partial derivatives equal to zero. The result is a system of normal equations:
nb b
gi(x)gj(x)w(x)dx cj = f(x)gi(x)w(x)dx (0in) (5)
aa
Thisisasystemofn+1linearequationsinn+1unknownsc0,c1,…,cn andcanbesolved by Gaussian elimination. Earlier remarks about choosing a good basis apply here also. The idealsituationistohavefunctionsg0,g1,…,gn thathavetheorthogonalityproperty:
b
gi(x)gj(x)w(x)dx = 0 (i ≠ j) (6)
a
Many such orthogonal systems have been developed over the years. For example,
Chebyshev polynomials form one such system, namely, ⎧
The weight function (1 − x 2 )−1/2 assigns heavy weight to the ends of the interval [−1, 1]. If a sequence of nonzero functions g0 , g1 , . . . , gn is orthogonal according to Equa- tion (6), then the sequence λ0 g0 , λ1 g1 , . . . , λn gn is orthonormal for appropriate positive
real numbers λ j , namely,
Nonlinear Example
b a
−1/2
λj =
[gj(x)]2w(x)dx
As another example of the least-squares principle, here is a nonlinear problem. Suppose that a table of points (xk , yk ) is to be fitted by a function of the form
y = ecx
Proceeding as before leads to the problem of minimizing the function
m k=0
φ(c)=
(ecxk −yk)2
⎪ ⎪⎩ 2
π i=j=0
12.3 Other Examples of the Least-Squares Principle 521 The minimum occurs for a value of c such that
is easy and leads to
φ(c)=
(cxk −zk)2
zk =lnyk
∂φ m
2(ecxk −yk)ecxkxk
This equation is nonlinear in c. One could contemplate solving it by Newton’s method or the secant method. On the other hand, the problem of minimizing φ(c) could be attacked directly. Since there can be multiple roots in the normal equation and local minima in φ itself, a direct minimization of φ would be safer. This type of difficulty is typical of nonlinear least-squares problems. Consequently, other methods of curve fitting are often preferred if the unknown parameters do not occur linearly in the problem.
Alternatively, this particular example can be linearized by a change of variables z = ln y and by considering
0=∂c=
k=0
z = cx The problem of minimizing the function
m k=0
m m c = z k x k
k=0
This value of c is not the solution of the original problem but may be satisfactory in some
applications.
Linear and Nonlinear Example
The final example contains elements of linear and nonlinear theory. Suppose that an (xk , yk ) table is given with m + 1 entries and that a functional relationship such as
y = a sin(bx)
is suspected. Can the least-squares principle be used to obtain the appropriate values of the parameters a and b?
Notice that parameter b enters this function in a nonlinear way, creating some difficulty, as will be seen. According to the principle of least squares, the parameters should be chosen such that the expression
m
[a sin(bxk ) − yk ]2
k=0
has a minimum value. The minimum value is sought by differentiating this expression with
respect to a and b and setting these partial derivatives equal to zero. The results are
⎧⎪ ⎪ m
⎪⎨
k=0 ⎪ ⎪ ⎪ m
⎪⎩
k=0
2[a sin(bxk ) − yk ] sin(bxk ) = 0 2[a sin(bxk ) − yk ]axk cos(bxk ) = 0
x k2 k=0
522 Chapter 12
Smoothing of Data and the Method of Least Squares
If b were known, a could be obtained from either equation. The correct value of b is the one for which these corresponding two a values are identical. So each of the preceding equations should be solved for a, and the results set equal to each other. This process leads to the equation
m k=0
m k=0
yk sin bxk (sinbxk)2
m = k=0
⎢ ⎢
⎢
UTAV=D= ⎢ ⎢
⎢ ⎢⎣
σ2
⎥ ⎥
⎥
⎥ ⎥
0 ⎥ ⎥⎦
m k=0
xk yk cos bxk
xk sinbxk cosbxk
which can now be solved for parameter b, using, for example, the bisection method or the secant method. Then either side of this equation can be evaluated as the value of a.
Additional Details on SVD
The singular value decomposition (SVD) of a matrix is a factorization that can reveal important properties of the matrix that otherwise could escape detection. For example, from the SVD decomposition of a square matrix one could be alerted to the near-singularity of the matrix. Or from the SVD factorization of a nonsquare matrix an unexpected loss of rank could be revealed. Since the SVD factorization of a matrix yields a complete orthogonal decomposition, it provides a technique for computing the least squares solution of a system of equations and at the same time producing the norm of the error vector.
Suppose that a given m × n matrix has the factorization A = U DV T
where U = [u1,u2,…,um] is an m × m orthogonal matrix, V = [v1,v2, …,vn] is an n × n orthogonal matrix, and the m × n diagonal matrix D contains the singular values of A on its diagonal, listed in decreasing order. The singular values of a matrix A are the positive square roots of the eigenvalues of AT A. These are denoted by σ1 σ2 ··· σr 0. In detail, we have
⎡σ⎤ 1
where UT U = Im and VT V = In. (In the above matrix, blank space corresponds to zero entries.) Moreover, we have Avi = σi ui and σi = || Avi ||2 where vi is column i in V and
…
σ
r
0
…
m×n
■ THEOREM1
EXAMPLE 1
Find the least-squares solution of this nonsquare system
⎡⎤⎡⎤⎡⎤
11×1 ⎣0 1⎦⎣y⎦=⎣−1⎦ 10z1
12.3 Other Examples of the Least-Squares Principle 523
ui is column i in U. Since U is orthogonal, we obtain
Ax − b2 = UT (Ax − b)2 = UT Ax − UT b2
=UT A(VVT)x−UTb2
=(UT AV)(VTx)−UTb2
= DVT x − UT b2 = Dy − c2
r =
m i=r+1
ci2
c =uTbandx=Vy,ify =σ−1c for1irthentheleast-squaressolutionis
and
(σi yi − ci )2 + wherey=VTxandc=UTb.Here,yisdefinedbyyi =ci/σj andxbyx=Vy.Since
i=1 ii iii
n r r x = yv= σ−1cv= σ−1 uTbv
LS ii iii iii i=1 i=1 i=1
m m AxLS −b2 = ci2 = uiTb 2
i=r+1 i=r+1
which is the smallest of all two-norm minimizers. For additional, details see Golub and Van Loan [1996].
In conclusion, we obtain the following theorem.
SVD LEAST SQUARES THEOREM
Let Abeanm×nmatrixofrankr.LettheSVDfactorizationbe A=UDVT. The least-squares solution of the system Ax = b is x = n (σ−1c )v , where
LS i=1 i i i
ci = uiT b. If there exist many least-squares solutions to the given system, then the
one of least 2-norm is x as described above.
using the singular value decomposition:
⎡⎤⎡1√ 1√⎤⎡√⎤
1 1 ⎢3√6 √0 3√3⎥ 3 0 1√2 1√2 ⎣0 1⎦=⎢1 6 1 2 −1 3⎥⎣ 0 1⎦ 2√ 2√
⎣6√ 2√ 3√⎦ 12−12 1016−12−13002 2
Solution
√
623
We have r = rank( A) = 2 and the singular values σ1 = 3 and σ2 = 1. This leads to
⎡1⎤
c1=u1Tb= 1√6 1√6 1√6 ⎣−1⎦=1√6 36613
524 Chapter 12
Smoothing of Data and the Method of Least Squares
and
⎡1⎤ c2=u2Tb= 0 −1√2 1√2 ⎣−1⎦=√2
221
11√1√2√1√2
and
x=σ−1cv+σ−1cv=√ 6 2√+2 2√
LS 111 222 33 12 −12 114 2 2
=3+1=3 1 −2
33
This solution is the same as that from the normal equations. ■
Using the Singular Value Decomposition
This material requires the theory of the singular value decomposition discussed in Sec- tion 8.3.
An important application of the singular value decomposition is in the matrix least- squares problem, to which we now return. For any system of linear equations Ax = b, we want to define a unique minimal solution. This is described as follows. Let A be m × n, and define
ρ=inf{||Ax−b||2 :x∈Rn}
The minimal solution of our system is taken to be the point of smallest norm in the set {x: || Ax − b||2 = ρ}. If the system is consistent, then ρ = 0, and we are simply asking for the point of least norm among all solutions. If the system is inconsistent, we want Ax to be ascloseaspossibletob;thatis,||Ax−b||2 =ρ.Iftherearemanysuchpoints,wechoose the one closest to the origin.
The minimal solution is produced by using the pseudo-inverse of A, and this object, in turn, can be computed from the singular value decomposition of A as discussed in Section 8.3. First, consider a diagonal m × n matrix of the following form, where the σ j are positive numbers:
⎡σ⎤ 1
⎢ ⎢
⎢ D= ⎢
⎢ ⎢ ⎢⎣
σ2
⎥ ⎥
⎥
⎥ ⎥
0 ⎥ ⎥⎦
…
σ
r
0
…
m×n
MINIMAL SOLUTION THEOREM
Consider a system of linear equations Ax = b, in which A is an m × n matrix. The minimal solution of the system is A+ b.
■ THEOREM2
Proof
12.3 Other Examples of the Least-Squares Principle 525 Its pseudo-inverse D+ is defined to be of the same form, except that it is to be n × m and it
has 1/σ j on its diagonal. For example, ⎡10⎤
500+⎣51⎦ D=020 D=02
00
If A is any m × n matrix and if UDV T is one of its singular value decompositions, we
define the pseudo-inverse of A to be
A+ = VD+UT
We do not stop to prove that the pseudo-inverse of A is unique if we impose the order σ1 σ2 · · ·.
Use the notation established above, and let x be any point in Rn . Define y = V T x and c = UT b. Using the properties of V and U, we obtain
= inf || D V T x − U T b||2 x
= inf || D y − c||2 y
Exploiting the special nature of D, we have
ρ = inf||Ax − b||2 x
= inf ||U DV T x − b||2 x
= inf ||U T (U D V T x − b)||2 x
r Dy−c2 =
i=1
(σiyi −ci)2 +
m i=r+1
ci2
Tominimizethislastexpression,wedefineyi =ci/σi for1ir.Theothercomponents canremainunspecified.Buttogetthe yofleastnorm,wemustsetyi =0forr+1im. This construction is carried out by the pseudo-inverse D+, so y = D+c. Hence, we obtain
x=Vy=VD+c=VD+UTb= A+b
Let us express the minimal solution in another form, taking advantage of the zero compo- nents in the vector y. Since yi = 0 for i > r, we require only the first r components of y. Thesearegivenbyyi =ci/σi.Nowitisevidentthatonlythefirstrcomponentsofcare needed. Since c = UT b, ci is the inner product of row i in UT with the vector b. That is the same as the inner product of the ith column of U with b. Thus,
yi =uiTb/σi 1ir
526 Chapter 12
Smoothing of Data and the Method of Least Squares
■ THEOREM3
The minimal solution, which we may denote by x∗, is then
r i=1
An example of this procedure can be carried out in mathematical software such as Matlab, Maple or Mathematica. We can generate a system of 20 equations with three unknowns by a random process. This technique is often used in testing software, especially in benchmarking studies, in which a large number of examples is run with careful timing. The software has a provision for entering random matrices. When executed, the computer program first exhibits the random input. The three singular values of matrix A are displayed. Then the diagonal 20 × 3 matrix D is displayed. A check on the numerical work is made by computing U DV T , which should equal A. Then the pseudo-inverse of D+ is computed. Next, the pseudo-inverse A+ is computed. The minimal solution, x = A+b, is computed, as well as the residual vector, r = A+ b = b. Then the orthogonality condition AT r = 0 is checked. This program is therefore carrying out all the steps described above for obtaining the minimal solution of a system of equations. Another example will be given below to show what happens in the case of a loss in rank. (See Computer Problem 12.3.10.)
In problems of this type, the user must examine the singular values and decide whether any are small enough to warrant being set equal to zero. The necessity of this step becomes clear when we look at the definition of D+. The reciprocals of the singular values are the principal constituents of this matrix. Any very small singular value that is not set equal to zero will therefore have a disruptive effect on the subsequent calculations. A rule of thumb that has been recommended is to drop any singular value whose magnitude is less than σ1 times the inherent accuracy of the coefficient matrix. Thus, if the data are accurate to three decimal places and if σ1 = 5, then any σi less than 0.005 should be set equal to zero.
An example of a small matrix having a near-deficiency in rank is given next. In the Maple program, certain singular values are set equal to zero if they fail to meet the relative size criterion mentioned in the previous paragraph. Also, we have added, as a check on the calculations, a verification of the following four Penrose properties for a pseudo-matrix.
x∗ =Vy=
yivi ■
PENROSE PROPERTIES OF THE PSEUDO-INVERSE
The pseudo-inverse A+ for the matrix A has these four properties:
A = A A+ A A+ = A+ A A+ AA+ = (AA+)T A+ A = (A+ A)T
We can use mathematical software such as Matlab, Maple, or Mathematica for finding the pseudo-inverse of a matrix that has a deficiency in rank. For example, consider this
5 × 3 matrix:
⎡⎤
−85 −55 −115 ⎢ −35 97 −167 ⎥
A = ⎢ 79 56 102 ⎥ (7) ⎣63 57 69⎦
45 −8 97.5
when we minimize
12.3 Other Examples of the Least-Squares Principle 527
A tolerance value is set so that in the evaluation of singular values any value whose magnitude is less than the tolerance is treated as zero. We can verify the Penrose properties for this matrix. (See Computer Problem 12.3.11.)
Summary
(1) We attempt to solve an inconsistent system
n j=0
(0km)
in which there are m + 1 equations but only n + 1 unknowns with m > n. We minimize the
akjxj =bk
sum of the squares of the residuals and are led to minimize the expression
m n 2 φ(x0,x1,…,xn)= akjxj −bk
k=0 j=0
We solve the (n + 1) × (n + 1) system of normal equations
nm m
akiakj xj = bkaki (0in)
j=0 k=0
by Gaussian elimination, and the solution is a best approximate solution of the original
system in the least-squares sense.
Additional References
See Acton [1959], Bjo ̈rck [1996], Branham [1990], Cheney [1982, 2001], Forsythe [1957], van Huffel and Vandewalle [1991], Lawson and Hanson [1995], Rice [1971], Rice and White [1964], Rivlin [1990], Spa ̈th [1992], and Whittaker and Robinson [1944].
1. Analyzetheleast-squaresproblemoffittingdatabyafunctionoftheformy=xc. a2. ShowthattheHilbertmatrix(ComputerProblem7.2.4)arisesinthenormalequations
j=0
a3. Find a function of the form y = ecx that best fits this table:
x01
y11 2
k=0
1 n 2 cjxj−f(x) dx
0
Problems 12.3
528 Chapter 12
Smoothing of Data and the Method of Least Squares
a4.
5. a 6.
7.
a8.
(Continuation)Repeattheprecedingproblemforthefollowingtable: x01
yab
(Continuation)Repeattheprecedingproblemunderthesuppositionthatbisnegative.
Show that the normal equation for the problem of fitting y = ecx to points (1, −12) and (2, 7.5) has two real roots: c = ln 2 and c = 0. Which value is correct for the fitting problem?
Consider the inconsistent System (1). Suppose that each equation has associated with it a positive number wi indicating its relative importance or reliability. How should Equations (2) and (3) be modified to reflect this?
Determinethebestapproximatesolutionoftheinconsistentsystemoflinearequations
⎧
⎪⎨2x+3y= 1
⎪⎩ x − 4 y = − 9 2x−y =−1
10. 11.
12. 13.
a14.
a15.
a 16.
a17. 18.
squares to the function sin x on the interval [0, π/2]. ab. Do the same for ex on [0, 1].
Analyze the problem of fitting a function y = (c − x)−1 to a table of m + 1 points. Showthatthenormalequationsfortheleast-squaressolutionofAx=bcanbewritten
(AT A)x = AT b.
Derive the normal equations given by System (5).
A table of values (xk,yk), where k = 0,1,…,m, is obtained from an experiment. When plotted on semilogarithmic graph paper, the points lie nearly on a straight line, implying that y ≈ eax+b. Suggest a simple procedure for obtaining parameters a and b.
in the least-squares sense.
9.aa. Find the constant c for which cx is the best approximation in the sense of least
In fitting a table of values to a function of the form a + bx−1 + cx−2, we try to make eachpointlieonthecurve.Thisleadstoa+bx−1 +cx−2 = y for0km.An
kkk
equivalent equation is axk2 + bxk + c = yk xk2 for 0 k m. Are the least-squares
problems for these systems of equations equivalent?
A table of points (xk,yk) is plotted and appears to lie on a hyperbola of the form y = (a + bx)−1. How can the linear theory of least squares be used to obtain good estimates of a and b?
Consider f (x ) = e2x over [0, π ]. We wish to approximate the function by a trigono- metric polynomial of the form p(x) = a + b cos(x) + c sin(x). Determine the linear system to be solved for determining the least squares fit of p to f .
Find the constant c that makes the expression 1(ex − cx)2 dx a minimum. 0
Showthatineveryleast-squaresmatrixproblem,thenormalequationshaveasymmetric coefficient matrix.
12.3 Other Examples of the Least-Squares Principle 529 19. Verify that the following steps produce the least-squares solution of Ax = b.
a. Factor A = Q R, where Q and R have the properties described in the text. b. Define y = QT b. c. Solve the lower triangular system Rx = y.
a20. What value of c should be used if a table of experimental data (xi, yi) for 0 i m is to be represented by the formula y = c sin x ? An explicit usable formula for c is required. Use the principle of least squares.
21. Refer to the formulas leading to the minimal solution of the system Ax = b. Prove thatthe y-vectorisgivenbytheformulayi =σ−2bT Av for1ir.
ii
22. Prove that the pseudo-inverse satisfies the four Penrose equations.
23. UsethefourPenrosepropertiestofindthepseudo-inverseofthematrix[a,0]T,where a > 0. Prove that the pseudo-inverse is a discontinuous function of a.
24. Usethetechniquesuggestedintheprecedingproblemtofindthepseudo-inverseofthe m × n matrix consisting solely of 1’s.
25. UsethePenroseequationstofindthepseudo-inverseofany1×nmatrixandanym×1 matrix.
26. (Multiplechoice)LetA=PDQ,whereAisanm×nmatrix,Pisanm×munitary matrix, D is an m × n diagonal matrix, and Q is an n × n unitary matrix. Which equation can be deduced from those hypotheses?
a. A∗=P∗D∗Q∗ b. A−1=Q∗D−1P∗ c. D=PAQ d. A∗A= Q∗D∗DQ e. Noneofthese.
27. (Multiplechoice,continuation)Assumethehypothesesoftheprecedingproblem.Use the notation + to indicate a pseudo-inverse. Which equation is correct?
28.
(Multiple choice) Let D be an m × n diagonal matrix with diagonal elements p1, p2, . . . , pr , 0, 0, . . . , 0. Here all the numbers pi , for 1 i r , are positive. Which assertion is not valid?
a. D+ isthem×ndiagonalmatrixwithdiagonalelements(1/p1,1/p2,…,1/pr,0, 0, …, 0)
b. D+ isthen×m diagonalmatrixwithdiagonalelements(1/p1,1/p2,…,1/pr,0, 0, …, 0)
29.
(Multiple choice) Consider an inconsistent system of equations Ax = b. Let U be a unitary matrix and let E = U∗ A. Let v, w, and z be vectors such that Uv = Eb, Uw = E∗b, E y = U∗b, and Ex = Ub. A vector that solves the least-squares problem for the original system Ax = b is:
a. v b. w c. y d. z e. None of these.
a. A+=PD+Q b. A∗=Q∗D−1P∗ c. A+=Q∗D+P∗
d. A−1=Q∗D+P∗
e. None of these.
c.(D+)∗=(D∗)+ d. D++=D
e. None of these.
530 Chapter 12
Smoothing of Data and the Method of Least Squares
a1. Usingthemethodsuggestedinthetext,fitthedatainthetable
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
y 0.6 1.1 1.6 1.8 2.0 1.9 1.7 1.3
by a function y = a sin bx.
2. (Prony’s method, n = 1) To fit a table of the form
x12···m
y y1 y2 ··· ym
bythefunctiony=abx,wecanproceedasfollows:Ifyisactuallyabx,thenyk =abk and yk+1 = byk for k = 1,2,…,m − 1. So we determine b by solving this system of equations using the least-squares method. Having found b, we find a by solving the equations yk = abk in the least-squares sense. Write a program to carry out this procedure, and test it on an artificial example.
3. (Continuation) Modify the procedure of the preceding computer problem to handle any case of equally spaced points.
4. Aquickwayoffittingafunctionoftheform f(x)≈ a+bx
1+cx
is to apply the least-squares method to the problem (1 + cx) f (x) ≈ a + bx. Use this
technique to fit the world population data given here:
Year Population (billions)
1000 0.340 1650 0.545 1800 0.907 1900 1.61 1950 2.51 1960 3.15 1970 3.65 1980 4.20 1990 5.30
Determine when the world population will become infinite!
5. (Studentresearchproject)Explorethequestionofwhethertheleast-squaresmethod should be used to predict. For example, study the variances in the preceding problem to determine whether a polynomial of any degree would be satisfactory.
6. Writeaprocedurethattakesasinputan(m+1)×(n+1)matrix Aandanm+1 vector b and returns the least-squares solution of the system Ax = b.
7. WriteaMapleprogramtofindtheminimalsolutionofanysystemofequations,Ax=b.
Computer Problems 12.3
12.3 Other Examples of the Least-Squares Principle 531
8. (Continuation) Write a Matlab program for the task in the preceding problem.
9. Investigate some of the newer methods for solving inconsistent linear equations
Ax = b, when the criterion is to make Ax close to b in one of the other useful norms,
namely, the maximum norm ||x||∞ = max1 i n |xi | or the l1 norm ||x||1 = n |xi |. i=1
Use some of the available software.
10. UsingmathematicalsoftwaresuchasMatlab,Maple,orMathematica,generateasys- tem of twenty equations with three unknowns by a random number generator. Form the pseudo-inverse matrix and verify the properties in Theorem 2.
11. (Continuation.) Repeat using Matrix (7).
12. WriteacomputerprogramforcarryingouttheleastsquarescurvefitusingChebyshev polynomials. Test the code on a suitable data set and plot the results.
13
Monte Carlo Methods and Simulation
532
FIGURE 13.1
Traffic flow
A highway engineer wishes to simulate the flow of traffic for a proposed design of a major freeway intersection. The information that is obtained will then be used to determine the capacity of storage lanes (in which cars must slow down to yield the right of way). The intersection has the form shown in Figure 13.1, and various flows (cars per minute) are postulated at the points where arrows are drawn. By writing and running a simulation program, the engineer can study the effect of different speed limits, determine which flows lead to saturation (bottlenecks), and so on. Some techniques for constructing such programs are developed in this chapter.
13.1 Random Numbers
This chapter differs from most of the others in its point of view. Instead of addressing clear- cut mathematical problems, it attempts to develop methods for simulating complicated proc- esses or phenomena. If the computer can be made to imitate an experiment or a process, then by repeating the computer simulation with different data, we can draw statistical conclu- sions. In such an approach, the conclusions may lack a high degree of mathematical precision but still be sufficiently accurate to enable us to understand the process being simulated.
Particular emphasis is given to problems in which the computer simulation involves an element of chance. The whimsical name of Monte Carlo methods was applied some years
ago by Stanislaw M. Ulam (1909–1984) to this way of imitating reality by a computer. Since chance or randomness is part of the method, we begin with the elusive concept of random numbers.
Consider a sequence of real numbers x1, x2, . . . all lying in the unit interval (0, 1). Expressed informally, the sequence is random if the numbers seem to be distributed hap- hazardly throughout the interval and if there seems to be no pattern in the progression x1, x2, . . . For example, if all the numbers in decimal form begin with the digit 3, then the numbers are clustered in the subinterval 0.3 x < 0.4 and are not randomly distributed in (0, 1). If the numbers are monotonically increasing, they are not random. If each xi is obtained from its predecessor by a simple continuous function, say, xi = f (xi−1), then the sequence is not random (although it might appear to be so). A precise definition of randomness is quite difficult to formulate, and the interested reader may wish to consult an article by Chaitlin [1975], in which randomness is related to the complexity of computer algorithms! Thus, it seems best, at least in introductory material, to accept intuitively the notion of a random sequence of numbers in an interval and to accept certain algorithms for generating sequences that are more or less random.
A recommended reference is the book of Niederreiter [1992].
Random-Number Algorithms and Generators
Most computer systems have random-number generators, which are procedures that produce either a single random number or an entire array of random numbers with each call. In this chapter, we call such a procedure Random. The reader can use a random-number generator available on his or her own computing system, one available within the computer language being used, or one of the generators described below. For example, random- number generators are contained in mathematical software systems such as Matlab, Maple, and Mathematica as well as many computer programming languages. These random-number procedures return one or an array of uniformly distributed pseudo-random numbers in the unit interval (0, 1) depending on whether the argument is a scalar variable or an array. A random seed procedure restarts or queries the pseudo-random-number generator. The random number generator can produce hundreds of thousands of pseudo-random numbers before repeating itself, at least theoretically.
For the problems in this chapter, one should select a routine to provide random numbers uniformly distributed in the interval (0, 1). A sequence of numbers is uniformly distributed in the interval (0, 1) if no subset of the interval contains more than its share of the numbers. In particular, the probability that an element x drawn from the sequence falls within the subinterval [a, a + h] should be h and hence independent of the number a. Similarly, if
pi =(xi,yi)arerandompointsintheplaneuniformlydistributedinsomerectangle,then the number of these points that fall inside a small square of area k should depend only on k and not on where the square is located inside the rectangle.
Random numbers produced by a computer code cannot be truly random because the manner in which they are produced is completely deterministic; that is, no element of chance is actually present. But the sequences that are produced by these routines appear to be random, and they do pass certain tests for randomness. Some authors prefer to emphasize this point by calling such computer-generated sequences pseudo-random numbers.
If the reader wishes to program a random-number generator, the following one should be satisfactory on a machine that has 32-bit word length. This algorithm generates n random
13.1 Random Numbers 533
534 Chapter 13
Monte Carlo Methods and Simulation
numbersx1,x2,...,xn uniformlydistributedintheopeninterval(0,1)bymeansofthe following recursive algorithm:
Here, all li ’s are integers in the range 1 < li < 231 − 1. The initial integer l0 is called the seed for the sequence and is selected as any integer between 1 and the Mersenne prime number 231 − 1 = 21474 83647.
For information on portable random-number generators, the reader should consult the article by Schrage [1979]. A fast normal random-number generator can be written in only a few lines of code as presented in Leva [1992]. It is based on the ratio of uniform deviates method of Kinderman and Monahan [1977].
An external function procedure to generate a new array of pseudo-random numbers per call could be based on the following pseudocode:
integer array (li )0:n ; real array (xi )1:n
l0 ←anyintegersuchthat1
Writeaprogramtogenerateandprint1000pointsuniformlyandrandomlydistributed inthecircle(x−3)2 +(y+1)29.
Generate 1000 random numbers xi according to a uniform distribution in the interval (0, 1). Define a function f on (0, 1) as follows: f (t ) is the number of random numbers x1, x2, . . . , x1000 less than t. Compute f (t)/1000 for 200 points t uniformly distributed in (0, 1). What do you expect f (t)/1000 to be? Is this expectation borne out by the experiment? If a plotter is available, plot f (t )/1000.
Let ni (1 i 1000) be a sequence of integers that satisfies 0 ni 9. Write a program to test the given sequence for periodicity. (The sequence is periodic if there is an integer k such that ni = ni+k for all i.)
Generateinthecomputer1000randomnumbersintheinterval(0,1).Printandexamine them for evidence of nonrandom behavior.
Generate1000randomnumbersxi (1i1000)onyourcomputer.Letni denotethe eighth decimal digit in xi . Count how many 0’s, 1’s, . . . , 9’s there are among the 1000 numbers ni . How many of each would you expect? This code can be written with nine statements.
(Continuation) Using a random-number generator, generate 1000 random numbers, and count how many times the digit i occurs in the jth decimal place. Print a table of these values—that is, frequency of digit versus decimal place. By examining the table, determine which decimal place seems to produce the best uniform distribution of random digits. Hint: Use the routine from Computer Problem 1.1.7 to compute the arithmetic mean, variance, and standard deviations of the table entries.
Using random integers, write a short program to simulate five people matching coin flips. Print the percentage of match-ups (five of a kind) after 125 flips.
Write a program to generate 1600 random points uniformly distributed in the sphere defined by x2 + y2 + z2 1. Count the number of random points in the first octant.
Write a program to simulate 1000 simultaneous flips of three coins. Print the number of times that two of the three coins come up heads.
Compute1000triplesofrandomnumbersdrawnfromauniformdistribution.Foreach triple (x, y, z), compute the leading significant digit of the product xyz. (The leading significant digit is one of 1, 2, . . . , 9.) Determine the frequencies with which the digits 1 through 9 occur among the 1000 cases. Try to account for the fact that these digits do not occur with the same frequency. (For example, 1 occurs approximately 7 times more often than 9.) If you are intrigued by this, you may wish to consult the articles by Flehinger [1966], Raimi [1969], and Turner [1982].
Runtheexampleprogramsinthissectionandseewhethersimilarresultsareobtained on your computer system.
13.1 Random Numbers 543
544
Chapter 13
Monte Carlo Methods and Simulation
20. Write a program to generate and plot 1000 pseudo-random points with the following exponential distribution inside the figure below: x = − ln(1 − r )/λ for r ∈ [0, 1) and λ = 1/30.
x
2
z
3–2 0
y
1
13.2
21. Improve the program Coarse Check by using ten or a hundred buckets instead of two.
22. (Student research project) Investigate some of the latest developments on random- number generators and explore parallel random number generators. Random numbers are often needed for distributions other than the uniform distribution, so this has a statistical aspect.
Estimation of Areas and Volumes by Monte Carlo Techniques
Numerical Integration
Now we turn to applications, the first being the approximation of a definite integral by the Monte Carlo method. If we select the first n elements x1, x2, . . . , xn from a random sequence in the interval (0, 1), then
1 1 n f(x)dx ≈ n f(xi)
0 i=1
Here, the integral is approximated by the average of n numbers f(x1), f(x2),…, f(xn). When this is actually carried out, the error is of order 1/√n, which is not at all competitive with good algorithms, such as the Romberg method. However, in higher dimensions, the Monte Carlo method can be quite attractive. For example,
1 1 1 1 n f(x,y,z)dxdydz≈ n f(xi,yi,zi)
000 i=1
where (xi , yi , zi ) is a random sequence of n points in the unit cube 0 x 1, 0 y 1, and 0 z 1. To obtain random points in the cube, we assume that we have a random sequence
13.2 Estimation of Areas and Volumes by Monte Carlo Techniques 545
in (0, 1) denoted by ξ1, ξ2, ξ3, ξ4, ξ5, ξ6, . . . To get our first random point p1 in the cube, just let p1 = (ξ1, ξ2, ξ3). The second is, of course, p2 = (ξ4, ξ5, ξ6), and so on.
If the interval (in a one-dimensional integral) is not of length 1 but, say, is the gen- eral case (a,b), then the average of f over n random points in (a,b) is not simply an approximation for the integral but rather for
1b
b − a
which agrees with our intention that the function f (x) = 1 have an average of 1. Similarly, in higher dimensions, the average of f over a region is obtained by integrating and dividing by the area, volume, or measure of that region. For instance,
1312
f (x ) d x
a2b2c2 a1 b1 c1
f(x,y)dxdy=
a2b2c2
8
f (x, y, z) dx dy dz
1 −1 0
a
is the average of f over the parallelepiped described by the following three inequalities: 0 x 2, −1 y 1, 1 z 3.
To keep the limits of integration straight, recall that
and
0
b d
ac ac
b d f(x,y)dx dy
f(x,y,z)dxdydz=
So if (xi , yi ) denote random points with appropriate uniform distribution, the following
examples illustrate Monte Carlo techniques: 5
5 6
21 i=1
In each case, the random points should be uniformly distributed in the regions involved. In general, we have
Here, we are using the fact that the average of a function on a set is equal to the integral of the function over the set divided by the measure of the set.
Example and Pseudocode
Let us consider the problem of obtaining the numerical value of the integral
sin ln(x + y + 1) dx dy = f (x, y) dx dy
a1 b1 c1
5 n f(x)dx ≈ n f(xi)
f(x,y)dxdy≈ n f(xi,yi)
i=1 15 n
f(x,y,z)dx dy dz
f ≈ (measure of A) × (average of f over n random points in A) A
546
Chapter 13
Monte Carlo Methods and Simulation
z
Surface f
1
Disk
y
FIGURE 13.6
Sketch of surface f(x,y) above disk
1
x
over the disk in xy-space, defined by the inequality
2 2
= (x,y): x−1 + y−1 1 224
A sketch of this domain, with a surface above it, is shown in Figure 13.6. We proceed by gen- erating random points in the square and discarding those that do not lie in the disk. We take n=5000pointsinthedisk.Ifthepointsarepi =(xi,yi),thentheintegralisestimatedtobe
f(x,y)dxdy ≈(areaofdisk)× averageheightof f over n random points
n =πr2 1 f(pi)
n i=1
π n
= 4n
i=1
f(pi)
The pseudocode for this example follows. Intermediate estimates of the integral are printed when n is a multiple of 1000. This gives us some idea of how the correct value is being approached by our averaging process.
program Double Integral integer i, j: real sum, vol, x, y; integer n ← 5000, iprt ← 1000; call Random((ri j ))
j←0; sum←0 for i = 1 to n do
real array (ri j )1:n×1:2 external function f
x = ri,1; y = ri,2
if(x−1/2)2 +(y−1/2)21/4then
j←j+1 sum←sum+ f(x,y) if mod( j, iprt) = 0 then
vol ← (π/4)sum/real( j) output j,vol
13.2 Estimation of Areas and Volumes by Monte Carlo Techniques 547
end if end if
end for
vol ← (π/4)sum/real( j) output j,vol
end program Double Integral
real function f (x, y)
real x, y√
f ←sin ln(x+y+1) end function
We obtain an approximate value of 0.57 for the integral. Computing Volumes
The volume of a complicated region in 3-space can be computed by a Monte Carlo technique. Taking a simple case, let us determine the volume of the region whose points satisfy the inequalities
⎧
⎪⎨ 0 x 1 0 y 1 0 z 1
⎪⎩x2 +sinyz x − z + ey 1
The first line defines a cube whose volume is 1. The region defined by all the given in- equalities is therefore a subset of this cube. If we generate n random points in the cube and determine that m of them satisfy the last two inequalities, then the volume of the desired region is approximately m/n. Here is a pseudocode that carries out this procedure:
program Volume Region
integer i, m; real array (ri j )1:n×1:3; real vol, x, y, z integer n ← 5000, iprt ← 1000
call Random((ri j ))
for i = 1 to n do
x ← ri,1
y ← ri,2
z ← ri,3
if x2 + sin y z, x − z + ey 1 then m ← m + 1 if mod(i, iprt) = 0 then
vol ← real(m)/real(i)
output i, vol end if
end for
end program Volume Region
548
Chapter 13
Monte Carlo Methods and Simulation
Observe that intermediate estimates are printed out when we reach 1000, 2000, . . . , 5000 points. An approximate value of 0.14 is determined for the volume of the region.
Ice Cream Cone Example
Consider the problem of finding the volume above the cone z2 = x2 + y2 and inside the sphere x2 + y2 + (z − 1)2 = 1 as shown in Figure 13.7. The volume is contained in the box bounded by −1 x 1, −1 y 1, and 0 z 2, which has volume 8. Thus, we want to generate random points inside this box and multiply by 8 the ratio of those inside the desired volume to the total number generated. A pseudocode for doing this follows:
program Cone
integer i, m; real vol, x, y, z; real array (ri j )1:n×1:3 integer n ← 5000, iprt ← 1000; m ← 0
call Random((ri j ))
for i = 1 to n do
x ← 2ri,1 − 1; y ← 2ri,2 − 1; z ← 2ri,3
ifx2 +y2z2,x2 +y2z(2−z)thenm←m+1 if mod(i, iprt) = 0 then
vol ← 8 real(m)/real(i)
output i, vol end if
end for
end program Cone
The volume of the cone is approximately 3.3.
z
FIGURE 13.7
Ice cream cone region
1 0 1
x
y
13.2 Estimation of Areas and Volumes by Monte Carlo Techniques 549
Summary
(1) We discuss the approximating of integrals by the Monte Carlo method to estimate areas and volumes. We use
1 0
1 n
f(x)dx ≈ n
1 n
f(xi) f(xi,yi,zi)
1 1 1
000 i=1
f(x,y,z)dxdydz ≈ n
where {xi } is a sequence of random numbers in the unit interval and (xi , yi , zi ) is a random
sequence of n points in the unit cube. (2) In general, we have
i=1
f ≈ (measure of A) × (average of f over n random points in A) A
Problems 13.2
a 1.
1. a 2.
3. a4.
It is proposed to calculate π by using the Monte Carlo method. A circle of radius 1 is inside a square of side 2. We count how many of m random points in the square happen to lie in the circle. Assume that the error is 1/√m. How many points must be taken to obtain π with three accurate figures (i.e., 3.142)?
Runthecodesgiveninthissectiononyourcomputersystemandverifythattheyproduce reasonable answers.
Write and test a program to evaluate the integral 1 ex d x by the Monte Carlo method, 0
using n = 25, 50, 100, 200, 400, 800, 16000, and 32000. Observe that 32,000 random numbers are needed and that the work in each case can be used in the next case. Print the exact answer. Plot the results using a logarithmic scale to show the rate of growth.
Writeaprogramtoverifynumericallythatπ =2(4−x2)1/2 dx.UsetheMonteCarlo 0
method and 2500 random numbers. UsetheMonteCarlomethodtoapproximatetheintegral
111
(x2 +y2 +z2)dxdydz
−1 −1 −1 Compare with the correct answer.
Computer Problems 13.2
550 Chapter 13
Monte Carlo Methods and Simulation
a5. Writeaprogramtoestimate
2 6 1
(yx2 +zlogy+ex)dxdydz 0 3 −1
6. UsingtheMonteCarlotechnique,writeapseudocodetoapproximatetheintegral
(ex sinylogz)dxdydz
where is the circular cylinder that has height 3 and circular base x2 + y2 4.
a7. Estimate the area under the curve y = e−(x+1)2 and inside the triangle that has vertices
(1, 0), (0, 1), and (−1, 0) by writing and testing a short program.
8. UsingtheMonteCarloapproach,findtheareaoftheirregularfiguredefinedby ⎧
⎪⎨ 1 x 3 − 1 y 4 ⎪⎩x3 +y329
yex −2
a9. UsetheMonteCarlomethodtoestimatethevolumeofthesolidwhosepoints(x,y,z)
satisfy
⎧
⎪⎨ 0 x y 1 y 2 − 1 z 3
⎪⎩ e x y
(sin z)y 0
a10. Using a Monte Carlo technique, estimate the area of the region determined by the inequalities 0 x 1, 10 y 13, y 12 cos x, and y 10 + x3. Print intermediate answers.
11. Use the Monte Carlo method to approximate the following integrals. 111
a.
b.
(x2 −y2 −z2)dxdydz (x2 −y2 +xy−3)dxdy
−1−1 −1 45
1 2√ 3 y
√
1 y y+z
c.
12. Thevalueoftheintegral
(x2y + xy2)dx dy
π/4 2cosφ 2π
d. xy dx dy dz 0 y2 0
ρ2 sin φ dθ dρ dφ
000
using spherical coordinates is the volume above the cone z2 = x2 + y2 and inside the sphere x2 + y2 + (z − 1)2 = 1. Use the Monte Carlo method to approximate this integral and compare the results with that from the example in the text.
13. Let R denote the region in the xy-plane defined by the inequalities 1 3x9−y
3
√x y 3
2 1+y
13.2 Estimation of Areas and Volumes by Monte Carlo Techniques 551
Estimate the integral
i
√ 00
(ex +cosxy)dxdy R
a14. UsingaMonteCarlotechnique,estimatetheareaoftheregiondefinedbytheinequali- ties4x2 +9y236andyarctan(x+1).
15. Writeaprogramtoestimatetheareaoftheregiondefinedbytheinequalities
x2 + y2 4 |y| ex
16. Anintegralcanbeestimatedbytheformula
1 1 n f(x)dx ≈ n f(xi)
numerical integration scheme. Test whether the estimates converge at the rate 1/n or
17. Consider the ellipsoid
x2 y2 z2
4 + 16 + 4 = 1
a. Writeaprogramtogenerateandstore5000randompointsuniformlydistributedin the first octant of this ellipsoid.
0 i=1
even if the x ’s are not random numbers; in fact, some nonrandom sequences may
be better. Use the sequence xi = fractional part of i 2 and test the corresponding
1/√n by using some simple examples, such as 1 ex dx and 1(1 + x2)−1 dx.
ab. Writeaprogramtoestimatethevolumeofthisellipsoidinthefirstoctant.
18. A Monte Carlo method for estimating b f (x) dx if f (x) 0 is as follows: Let
a
c maxaxb f(x).Thengeneratenrandompoints(x,y)intherectangleaxb,
0yc.Countthenumberkoftheserandompoints(x,y)thatsatisfyy f(x).Then b f(x)dx≈kc(b−a)/n.Verifythisandtestthemethodon2x2dx,1(2×2−
a110 x+1)dx,and 0(x2+sin2x)dx.
19. (Continuation) Use the method of Computer Problem 13.2.18 to estimate the value ofπ = 41√1−x2dx.Generaterandompointsin0x1,0y1.Usen =
0√ 1000, 2000, . . . , 10000 and try to determine whether the error is behaving like 1/ n.
20. (Continuation)ModifythemethodoutlinedinComputerProblem13.2.19tohandlethe case when f takes positive and negative values on [a, b]. Test the method on 1 x3 dx.
−1
21. Another Monte Carlo method for evaluating b f (x ) d x is as follows: Generate an odd
a
number of random numbers in (a, b). Reorder these points so that a < x1 < x2 < · · · <
xn < b. Now compute
f(x1)(x2 −a)+ f(x3)(x4 −x2)+ f(x5)(x6 −x4)+···+ f(xn)(b−xn−1)
Test this method on 111
(1+x2)−1 dx (1−x2)−1/2 dx x−1 sinxdx 000
552 Chapter 13
Monte Carlo Methods and Simulation
22. What is the expected value of the volume of a tetrahedron formed by four points chosen randomly inside the tetrahedron whose vertices are (0, 0, 0), (0, 1, 0), (0, 0, 1), and (1, 0, 0)? (The precise answer is unknown!)
23. Write a program to compute the area under the curve y = sin x and above the curve y = ln(x + 2). Use the Monte Carlo method, and print intermediate results.
24. Estimatetheintegral
by the Monte Carlo method.
5.9 sinx+x2
e dx
lnx
25. Test the random-number generator that is available to you in the following manner: Begin by creating a list of N random numbers rk , uniformly distributed in the interval [0,1]. Create a list of random integers nk by extracting the integer part of 10rk for 1k N.Computetheelementsina10×10matrix(mij),wheremij isthenumberof times i is followed by j in the list (nk ). Compare these numbers to the values predicted byelementaryprobabilitytheory.Ifpossible,displaythevaluesofmij graphically.
26. (Student research project) Investigate some of the latest developments on Monte Carlo methods for multivariable integration.
13.3 Simulation
We next illustrate the idea of simulation. We consider a physical situation in which an element of chance is present and try to imitate the situation on the computer. Statistical conclusions can be drawn if the experiment is performed many times. Applications include the simulation of servers, clients, and queues as might occur in businesses such as banks or grocery stores.
Loaded Die Problem
In simulation problems, we must often produce random variables with a prescribed distri- bution. Suppose, for example, that we want to simulate the throw of a loaded die and that the probabilities of various outcomes have been determined as shown:
Outcome 1 2 3 4 5 6 Probability 0.2 0.14 0.22 0.16 0.17 0.11
If the random variable x is uniformly distributed in the interval (0, 1), then by breaking this interval into six subintervals of lengths given by the table, we can simulate the throw of this loaded die. For example, we agree that if x is in (0, 0.2), the die shows 1; if x is in [0.2, 0.34), the die shows 2, and so on. A pseudocode to count the outcome of 5000 throws of
3.2
13.3 Simulation 553 this die and compute the probability might be written as follows:
program Loaded Die
integer i, j; real array (yi )1:6, (mi )1:6, (ri )1:n real n ← 5000
(yi)6 ←(0.2,0.34,0.56,0.72,0.89,1.0)
(mi)6 ←(0.0,0.0,0.0,0.0,0.0,0.0)
call Random((ri ))
for i = 1 to n do
for j = 1 to 6 do ifri
6. Establish the properties claimed for the function g in Equation (10).
7. Showthatforthesimpleproblem
the tridiagonal system to be solved can be written as
⎧⎪⎨ ( 2 − h 2 ) x 1 ⎪⎩−xi−1 +(2−h2)xi
happens:
−xn−2 + (2 − h2)xn−1
x′′ =−x
x(a) = α x(b) = β
− x 2 = α −xi+1 =0 = β
(2in−2)
a8. Write down the system of equations Ax = b that results from using the usual second-
order central difference approximation to solve
a9. Letubeasolutionoftheinitial-valueproblem
How do we solve the following two-point boundary-value problem by utilizing u?
10. How would you solve the problem
where a, b, A, B, and C are given real numbers? (Assume that A and B are not both zero.) a11. Usetheshootingmethodonthistwo-pointboundary-valueproblem,andexplainwhat
x′′ =(1+t)x
x(0) = 0 x(1) = 1
u′′ =etu+t2u′
u(1) = 0 u′(1) = 1
x′′ =etx+t2x′
x(1) = 0 x(2) = 7
x′ = f(t,x)
Ax(a) + Bx(b) = C
x′′ =−x
x(0) = 3 x(π) = 7
This problem is to be solved analytically, not by computer or calculator.
14.2 A Discretization Method 579
580 Chapter 14
Boundary-Value Problems for Ordinary Differential Equations
1. Explainthemainstepsinsettingupaprogramtosolvethistwo-pointboundaryvalue problem by the finite-difference method.
Show any preliminary work that must be done before programming. Exploit the linearity of the differential equation. Program and compare the results when different values of n are used, say, n = 10, 100, and 1000.
2. Solvethefollowingtwo-pointboundaryvalueproblemnumerically.Forcomparisons, the exact solutions are given.
⎧⎨ x′′ = (1 − t)x + 1
aa. ⎩ (1+t)2
x(0) = 1 x(1) = 0.5
1 ab. x′′ = 3 (2−t)e2x +(1+t)−1
x(0) = 0 x(1) = −log2
3. Solvetheboundary-valueproblem
by discretization. Compare with the exact solution, which is x (t ) = t + 2 sin t .
4. Repeat Computer Problem 14.1.2, using a discretization method.
5. Writeacomputerprogramtoimplement
a. program BVP1. b. program BVP2.
6. (Continuation)Usingbuilt-inroutinesinmathematicalsoftwaresystemssuchasMatlab, Maple, or Mathematical, solve and plot the solution curve for the boundary-value prob- lem associated with
a. program BVP1. b. program BVP2.
7. Investigate the computation of numerical solutions to the following challenging test
x′′ =xsint+x′cost−et x(0) = 0 x(1) = 1
problems, which are nonlinear:
a.
x′′ = ex εx′′ + (x′)2 = 1 x(0) = 0, x(1) = 0 b. x(0) = 0, x(1) = 1
Vary ε = 10−1, 10−2, 10−3, . . . . Compare to the true solution x(t) = 1 + ε ln cosh((x − 0.745)/ε)
which has a corner at t = 0.745.
x′′ =−x+tx′−2tcost+t x(0) = 0 x(π) = π
Computer Problems 14.2
c. Troesch’sproblem: d. Bratu’s problem:
x′′ =μsinh(μx) x(0) = 0,x(1) = 1
usingμ=50. using λ = 3.55.
e.
λ = λ∗, and no solutions when λ > λ∗.
εx′′ + tx′ = 0 using ε = 10−8. x(−1) = 0, x(1) = 2
x′′ + λex = 0
x(0) = 0,x(1) = 0
If we let λ = 3.51383 . . . , there are two solutions when λ < λ∗, one solution when
14.2 A Discretization Method 581
√√
Compare to the true solution x(t) = 1 + erf(t/
Cash [2003] uses these and other test problems in his research. For more information
on them, see www.ma.ic.ac.uk/∼jcash/
8. (Buckingofacircularringproject)Amodelforacircularringwithcompressibility c under hydrostatic pressure p from all directions is given by the following boundary- value problem involving a system of seven differential equations:
y′ =−1−cy +(c+1)y , y (0)= π, y π =0 1571212
2ε)/erf(1/ 2ε).
y2′ =[1+c(y5−y7)]cosy1,
y3′ =[1+c(y5 −y7)]siny1,
y4′ =1+c(y5 −y7),
y5′ = y6[−1−cy5 +(c+1)y7],
y3(0)=0 y4(0)=0
π
y2 2 =0
π
y6′ =y5y7−[1+c(y5−y7)](y5+p), y6(0)=0, y6 2 =0
y7′ =[1+c(y5−y7)]y6
Various simplifications are useful in the study of the buckling or collapse of the circular ring such as by considering only a quarter-circle by symmetry (sketch (a) below). As the pressure increases, the radius of the circle decreases, and a bifurcation or a change of state can occur (sketch (b) below). The shooting method together with more advanced numerical methods can be used to solve this problem. Explore some of them. See Huddleston [2000] and Sauer [2006] for additional details.
s /2
(y2, y3) y1
y4 p s0
l
1 1
1 (b)
p
p
(a)
15
Partial Differential Equations
In the theory of elasticity, it is shown that the stress in a cylindrical beam under torsion can be derived from a function u(x, y) that satisfies the Poisson equation
∂2u + ∂2u + 2 = 0 ∂x2 ∂y2
In the case of a beam whose cross section is the square defined by |x| 1, | y | 1, the function u must satisfy Poisson’s equation inside the square and must be zero at each point on the perimeter of the square. By using the methods of this chapter, we can construct a table of approximate values of u(x, y).
15.1 Parabolic Problems
Many physical phenomena can be modeled mathematically by differential equations. When the function that is being studied involves two or more independent variables, the differential equation is usually a partial differential equation. Since functions of several variables are intrinsically more complicated than those of one variable, partial differential equations can lead to some of the most challenging of numerical problems. In fact, their numerical solution is one type of scientific calculation in which the resources of the fastest and most expensive computing systems easily become taxed. We shall see later why this is so.
Some Partial Differential Equations from Applied Problems
Some important partial differential equations and the physical phenomena that they govern are listed here:
• Thewaveequationinthreespatialvariables(x,y,z)andtimetis ∂2u = ∂2u + ∂2u + ∂2u
∂t2 ∂x2 ∂y2 ∂z2
The function u represents the displacement at time t of a particle whose position at rest is (x, y, z). With appropriate boundary conditions, this equation governs vibrations of a three-dimensional elastic body.
582
• Theheatequationis
∂u = ∂2u + ∂2u + ∂2u ∂t ∂x2 ∂y2 ∂z2
15.1 Parabolic Problems 583
The function u represents the temperature at time t in a physical body at the point that has coordinates (x, y, z).
• Laplace’sequationis
∂2u + ∂2u + ∂2u = 0 ∂x2 ∂y2 ∂z2
It governs the steady-state distribution of heat in a body or the steady-state distribution of electrical charge in a body. Laplace’s equation also governs gravitational, electric, and magnetic potentials and velocity potentials in irrotational flows of incompressible fluids. The form of Laplace’s equation given above applies to rectangular coordinates. In cylindrical and spherical coordinates, it takes these respective forms:
∂2u + 1 ∂u + 1 ∂2u + ∂2u = 0 ∂r2 r ∂r r2 ∂φ2 ∂z2
1∂2 1 ∂ ∂u 1 ∂2u r∂r2(ru)+r2sinθ∂θ sinθ∂θ +r2sin2θ∂φ2 =0
• Thebiharmonicequationis
∂4u+2 ∂4u +∂4u=0 ∂x4 ∂x2 ∂y2 ∂y4
It occurs in the study of elastic stress, and from its solution the shearing and normal stresses can be derived for an elastic body.
• TheNavier-Stokesequationsare
∂u +u∂u +v∂u + ∂p = ∂2u + ∂2u
∂t ∂x ∂y ∂x ∂x2 ∂y2 ∂v +u∂v +v∂v + ∂p = ∂2v + ∂2v
∂t ∂x ∂y ∂y ∂x2 ∂y2
Here, u and v are components of the velocity vector in a fluid flow. The function p is the
pressure, and the fluid is assumed to be incompressible but viscous.
In three dimensions, the following operators are useful in writing many standard partial differential equations
∇=∂+∂+∂ ∂x ∂y ∂z
2 ∂2 ∂2 ∂2
∇ = ∂x2 + ∂y2 + ∂z2 (Laplacian operator)
584 Chapter 15
Partial Differential Equations
For example, we have
Heat equation
Diffusion equation
Wave equation
Laplace equation Poisson equation Helmholtz equation
1 ∂u = ∇2u k ∂t
∂u = ∇(d∇u) + ρ ∂t
1 ∂2u = ∇2u ν2 ∂t2
∇2u = 0
∇2u = −4πρ ∇2u = −k2u
The diffusion equation with diffusion constant d has the same structure as the heat equation because heat transfer is a diffusion process. Some authors use alternate notation such as u = curl(grad(u)) = ∇2u.
Additional examples from quantum mechanics, electromagnetism, hydrodynamics, elasticity, and so on could also be given, but the five partial differential equations shown already exhibit a great diversity. The Navier-Stokes equation, in particular, illustrates a very complicated problem: a pair of nonlinear, simultaneous partial differential equations.
To specify a unique solution to a partial differential equation, additional conditions must be imposed on the solution function. Typically, these conditions occur in the form of bound- ary values that are prescribed on all or part of the perimeter of the region in which the solution is sought. The nature of the boundary and the boundary values are usually the determining factors in setting up an appropriate numerical scheme for obtaining the approximate solution.
Matlab includes a PDE Toolbox for partial differential equations. It contains many commands for such tasks as describing the domain of an equation, generating meshes, computing numerical solutions, and plotting. Within Matlab, the command pdetool in- vokes a graphical user interface (GUI) that is a self-contained graphical environment for solving partial differential equations. One draws the domain and indicates the boundary, fills in menus with the problem and boundary specifications, and selects buttons to solve the problem and plot the results. Although this interface may provide a convenient working environment, there are situations in which command-line functions are needed for addi- tional flexibility. A suite of demonstrations and help files is useful in finding one’s way. For example, this software can handle PDEs of the following types
ParabolicPDE
HyperbolicPDE Elliptic PDE
b∂u−∇·(c∇u)+au= f ∂t
b∂2u−∇·(c∇u)+au= f ∂t2
−∇ · (c∇u) + au = f
for x and y on the two-dimensional domain for the problem. On the boundaries of the domain, the following boundary conditions can be handled:
Dirichlet
Generalized Neumann Mixed
hu = r
n⃗ · (c∇u) + qu = g
combination of Dirichlet/Neumann
Here, n⃗ = du/dν is the outward unit length normal derivative. While the PDE can be entered via a dialog box, both the boundary conditions and the PDE coefficients a, c, d can be entered in a variety of ways. One can construct the geometry of the domain by drawing solid objects (circle, polygon, rectangle, and ellipse) that may be overlapped, moved, and rotated.
Heat Equation Model Problem
In this section, we consider a model problem of modest scope to introduce some of the essential ideas. For technical reasons, the problem is said to be of the parabolic type. In it we have the heat equation in one spatial variable accompanied by boundary conditions appropriate to a certain physical phenomenon:
⎧⎪ ⎪ ∂ 2 ∂
⎪⎨ ∂ x 2 u ( x , t ) = ∂ t u ( x , t )
⎪ ⎪ ⎪⎩ u(0,t) = u(1,t) = 0 u(x,0) = sinπx
(1)
15.1 Parabolic Problems 585
These equations govern the temperature u(x, t) in a thin rod of length 1 when the ends are held at temperature 0, under the assumption that the initial temperature in the rod is given by the function sinπx (see Figure 15.1). In the xt-plane, the region in which the solution is sought is described by inequalities 0 x 1 and t 0. On the boundary of this region (shaded in Figure 15.2), the values of u have been prescribed.
Rod
Ice Ice
FIGURE 15.1
Heated rod 01
x
FIGURE 15.2
Heat equation: xt-plane
t
01
Finite-Difference Method
x
A principal approach to the numerical solution of such a problem is the finite-difference method. It proceeds by replacing the derivatives in the equation by finite differences. Two
586
Chapter 15
Partial Differential Equations
formulas from Section 4.3 are useful in this context: f′(x)≈ 1[f(x+h)− f(x)]
h
f′′(x)≈ 1[f(x+h)−2f(x)+ f(x−h)]
h2
If the formulas are used in the differential Equation (1), with possibly different step lengths
h and k, the result is
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t +k)−u(x,t)] (2)
This equation is now interpreted as a means of advancing the solution step by step in the t variable. That is, if u(x,t) is known for 0 x 1 and 0 t t0, then Equation (2) allows us to evaluate the solution for t = t0 + k.
Equation (2) can be rewritten in the form
u(x,t +k)=σu(x +h,t)+(1−2σ)u(x,t)+σu(x −h,t) (3)
where
σ=k h2
A sketch showing the location of the four points involved in this equation is given in Figure 15.3. Since the solution is known on the boundary of the region, it is possible to compute an approximate solution inside the region by systematically using Equation (3). It is, of course, an approximate solution because Equation (2) is only a finite-difference analog of Equation (1).
(x, t k)
(x h, t) (x, t) (x h, t)
To obtain an approximate solution on a computer, we select values for h and k and use Equation (3). An analysis of this procedure, which is outside the scope of this text, shows that for stability of the computation, the coefficient 1−2σ in Equation (3) should be nonnegative. (If this condition is not met, errors made at one step will probably be magnified at subsequent steps, ultimately spoiling the solution.) The reader is referred to Kincaid and Cheney [2002] or Forsythe and Wasow [1960] for a discussion of stability. Using this algorithm, we can continue the solution indefinitely in the t-variable by computations involving only prior values of t. This is an example of a marching problem or marching method.
FIGURE 15.3
Heat equation: Explicit stencil
h2 k
Pseudocode for Explicit Method
For utmost simplicity, we select h = 0.1 and k = 0.005. Coefficient σ is now 0.5. This choicemakesthecoefficient1−2σ equaltozero.Ourpseudocodefirstprintsu(ih,0)for 0 i 10 because they are known boundary values. Then it computes and prints u(ih, k) for 0 i 10 using Equation (3) and boundary values u(0, t) = u(1, t) = 0. This procedure is continued until t reaches the value 0.1. The single subscripted arrays (ui ) and (vi ) are used to store the values of the approximate solution at t and t + k , respectively. Since the analytic solution of the problem is u(x,t) = e−π2t sinπx (see Problem 15.1.3), the error can be printed out at each step.
The procedure described is an example of an explicit method. The approximate values ofu(x,t+k)arecalculatedexplicitlyintermsofu(x,t).Notonlyisthissituationatypical, but even in this problem the procedure is rather slow because considerations of stability force us to select
k 1h2 2
Since h must be rather small to represent the derivative accurately by the finite difference formula, the corresponding k must be extremely small. Values such as h = 0.1 and k = 0.005 are representative, as are h = 0.01 and k = 0.00005. With such small values of k, an inordinate amount of computation is necessary to make much progress in the t variable.
15.1 Parabolic Problems 587
program Parabolic1
integer i, j ; real array (ui )0:n , (vi )0:n integer n ← 10, m ← 20
real h ← 0.1, k ← 0.005
realu0 ←0,v0 ←0,un ←0,vn ←0 fori =1ton−1do
ui ← sin(πih) end for
output (ui )
for j = 1 to m do
fori =1ton−1do
vi ← (ui−1 + ui+1)/2
end for
output (vi )
t ← jk
fori =1ton−1do
ui ← e−π2t sin(πih) − vi end for
output (ui )
fori =1ton−1do
ui ←vi end for
end for
end program Parabolic1
588
Chapter 15
Partial Differential Equations
Crank-Nicolson Method
An alternative procedure of the implicit type goes by the name of its inventors, John Crank and Phyllis Nicolson, and is based on a simple variant of Equation (2):
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t)−u(x,t −k)] (4) h2 k
If a numerical solution at grid points x = ih, t = jk has been obtained up to a certain level in the t variable, Equation (4) governs the values of u on the next t level. Therefore, Equation (4) should be rewritten as
in which
−u(x − h, t) + ru(x, t) − u(x + h, t) = su(x, t − k) (5)
r = 2 + s and s = h2 k
FIGURE 15.4
Crank-Nicolson method: Implicit stencil
The locations of the four points in this equation are shown in Figure 15.4.
(x h, t) (x, t) (x h, t)
(x, t k)
On the t level, u is unknown, but on the (t − k) level, u is known. So we can introduce unknownsui =u(ih,t)andknownquantitiesbi =su(ih,t−k)andwriteEquation(5)in
matrix form:
⎡ r −1 ⎤⎡u⎤⎡b ⎤ 11
⎢−1 r −1 ⎥⎢u2 ⎥ ⎢b2 ⎥ ⎢ −1 r −1 ⎥⎢u3 ⎥ ⎢b3 ⎥
⎥=⎢ . ⎥ (6)
⎢ ... ... ... ⎥⎢ . ⎢⎣ −1r−1 ⎥⎦ ⎢⎣ u
⎥⎦
⎢⎣ b ⎥⎦ n−2
bn−1
n−2 −1 r un−1
The simplifying assumption that u(0,t) = u(1,t) = 0 has been used here. Also, h = 1/n. The system of equations is tridiagonal and diagonally dominant because |r| = 2 + h2/k > 2. Hence, it can be solved by the efficient method of Section 7.3.
An elementary argument shows that this method is stable. We shall see that if the initial values u(x, 0) lie in an interval [α, β], then values subsequently calculated by using Equation (5) will also lie in [α, β ], thereby ruling out any unstable growth. Since the solution is built up line by line in a uniform way, we need only verify that the values on the first computed line, u(x, k), lie in [α, β]. Let j be the index of the largest ui that occurs on this line t = k. Then
−uj−1 +ruj −uj+1 =bj
Since uj is the largest of the u’s, uj−1 uj and uj+1 uj. Thus, ruj =bj +uj−1+uj+1bj +2uj
Sincer=2+sandbj =su(jh,0),thepreviousinequalityleadsatonceto uj u(jh,0)β
Sinceuj isthelargestoftheui,wehave
uiβ foralli
Similarly,
thus establishing our assertion.
uiα foralli
Pseudocode for the Crank-Nicolson Method
A pseudocode to carry out the Crank-Nicolson method on the model program is given next. In it, h = 0.1, k = h2/2, and the solution is continued until t = 0.1. The value of r is 4 and s = 2. It is easier to compute and print only the values of u at interior points on each horizontal line. At boundary points, we have u(0, t) = u(1, t) = 0. The program calls procedure Tri from Section 7.3.
15.1 Parabolic Problems 589
program Parabolic2
integer i, j; real array (ci )1:n−1, (di )1:n−1, (ui )1:n−1, (vi )1:n−1 integer n ← 10, m ← 20
real h ← 0.1, k ← 0.005
real r, s, t
s ← h2/k
r←2+s
fori =1ton−1do
di ←r
ci ←−1
ui ← sin(πih)
end for
output (ui )
for j = 1 to m do
fori =1ton−1do di ←r
vi ←sui end for
590 Chapter 15
Partial Differential Equations
call Tri(n − 1, (ci ), (di ), (ci ), (vi ), (vi )) output (vi )
t ← jk
fori =1ton−1do
ui ← e−π2t sin(πih) − vi end for
output (ui )
fori =1ton−1do
ui ←vi end for
end for
end program Parabolic2
We used the same values for h and k in the pseudocode for two methods (explicit and Crank- Nicolson), so a fair comparison can be made of the outputs. Because the Crank-Nicolson method is stable, a much larger k could have been used.
Alternative Version of the Crank-Nicolson Method
Another version of the Crank-Nicolson method is obtained as follows: The central differ-
ences at x, t − 1 k in Equation (4) produce 2
1 11 1 h2 u x+h,t−2k −2u x,t−2k +u x−h,t−2k
= 1[u(x,t)−u(x,t −k)] k
Since the u values are known only at integer multiples of k, terms such as ux, t − 1 k are 2
replaced by the average of u values at adjacent grid points; that is, 11
u x,t−2k ≈2[u(x,t)+u(x,t−k)] 1 [u(x +h,t)−2u(x,t)+u(x −h,t)+u(x +h,t −k)
So we have 2h2
−2u(x,t−k)+u(x−h,t−k)]= 1[u(x,t)−u(x,t−k)] k
The computational form of this equation is
−u(x −h,t)+2(1+s)u(x,t)−u(x +h,t)
where
= u(x − h, t − k) + 2(s − 1)u(x, t − k) + u(x + h, t − k) (7)
h2 1 s=k≡σ
FIGURE 15.5
Crank-Nicolson method: Alternative stencil
Stability
(x h, t)
(x h, t k)
(x, t)
(x, t k)
(x h, t)
(x h, t k)
15.1 Parabolic Problems 591 The six points in this equation are shown in Figure 15.5. This leads to a tridiagonal system
ofform(6)withr =2(1+s)and
bi =u((i−1)h,t−k)+2(s−1)u(ih,t−k)+u((i+1)h,t−k)
At the heart of the explicit method is Equation (3), which shows how the values of u for t + k depend on the values of u at the previous time step, t . If we introduce the values of u onthemeshbywritinguij =u(ih,jk),thenwecanassembleallthevaluesforonet-level into a vector v( j ) as follows:
v(j) = [u0j,u1j,u2j,…,unj]T Equation (3) can now be written in the form
ui,j+1 =σui+1,j +(1−2σ)uij +σui−1,j This equation shows how v( j+1) is obtained from v( j). It is simply
v(j+1) = Av(j) where A is the matrix whose elements are
⎤
Our equations tell us that
v(j) = Av(j−1) = A2v(j−2) = A3v(j−3) = ··· = Ajv(0)
From physical considerations, the temperature in the bar should approach zero. After all, the heat is being lost through the ends of the rod, which are being kept at temperature 0. Hence, A j v(0) should converge to 0 as j → ∞.
At this juncture, we need a theorem in linear algebra that asserts (for any matrix A) that Ajv → 0 for all vectors v if and only if all eigenvalues of A satisfy |λi| < 1. The eigenvalues of the matrix A in the present analysis are known to be
λi =1−2σ(1−cosθi) θi = iπ n+1
⎡1−2σ σ
⎢ σ 1−2σ σ ⎢σ1−2σσ
⎥ ⎥
⎢ ... ...
⎢⎣ σ 1−2σ σ ⎥⎦
... ⎥ σ 1−2σ
592
Chapter 15
Partial Differential Equations
FIGURE 15.6
Heat equation: (a) Solution surface; (b) Contour plot
0
0 0.2 0.4 0.6 0.8 1
In our problem, we therefore must have
−1 < 1 − 2σ (1 − cos θi ) < 1
This leads to 0 < σ 1 , because θi can be arbitrarily close to π . This in turn leads to 2
the step-size condition k 1 h2. 2
Mathematical software systems such as Matlab, Maple, or Mathematica contain rou- tines that solve partial differential equations. For example in Maple and Mathematica, we can invoke commands to verify the general analytical solution. (See Problem 15.1.3.) In Matlab, there is a sample program to numerically solve our model heat equation. In Fig- ure 15.6, we solve the heat equation, generate a three-dimensional plot of its solution surface, and produce a two-dimensional contour plot, which is displayed in color for indicating the various contours.
1 0.75 0.5
0.25
0 0
0.5
0.4
0.3
0.2
0.1
0.2
0.5 0.4
0.3 0.2
0.4
0.6
0.1 10
0.8
The PDE Toolbox within Matlab produces solutions to partial differential equations using the finite-element formulation of the scalar PDE problem. (See Section 15.3 for addi- tional discussion of the finite-element method.) This software library contains a graphical user interface with graphical tools for describing domains, generating triangular meshes on them, discretizing the PDEs on the mesh, building systems of equations, obtaining numerical approximations for their solution, and visualizing the results. In particular, Matlab has the function parabolic for solving parabolic PDEs. As is found in the Matlab documentation, one can solve the two-dimensional heat equation
∂u =∇2u ∂t
on the square −1 x , y 1. There are Dirichlet boundary conditions u = 0 and discon- tinuous initial conditions u(0) = 1 in the circle x2 + y2 < 2 and u(0) = 0 otherwise. A
15.1 Parabolic Problems 593
5 Matlab demonstration continues with a movie of the solution curves.
Summary
(1) We consider a model problem involving the following parabolic partial differential equation
∂2 ∂ ∂x2u(x,t)= ∂tu(x,t)
Using finite differences with step size h in the x -direction and k in the t -direction, we obtain 1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t +k)−u(x,t)]
h2 k The computational form is
u(x,t +k)=σu(x +h,t)+(1−2σ)u(x,t)+σu(x −h,t)
where σ = k/h2. An alternative approach is the Crank-Nicolson method based on other
finite differences for the right-hand side:
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t)−u(x,t −k)]
h2 k Its computational form is
−u(x − h, t) + ru(x, t) − u(x + h, t) = su(x, t − k)
where r = 2 + s and s = h2/k. Yet another variant of the Crank-Nicolson method is based
on these finite differences:
1 11 1
h2 u x+h,t−2k −2u x,t−2k +u x−h,t−2k = 1[u(x,t)−u(x,t −k)]
Then by using
k
11
u x,t−2k ≈2[u(x,t)+u(x,t−k)]
594 Chapter 15
Partial Differential Equations
the computational form is
−u(x −h,t) + 2(1+s)u(x,t)−u(x +h,t)
= u(x −h,t −k)+2(s −1)u(x,t −k)+u(x +h,t −k) where s = h2/k. This results in a tridiagonal system of equations to be solved.
1.
Asecond-orderlineardifferentialequationwithtwovariableshastheform A∂2u+B ∂2u +C∂2u+···=0
Here, A, B, and C are functions of x and y, and the terms not written are of lower order. The equation is said to be elliptic, parabolic, or hyperbolic at a point (x, y), depending on whether B2 − 4AC is negative, zero, or positive, respectively. Classify each of these equations in this manner:
aa. uxx+uyy+ux+sinxuy−u=x2+y2 b. uxx −uyy +2ux +2uy +exu=x−y
a2. 3.
h. xuxx +yuxy +uyy =0 Derivethetwo-dimensionalformofLaplace’sequationinpolarcoordinates.
a4.
a5. a 6.
ac. uxx =uy +u−ux +y e. 3uxx +uxy +uyy =exy g. uxx +2uxy +uyy =0
d. uxy =u−ux −uy
af. exuxx +cosyuxy −uyy =0
∂x2 ∂x ∂y ∂y2
Showthatthefunction
u(x,t) =
is a solution of the heat conduction problem uxx = ut and satisfies the boundary
condition
N n=1
Refertothemodelproblemsolvednumericallyinthissectionandshowthatifthereis no roundoff, the approximate solution values obtained by using Equation (3) lie in the interval [0, 1]. (Assume 1 2k/h2.)
Find a solution of Equation (3) that has the form u(x,t) = at sinπx, where a is a constant.
In using Equation (5), how must the linear System (6) be modified for u (0, t ) = c0 and u(1, t) = cn with c0 ≠ 0, cn ≠ 0? When using Equation (7)?
u(0,t) = u(1,t) = 0 u(x,0) =
cn sinnπx for all N 1
N n=1
cne−(nπ)2t sinnπx
Problems 15.1
a7. DescribeindetailhowEquation(1)withboundaryconditionsu(0,t)=q(t),u(1,t)= g(t), and u(x, 0) = f (x) can be solved numerically by using System (6). Here q, g, and f are known functions.
a8. What finite difference equation should be a suitable replacement for the equation ∂2u/∂x2 = ∂u/∂t + ∂u/∂x in numerical work?
a9. Consider the partial differential equation ∂u/∂x + ∂u/∂t = 0 with u = u(x,t) in the region [0, 1] × [0, ∞], subject to the boundary conditions u(0, t) = 0 and u(x, 0) specified. For fixed t, we discretize only the first term using (ui+1 − ui−1)/(2h) for i = 1,2,...,n − 1 and (un − un−1)/h, where h = 1/n. Here, ui = u(xi,t) and xi = ih with fixed t. In this way, the original problem can be considered a first-order initial-value problem
y = [u , u , . . . , u ]T Determine the n × n matrix A.
15.1 Parabolic Problems 595
where
dy+1Ay=0 dx 2h
d y = u′ , u′ , . . . , u′ T u′ = ∂ui 12 n dx12 n i∂t
10. RefertothediscussionofthestabilityoftheCrank-Nicolsonprocedure,andestablish the inequality ui α.
11. What happens to System (6) when k = h2?
12. (Multiple choice) In solving the heat equation uxx = ut on the domain t1 and 0 x 1, one can use the explicit method. Suppose the approximate solution on one horizontal line is a vector V j . Then the whole process turns out to be described by
Vj+1 = AVj
where A is a tridiagonal matrix, having 1−2σ on its diagonal and σ in the superdiagonal and subdiagonal positions. Here σ = k/h2, where k is the time step and h is the x-step. For stability in the numerical solution, what should we require?
a. σ = 1 b. Alleigenvaluesof Asatisfy|λ|<1. c. kh2/2 2
d. h=0.01andk=5×10−3 e. Noneofthese.
13. (Continuation) The fully implicit method for solving the heat conduction problem
requires at each step the solution of the equation AVj−1 = Vj
Here, A is not the same as in the preceding problem, but is similar: It has 1 + 2σ on the diagonal and −σ on the subdiagonal and superdiagonal. What do we know about the eigenvalues of this matrix, A? Hint: This question concerns eigenvalues of A, not
A−1.
a. They are all negative. b. They are all in the open interval (0, 1). c. They are greater than 1. d. They are in the interval (−1, 0).
e. Noneofthese.
596
Chapter 15
Partial Differential Equations
15.2
1. Solve the same heat conduction problem as in the text except use h = 2−4, k = 2−10, and u(x, 0) = x(1 − x). Carry out the solution until t = 0.0125.
2. Modify the Crank-Nicolson code in the text so that it uses the alternative scheme (7). Compare the two programs on the same problems with the same spacing.
3. Recodeandtestthepseudocodeinthissectionusingacomputerlanguagethatsupports vector operations.
4. RuntheCrank-Nicolsoncodewithdifferentchoicesofhandk,inparticular,lettingk be much larger. Try k = h, for example.
5. Trytotakeadvantageofanyspecialcommandsorproceduresinmathematicalsoftware such as in Matlab, Maple, or Mathematica to solve the numerical example (1).
6. (Continuation) Use the symbolic manipulation capabilities in Maple or Mathematica to verify the general analytical solution of (1). Hint: See Problem 15.1.3.
Hyperbolic Problems
Wave Equation Model Problem
The wave equation with one space variable
∂2u = ∂2u (1)
∂t2 ∂x2
governs the vibration of a string (transverse vibration in a plane) or the vibration in a rod (longitudinal vibration). It is an example of a second-order linear differential equation of the hyperbolic type. If Equation (1) is used to model the vibrating string, then u(x, t) represents the deflection at time t of a point on the string whose coordinate is x when the string is at rest.
To pose a definite model problem, we suppose that the points on the string have coordinates x in the interval 0 x 1 (see Figure 15.7). Let us suppose that at time t = 0, the deflections satisfy equations u(x, 0) = f (x) and ut (x, 0) = 0. Assume also that the ends of the string remain fixed. Then u(0, t) = u(1, t) = 0. A fully defined boundary-value
FIGURE 15.7
x
Vibrating string
0 x
1
u
u(x, t)
Computer Problems 15.1
problem, then, is
⎧⎪ ⎪ ⎪ u t t − u x x = 0
⎨ u(x,0)=f(x) ⎪⎩ ut(x,0)=0
u(0,t) = u(1,t) = 0
(2)
The region in the xt-plane where a solution is sought is the semi-infinite strip defined by inequalities 0 x 1 and t 0. As in the heat conduction problem of Section 15.1, the values of the unknown function are prescribed on the boundary of the region shown (see Figure 15.8).
15.2 Hyperbolic Problems 597
FIGURE 15.8
Wave equation: xt-plane
t
01
x
Analytic Solution
The model problem in (2) is so simple that it can be immediately solved. Indeed, the solution is
u(x,t)= 1[f(x+t)+ f(x−t)] (3) 2
provided that f possesses two derivatives and has been extended to the whole real line by defining
f(−x)=−f(x) and f(x+2)= f(x)
To verify that Equation (3) is a solution, we compute derivatives using the chain rule:
ux = 1[f′(x +t)+ f′(x −t)] ut = 1[f′(x +t)− f′(x −t)] 22
uxx = 1[f′′(x +t)+ f′′(x −t)] utt = 1[f′′(x +t)+ f′′(x −t)] 22
Obviously,
Also,
Furthermore, we have
utt =uxx u(x,0)= f(x)
ut(x,0)= 1[f′(x)− f′(x)]=0 2
598
Chapter 15
Partial Differential Equations
FIGURE 15.9
Wave equation: f stencil
(x t, 0)
Numerical Solution
(x, t)
(x, 0)
(x t, 0)
x
In checking endpoint conditions, we use the formulas by which f was extended: u(0,t)= 1[f(t)+ f(−t)]=0
2
u(1,t)= 1[f(1+t)+ f(1−t)] 2
= 1[f(1+t)− f(t −1)] 2
= 1[f(1+t)− f(t−1+2)]=0 2
The extension of f from its original domain to the entire real line makes it an odd periodic function of period 2. Odd means that
f (x) = − f (−x)
and the periodicity is expressed by
f(x+2)= f(x)
for all x. To compute u(x, t), we need to know f at only two points on the x-axis, x + t
and x − t, as in Figure 15.9.
The model problem is used next to illustrate again the principle of numerical solution. Choosing step sizes h and k for x and t , respectively, and using the familiar approximations for derivatives, we have from Equation (1)
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] h2
= 1 [u(x,t +k)−2u(x,t)+u(x,t −k)] k2
which can be rearranged as
u(x, t + k) = ρu(x + h, t) + 2(1 − ρ)u(x, t) + ρu(x − h, t) − u(x, t − k) (4)
Here,
ρ = k2 h2
15.2 Hyperbolic Problems 599
FIGURE 15.10
Wave equation: Explicit stencil
Figure 15.10 shows the point (x, t + k) and the nearby points that enter into Equation (4). (x, t k)
(x h, t) (x, t) (x h, t)
(x, t k)
The boundary conditions in Problem (2) can be written as
⎧
⎪⎨ u(x,0)= f(x)
1[u(x,k)−u(x,0)] = 0 (5)
⎪ ⎪ ⎪⎩ k
u(0,t) = u(1,t) = 0
The problem defined by Equations (4) and (5) can be solved by beginning at the line t = 0, where u is known, and then progressing one line at a time with t = k, t = 2k, t = 3k,.... Note that because of (5), our approximate solution satisfies
u(x,k)=u(x,0)= f(x) (6)
The use of the O(k) approximation for ut leads to low accuracy in the computed solution to Problem (2). Suppose that there is a row of grid points (x, −k). Letting t = 0 in Equation (4), we have
u(x, k) = ρu(x + h, 0) + 2(1 − ρ)u(x, 0) + ρu(x − h, 0) − u(x, −k) Now the central difference approximation
1 [u(x,k)−u(x,−k)]=0 2k
for
ut(x,0) = 0
can be used to eliminate the fictitious grid point (x, −k). So instead of Equation (6), we set
u(x,k)= 1ρ[f(x+h)+ f(x−h)]+(1−ρ)f(x) (7) 2
becauseu(x,0)= f(x).Valuesofu(x,nk),n2,cannowbecomputedfromEquation(4).
600 Chapter 15
Partial Differential Equations
Pseudocode
A pseudocode to carry out this numerical process is given next. For simplicity, three one- dimensional arrays (ui ), (vi ), and (wi ) are used: (ui ) represents the solution being computed on the new t line; (vi ) and (wi ) represent solutions on the preceding two t lines.
program Hyperbolic
integer i, j; real t,x,ρ; real array (ui)0:n,(vi)0:n,(wi)0:n integer n ← 10, m ← 20
real h ← 0.1, k ← 0.05
u0 ←0;v0 ←0;w0 ←0;un ←0;vn ←0;wn ←0
ρ ← (k/h)2
fori =1ton−1do
x ← ih
wi ← f(x)
vi ← 1ρ[f(x −h)+ f(x +h)]+(1−ρ)f(x) 2
end for
for j = 2 to m do
fori =1ton−1do
ui ←ρ(vi+1 +vi−1)+2(1−ρ)vi −wi
end for
output j,(ui)
fori =1ton−1do
wi ←vi
vi ←ui
t ← jk
x ← ih
ui ←TrueSolution(x,t)−vi
end for
output j,(ui) end for
end program Hyperbolic
real function f (x) real x
f ←sin(πx) end function f
real function True Solution(x,t) real t, x
True Solution ← sin(πx)cos(πt) end function True Solution
This pseudocode requires accompanying functions to compute values of f (x) and the true solution. We chose f (x ) = sin(π x ) in our example. It is assumed that the x interval is [0, 1], but when h or n is changed, the interval can be [0, b]; that is, nh = b. The numerical solution is printed on the t lines that correspond to 1k, 2k, . . . , mk.
More advanced treatments show that the ratios ρ = k2
h2
must not exceed 1 if the solution of the finite difference equations is to converge to a solution of the differential problem as k → 0 and h → 0. Furthermore, if ρ > 1, roundoff errors that occur at one stage of the computation would probably be magnified at later stages and thereby ruin the numerical solution.
In Matlab, the PDE Toolbox has a function for producing the solution of hyperbolic problems using the finite element formulation of the scalar PDE problem. An example found in the Matlab documentation finds the numerical solution of the two-dimensional wave propagation problem
∂2u = ∇2u ∂t2
on the square −1 x , y 1 with Dirichlet boundary conditions on the left and right bound-
aries, u = 0 for x = ±1, and zero values of the normal derivatives on the top and bottom
boundaries. Further, there are Neumann boundary conditions ∂u/∂ν = 0 for y = ±1. The
initial conditions u(0) = arctancosπ x and du(0)/dt = 3sin(πx)expsinπ y are 22
chosen to avoid putting too much energy into the higher vibration modes.
Advection Equation
We focus on the advection equation
∂u =−c∂u
∂t ∂x
Here, u = u(x,t) and c = c(x,t) in which one can consider x as space and t as time. The advection equation is a hyperbolic partial differential equation that governs the motion of a conserved scalar as it is advected by a known velocity field. For example, the advection equation applies to the transport of dissolved salt in water. Even in one space dimension and constant velocity, the system remains difficult to solve. Since the advection equation is difficult to solve numerically, interest typically centers on discontinuous shock solutions, which are notoriously hard for numerical schemes to handle.
Using the forward difference approximation in time and the central-difference approx- imations in space, we have
15.2 Hyperbolic Problems 601
This gives
1[u(x,t+k)−u(x,t)]=−c 1 [u(x+h,t)−u(x−h,t)] k 2h
u(x, t + k) = u(x, t) − 1σ [u(x + h, t) − u(x − h, t)] 2
where σ = (k/h)c(x,t). All numerical solutions grow in magnitude for all time steps k. For all σ > 0, this scheme is unstable by Fourier stability analysis.
602 Chapter 15
Partial Differential Equations
Lax Method
In the central-difference scheme above, replace the first term on the right-hand side, u(x, t), by 1 [u(x,t−k)+u(x,t+k)].Thenweobtain
2
This is the Lax method, and this simple change makes the method conditionally stable. Upwind Method
Another way of obtaining a stable method is by using a one-sided approximation to ux in the advection equation as long as the side is taken in the upwind direction. If c > 0, the transport is to the right. This can be interpreted as the wind of speed c blowing the solution from left to right. So the upwind direction is to the left for c > 0 and to the right for c < 0. Thus, the upwind difference approximation is
u(x,t +k) = 1 [u(x,t −k)+u(x,t +k)]− 1σ [u(x +h,t)−u(x −h,t)] 22
= 1(1+σ)u(x −h,t)+ 1(1+σ)u(x,t −k) 22
ux(x,t) ≈
Then the upwind scheme for the advection equation is
u(x, t + k) = u(x, t) − σ Lax-Wendroff Method
−c [u(x, t) − u(x − h, t)] /h −c [u(x + h, t) − u(x, t)] /h
(c > 0) (c < 0)
−c [u(x, t) − u(x − h, t)] /h −c [u(x + h, t) − u(x, t)] /h
(c > 0) (c < 0)
The Lax-Wendroff scheme is second-order in space and time. The following is one of several possible forms of this method. We start with a Taylor series expansion over one time step:
u(x,t + k) = u(x,t) + kut(x,t) + 1k2utt(x,t) + O(k3) 2
Now use the advection equation to replace time derivatives on the right-hand side by space derivatives:
ut = −cux utt =(−cux)t
=−ctux −c(ux)t =−ctux −c(ut)x =−ctux +c(cux)t
Here, we have let c = c(x, t) and have not assumed c is a constant. Substituting for ut and uxx givesus
u(x,t+k)=u(x,t)−cku +1k2−cu +c(cu)+O(k3) x2txxx
where everything on the right-hand side is evaluated at (x , t ). If we approximate the space derivative with second-order differences, we will have a second-order scheme in space and time:
u(x,t+k)≈u(x,t)−ck 1 [u(x+h,t)−u(x−h,t)] 2h
12 1 +2k −ct2h[u(x+h,t)−u(x−h,t)]+c(cux)x
The difficulty with this scheme arises when c depends on space and we must evaluate the last term in the expression above. In the case in which c is a constant, we obtain
c(cux)x =c2uxx
≈ 1 [u(x +h,t)−2u(x,t)+u(x −h,t)]
2h
The Lax-Wendroff scheme becomes
u(x,t+k)=u(x,t)− 1σ[u(x+h,t)−u(x−h,t)] 2
+ 1cσ2 [u(x +h,t)−2u(x,t)+u(x −h,t)] 2
where σ = c(k/h). As does the Lax method, this method has numerical dissipation (lose of amplitude); however, it is relatively weak.
Summary
(1) We consider a model problem involving the following hyperbolic partial differential equation:
∂2u = ∂2u ∂t2 ∂x2
15.2 Hyperbolic Problems 603
Using finite differences, we approximate it by
1 [u(x +h,t)−2u(x,t)+u(x −h,t)]
h2
= 1 [u(x,t +k)−2u(x,t)+u(x,t −k)]
k2 The computational form is
u(x, t + k) = ρu(x + h, t) + 2(1 − ρ)u(x, t) + ρu(x − h, t) − u(x, t − k) whereρ=k2/h2 <1.Att=0,weuse
u(x,k)= 1ρ[f(x+h)+ f(x−h)]+(1−ρ)f(x) 2
604 Chapter 15
Partial Differential Equations
Problems 15.2
a1. Whatisthesolutionoftheboundary-valueproblem
utt =uxx u(x,0)=x(1−x) ut(x,0)=0 u(0,t)=u(1,t)=0
atthepointwherex =0.3andt =4?
a2. Show that the function u(x,t) = f(x + at) + g(x − at) satisfies the wave equa-
tionutt =a2uxx.
a3. (Continuation) Using the idea in the preceding problem, solve this boundary-value
problem:
utt = uxx u(x,0) = F(x) ut(x,0) = G(x) u(0,t) = u(1,t) = 0
4. Show that the boundary-value problem
utt =uxx u(x,0)=2f(x) ut(x,0)=2g(x)
has the solution
u(x,t)= f(x+t)+ f(x−t)+G(x+t)−G(x−t)
where G is an antiderivative (i.e., indefinite integral) of g. Here, we assume that −∞ <
x <∞andt0.
5. (Continuation)Solvetheprecedingproblemonafinitexinterval,forexample,0x1, adding boundary condition u(0,t) = u(1,t) = 0. In this case, f and g are defined only for 0 x 1.
Computer Problems 15.2
a 1.
2. (Continuation) Write a program to compute the solution of u(x, t) at any given point
Given f (x ) defined on [0, 1], write and test a function for calculating the extended f thatobeystheequations f(−x)=−f(x)and f(x+2)= f(x).
(x , t ) for the boundary-value problem of Equation (2).
3. Compare the accuracy of the computed solution, using first Equation (6) and then
Equation (7), in the computer program in the text.
4. Usetheprograminthetexttosolveboundary-valueProblem(2)with
111 1 1 f(x)= −x− h= k=
42 2 16 32
5. Modifythecodeinthetexttosolveboundary-valueProblem(2)whenut(x,0)=g(x). Hint: Equations (5) and (7) will be slightly different (a fact that affects only the initial loop in the program).
15.3 Elliptic Problems 605
6. (Continuation)Usetheprogramthatyouwrotefortheprecedingcomputerproblemto
solve the following boundary-value problem:
⎧⎪ ⎪ ⎪ u t t = u x x ( 0 x 1 , t 0 )
⎪⎨ u(x,0)=sinπx ⎪ut(x,0)= 1sin2πx
⎪⎩ 4
u(0,t) = u(1,t) = 0
7. Modifythecodeinthetexttosolvethefollowingboundary-valueproblem:
⎧⎪ ⎪ ⎪ utt =uxx (−1x1,t0) ⎨ u(x,0)=|x|−1
⎪ ⎪ ⎪⎩ u t ( x , 0 ) = 0
u(−1,t) = u(1,t) = 0
8. Modifythecodeinthetexttoavoidstorageofthe(vi)and(ui)arrays.
9. Simplifythecodeinthetextforthespecialcaseinwhichρ=1.Comparethenumerical solution at the same grid points for a problem in which ρ = 1 and ρ ≠ 1.
10. Use mathematical software such as in Matlab, Maple, or Mathematica to solve the wave Equation (2) and plot both the solution surface and the contour plot.
11. Use the symbolic manipulation capabilities in Maple or Mathematica to verify that Equation (3) is the general analytical solution of the wave equation.
15.3 Elliptic Problems
One of the most important partial differential equations in mathematical physics and engi- neering is Laplace’s equation, which has the following form in two variables:
∇2u ≡ ∂2u + ∂2u = 0 ∂x2 ∂y2
Closely related to it is Poisson’s equation:
∇2u = g(x, y)
These are examples of elliptic equations. (Refer to Problem 17.1.1 for the classification of equations.) The boundary conditions associated with elliptic equations generally differ from those for parabolic and hyperbolic equations. A model problem is considered here to illustrate the numerical procedures that are often used.
Helmholtz Equation Model Problem
Suppose that a function u = u(x, y) of two variables is the solution to a certain physical problem. This function is unknown but has some properties that, theoretically, determine it
606 Chapter 15
Partial Differential Equations
uniquely. We assume that on a given region R in the xy-plane,
∇2u + f u = g
u(x, y) known on the boundary of R
(1)
Here, f = f (x, y) and g = g(x, y) are given continuous functions defined in R. The boundary values could be given by a third function
u(x, y) = q(x, y)
on the perimeter of R. When f is a constant, this partial differential equation is called the
Helmholtz equation. It arises in looking for oscillatory solutions of the wave equations. Finite-Difference Method
As before, we find an approximate solution of such a problem by the finite-difference method. The first step is to select approximate formulas for the derivatives in our problem. In the present situation, we use the standard formula
f′′(x)≈ 1[f(x+h)−2f(x)+ f(x−h)] (2) h2
derived in Section 4.3. If it is used on a function of two variables, we obtain the five-point formula approximation to Laplace’s equation:
∇2u≈ 1[u(x+h,y)+u(x−h,y)+u(x,y+h)+u(x,y−h)−4u(x,y)] (3) h2
This formula involves the five points displayed in Figure 15.11. The local error inherent in the five-point formula is
h2 ∂4u ∂4u −12 ∂x4(ξ,y)+∂y4(x,η)
(4)
and for this reason, Formula (3) is said to provide an approximation of order O(h2). In other words, if grids are used with smaller and smaller spacing, h → 0, then the error that is committed in replacing ∇2u by its finite-difference approximation goes to zero as rapidly as does h2. Equation (3) is called the five-point formula because it involves values of u at (x, y) and at the four nearest grid points.
(x h, y)
(x, y h)
(x, y)
(x, y h)
(x h, y)
FIGURE 15.11
Laplace’s equation: Five-point stencil
It should be emphasized that when the differential equation in (1) is replaced by the finite-difference analog, we have changed the problem. Even if the analogous finite- difference problem is solved with complete precision, the solution is that of a problem that only simulates the original one. This simulation of one problem by another becomes better and better as h is made to decrease to zero, but the computing cost will inevitably increase.
We should also note that other representations of the derivatives can be used. For example, the nine-point formula is
∇2u≈ 1 [4u(x+h,y)+4u(x−h,y)+4u(x,y+h)+4u(x,y−h) 6h2
+ u(x + h, y + h) + u(x − h, y + h) + u(x + h, y − h)
+ u(x − h, y − h) − 20u(x, y)] (5)
This formula is of order O(h2). In the special case that u is a harmonic function (which means it is a solution of Laplace’s equation), the nine-point formula is of order O(h6). For additional details, see Forsythe and Wasow [1960, pp. 194–195]. Hence, it is an extremely accurate approximation in using finite-difference methods and solving the Poisson equation ∇2u = g, with g a harmonic function. For more general problems, the nine-point Formula (5) has the same order error term as the five-point Formula (3) [namely, O(h2)] and would not be an improvement over it.
If the mesh spacing is not regular (say, h1, h2, h3, and h4 are the left, bottom, right, and top spacing, respectively), then it is not difficult to show that at (x, y) the irregular five-point formula is
∇2u ≈
which is only of order h when h1 = αih for 0 < αi < 1. This formula is usually used near boundary points, as in Figure 15.12. If the mesh is small, however, the boundary points can be moved over slightly to avoid the use of (6). This perturbation of the region R (in most
15.3 Elliptic Problems 607
1 [h1u(x +h3,y)+h3u(x −h1,y)] 1h1h3(h1 +h3)
2
+ 1
1h2h4(h2 +h4) 211
[h2u(x, y + h4) + h4u(x, y − h2)] −2hh+hh u(x,y) (6)
13 24
h1
h4
h2
h3
FIGURE 15.12
Boundary points: Irregular mesh spacing
h
h
608 Chapter 15
Partial Differential Equations
cases for small h) produces an error no greater than that introduced by using the irregular scheme (6).
Returning to the model Problem (1), we cover the region R by mesh points
xi =ih yj =jh (i,j0) (7)
At this time, it is convenient to introduce an abbreviated notation:
uij =u(xi,yi) fij = f(xi,yi) gij =g(xi,yj) (8) With it, the five-point formula takes on a simple form at the point (xi , y j ):
(∇2u)ij ≈ 1 (ui+1, j +ui−1, j +ui, j+1 +ui, j−1 −4uij) (9) h2
If this approximation is made in the differential Equation (1), the result is (the reader should verify it)
−u +4−h2 f u =−h2g (10) i, j−1 ij ij ij
−u
The coefficients of this equation can be illustrated by a five-point star in which each point
−u −u i+1, j
i−1, j
i, j+1
corresponds to the coefficient of u in the grid (see Figure 15.13). 1
FIGURE 15.13
Helmholtz equations: Five-point star
1
4hfij 1
1
FIGURE 15.14
Uniform grid spacing
To be specific, we assume that the region R is a unit square and that the grid has spacing
h = 1 (see Figure 15.14). We obtain a single linear equation of the form (10) for each of 4
5 4 3 2 1
12345
15.3 Elliptic Problems 609 the nine interior grid points. These nine equations are as follows:
⎧⎪−u21 − u01 − u12 − u10 + (4 − h2 f11)u11 = −h2g11 ⎪−u31 − u11 − u22 − u20 + (4 − h2 f21)u21 = −h2g21 ⎪−u41 − u21 − u32 − u30 + (4 − h2 f31)u31 = −h2g31 ⎪⎨−u22 −u02 −u13 −u11 +(4−h2 f12)u12 =−h2g12 ⎪−u32 −u12 −u23 −u21 +(4−h2 f22)u22 = −h2g22 ⎪−u42 − u22 − u33 − u31 + (4 − h2 f32)u32 = −h2g32 ⎪−u23 − u03 − u14 − u12 + (4 − h2 f13)u13 = −h2g13 ⎪⎩−u33 − u13 − u24 − u22 + (4 − h2 f23)u23 = −h2g23
−u43 −u23 −u34 −u32 +(4−h2 f33)u33 = −h2g33
This system of equations could be solved through Gaussian elimination, but let us examine them more closely. There are 45 coefficients. Since u is known at the boundary points, we move these 12 terms to the right-hand side, leaving only 33 nonzero entries out of 81 in our 9 × 9 system. The standard Gaussian elimination causes a great deal of fill-in, in the forward elimination phase—that is, zero entries are replaced by nonzero values. So we seek a method that retains the sparse structure of this system. To illustrate how sparse this system of equations is, we write it in matrix notation:
Au = b
Suppose that we order the unknowns from left to right and bottom to top:
u = [u11, u21, u31, u12, u22, u32, u13, u23, u33]T This is called the natural ordering. Now the coefficient matrix is
(11) (12)
⎡4−h2f11 −1 0 −1 0 0 ⎢−14−h2f21−1 0−1 0 ⎢0 −14−h2f31 0 0 −1
0 0 0 0 0 0
0⎤ 0⎥
⎢ −1 0 A=⎢ 0 −1 ⎢0 0 ⎢⎣0 0 0 0 0 0
and the right-hand side is
04−h2f12−1 0
0⎥ 0⎥ 0⎥
0 −1 −1 0 0−1 0 0 00
4−h2 f22 −1
−1 0 −1 4−h2f23 −1
+ u43
Notice that if f (x, y) < 0, then A is a diagonally dominant matrix.
−1 0 0 −1 −14−h2f32 0 0 0 04−h2f13−1
⎡−h2g +u +u ⎤
11 ⎢−h2g21 ⎢−h2g31 ⎢−h2g12 b= ⎢−h2g22 ⎢−h2g32 ⎢−h2g13 ⎣−h2g23 −h2g33
10 +u20 +u30 +u02
01
⎥
0 −1
0 −1
4−h2 f33
+u42 +u14 +u24 + u34
⎥
⎥
+u03⎥ ⎦
+u41⎥
⎥
−1 ⎥ 0⎥⎦
610 Chapter 15
Partial Differential Equations
Gauss-Seidel Iterative Method
Since the equations are similar in form, iterative methods are often used to solve such sparse systems. Solving for the diagonal unknown, we have from Equation (10) the Gauss-Seidel method or iteration given by
(k) (k+1) (k) (k+1) 2 ui+1, j +ui−1, j +ui, j+1 +ui, j−1 −h gij
If we have approximate values of the unknowns at each grid point, this equation can be used to generate new values. We call u(k) the current values of the unknowns at iteration k and u(k+1) the value in the next iteration. Moreover, the new values are used in this equation as soon as they become available. The Gauss-Seidel method and other iterative methods are discussed in Section 8.2.
The pseudocode for this method on a rectangle is as follows:
(k+1) 1 uij = 4−h2 f
ij
procedure Seidel(ax,ay,nx,ny,h,itmax,(uij)) integer i, j,k,nx,ny,itmax
realax,ay,x,y; realarray(uij)0:nx,0:ny
fork =1toitmaxdo
for j = 1 to ny − 1 do y ← ay + jh
for i = 1 to nx − 1 do x←ax +ih
v ← ui+1, j +ui−1, j +ui, j+1 +ui, j−1
uij ←(v−h2g(x,y))/(4−h2 f(x,y)) end for
end for end for
end procedure Seidel
In using this procedure, one must decide on the number of iterative steps to be computed, itmax. The coordinates of the lower left-hand corner of the rectangle, (ax , ay ), and the step size h are specified. The number of x grid points is n x , and the number of y grid points is n y .
Numerical Example and Pseudocode
Let us illustrate this procedure on the boundary-value problem ⎧⎨∇2u − 1 u = 0 inside R (unit square)
55
pseudocode for the Gauss-Seidel procedure, starting with u = 1 and taking 20 iterations,
is given next. Notice that only 81 words of storage are needed for the array in solving the
49 × 49 linear system iteratively. Here, h = 1 . 8
⎩ 25
where q = cosh1 x + cosh1 y. This problem has the known solution u = q. A driver
u=q ontheboundaryofR
(13)
For this model problem, the accompanying functions are given next:
15.3 Elliptic Problems 611
program Elliptic
integer i, j; real h, x, y; real array (ui j )0:nx ,0:ny integernx ←8,ny ←8,itmax←20
realax ←0,bx ←1,ay ←0,by ←1
h←(bx −ax)/nx
for j = 0 to ny do
y ← ay + jh
u0j ←Bndy(ax,y) unx,j ←Bndy(bx,y)
end for
for i = 0 to nx do
x←ax +ih
ui0 ←Bndy(x,ay) ui,ny ←Bndy(x,by)
end for
for j = 1 to ny − 1 do
y ← ay + jh
for i = 1 to nx − 1
x←ax +ih
uij ←Ustart(x,y) end for
end for
output 0,Norm((uij),nx,ny)
call Seidel(ax,ay,nx,ny,h,itmax,(uij)) output itmax,Norm((uij),nx,ny)
for j = 0 to ny do
y ← ay + jh
for i = 0 to nx do
x←ax +ih
uij ←|TrueSolution(x,y)−uij| end for
end for
output itmax,Norm((uij),nx,ny) end program Elliptic
real function f (x, y) real x, y
f ← −0.04 end function f
real function Bndy(x, y) real x, y
Bndy ← True Solution(x, y) end function Bndy
real function g(x, y) real x, y
g ← 0
end function g
real function Ustart(x, y) real x, y
Ustart ← 1
end function Ustart
612 Chapter 15
Partial Differential Equations
real function True Solution(x, y)
real x, y
True Solution ← cosh(0.2x) + cosh(0.2y) end function True Solution
real function Norm((uij),nx,ny) real array (ui j )0:nx ,0:ny
t←0
for i = 1 to nx − 1 do
for j = 1 to ny − 1 do t ← t + u2
ij
Norm← t
end function Norm
end for end for √
After 75 iterations, the computed values at the 49 interior grid points are as follows:
2.0000 2.0003 2.0013 2.0028 2.0050 2.0078 2.0113 2.0154 2.0201 2.0003 2.0006 2.0016 2.0031 2.0053 2.0081 2.0116 2.0157 2.0204 2.0013 2.0016 2.0025 2.0041 2.0062 2.0091 2.0125 2.0166 2.0213 2.0028 2.0031 2.0041 2.0056 2.0078 2.0106 2.0141 2.0182 2.0229 2.0050 2.0053 2.0062 2.0078 2.0100 2.0128 2.0163 2.0204 2.0251 2.0078 2.0081 2.0091 2.0106 2.0128 2.0156 2.0191 2.0232 2.0279 2.0113 2.0116 2.0125 2.0141 2.0163 2.0191 2.0225 2.0266 2.0313 2.0154 2.0157 2.0166 2.0182 2.0204 2.0232 2.0266 2.0307 2.0354 2.0201 2.0204 2.0213 2.0229 2.0251 2.0279 2.0313 2.0354 2.0401
The Euclidean norm ||u||2 = nx −1 ny −1 u2 of the difference between the computed 2 i=1 j=1 ij
values and the known solution of the boundary-value problem (13) is approximately 0.47 × 10−4.
This example is a good illustration of the fact that the numerical problem being solved
is the system of linear Equations (11), which is a discrete approximation to the continuous-
boundary-value Problem (13). When comparing the true solution of (13) with the computed
solution of the system, remember the discretization error involved in making the approxi-
mation. This error is O(h2). With h as large as h = 1 , most of the errors in the computed 8
solution are due to the discretization error! To obtain a better agreement between the dis- crete and continuous problems, select a much smaller mesh size. Of course, the resulting linear system will have a coefficient matrix that is extremely large and quite sparse. Iterative methods are ideal for solving such systems that arise from partial differential equations. For additional information, see the references listed at the end of this section.
For a range of engineering and science applications, Matlab has a PDE Toolbox for the numerical solution of partial differential equations. It can accommodate two space variables and one time variable. After discretizing the equation over an unstructured mesh, it applies finite elements to solve it and offers a provision for visualizing the results. The first example
is Poisson’s equation
∇2u = −1
in the unit circle with u = 0 on the boundary. A comparison of the finite-element solution
is made with the exact solution.
Finite-Element Methods
The finite-element method has become one of the major strategies for solving partial dif- ferential equations. It provides an alternative to the finite-difference methods discussed up to now in this chapter.
As an illustration, we develop a version of the finite-element method for Poisson’s equation
∇2u≡uxx +uyy =r
where r is a constant function. The partial differential equation holds over a specified region R in a two-dimensional plane. Solving Poisson’s equation is equivalent to minimizing the expression
12 2
2 ux+uy +ru dxdy
J(u)=
This means that if the function u minimizes the expression above, then u obeys Poisson’s equation. Suppose the region is subdivided into triangles using approximations as necessary. The function u is approximated by a function φ that is a composite of plane triangular elements, each defined over a triangular piece of R. Then consider the substitute problem of minimizing
(e) Je φ
e
where each term in the summation is evaluated over its own base triangle T as described below. (By accepting this theory on faith, you should be able to grasp the general idea of the finite-element method.)
Assume that a base triangle has vertices (xi , yi ), (x j , y j ), and (xk , yk ). The solution
surface above the triangle is approximated by a plane triangular element denoted φ(e)(x, y),
where the superscript indicated this element. Let zi , z j , and zk be the distances up to the
plane at the triangle corners called nodes. Let L(e) be one at node i and zero at nodes j and i
k. Similarly, let L(e) be one at node j and zero at nodes i and k, and let L(e) be one at node k jk
and zero at nodes i and j.
As is shown in Figure 15.15, the area of the base triangle, denoted e, is given by
⎡⎤
1 1 xi yi e = 2Det⎣1 xj yj ⎦
1 xk yk
=xjyk +xiyj +xkyi −xjyi −xiyk −xkyj
15.3 Elliptic Problems 613
R
614 Chapter 15
Partial Differential Equations
FIGURE 15.15
Base triangle
(xi, yi)
(xk, yk)
Consequently, we obtain
1 1xy
⎡⎤ L(e) = −1Det⎣1 x y ⎦
i2ejj 1 xk yk
(xj, yj)
= 1−1[(x y − x y ) + (y − y )x + (x − x )y] 2ejkkjjkkj
1 −1(e) (e) (e) ≡ 2 e ai + bi x + ci y
We have defined the coefficients a(e), b(e), and c(e). Similarly, we find iii
and
Finally, we obtain
1 1xy L(e) = −1Det⎣1 x y ⎦
j2ekk 1 xi yi
⎡⎤
= 1−1[(x y − x y ) + (y − y )x + (x − x )y] 2ekiikkiik
1 −1(e) (e) (e) ≡2e aj +bj x+cj y
⎡⎤
=1−1[(xy −xy)+(y−y)x+(x −x)y] 2eijjiijji
1 −1 (e) (e) (e) ≡2e ak+bkx+cky
φ(e) = L(e)z + L(e)z + L(e)z iijjkk
1 1xy L(e) = −1Det⎣1 x y ⎦
k2eii 1 xj yj
Wehave
J φ(e) = 1 φ(e)2 +φ(e)2 +rφ(e) dx dy ≡ F(z ,z ,z )
e2xyijk T
To solve the minimization problem, we set the appropriate derivatives to zero, which requires derivatives of the components. Notice that
and
(e) 1 −1 (e) (e) (e) φx =2e bi zi+bj zj+bk zk
(e) 1 −1(e) (e) (e) φy =2e ci zi+cj zj+ck zk
We carry out the differentiations
∂F/∂zi = =
(e) (e) (e) (e) (e)
φx φxzi +φy φyzi +rφzi dxdy
T 1 1 φ(e) −1b(e) + φ(e)
−1c(e) + r L(e) x2ei y2ei i
dx dy
(e) (e) =4e bi + ci zi+ bi bj +ci cj
zj
15.3 Elliptic Problems 615
T
1 −1 (e) 2 (e) 2 (e) (e)
(e) (e) (e) (e) 1 +bibk+cick zk+r3e
Here, the integrations are straightforward by elementary calculus. Moreover, it can be shown
that
wheree istheareaofeachtriangleT.Similarresultsareobtainedfor∂F/∂zj and∂F/∂zk. Consequently, we set
⎡∂F/∂z ⎤ ⎡0⎤ i
⎢⎣∂F/∂zj ⎥⎦=⎣0⎦
0
b(e)b(e) + c(e)c(e) b(e)b(e) + c(e)c(e) ⎤ ⎡ z ⎤ ⎡ 1 ⎤
1 L(e)dxdy= L(e)dxdy= L(e)dxdy=
ijk3e TTT
∂F/∂zk ⎢iiijijikik⎥1
and we obtain
⎡ b(e)2 + c(e)2
⎢ b ( e ) b ( e ) + c ( e ) c ( e ) b ( e ) 2 + c ( e ) 2 b ( e ) b ( e ) + c ( e ) c ( e ) ⎥ ⎢⎣ z 2 ⎥⎦ = − 4 r 2 ⎢⎣ 1 ⎥⎦
⎣ijijjjjkjk⎦3e b(e)b(e) + c(e)c(e) b(e)b(e) + c(e)c(e) b(e)2 + c(e)2 z3
ikikjkjkk k
1
EXAMPLE 1
This matrix equation contains all the ingredients we need to assemble the partial derivatives. In a particular application, we need to do the proper assembling. For each element φ(e), the ac- tive nodes i, j, and k are those that contribute nonzero values. These contributions are re- corded for derivatives relative to the corresponding variables among the zi , z j , zk , and so on.
Apply the finite-element method to solve Poisson’s equation u x x + u y y = 4 over the unit square with the triangularizations shown in Figure 15.16 and using boundary values corre- sponding to the exact solution u(x, y) = x2 + y2.
y
4
23x
FIGURE 15.16
1 (e 2) (e 1)
Triangularization
616 Chapter 15 Solution
Partial Differential Equations
By symmetry, we need to consider only the bottom right-hand part of the square, which has
been split into two triangles. The input ingredients are nodes 1 to 4, where the coordinates
(x,y) are as follows: node 1: 1,1, node 2: (0, 0), node 3: (1, 0), and node 4: (1, 1). 22
The elements are two triangles with node numbers indicated: e = 1: 1, 2, 3 and e = 2:
1, 3, 4. The astute reader will notice that the z coordinates need to be determined only for
node 1, since they are boundary values for nodes 2, 3, 4! However, we will ignore this fact
for the moment to illustrate the assembly process in the finite-element method. Notice that
the areas of the triangular elements are 1 = 2 = 1 and r = 4. First, we compute the 4
a(e), b(e), c(e) coefficients from this basic information. In the following table, each column corresponds to a node (i, j, k):
e=1 e=2 a(e) 0 1 0 1 0−1
22 b(e) 0−1 1 −1 1 1 22 22 c(e) 1−1−1 0−1 1 22 22
One can verify that the columns do produce the desired L(e), L(e), and L(e) functions. For ijk
example, the first column gives L(1) = 1−1[0+0 · x +1 · y] = 2y. At node 1, this gives i21
the value of 1, while at nodes 2 and 3, it gives the value 0. Similarly, the other columns produce the desired results.
Next, we obtain the matrix equation for element e = 1:
⎡ 1 1⎤⎡⎤⎡1⎤
⎢1−2 −2⎥z1 −3
⎢ − 1 1 0 ⎥ ⎣ z 2 ⎦ = ⎢⎣ − 1 ⎥⎦
⎣22⎦3 −1 0 1 z3 −1
223 and the matrix equation for element e = 2:
⎡ 1 1⎤⎡⎤⎡1⎤ ⎢1−2 −2⎥z1 −3
⎢ − 1 1 0 ⎥ ⎣ z 3 ⎦ = ⎢⎣ − 1 ⎥⎦ ⎣22⎦3
−1 0−1 z4 −1 223
Then we assemble the two matrices, obtaining
⎡ 1 ⎢ 2 −2
1⎤⎡ ⎤ ⎡−2⎤ −1 −2⎥ z1 ⎢ 3⎥
⎢−1 1 ⎢2 2
0⎥⎢z ⎥ ⎢−1 ⎥ ⎥⎢2⎥=⎢3⎥ 0 ⎥⎦ ⎣ z 3 ⎦ ⎢⎣ − 2 ⎥⎦
0 1
Now that we have illustrated the process of assembling the elements, we can quickly find
the solution using the fact that z2 = 0, z3 = 1, and z4 = 2, since they are boundary values.
Using these values in the last matrix equation above, we immediately find that z1 = 2 . This 3
⎢⎣ − 1 0
−100−1z 3 2 24−1 3
is a rough approximation, since the true value is 1 . Remember that u(x, y) = x2 + y2 is 2
the exact solution.
■
We can obtain more accurate approximations by adding more elements and writing a com- puter program to handle the computations. (See Computer Problem 15.3.15.) For additional details, see Scheid [1990] and Sauer [2006].
More on Finite Elements
At first, we take a very general approach to this topic, supposing that we have a linear transformation A and want to solve the equation
Au = b
for u, when b is given. This obviously includes the case when A is an m × n matrix and b is a vector of m components. But there are many complicated problems that fit this same mold. For example, A can be a linear differential operator, and we may wish to solve a
two-point boundary-value problem involving it, such as
Here, A operates on functions and is defined by the equation Au = u′′ + 2u.
Another example of great importance is the model problem Equation (1). In this case, A would be the Laplacian differential operator. This problem is discussed in Chapter 17 as well. The basic strategy of the finite-element method for solving the equation Au = b is to selectbasicfunctionsv1,v2,...,vn andtrytosolvetheequationwithalinearcombination
of these basic functions. Since A is assumed to be a linear transformation, we obtain
u′′(t) + 2u(t) = t2 (0 t 1) u(0) = u(1) = 0
n Au=A
j=1
cjvj =
n j=1
cj Avj =b
Now the unknowns in the problem are the coefficients c j . Typically, the equation just dis- played is inconsistent because b is not in the linear span of the set of functions { Av1, Av2, . . . , Avn}. In this case, one must compromise and accept an approximate solution to the set of equations. Many different tactics can be used to arrive at an approximate solution to the problem. For example, a least-squares approach can be used if the linear space involved has
aninnerproduct,⟨·,·⟩.Thecoefficientscj wouldthenbechosensothattheorthogonality condition was fulfilled; that is,
n
cj Avj − b ⊥ This leads to the normal equations
Span{v1,v2,...,vn}
j=1
n
⟨Avj,vi⟩cj =⟨b,vi⟩ (1in)
j=1
These equations for the unknown coefficients cj are also known (in this context) as the Galerkin equations. They form a system of n linear equations in n unknowns.
We shall illustrate this process with a two-point boundary-value problem involving a
second-order ordinary differential equation:
The finite element method usually uses local functions as the basic functions in the previous discussion. This means that each basic function should be zero except on a short interval. B splines have this property and are therefore often used in the finite-element method. In the
u′′(t) + g(t)u(t) = f (t) u(0) = a u(1) = b
15.3 Elliptic Problems 617
618 Chapter 15
Partial Differential Equations
present problem, we shall want to use B splines having two continuous derivatives because the operator A will be defined by
Au = u′′ + gu
Hence,cubicsplineswouldsuggestthemselves.Defineknotsti =ih,wherehisachosen step size. (Its reciprocal should be an integer in this example.) Let B3j be the cubic B splines corresponding to the given knots. This is an infinite list of B splines, as discussed in Chapter 9. All but a finite number are zero on the interval [0, 1]. The ones that are not identically zero on the interval [0, 1] can be relabeled as v1, v2, ..., vn. These are our test functions. Proceeding as before, we arrive at a set of n linear equations in n unknowns. The details require one to find the functions Av j by using the B spline formulas in Chapter 9. This is tedious and not very instructive.
Similar considerations can be applied to Laplace’s equation on a given domain. To illustrate, we take the domain to be a square of side 2, where 0 x , y 2. On the boundary of the square, we require u(x, y) = sin(xy). Such a problem is called a Dirichlet problem. For base functions, we use functions vj that already satisfy the homogeneous part of the problem. That is, we want each vj to satisfy Laplace’s equation inside the square domain. Functions that satisfy Laplace’s equation are said to be harmonic. We can exploit the fact that the real and imaginary parts of an analytic function are harmonic. Thus, if we set z = x + iy and compute zk, we will be able to extract harmonic functions that are polynomials. Here are a few harmonic polynomials, v j for 0 j 6:
z = 1
z = x + iy
z2 =(x+iy)2 z3 =(x+iy)3
v0(x, y) = 1
v1(x, y) = x v3(x,y)=x2 −y2 v5(x,y)=x3 −3xy2
v2(x, y) = y v4(x,y)=2xy v6(x,y)=3x2y−y3
Using these seven functions, we form u = 6j=0 cjvj. This satisfies Laplace’s equation, and we can concentrate on making u close to the specified boundary value x3 − y2 on the perimeter of the square. There are many ways to proceed, and we choose first to use a method called collocation. In this process, we select a number of points on the boundary and writedownanequationateachpointthatsaysthevalueof6j=0cjvj equalstheprescribed value. If the number of points equals the number of basic functions, we have the classical collocation method. Here, we took eight points, whereas there are only seven functions and seven coefficients. Hence, we ask for a least-squares solution. We took the so-called collocation points to be (0, 2), (1, 2), (2, 2), (2, 1), (2, 0), (1, 0), (0, 0) and (0, 1). This led
to the following system of eight equations:
⎡ ⎤⎡⎤⎡⎤
2 −4 0 0
2 −3 4 −11
2 0 8 −16
1 3 4 2
0 4 0 8
0 1 0 1
0 0 0 0
1 −1 0 0 −1 c7 0
The least-squares solution is
c = [0.3219, −0.8585, −0.8585, 0, 1.1931, 0.2146, −0.2146]T
1 0 ⎢1 1 ⎢1 2 ⎢1 2 ⎢1 2 ⎢1 1 ⎣1 0 1 0
−8c0 0 −2⎥⎢c1⎥ ⎢sin(2)⎥ 16⎥⎢c2⎥ ⎢sin(4)⎥ 11⎥⎢c3 ⎥ = ⎢sin(2)⎥ 0⎥⎢c4⎥ ⎢ 0 ⎥
a c-vector having components
0⎥⎢c5⎥ ⎢ 0⎦⎣c6⎦ ⎣
0 ⎥ 0 ⎦
The residual function is 6j=0 cjvj − b, where bi(x, y) = sin(xy). Its absolute value is 0.3219 at each of the eight collocation points. To improve the accuracy, one must employ more basic functions and more collocation points.
Another technique that is often used in the finite-element method is the replacement of a differential equation by an optimization problem. This can be illustrated by a two-point
boundary-value problem such as
(hu′)′ − gu = f u(a) = α u(b) = β
Here, u is the unknown function, while h, g, and f are prescribed functions, all defined on the interval [a,b]. This problem is called a Sturm-Liouville problem. There is an accompanying functional, defined by
b (u′)2h + u2g + 2u f dx
a
The functional and the two-point boundary-value problem are related by several theorems. One of these states roughly that if we find the function u that minimizes the functional (u) subject to the side conditions u(a) = α and u(b) = β, then we will have the solution of the boundary-value problem. It is possible to exploit the fact that (u) is defined as long as u has a derivative, whereas in the differential equation, we require a function possessing two derivatives. In fact, for the functional, we require only that u be piecewise differentiable, a property that spline functions of degrees 0 and 1 possess. These ideas extend to functions of two or more variables and allow one to use spline functions of low degree in two or more variables to approximate the solution to a differential equation. These are the principal features of the finite element method. For the mathematical theory of finite-element methods, see the books by Brenner and Scott [2002], Strang [2006], and others.
Summary
(1) We study a model problem involving the following elliptic partial differential equation
∇2u + f u = g
over a region, with the value of u given on the boundary. The first term involves the Laplace operator ∇2, which is
∇2u ≡ ∂2u + ∂2u ∂x2 ∂y2
By placing a grid over the region with uniform spacing h in both directions the Laplacian term can be approximated by using the five-point finite differences
∇2u≈ 1[u(x+h,y)+u(x−h,y)+u(x,y+h)+u(x,y−h)−4u(x,y)] h2
At each interior grid point, we write ui j = u(xi , yj ) = u(ih, jh), and we obtain the following equation for our model problem:
(u) =
15.3 Elliptic Problems 619
−u −u i+1, j
i−1, j
−u
i, j+1
−u +4−h2 f u =−h2g i, j−1 ij ij ij
Usually, the resulting linear system of equations is large and sparse, and iterative methods can be used to solve it. For example, the Gauss-Seidel iterative method for our linear
620 Chapter 15
Partial Differential Equations
system is
(k+1) 1 uij = 4−h2 f
ij
(k) (k+1) (k) (k+1) 2 ui+1, j +ui−1, j +ui, j+1 +ui, j−1 −h gij
The grid points can be ordered in different ways, such as the natural ordering or the red-black ordering, which affects the rate of convergence of the iterative procedures.
(2) The distinguishing feature of the finite-element method is that we solve an equation Ax = b approximately by setting u = n cjvj, where v1,v2,...,vn are chosen by the
j=1 n
user. The unknown coefficients c j are computed so that j =1 c j Av j is as close as possible
to b. Typically, in partial differential equations, the functions v j will be multidimensional spline functions.
Additional References
For additional study and reading, see Ames [1992], Evans [2000], Forsythe and Wasow [1960], Gockenbach [2002], Mattheij, Rienstra and Boonkkamp [2005], Ortega and Voigt [1985], Rice and Boisvert [1984], Smith [1965], Street [1973], Varga [1962, 2002], Wachspress [1966], Young [1971], and Young and Gregory [1972].
1. Establishtheformulafortheerrorinthe a. five-pointformula,Equation(3).
b. nine-point formula, Equation (5).
2. Establishtheirregularfive-pointFormula(6)anditserrorterm.
3. WritethematricesthatoccurinEquation(11)whentheunknownsareorderedaccording to the vector u = [u11, u31, u22, u13, u33, u21, u12, u32, u23]T . This is known as red- black or checkerboard ordering.
4. a. Verify Equation (10).
b. VerifythatthesolutionofEquation(13)isasgiveninthetext.
a5. Considertheproblemofsolvingthepartialdifferentialequation 20uxx−30uyy+ 5 ux+1uy=69
x+y y
in a region R with u prescribed on the boundary. Derive a five-point finite difference
equation of order O(h2) that corresponds to this equation at some interior point (xi , yj ). a6. Solvethisboundary-valueproblemtoestimateu1,1andu0,1:
∇2u = 0 (x, y) ∈ R u = x (x, y) ∈ ∂ R
222
Problems 15.3
15.3 Elliptic Problems 621 The region R with boundary ∂ R is shown in the figure (the arc is circular). Use h = 1 .
2 Note: This problem (and many others in this text) can be stated in physical terms also.
For example, in this problem, we are finding the steady-state temperature in a beam of cross section R if the surface of the beam is held at temperature u(x, y) = x.
y
1
12 0 12 1 a7. Considertheboundary-valueproblem
for the region in the unit square with h = 1 in the figure below. Here, ∂ R is the 232
boundaryofR,∂R2 = (x,y)∈∂R:3x<1, 3y<1 ,and∂R1 =∂R−∂R2. At the mesh points, determine the system of linear equations that yields an approximate value for u(x, y). Write the system in the form Au = b.
x
∇2u = 9(x2 + y2) (x, y) ∈ R
u = x − y (x, y) ∈ ∂ R1
y
1
2 3
1 3
12 33
x
1
8. Determine the linear system to be solved if the nine-point Formula (5) is used as the approximation in the problem of Equation (1). Notice the pattern in the coefficient matrix with both the five-point and nine-point formulas when unknowns in each row are grouped together. (Draw dotted lines through A to form 3 × 3 submatrices.)
9. In Equation (11), show that A is diagonally dominant when f (x, y) 0.
10. Whatisthelinearsystemifanalternativenine-pointformula
∇2u≈ 1 [16u(x+h,y)+16u(x−h,y)+16u(x,y+h) 12h2
+16u(x, y − h) − u(x + 2h, y) − u(x − 2h, y) −u(x, y + 2h) − u(x, y − 2h) − 60u(x, y)]
622 Chapter 15
Partial Differential Equations
is used? What are the advantages and disadvantages of using it? Hint: It has accuracy O(h4).
11. (Multiple choice) What is Laplace’s equation in three variables? a. u−x+uy +uz =0 b. uxx +uyy =0
c. uxx +uyy +yzz =0 d. uxx +uyy =yut e. Noneofthese. 12. (Multiplechoice)Whichoftheseisnotaharmonicfunctionof(x,y)?
a. x2−y2 b.2xy c.x3y−xy3 d. x3 − xy3 e. None of these.
13. (Multiplechoice)InsolvingtheDirichletproblemontheunitsquare,where0
then x∗ ∈ [b′, b]. Taking the midpoint of this interval, we obtain x = 1 (b′ + b) as our 2
estimate of x∗ and find that |x − x∗| 1 (b − a). On the other hand, if F(b′) < F(b′ + δ), 6
then x∗ ∈ [a′,b′ + δ]. Again we take the midpoint, x = 1(a′ + b′ + δ), and find that 2
|x − x∗| 1 (b − a) + 1 δ. So if we ignore the small quantity δ/2, our accuracy is 1 (b − a) 626
in using three evaluations of F.
By continuing the search pattern outlined, we find an estimate x of x∗ with only n
evaluations of F and with an error not exceeding 1b − a
2 λn
where λn is the (n + 1)st member of the Fibonacci sequence:
(1)
(2)
λ1=1, λ2=1
λk =λk−1 +λk−2 (k3)
For example, elements λ1 through λ8 are 1, 1, 2, 3, 5, 8, 13, and 21.
In the Fibonacci search algorithm, we initially determine the number of steps N for
a desired accuracy ε > δ by selecting N to be the subscript of the smallest Fibonacci number greater than 1 (b − a)/ε. We define a sequence of intervals, starting with the given
2
interval [a,b] of length l = b − a, and, for k = N, N − 1,…,3, use these formulas
630
Chapter 16
Minimization of Functions
for updating:
Atthestepk =2,weset
22 a = a′ if F(a′) F(b′)
λ = k−2
λk a′ =a+
b = b′ if F(a′) < F(b′)
a′ = 1(a+b)−2δ b′ = 1(a+b)+2δ
b′ =b− a = a′ if F(a′) F(b′)
(b − a) (3)
b = b′ if F(a′) < F(b′)
and we have the final interval [a, b], from which we compute x = 1 (a + b). This algorithm
FIGURE 16.7
Fibonacci search algorithm: Verify using a typical situation
l
l′
a a′ b′ b
To verify the algorithm, consider the situation shown in Figure 16.7. Since λk =
2 requires only one function evaluation per step after the initial step.
λk−1 + λk−2, we have
λ λ
l′ =l−=l− k−2 l= k−1 l (4)
λk λk
and the length of the interval of uncertainty has been reduced by the factor (λk−1/λk). The
next step yields
λ
′ = k−3 l′ (5)
λk −1
and ′ is actually the distance between a′ and b′. Therefore, one of the preceding points at
which the function was evaluated is at one end or the other of [a, b]; that is, λ−2λ
k−2 l λ −λ λ
b′ − a′ = l = 2 = k
= k−1 λ
λk k−2 l=
k−3 l λk
= k−3 l′=′ λk −1
λk
by Equations (2), (4), and (5).
It is clear by Equation (4) that after N − 1 function evaluations, the next-to-last interval
has length (1/λN) times the length of the initial interval [a,b]. So the final interval is
(b − a)(1/λN ) wide, and the maximum error (1) is established. The final step is similar to
that outlined, and F is evaluated at a point 2δ away from the midpoint of the next-to-last
interval. Finally, set x = 1 (b + a) from the last interval [a, b]. 2
One disadvantage of the Fibonacci search is that the algorithm is rather complicated. Also, the desired precision must be given in advance, and the number of steps to be computed for this precision must be determined before beginning the computation. Thus, the initial evaluation points for the function F depend on N , the number of steps.
Golden Section Search Algorithm
A similar algorithm that is free of these drawbacks is described next. It has been termed the golden section search because it depends on a ratio ρ known to the early Greeks as the golden section ratio:
1 √
ρ=2 1+ 5 ≈1.6180339887
16.1 One-Variable Case 631
The mathematical history of this number can be found in Roger [1998], and ρ satisfies the equation ρ2 = ρ + 1, which has roots 11 + √5 ≈ 1.61803... and 11 − √5 ≈
22
−0.61803. . . . In each step of this iterative algorithm, an interval [a, b] is available from
the previous work. It is an interval that is known to contain the minimum point x∗, and our
objective is to replace it by a smaller interval that is also known to contain x∗. In each step,
two values of F are needed:
x = a + r(b − a) u = F(x) y = a + r2(b − a) v = F(y)
where r = 1/ρ and r2 + r = 1, which has roots 1−1 + √5 ≈ 0.61803... and 1√ 2
2 −1 − 5 ≈ −1.61803.... There are two cases to consider: Either u > v or u v. Let us take the first. Figure 16.8 depicts this situation. Since F is assumed continuous and unimodal, the minimum of F must be in the interval [a, x]. This interval is the input interval at the beginning of the next step. Observe now that within the interval [a, x], one evaluation of F is already available, namely, at y. Also note that
a + r(x − a) = y
u
v
(6)
FIGURE 16.8
Golden section search
r(b a)
r(b a)
algorithm: u>v a
y x* x
b
632
Chapter 16
Minimization of Functions
because x −a = r(b−a). In the next step, therefore, y will play the role of x, and we shall need the value of F at the point a + r2(x − a).
So what must be done in this step is to carry out the following replacements in order:
b←x
x←y
u←v
y ← a + r2(b − a) v ← F(y)
The other case is similar. If u v, the picture might be as in Figure 16.9. In this case, the minimum point must lie in [y, b]. Within this interval, one value of F is available, namely, at x. Observe that
y + r2(b − y) = x
(See Problem 16.1.9.) Thus, x should now be given the role of y, and the value of F is to
be computed at y + r(b − y). The following ordered replacements accomplish this:
a←y
y←x
v←u
x ← a + r(b − a) u ← F(x)
Problems 16.1.10 and 16.1.11 hint at a shortcoming of this procedure: It is quite slow. Slowness in this context refers to the large number of function evaluations that are needed to achieve reasonable precision. It can be surmised that this slowness is attributable to the extreme generality of the algorithm. No advantage has been taken of any smoothness that the function F may possess.
If [a, b] is the starting interval in the search for a minimum of F, then at the beginning,
with one evaluation of F , we can be sure only that the minimum point, x ∗ , is in an interval of
width b − a. In the golden section search, the corresponding lengths in successive steps are
r(b − a) for two evaluations of F, r2(b − a) for three evaluations of F, and so on. After n
steps, the minimum point has been pinned down to an interval of length rn−1(b − a). How
does this compare with the Fibonacci search algorithm using n evaluations? The correspond-
ing width of interval, at the last step of this algorithm, is λ−1(b − a). Now, the Fibonacci n
algorithm should be better, because it is designed to do as well as possible with a prescribed
v
u
FIGURE 16.9
Golden section search
r(b a)
r(b a)
algorithm: u v a
y x x*
b
16.1 One-Variable Case 633 number of steps. So we expect the ratio rn−1/λ−1 to be greater than 1. But it approaches
n
1.17 as n → ∞. (See Problem 16.1.8.) Thus, one may conclude that the extra complexity
of the Fibonacci algorithm, together with the disadvantage of having the algorithm itself depend on the number of evaluations permitted, mitigates against its use in general.
In the golden section search algorithm, how is the correct ratio r determined? Remember that when we pass from one interval to the next in the algorithm, one of the points x or y is to be retained in the next step. Here, we present first a sketch of the first interval in which we let x = a + r(b − a) and y = b + r(a − b). It is followed by a sketch of the next interval.
ayxb
azyxb
In this new interval, the same ratios should hold, so we have y = a + r(x − a). Since x − a = r(b − a), we can write y = a + r[r(b − a)]. Setting the two formulas for y equal to each other gives us
whence
Dividing by (a − b) gives
a + r2(b − a) = b + r(a − b)
a − b + r2(b − a) = r(a − b)
r2 + r − 1 = 0
The roots of this quadratic equation are as given previously.
Quadratic Interpolation Algorithm
Suppose that F is represented by a Taylor series in the vicinity of the point x∗. Then F(x)= F(x∗)+(x −x∗)F′(x∗)+ 1(x −x∗)2F′′(x∗)+···
2
Since x∗ is a minimum point of F, we have F′(x∗) = 0. Thus,
F(x) ≈ F(x∗) + 1(x − x∗)2 F′′(x∗) 2
This tells us that, in the neighborhood of x∗, F(x) is approximated by a quadratic function whose minimum is also at x ∗ . Since we do not know x ∗ and do not want to involve derivatives in our algorithms, a natural stratagem is to interpolate F by a quadratic polynomial. Any three values (xi , F (xi )), i = 1, 2, 3, can be used for this purpose. The minimum point of the resulting quadratic function may be a better approximation to x∗ than is x1, x2, or x3. Writing an algorithm that carries out this idea iteratively is not trivial, and many unpleasant cases must be handled. What should be done if the quadratic interpolant has a maximum instead of a minimum, for example? There is also the possibility that F′′(x∗) = 0, in which case higher-order terms of the Taylor series determine the nature of F near x∗.
Here is the outline of an algorithm for this procedure. At the beginning, we have a function F whose minimum is sought. Two starting points x and y are given, as well as two
634 Chapter 16
Minimization of Functions
control numbers δ and ε. Computing begins by evaluating the two numbers
Now let
In either case, the number
z=
2x − y 2y − x
if u < v if u v
u = F(x) v = F(y)
w = F(z)
is to be computed.
At this stage, we have three points x, y, and z together with corresponding function
values u, v, and w. In the main iteration step of the algorithm, one of these points and its accompanying function value are replaced by a new point and new function value. The process is repeated until a success or failure is reached.
In the main calculation, a quadratic polynomial q is determined to interpolate F at the three current points x, y, and z. The formulas are discussed below. Next, the point t where q′(t) = 0 is determined. Under ideal circumstances, t is a minimum point of q and an approximate minimum point of F. So one of the x, y, or z should be replaced by t. We are interested in examining q′′(t) to determine the shape of the curve q near t.
For the complete description of this algorithm, the formulas for t and q′′(t) must be given. They are obtained as follows:
⎧⎪ ⎪ ⎪ a = v − u ⎪ y−x
⎪ b = w − v ⎪⎨ z−y
⎪ c = b − a ⎪ z−x
⎪ 1 a ⎪t=2 x+y−c
⎪⎩q′′(t) = 2c Their derivation is outlined in Problem 16.1.12.
The solution case occurs if
q′′(t) > 0 and max{|t − x|,|t − y|,|t − z|} < ε
The condition q′′(t) > 0 indicates, of course, that q′ is increasing in the vicinity of t, so t is indeed a minimum point of q. The second condition indicates that this estimate, t, of the minimum point of F is within distance ε of each of the three points x, y, and z. In this case, t is accepted as a solution.
The usual case occurs if
q′′(t)>0 and δ max{|t−x|,|t−y|,|t−z|}ε
These inequalities indicate that t is a minimum point of q but not near enough to the three initial points to be accepted as a solution. Also, t is not farther than δ units from each of x, y, and z and can thus be accepted as a reasonable new point. The old point that has the greatest function value is now replaced by t and its function value by F(t).
FIGURE 16.10
Taylor series algorithm: First bad case
t zsign(tz) z y x The second bad case occurs if
q′′(t) < 0
thus indicating that t is a maximum point of q. In this case, identify the greatest and the least among u, v, and w. Suppose, for example, that u v w. Then replace x by z + δ sign(z − x). An example is shown in Figure 16.11.
The first bad case occurs if
q′′(t) > 0 and max{|t − x|,|t − y|,|t − z|} > δ
Here, t is a minimum point of q but is so remote that there is some danger in using it as a new point. We identify one of the original three points that is farthest from t, for example, x, and also we identify the point closest to t, say z. Then we replace x by z + δ sign(t − z) and u by F(x). Figure 16.10 shows this case. The curve is the graph of q.
q
16.1 One-Variable Case 635
u
v
v
u
zsign(zx) z
q
FIGURE 16.11
Taylor series algorithm: Second bad case
Summary
y x t
We consider the problem of finding the local minimum of a unimodal function of a one-variable. Algorithms discussed are Fibonacci search, golden section search, and quadratic interpolation.
a1. For the function F(x1, x2, x3) = x12 + 3×2 + 2×32 − 4×1 − 6×2 + 8×3, find the uncon- strained minimum point. Then find the constrained minimum over the set K defined by inequalities x1 0, x2 0, and x3 0. Next, solve the same problem when K is defined byx12,×20,andx3 −2.
Problems 16.1
636 Chapter 16
Minimization of Functions
a2. 3.
a4. 5.
6.
a 7.
8.
a9. a10.
a 11. 12.
ForthefunctionF(x,y)=13×2+13y2−10xy−18x−18y,findtheunconstrained minimum. Hint: Try substituting x = u +v and y = u −v.
If F is unimodal and continuous on the interval [a, b], how many local maxima may F have on [a, b]?
FortheFibonaccisearchalgorithm,writeexpressionsforxinthetwocasesn=2,3. Carry out four steps of the Fibonacci search algorithm using ε = 1 to determine the
following:
aa. MinimumofF(x)=x2−6x+2on[0,10]
b. MinimumofF(x)=2×3 −9×2 +12x+2on[0,3] c. MaximumofF(x)=2×3−9×2+12xon[0,2]
Let F be a continuous unimodal function defined on the interval [a, b]. Suppose that thevaluesofFareknownatnpoints,namely,a=t1
1(a+a′),andb′ =aifF(a′)
⎧⎪ ⎪ ⎪ ⎪ b ← b ′
⎪b′←a′
⎪ ⎪ ⎪⎨ v ← u
λ
⎪← k−2 (b−a) ⎪ λk
⎪a′←a+ ⎩u←F(a′)
λ
= N−2 (b − a)
λN a′ =a+
b′ =b− u = F(a′) v = F(b′)
Then loop on k from N − 1 downward to 3, updating as follows:
5. (Berman algorithm) Suppose that F is unimodal on [a, b]. Then if x1 and x2 are any twopointssuchthatax1
at x0 +ih, i = 1,2,…,q, with h = (b−a)/2q, until we find a point x1 from which F begins to increase again (or until we reach b). Then we repeat this procedure starting at x1 and using a smaller step length h/q. Here, q is the maximal number of evaluations at each step, say, 4. Write a subroutine to perform the Berman algorithm and test it for evaluating the approximate minimization of one-dimensional functions. Note: The total number of evaluations of F needed for executing this algorithm up to some iterative step k depends on the location of x∗. If, for example, x∗ = b, then clearly, we need q evaluations at each iteration and hence kq evaluations. This number will decrease the closer x∗ is to x0, and it can be shown that with q = 4, the expected number of evaluations is three per step. It is interesting to compare the efficiency of the Berman algorithm (q = 4) with that of the Fibonacci search algorithm. The expected number of evaluations per step is three, and the uncertainty interval decreases by a factor 4−1/3 ≈ 0.63 per evaluation. In comparison, the Fibonacci search algorithm has
16.2 Multivariate Case 639 a reduction factor of 1 1 + √5 ≈ 0.62. Of course, the factor 0.63 in the Berman
2
algorithm represents only an average and can be considerably lower but also as high
as 4−1/4 ≈ 0.87.
6. SelectaroutinefromyourprogramlibraryorfromapackagesuchasMatlab,Maple,or Mathematica for finding the minimum point of a function of one variable. Experiment with the function F(x) = x4 + sin(23x) to determine whether this routine encounters any difficulties in finding a global minimum point. Use starting values both near to and far from the global minimum point. (See Figure 16.2.)
7. (Student project) The Greek mathematician Euclid of Alexandria (325–265 B.C.E.) wrote a collection of 13 books on mathematics and geometry. In book six, Proposition 30 shows how to divide a line into its mean and extreme mean, which is finding the golden section point on a line. This states that the ratio of the smaller part of a line segment to the larger part is the same as the ratio of the larger part to the whole line segment. For a line segment of length 1, denote the larger part by r and the smaller part by 1 − r as shown here:
Hence, we have the ratios
r 1r
01
1−r=r r1
and we obtain the quadratic equation
r2 =1−r
This equation has two roots, one positive and one negative. The reciprocal of the
positive root is the golden ratio 1 1 + √5, which was of interest to Pythagoras 2
(580–500 B.C.E.). It was also used in the construction of the Great Pyramid of Gizah. Mathematical software systems such as Matlab, Maple, or Mathematica contain the golden ratio constant. In fact, the default width-to-height ratio for the plot function is the golden ratio. Investigate the golden section ratio and its use in scientific computing.
8. Using a mathematical software system such as Matlab, Maple, or Mathematica, write computer program to reproduce
a. Figure16.1.
b. Figure16.2.Also,findtheglobalminimumofthefunctionaswellasseverallocal
minimum points near the origin.
16.2 Multivariate Case
Now we consider a real-valued function of n real variables F: Rn → R. As before, a point x∗ is sought such that
F(x∗)F(x) forallx∈Rn
640
Chapter 16
Minimization of Functions
Some of the theory of multivariate functions must be developed to understand the rather sophisticated minimization algorithms in current use.
Taylor Series for F : Gradient Vector and Hessian Matrix
If the function F possesses partial derivatives of certain low orders (which is usually assumed in the development of these algorithms), then at any given point x, a gradient vector G(x) = (Gi )n is defined with components
Gi =Gi(x)=∂F(x) (1in) (1) ∂xi
and a Hessian matrix H(x) = (Hi j )n×n is defined with components
Hij =Hij(x)=∂2F(x) (1i, jn) (2)
∂xi ∂xj
We interpret G(x) as an n-component vector and H(x) as an n × n matrix, both depending on x.
Using the gradient and Hessian, we can write the first few terms of the Taylor series for F as
n F(x+h)=F(x)+
i=1
Equation (3) can also be written in an elegant matrix-vector form:
F(x + h) = F(x) + G(x)T h + 1 hT H(x)h + · · · (4) 2
Here, x is the fixed point of expansion in Rn , and h is the variable in Rn with components h1,h2,…,hn. The three dots indicate higher-order terms in h that are not needed in this discussion.
A result in calculus states that the order in which partial derivatives are taken is imma- terial if all partial derivatives that occur are continuous. In the special case of the Hessian matrix, if the second partial derivatives of F are all continuous, then H is a symmetric matrix; that is, H = HT because
Hij= ∂2F = ∂2F =Hji ∂xi ∂xj ∂xj ∂xi
To illustrate Formula (4), let us compute the first three terms in the Taylor series for the function
F(x1, x2) = cos(πx1) + sin(πx2) + ex1x2 taking (1, 1) as the point of expansion.
1 n n
Gi(x)hi +2 hiHij(x)hj +··· (3)
i=1 j=1
EXAMPLE 1
Solution
Partial derivatives are
∂F = −π sin(πx1) + x2ex1x2
16.2 Multivariate Case 641
∂x1
∂2F =−π2cos(πx1)+x2ex1x2
∂F = π cos(πx2) + x1ex1x2 ∂x2
∂2F =(x1x2+1)ex1x2 ∂ x 2 ∂ x 1
∂ x 12 ∂2F
∂2F =−π2sin(πx2)+x12ex1x2 ∂x2
=(x1x2+1)ex1x2
Note the equality of cross derivatives; that is, ∂2 F/∂x1 ∂x2 = ∂2 F/∂x2 ∂x1. At the particular
∂x1 ∂x2
point x = [1, 1]T , we have
F(x) = −1 + e, So by Equation (4),
G(x) =
e −π + e
, H(x) =
π2+e 2e
2e e
F(1+h1,1+h2)=−1+e+[e,−π+e] h1 h2
1 π2+e 2eh
+ [h1,h2] 1 +···
+ 1(π2 +e)h2 +(2e)h h +(2e)h h +eh2+··· ■ 2112212
In mathematical software systems such Maple or Mathematica, we can verify these calculations using built-in routines for the gradient and Hessian. Also, we can obtain two terms in the Taylor series in two variables expanded about the point (1, 1) and then carry out a change of variables to obtain similar results as above.
Alternative Form of Taylor Series
Another form of the Taylor series is useful. First let z be the point of expansion, and then let h = x − z. Now from Equation (4),
F(x)= F(z)+G(z)T(x−z)+ 1(x−z)T H(z)(x−z)+··· (5) 2
We illustrate with two special types of functions. First, the linear function has the form
2 2eeh2 F(1+h1,1+h2)=−1+e+eh1 +(−π+e)h2
or equivalently, by Equation (3),
n i=1
bixi =c+bTx
for appropriate coefficients c,b1,b2,…,bn. Clearly, the gradient and Hessian are Gi(z) =
F(x)=c+ bi andHij(z)=0,soEquation(5)yields
F(x)= F(z)+
n i=1
bi(xi −zi)= F(z)+bT(x−z)
642 Chapter 16
Minimization of Functions
Second, consider a general quadratic function. For simplicity, we take only two vari- ables. The form of the function is
F(x1, x2) = c + (b1x1 + b2x2) + 1a11x12 + 2a12x1x2 + a22x2 (6) 2
which can be interpreted as the Taylor series for F when the point of expansion is (0, 0). To verify this assertion, the partial derivatives must be computed and evaluated at (0, 0):
∂F=b+a x+a x ∂F=b+a x+a x ∂x 1 111 122 ∂x 2 222 121
12 ∂2F=a ∂2F =a
∂x2 11 ∂x ∂x 12 112
∂2F =a ∂2F=a ∂x ∂x 12 ∂x2 22
212 Letting z = [0, 0]T , we obtain from Equation (5)
x1 aax F(x) = c + [b1, b2] 1 + [x1, x2] 11 12 1
written as
F(x)=c+bTx+1xT Ax (7) 2
where c is a scalar, b a vector, and A a matrix. Equation (7) holds for a general quadratic function of n variables, with b an n-component vector and A an n × n matrix.
Returning to Equation (3), we now write out the complicated double sum in complete detail to assist in understanding it:
⎧n xHx⎫ ⎪ j=1 1 1j j⎪
⎪+ nj=1xHx⎪ nn ⎨22jj⎬
x2 2 a12 a22 x2
This is the matrix form of the original quadratic function of two variables. It can also be
xTHx=
i=1 j=1
xiHijxj =⎪+··· ⎪
⎪+ ··· ⎪
⎩+n xHx⎭ j=1 n nj j
⎧⎫ ⎪ x1H11x1 + x1H12x2 + ··· + x1H1nxn ⎪
⎪+xH x +xH x ⎪⎨22112222
=⎪+··· ⎪+ ···
⎪⎩+xH x +xH x nn11nn22
+···+xH x ⎪ 2 2 n n ⎪ ⎪⎬
+··· ⎪
+··· ⎪
+···+xH x⎪⎭ n nn n
Thus, xT Hx can be interpreted as the sum of all n2 terms in a square array of which the (i, j ) element is xi Hi j x j .
Steepest Descent Procedure
A crucial property of the gradient vector G(x) is that it points in the direction of the most
rapid increase in the function F, which is the direction of steepest ascent. Conversely,
−G(x) points in the direction of the steepest descent. This fact is so important that it is
worth a few words of justification. Suppose that h is a unit vector, n h2 = 1. The rate i=1 i
of change of F (at x) in the direction h is defined naturally by d
that
F(x + th) = F(x) + tG(x)T h + 1t2hT H(x)h + ··· (8) 2
Differentiation with respect to t leads to
d F(x+th)=G(x)Th+thTH(x)h+··· (9)
dt
By letting t = 0 here, we see that the rate of change of F in the direction h is nothing else
than
G(x)T h
Now we ask: For what unit vector h is the rate of change a maximum? The simplest path
to the answer is to invoke the powerful Cauchy-Schwarz inequality:
dt F(x + th)t=0
This rate of change can be evaluated by using Equation (4). From that equation, it follows
16.2 Multivariate Case 643
n i=1
n i=1
1/2 n 1/2 ui2 vi2
uivi
where equality holds only if one of the vectors u or v is a nonnegative multiple of the other.
Applying this to
On the basis of the foregoing discussion, a minimization procedure called best-step steepest descent can be described. At any given point x, the gradient vector G(x) is calculated. Then a one-dimensional minimization problem is solved by determining the value t∗ for which the function
φ(t) = F(x + tG(x))
is a minimum. Then we replace x by x + t∗ G(x) and begin anew.
The general method of steepest descent takes a step of any size in the direction of the
negative gradient. It is not usually competitive with other methods, but it has the advantage of simplicity. One way of speeding it up is described in Computer Problem 16.2.2.
n i=1
G(x)T h =
and remembering that n h2 = 1, we conclude that the maximum occurs when h is a
i=1 i
positive multiple of G(x), that is, when h points in the direction of G.
Gi (x)hi
i=1
(10)
644 Chapter 16
Minimization of Functions
Contour Diagrams
In understanding how these methods work on functions of two variables, it is often helpful to draw contour diagrams. A contour of a function F is a set of the form
{x : F(x) = c}
where c is a given constant. For example, the contours of function
F(x) = 25×12 + x2
are ellipses, as shown in Figure 16.12. Contours are also called level sets by some authors. At any point on a contour, the gradient of F is perpendicular to the curve. So, in general, the path of steepest descent may look like Figure 16.13.
y
6.00
FIGURE 16.12
Contours of F(x) = 25×12+x2
6.00
More Advanced Algorithms
2.00
Ellipse c25×2 y2
x
2.00
To explain more advanced algorithms, we consider a general real-valued function F of n variables. Suppose that we have obtained the first three terms in the Taylor series of F in the vicinity of a point z. How can they be used to guess the minimum point of F? Obviously,
16.2
Multivariate Case 645
F(x) F(x1) F(x) F(x2) F(x) F(x3) F(x) F(x4) F(x) F(x5)
x3
x2
x5
x4
x1
FIGURE 16.13
Path of steepest descent
we could ignore all terms beyond the quadratic terms and find the minimum of the resulting quadratic function:
F(x + z) = F(z) + G(z)T x + 1 xT H(z)x + · · · (11) 2
Here, z is fixed and x is the variable. To find the minimum of this quadratic function of x, we must compute the first partial derivatives and set them equal to zero. Denoting this quadratic function by Q and simplifying the notation slightly, we have
Q(x)=F(z)+ from which it follows that
xiHijxj (12)
n i=1
1 n n
Gixi +2 ∂Q n
(1kn) (13) (See Problem 16.2.13.) The point x that is sought is thus a solution of the system of n
∂x =Gk+ Hkjxj k j=1
n
Hkjxj =−Gk (1kn)
j=1
H(z)x = −G(z) (14)
The preceding analysis suggests the following iterative procedure for locating a mini- mum point of a function F: Start with a point z that is a current estimate of the minimum point. Compute the gradient and Hessian of F at the point z. They can be denoted by G and H, respectively. Of course, G is an n-component vector of numbers and H is an n × n matrix of numbers. Then solve the matrix equation
i=1 j=1
equations
or, equivalently,
Hx = −G
646 Chapter 16
Minimization of Functions
obtaining an n-component vector x. Replace z by z + x and return to the beginning of the procedure.
Minimum, Maximum, and Saddle Points
There are many reasons for expecting trouble from the iterative procedure just outlined. One especially noisome aspect is that we can expect to find a point only where the first partial derivatives of F vanish; it need not be a minimum point. It is what we call a stationary point. Such points can be classified into three types: minimum point, maximum point, and saddle point. They can be illustrated by simple quadratic surfaces familiar from analytic geometry:
• Minimum of F(x, y) = x2 + y2 at (0, 0)
• Maximum of F(x, y) = 1 − x2 − y2 at (0, 0) • Saddle point of F(x, y) = x2 − y2 at (0, 0)
(See Figure 16.14(a).) (See Figure 16.14(b).) (See Figure 16.14(c).)
FIGURE 16.14
Simple quadratic surfaces
(c) Saddle point
(a) Minimum point
(b) Maximum point
QUADRATIC FUNCTION THEOREM
If the matrix H has the property that xT Hx > 0 for every nonzero vector x, then the quadratic function Q has a minimum point.
■ THEOREM1
Positive Definite Matrix
If z is a stationary point of F, then
G(z) = 0
Moreover, a criterion ensuring that Q, as defined in Equation (12), has a minimum point is
as follows:
(See Problem 16.2.15.) A matrix that has this property is said to be positive definite. Notice that this theorem involves only second-degree terms in the quadratic function Q.
As examples of quadratic functions that do not have minima, consider the following: −x12 −x2 +13×1 +6×2 +12 x12 −x2 +3×1 +5×2 +7
x12 −2x1x2 +x1 +2×2 +3 2×1 +4×2 +6
Inthefirsttwoexamples,letx1 =0andx2 →∞.Inthethird,letx1 =x2 →∞.Inthe last, let x1 = 0 and x2 → −∞. In each case, the function values approach −∞, and no global minimum can exist.
Quasi-Newton Methods
Algorithms that converge faster than steepest descent in general and that are currently recommended for minimization are of a type called quasi-Newton. The principal example is an algorithm introduced in 1959 by Davidon, called the variable metric algorithm. Subsequently, important modifications and improvements were made by others, such as R. Fletcher, M. J. D. Powell, C. G. Broyden, P. E. Gill, and W. Murray. These algorithms proceed iteratively, assuming in each step that a local quadratic approximation is known for the function F whose minimum is sought. The minimum of this quadratic function either provides the new point directly or is used to determine a line along which a one-dimensional search can be carried out. In implementation of the algorithm, the gradient can be either provided in the form of a procedure or computed numerically by finite differences. The Hessian H is not computed, but an estimate of its LU factorization is kept up to date as the process continues.
Nelder-Mead Algorithm
ForminimizingafunctionF:Rn →R,anothermethodcalledtheNelder-Meadalgorithm is available. It is a method of direct search and proceeds without involving any derivatives of the function F and without any line searches.
Before beginning the calculations, the user assigns values to three parameters: α, β,
and γ . The default values are 1, 1 , and 1, respectively. In each step of the algorithm, a set 2
16.2 Multivariate Case 647
648 Chapter 16
Minimization of Functions
ofn+1pointsinRn isgiven:{x0,x1,…,xn}.ThissetisingeneralpositioninRn.This means that the set of n points xi − x0, with 1 i n, is linearly independent. A consequence ofthisassumptionisthattheconvexhulloftheoriginalset{x0,x1,…,xn}isann-simplex. For example, a 2-simplex is a triangle in R2, and a 3-simplex is a tetrahedron in R3. To make the description of the algorithm as simple as possible, we assume that the points have been relabeled (if necessary) so that F(x0) F(x1) ··· F(xn). Since we are trying to minimize the function F, the point x0 is the worst of the current set, because it produces the highest value of F.
We compute the point
1 n u=n xi
i=1
This is the centroid of the face of the current simplex opposite the worst vertex, x0. Next, we compute a reflected point v = (1 + α)u − αx0.
If F(v) is less than F(xn), then this is a favorable situation, and one is tempted to replace x0 by v and begin anew. However, we first compute an expanded reflected point w = (1 + γ)v − γu and test to see whether F(w) is less than F(xn). If so, we replace x0 by w and begin anew. Otherwise, we replace x0 by v as originally suggested and begin with the new simplex.
Assume now that F(v) is not less than F(xn). If F(v) F(x1), then replace x0 by v
and begin again. Having disposed of all cases when F(v) F(x1), we now consider two
further cases. First, if F(v) F(x0), then define w = u + β(v − u). If F(v) > F(x0),
compute w = u + β(x0 − u). With w now defined, test whether F(w) < F(x0). If this
is true, replace x0 by w and begin anew. However, if F(w) F(x0), shrink the simplex by
usingxi ← 1(xi +xn)for0in−1.Thenbeginanew. 2
The algorithm needs a stopping test in each major step. One such test is whether the relative flatness is small. That is the quantity
F(x0) − F(xn)
|F(x0)| + |F(xn)|
Other tests to make sure progress is being made can be added. In programming the algorithm, one keeps the number of evaluations of f to a minimum. In fact, only three indices are needed: the indices of the greatest F(xi), the next greatest, and the least.
In addition to the original paper of Nelder and Mead [1965], one can consult Dennis and Woods [1987], Dixon [1974], and Torczon [1997]. Different authors give slightly different versions of the algorithm. We have followed the original description by Nelder and Mead.
Method of Simulated Annealing
This method has been proposed and found to be effective for the minimization of difficult functions, especially if they have many purely local minimum points. It involves no deriva- tives or line searches; indeed, it has found great success in minimizing discrete functions, such as arise in the traveling salesman problem.
Suppose we are given a real-valued function of n real variables; that is, F: Rn → R. We must be able to compute the values F(x) for any x in Rn. It is desired to locate a global minimum point of F, which is a point x∗ such that F(x∗)F(x) for all x in Rn. In other words, F(x∗) is equal to infx∈Rn F(x). The algorithm generates a
sequenceofpointsx1,x2,x3,...,andonehopesthatminjk F(xj)convergestoinfF(x) as k → ∞.
It suffices to describe the computation that leads to xk+1, assuming that xk has been computed.Webeginbygeneratingamodestnumberofrandompointsu1,u2,...,um in a large neighborhood of xk. For each of these points, the value of F must be computed. The next point, xk+1, in our sequence is chosen to be one of the points u1, u2, . . . , um . This choice is made as follows. Select an index j such that
F(uj) = min{F(u1), F(u2),..., F(um)}
If F(uj) < F(xk), then set xk+1 = uj. In the other case, for each i, we assign a probability
pi to ui by the formula
pi =eα[F(xk)−F(ui)] (1im)
Here, α is a positive parameter chosen by the user of the code. We normalize the probabilities
by dividing each by their sum. That is, we compute
and then carry out a replacement
m S= pi
i=1
pi ← pi/S
Finally, a random choice is made among the points u1, u2, . . . , um , taking account of the probabilities pi that have been assigned to them. This randomly chosen ui becomes xk+1.
The simplest way to make this random choice is to employ a random number generator to get a random point ξ in the interval (0, 1). Select i to be the first integer such that
ξ p1 + p2 + · · · + pi
Thus,ifξ p1,leti=1(andxn+1 = u1).If p1 < ξ p1+p2,thenleti = 2(andxn+1 = u2), and so on.
The formula for the probabilities pi is taken from the theory of thermodynamics. The interested reader can consult the original articles by Metropolis et al. [1953] or Otten and van Ginneken [1989]. Presumably, other functions can serve in this role as well.
What is the purpose of the complicated choice for xk+1? Because of the possibility of encountering local minima, the algorithm must occasionally choose a point that is uphill from the current point. Then there is a chance that subsequent points might begin to move toward a different local minimum. An element of randomness is introduced to make this possible.
With minor modifications, the algorithm can be used for functions f : X → R, where X is any set. For example, in the traveling salesman problem, X will be the set of all permutations of a set of integers {1, 2, 3, . . . , N }. All that is required is a procedure for generating random permutations and, of course, a code for evaluating the function f .
Computer programs for this algorithm can be found on the Internet such as at the web- sites http://www.netlib.gov and http://www.ingber.com. A collection of papers on this subject, emphasizing parallel computation, is Azencott [1992].
16.2 Multivariate Case 649
650 Chapter 16
Minimization of Functions
Summary
(1) In a typical minimization problem, we seek a point x∗ such that F(x∗)F(x) forallx∈Rn
where F is a real-valued multivariate function. (2) A gradient vector G(x) has components
Gi =Gi(x)=∂F(x) ∂xi
and a Hessian matrix H(x) has components Hij =Hij(x)=∂2F(x)
(1in)
(3) The Taylor series for F is
F(x + h) = F(x) + G(x)T h + 1 hT H(x)h + · · ·
2
Here, x is the fixed point of expansion in Rn and h is the variable in Rn with components h1,h2,...,hn. The three dots indicate higher-order terms in h that are not needed in this discussion.
(4) An alternative form of the Taylor series is
F(x)= F(z)+G(z)T(x−z)+ 1(x−z)T H(z)(x−z)+···
2
For example, a linear function F(x) = c + bT x has the Taylor series
(1i, jn)
It is a symmetric matrix if the second-order derivatives are continuous.
∂xi ∂xj
A quadratic function is
F(x) = F(z) + bT (x − z)
F(x)=c+bTx+1xT Ax 2
(5) An iterative procedure for locating a minimum point of a function F is to start with a point z that is a current estimate of the minimum point, compute the gradient G and Hessian H of F at the point z, and solve the matrix equation
Hx = −G for x. Then replace z by z + x and repeat.
(6) If the matrix H has the property that xT Hx > 0 for every nonzero vector x, then the quadratic function Q has a unique minimum point.
(7) Algorithms that are discussed are steepest descent, Nelder-Mead, and simulated annealing.
Additional References
For more reading on the subject of optimization, see books and papers by Azencott [1992], Baldick [2006], Beale [1988], Cvijovic and Kilnowski [1995], Dennis and Schnabel [1983, 1996], Dennis and Woods [1987], Dixon [1974], Fletcher [1976], Floudas and Pardalos [1992], Gill, Murray and Wright [1981], Herz-Fischer [1998], Horst, Pardalos, and Thoai [2000], Kelley [2003], Kirkpatrick et al. [1983], Lootsam [1972], More ́ and Wright [1993], Nelder and Mead [1965], Nocedal and Wright [2006], Otten and van Ginneken [1989], Rheinboldt [1998], Roos, Terlaky, and Vial [1997], Torczon [1997], and To ̈rn and Zilinskas [1989].
1. DeterminewhetherthesefunctionshaveminimumvaluesinR2:
a2. a 3. 4. 5.
aa. x12 −x1x2 +x2 +3×1 +6×2 −4 ab. x12 −3x1x2 +x2 +7×1 +3×2 +5 c. 2×12 −3x1x2 +x2 +4×1 −x2 +6
d. ax12 −2bx1x2 +cx2 +dx1 +ex2 + f
Hint: Use the method of completing the square.
Locate the minimum point of 3×2 − 2xy + y2 + 3x − 46 + 7 by finding the gradient and Hessian and solving the appropriate linear equations.
Using (0, 0) as the point of expansion, write the first three terms of the Taylor series forF(x,y)=ex cosy−yln(x+1).
Using (1, 1) as the point of expansion, write the first three terms of the Taylor series for F(x, y) = 2×2 − 4xy + 7y2 − 3x + 5y.
TheTaylorseriesexpansionaboutzerocanbewrittenas
F(x) = F(0) + G(0)T x + 1 xT H(0)x + · · · 2
Show that the Taylor series about z can be written in a similar form by using matrix- vector notation; that is,
F(x)= F(z)+G(z)TX + 1XTH(z)X +··· 2
a6.
Show that the gradient of F(x, y) is perpendicular to the contour. Hint: Interpret the equation F(x, y) = c as defining y as a function of x. Then by the chain rule,
∂F + ∂F dy = 0 ∂x ∂y dx
where x G(z) X = z , G(z)= −G(z) ,
H(z) −H(z) H(z)= −H(z) H(z)
16.2 Multivariate Case 651
From it obtain the slope of the tangent to the contour.
Problems 16.2
652 Chapter 16
Minimization of Functions
7.
8.
9.
a 10.
11.
a12.
13. 14. 15.
Considerthefunction
F(x1,x2,x3)=3ex1x2 −x3cosx1+x2lnx3
a. DeterminethegradientvectorandHessianmatrix.
ab. DerivethefirstthreetermsoftheTaylorseriesexpansionabout(0,1,1).
c. What linear system should be solved for a reasonable guess as to the minimum point for F? What is the value of F at this point?
It is asserted that the Hessian of an unknown function F at a certain point is
32 14
What conclusion can be drawn about F?
Whatarethegradientsofthefollowingfunctionsatthepointsindicated? aa. F(x,y)=x2y−2x+yat(1,0)
ab. F(x,y,z)=xy+yz2+x2zat(1,2,1)
Consider F(x, y, z) = y2z2(1 + sin2 x) + (y + 1)2(z + 3)2. We want to find the mini- mum of the function. The program to be used requires the gradient of the function. What formulas must we program for the gradient?
Let F be a function of two variables whose gradient at (0, 0) is [−5, 1]T and whose Hessian is
6 −1 −1 2
Make a reasonable guess as to the minimum point of F. Explain.
Write the function F(x1, x2) = 3×12 + 6x1x2 − 2×2 + 5×1 + 3×2 + 7 in the form of Equation (7) with appropriate A, b, and c. Show in matrix form the linear equations that must be solved in order to find a point where the first partial derivatives of F vanish. Finally, solve these equations to locate this point numerically.
VerifyEquation(13).IndifferentiatingthedoublesuminEquation(12),firstwriteall terms that contain xk. Then differentiate and use the symmetry of the matrix H.
ConsiderthequadraticfunctionQinEquation(12).ShowthatifHispositivedefinite, then the stationary point is a minimum point.
(General quadratic function) Generalize Equation (6) to n variables. Show that a general quadratic function Q(x) of n variables can be written in the matrix-vector form of Equation (7), where A is an n × n symmetric matrix, b a vector of length n, and c a scalar. Establish that the gradient and Hessian are
G(x)= Ax+b and H(x)= A
respectively.
16.2 Multivariate Case 653 16. Let A be an n × n symmetric matrix and define an upper triangular matrix U = (ui j )
by putting
⎧
⎪⎨ a i j i = j
u i j = ⎪⎩ 2 a i j i < j 0 i>j
Show that xT Ux = xT Ax for all vectors x.
17. Show that the general quadratic function Q(x) of n variables can be written
Q(x) = c + bT x + 1 xT U x 2
where U is an upper triangular matrix. Can this simplify the work of finding the station- ary point of Q?
18. Show that the gradient and Hessian satisfy the equation H(z)(x − z) = G(x) − G(z)
for a general quadratic function of n variables.
19. UsingTaylorseries,showthatageneralquadraticfunctionofnvariablescanbewritten
in block form
where
Q(x)= 1XTAX +BTX +c 2
x A−A b X= z , A= −A A , B= −b
Here z is the point of expansion.
20. (Least-squares problem) Consider the function
F(x) = (b − Ax)T (b − Ax) + αxT x
where A is a real m ×n matrix, b is a real column vector of order m, and α is a positive
real number. We want the minimum point of F for given A, b, and α. Show that F(x + h) − F(x) = (Ah)T (Ah) + αhT h 0
for h a vector of order n, provided that
(AT A+αI)x= ATb
This means that any solution of this linear system minimizes F(x); hence, this is the normal equation.
21. (Multiple choice) What is the gradient of the function f (x) = 3×12 − sin(x1x2) at the point (3, 0)?
a. (6, −3) b. (3, −1) c. (18, 0) d. (18, −3) e. None of these.
654 Chapter 16
Minimization of Functions
22. (Multiple choice, continuation) The directional derivative of the function f at the point x in the direction u is given by the expression
d f(x+tu)|t=0 dt
In this description, u should be a unit vector. What is the numerical value of the directional derivative where f (x) is the function defined in the preceding problem,
√ x = (1,π/2), and u = (1,1)/ 2.
√
a. 6/ 2 b.6 c.18
d.3 e.Noneofthese.
23. (Multiple choice, continuation) If f is a real-valued function of n variables, the Hessian H=(Hij)isgivenbyHij =∂2f/∂xi∂xj,alltermsbeingevaluatedataspecificpoint x. What is the entry H22 in this matrix in the case of f as given in the previous problem and x = (1, π/2)?
a. 6 b. 6/√2 c. 1 d. π2/2 e. Noneofthese.
24. (Multiple choice) Let f be a real-valued function of n real variables. Let x and u be given as numerical vectors, and u ≠ 0. Then the expression f (x + t u) defines a functionoft.Supposethattheminimumof f(x+tu)occursatt = 0.Whatconclusion can be drawn?
a. The gradient of f at x, denoted by G(x), is 0.
b. u is perpendicular to the gradient of f at x.
c. u = G(x), where G(x) denotes the gradient of f at x. d. G(x)isperpendiculartox. e.Noneofthese.
25. (Multiple choice) If f is a (real-valued) quadratic function of n real variables, we can writeitintheform fx)=c−bTx+1xT Ax.Thegradientof f isthen:
2
a. Ax b.b−Ax c.Ax−b d.1Ax−b e.Noneofthese. 2
1. SelectaroutinefromyourprogramlibraryorfromapackagesuchasMatlab,Maple,or Mathematica for minimizing a function of many variables without the need to program derivatives. Test it on one or more of the following well-known functions. The ordering of our variables is (x, y, z, w).
a. Rosenbrock’s: 100(y − x2)2 + (1 − x)2. Start at (−1.2, 1.0).
b. Powell’s: (x +10y)2 +5(z −w)2 +(y −2z)4 +10(x −w)4. Start at (3,−1,0,1).
c. Powell’s:x2 +2y2 +3z2 +4w2 +(x+y+z+w)4.Startat(1,−1,1,1).
d. Fletcher and Powell’s: 100(z − 10φ)2 + x2 + y2 − 12 + z2 in which φ is an angle determined from (x, y) by
cos2πφ = x sin2πφ = y and
x2 + y2 x2 + y2 where −π/2 < 2πφ 3π/2. Start at (1, 1, 1).
Computer Problems 16.2
16.2 Multivariate Case 655 e. Wood’s:100(x2 −y)2 +(1−x)2 +90(z2 −w)2 +(1−z)2 +10(y−1)2 +
(w − 1)2 + 19.8(y − 1)(w − 1). Start at (−3, −1, −3, −1).
2. (Acceleratedsteepestdescent)Thisversionofsteepestdescentissuperiortothebasic one. A sequence of points x1, x2, . . . is generated as follows: Point x1 is specified as the starting point. Then x2 is obtained by one step of steepest descent from x1. In the general step, if x1, x2, . . . , xm have been obtained, we find a point z by steepest descent from xm. Then xm+1 is taken as the minimum point on the line xm−1 + t(z − zm−1). Program and test this algorithm on one of the examples in Computer Problem 16.2.1.
3. Using a routine in your program library or in Matlab, Maple, or Mathematica, a. solve the minimization problem that begins this chapter.
b. plotandsolvefortheminimumpoint,themaximumpoint,andthesaddlepointof these functions, respectively: x2 + y2, 1 − x2 − y2, x2 − y2.
c. plot and numerically experiment with these functions that do not have minima: −x2 −y2 +13x+6y+12,x2 −y2 +3x+5y+7,x2 −2xy+x+2y+3, 2x + 4y + 6.
4. WewanttofindtheminimumofF(x,y,z)=z2cosx+x2y2+x2ez usingacomputer program that requires procedures for the gradient of F together with F. Write the necessary procedures. Find the minimum using a preprogrammed code that uses the gradient.
5. Assumethat
procedure Xmin( f, (gradi ), n, (xi), (gi j ))
is available to compute the minimum value of a function of two variables. Suppose that this routine requires not only the function but also its gradient. If we are going to use this routine with the function F(x, y) = ex cos2(xy), what procedure will be needed? Write the appropriate code. Find the minimum using a preprogrammed code that uses the gradient.
6. ProgramandtesttheNelder-Meadalgorithm.
7. Program and test the Simulated Annealing algorithm.
8. (Studentresearchproject)Exploreoneofthenewermethodsforminimizationsuch as generic algorithms, methods of simulated annealing, or the Nelder-Mead algorithm. Use some of the software that is available for them.
9. Usebuilt-inroutinesinmathematicalsoftwaresystemssuchasMapleorMathematica to verify the calculations in Example 1. Hint: In Maple, use grad and Hessian, and in Mathematica, use Series. For example, obtain two terms in the Taylor series in two variables expanded about the point (1, 1), and then carry out a change of variables.
10. (Molecular conformation: Protein folding project) Forces that govern folding of amino acids into proteins are due to bonds between individual atoms and to weaker interactions between unbound atoms such as electrostatic and Van der Waals forces. The Van der Waals forces are modeled by the Lennard-Jones potential
U(r)= 1 − 2 r12 r6
656 Chapter 16
Minimization of Functions
where r is the distance between atoms. y
1
123
x
In the figure, the energy minimum is −1 and it is achieved at r = 1. Explore this subject and the numerical methods used. One approach is to predict the conformation of the proteins in finding the minimum potential energy of the total configuration of amino acids. For a cluster of atoms with positions (x1, y1, z1) to (xn, yn, zn), the objective function to be minimized is
U=1−2 r12 r6
i
⎪⎩ constraints:
yT A cT y0
yx−2
668 Chapter 17
Linear Programming
Here, nonunique and unbounded “solutions” may be obtained.
aa.c=[2,−4]T 1T
b. c= 2, 2 ac. c=[3,2]T
d.c=[2,−3]T e. c = [−4, 11]T
af. c=[−3,4]T g. c = [2, 1]T ah. c=[3,1]T
A= −3 −5 49
b = [−15, 36] b = [30, 12]T
b=[6,36]T
b=[0,5]T
b = [12, 44]T
b = [6, −20]T
b=[0,−2]T
6 5
A= 41
A= −3 2 −4 9
A= −1 1 01
A= −3 4 −4 11
A= 23 −4 −5
A= 11 1 2
A= 24 53
b = [21, 18]T
a14. Solvethefollowinglinearprogrammingproblembyhand,usingagraphforhelp:
⎧⎪maximize: 4x+4y+z
⎪ ⎪⎨
⎪ constraints: ⎪
⎧⎪ 3x+2y+ z= 12 ⎪⎨ 7x+7y+2z 144 ⎪ 7x + 5y + 2z 80 ⎪11x+7y+3z 132 ⎪⎩ x 0 y 0
⎩
Hint: Use the equation to eliminate z from all other expressions. Solve the resulting two-dimensional problem.
15. Putthislinearprogrammingproblemintosecondprimalform.Youmaywanttomake changes of variables. If so, include a dictionary relating new and old variables.
⎧⎪ minimize:
⎪⎨
ε1 + ε2 + ε3 ⎧⎪|3x+4y+6| ε1 ⎨|2x−8y−4| ε2
⎪ ⎪constraints:⎪|−x−3y+5| ε
⎪⎩ ⎪⎩ 3
ε1>0 ε2>0 ε3>0 x>0 y>0
Solve the resulting problem.
16.
17.1 Standard Forms and Duality 669 Consider the following linear programming problem:
a 17.
18.
⎪⎩ constraints:
In the special case in which all data are positive, show that the dual problem has the
same extreme value as the original problem.
Suppose that a linear programming problem in first primal form has the property that cTx is not bounded on the feasible set. What conclusion can be drawn about the dual problem?
(Multiplechoice)Whichoftheseproblemsisformulatedinthefirstprimalformfora linear programming problem?
a. maximize cTx subject to Ax b
b. minimizecTxsubjecttoAxb,x0
c. maximizecTxsubjecttoAx=b,x0
d. maximize cTx subject to Ax b, x 0 e. None of these.
Awesternshopwishestopurchase300feltand200strawcowboyhats.Bidshavebeen received from three wholesalers. Texas Hatters has agreed to supply not more than 200 hats, Lone Star Hatters not more than 250, and Lariat Ranch Wear not more than 150. The owner of the shop has estimated that his profit per hat sold from Texas Hatters would be $3/felt and $4/straw, from Lone Star Hatters $3.80/felt and $3.50/straw, and from Lariat Ranch Wear $4/felt and $3.60/straw. Set up a linear programming problem to maximize the owner’s profits. Solve by using a program from your software library.
The ABC Drug Company makes two types of liquid painkiller that have brand names Relieve (R) and Ease (E) and contain different mixtures of three basic drugs, A, B, and C, produced by the company. Each bottle of R requires 7 unit of drug A, 1 unit
of drug B, and 3 unit of drug C. Each bottle of E requires 4 unit of drug A, 5 unit of 492
drug A, 7 units of drug B, and 9 units of C. Moreover, Food and Drug Administration regulations stipulate that the number of bottles of R manufactured cannot exceed twice the number of bottles of E. The profit margin for each bottle of E and R is $7 and $3, respectively. Set up the linear programming problem in first primal form to determine the number of bottles of the two painkillers that the company should produce each day so as to maximize their profits. Solve by using available software.
Supposethattheuniversitystudentgovernmentwishestocharterplanestotransportat least 750 students to the bowl game. Two airlines, α and β, agree to supply aircraft for
a1.
2.
⎧
⎪⎨ maximize:
c1 x1 + c2 x2
a1x1 + a2x2 b x10 x20
92
drug B, and 1 unit of drug C. The company is able to produce each day only 5 units of 4
a3.
Computer Problems 17.1
670
Chapter 17
Linear Programming
17.2
the trip. Airline α has five aircraft available carrying 75 passengers each, and airline β has three aircraft available carrying 250 passengers each. The cost per aircraft is $900 and $3250 for the trip from airlines α and β, respectively. The student government wants to charter at most six aircraft. How many of each type should be chartered to minimize the cost of the airlift? How much should the student government charge to make 50c/ profit per student? Solve by the graphical method, and verify by using a routine from your program library.
4. (Continuation)Reworktheprecedingcomputerprobleminthefollowingtwopossibly different ways:
a. Thenumberofstudentsgoingontheairliftismaximized. b. Thecostperstudentisminimized.
a5. (Dietproblem)Auniversitydininghallwishestoprovideatleast5unitsofvitaminC and 3 units of vitamin E per serving. Three foods are available containing these vitamins. Food f1 contains 2.5 and 1.25 units per ounce of vitamins C and E, respectively, whereas food f2 contains just the opposite amounts. The third food f3 contains an equal amount of each vitamin at 1 unit per ounce. Food f1 costs 25c/ per ounce, food f2 costs 56c/ per ounce, and food f3 costs 10c/ per ounce. The dietitian wishes to provide the meal at a minimum cost per serving that satisfies the minimum vitamin requirements. Set up this linear programming problem in second primal form. Solve with the aid of a code from your computer program library.
6. Use built-in routines in mathematical software systems such as Matlab, Maple, or Mathematica to solve linear programming problem with equation number below in first primal form, in second primal form, and in dual form:
a. (2) b. (3) c. (4) d. (5) e. (6)
Simplex Method
The principal algorithm that is used in solving linear programming problems is the simplex method. Here, enough of the background of this method is described that the reader can use available computer programs that incorporate it.
Consider a linear programming problem in second primal form:
⎧
⎪⎨ maximize: cTx
Ax = b x0
It is assumed that c and x are n-component vectors, b is an m-component vector, and A is an m × n matrix. Also, it is assumed that b 0 and that A contains an m × m identity
⎪⎩ constraints:
matrix in its last m columns. As before, we define the set of feasible points as K = {x ∈ Rn: Ax = b, x 0}
The points of K are exactly the points that are competing to maximize cTx. Vertices in K and Linearly Independent Columns of A
The set K is a polyhedral set in Rn , and the algorithm to be described proceeds from vertex
to vertex in K, always increasing the value of cTx as it goes from one to another. Let us
give a precise definition of vertex. A point x in K is called a vertex if it is impossible to
expressitasx=1(u+v),withbothuandvinKandu≠ v.Inotherwords,xisnotthe 2
midpoint of any line segment whose endpoints lie in K .
We denote by a(1), a(2), . . . , a(n) the column vectors constituting the matrix A. The
following theorem relates the columns of A to the vertices of K :
everyindexi thatisnotinthesetI(x),wehavexi =0,ui 0,vi 0,andxi = 1(ui +vi). 2
17.2 Simplex Method 671
THEOREM ON VERTICES AND COLUMN VECTORS
Let x ∈ K and define I(x) = {i: xi > 0}. Then the following are equivalent:
1. x isavertexof K.
2. The set {a(i): i ∈ I(x)} is linearly independent.
■ THEOREM1
Proof
IfStatement1isfalse,thenwecanwritex=1(u+v),withu∈K,v∈K,andu≠ v.For 2
This forces ui and vi to be zero. Thus, all the nonzero components of u and v correspond to indices i in I(x). Since u and v belong to K,
and
Hence, we obtain
n
b = Au = ui a(i) = ui a(i)
i=1 i∈I(x) n
b = Av = vi a(i) = vi a(i) i=1 i∈I(x)
(ui −vi)a(i) =0 i∈I(x)
showing the linear dependence of the set {a(i): i ∈ I(x)}. Thus, Statement 2 is false. Con- sequently, Statement 2 implies Statement 1.
For the converse, assume that Statement 2 is false. From the linear dependence of
column vectors a(i) for i ∈ I(x), we have
i∈I(x)
(i) yi a
i∈I(x)
=0
with
|yi|≠ 0
672 Chapter 17
Linear Programming
for appropriate coefficients yi. For each i ∈/ I(x), let yi = 0. Form the vector y with components yi for i = 1,2,…,n. Then, for any λ, we see that because x ∈ K,
n n
A(x ±λy) = (xi ±λyi)a(i) = i=1
i=1
xi a(i) ±λ yi a(i) = Ax = b i∈I(x)
Now select the real number λ positive but so small that x +λy0 and x −λy0. [To
see that it is possible, consider separately the components for i ∈ I(x) and i ∈/ I(x).] The
resulting vectors, u = x + λy and v = x − λy, belong to K. They differ, and obviously,
x = 1 (u + v). Thus, x is not a vertex of K; that is, Statement 1 is false. So Statement 1 2
implies Statement 2. ■
Given a linear programming problem, there are three possibilities:
1. There are no feasible points; that is, the set K is empty.
2. K is not empty, and cTx is not bounded on K.
3. K is not empty, and cTx is bounded on K.
It is true (but not obvious) that in the third case, there is a point x in K such that cTx cT y for all y in K. We have assumed that our problem is in the second primal form so that possibility 1 cannot occur. Indeed, A contains an m × m identity matrix and so has the form
⎡a11 a12 ··· a1k 1 0 ··· 0⎤
⎢ a21 a22 · · · a2k 0 1 · · · 0 ⎥ A=⎢ . . .. . . . .. .⎥
⎣……..⎦ am1 am2 ··· amk 0 0 ··· 1
where k = n − m. Consequently, we can construct a feasible point x easily by setting x1 =x2 =···=xk =0andxk+1 =b1,xk+2 =b2,andsoon.Itisthenclearthat Ax=b. The inequality x 0 follows from our initial assumption that b 0.
Simplex Method
Next we present a brief outline of the simplex method for solving linear programming prob- lems. It involves a sequence of exchanges so that the trial solution proceeds systematically from one vertex to another in K. This procedure is stopped when the value of cTx is no longer increased as a result of the exchange.
The following is an outline of the simplex algorithm.
■ ALGORITHM1
Simplex
A few remarks on this algorithm are in order. In the beginning, select the indices k1,k2,…,km such that a(k1),a(k2),…,a(km) form an m × m identity matrix. At step 5, where we say that x is a solution, we mean that the vector v = (vi ) given by vki = xi for 1 i n and vi = 0 for i ∈/ {k1,k2,…,km} is the solution. A convenient choice for the tolerance ε that occurs in steps 5 and 7 might be 10−6.
In any reasonable implementation of the simplex method, advantage must be taken of the fact that succeeding occurrences of step 1 are very similar. In fact, only one column of B changes at a time. Similar remarks hold for steps 3 and 6.
We do not recommend that the reader attempt to program the simplex algorithm. Efficient codes, refined over many years of experience, are usually available in software libraries. Many of them can provide solutions to a given problem and to its dual with very little additional computing. Sometimes this feature can be exploited to decrease the execution time of a problem. To see why, consider a linear programming problem in first primal form:
⎧⎪ maximize: cTx ⎨
As usual, we assume that x is an n vector and that A is an m × n matrix. When the simplex algorithm is applied to this problem, it performs an iterative process on an m × m matrix denoted by B in the preceding description. If the number of inequality constraints m is very large relative to n, then the dual problem may be easier to solve, since the B matrices for it will be of dimension n × n. Indeed, the dual problem is
⎧⎪ minimize: bT y ⎨
(D) ⎪⎩constraints: ATyc y0
(P) ⎪⎩constraints:
x0
Ax b
17.2 Simplex Method 673
Select a small positive value for ε. In each step, we have a set of m indices {k1,k2,…,km}.
1. Put columns a(k1), a(k2), . . . , a(km ) into B, and solve Bx = b.
2. If xi > 0 for 1 i m, continue. Otherwise, exit because the algorithm has
failed.
3. Sete=[ck1,ck2,…,ckm]T,andsolveBTy=e.
4. Choose any s in {1,2,…,n} but not in {k1,k2,…,km} for which cs − yTa(s) is greatest.
5. If cs − yTa(s) < ε, exit because x is the solution.
6. Solve Bz = a(s).
7. If zi ε for 1 i m, then exit because the objective function is unbounded on K.
8. Among the ratios xi/zi that have zi > 0 for 1 i m, let xr/zr be the smallest. In case of a tie, let r be the first occurrence.
9. Replace kr by s, and go to step 1.
674 Chapter 17
Linear Programming
and the number of inequality constraints here is n. An example of this technique appears in the next section.
Summary
(1) For the second primal form, the set of feasible points is K = {x ∈ Rn: Ax = b, x 0}
which are the points of K competing to maximize cTx.
(2) For a linear programming problem, there are these possibilities: There are no feasible points, that is, the set K is empty; K is not empty, and cTx is not bounded on K; K is not empty, and cTx is bounded on K.
(3) Denote by a(1), a(2), . . . , a(n) the column vectors constituting the matrix A. Let x ∈ K anddefineI(x)={i:xi >0}.ThenxisavertexofKifandonlyiftheset{a(i):i∈I(x)} is linearly independent.
(4) The simplex method involves a sequence of exchanges so that the trial solution proceeds systematically from one vertex to another in the set of feasible points K . This procedure is stopped when the value of cTx is no longer increased as a result of exchanges.
a1.
a 2. 3.
4. a5.
a 6. 7.
Showthatthelinearprogrammingproblem
can be put into first primal form by increasing the number of variables by just one. Hint: Replace xj by yj − y0.
Show that the set K can have only a finite number of vertices.
Suppose that u and v are solution points for a linear programming problem and that
x = 1 (u + v). Show that x is also a solution. 2
Usingthesimplexmethodasdescribed,solvethenumericalexampleinthetext.
Using standard manipulations, put the dual problem (D) into first and second primal forms.
Show how a code for solving a linear programming problem in first primal form can be used to solve a system of n linear equations in n variables.
Using standard techniques, put the dual problem (D) into first primal form (P); then take the dual of it. What is the result?
maximize: cT x constraints: Ax b
Problems 17.2
⎧⎪ minimize: ⎪ ⎪ ⎪ ⎪
8×1 +6×2 +6×3 +9×4
⎪ ⎪⎨
+ x 4 2
+ x4 4 x3 + x4 1 +x3 1
a. ⎪ ⎪
⎧⎪ minimize: ⎪⎨
ab. ⎪ ⎪⎩constraints:
⎧⎪ maximize: ⎪⎨
ac. ⎪ ⎪constraints:
⎪⎩
⎩
⎪⎩x10 x20 x30 x40 10×1 −5×2 −4×3 +7×4 +x5
constraints:
⎪
⎪ x1
17.3 Approximate Solution of Inconsistent Linear Systems 675
1. Select a linear programming code from your computing center library and use it to solve these problems:
⎧⎪ x 1 + 2 x 2 ⎪ ⎪ ⎪ ⎪ ⎪⎨ 3×1 + x2
⎧⎪4×1−3×2− x3+4×4+ x5=1 ⎨
⎪⎩−x1 + 2×2 + 2×3 + x4 + 3×5 = 4 x10 x20 x30 x40 x50
2×1 + 4×2 + 3×3
⎧⎪4×1 +2×2 +3×3 15
⎨3×1+2×2+x3 7
⎪ x + x + 2x 6 ⎪⎩1 2 3
x10 x20 x30
2. (Student research project) Investigate recent developments in computational linear
programming algorithms, especially by interior-point methods.
17.3 Approximate Solution of Inconsistent Linear Systems
Linear programming can be used for the approximate solution of systems of linear equations that are inconsistent. An m × n system of equations
n
aijxj =bi (1im)
j=1
is said to be inconsistent if there is no vector x = [x1,x2,…,xn]T that simultaneously
satisfies all m equations in the system. For instance, the system ⎧
⎪⎨ 2 x 1 + 3 x 2 = 4
⎪⎩ x1 − x2 = 2 (1)
x1 +2×2 =7
is inconsistent, as can be seen by attempting to carry out the Gaussian elimination process.
Computer Problems 17.2
676 Chapter 17
Linear Programming
l1 Problem
Since no vector x can solve an inconsistent system of equations, the residuals
n j=1
ri =
cannot be made to vanish simultaneously. Hence, m |r | > 0. Now it is natural to ask for
⎪⎨
n+1 ⎪
(1im) (1im)
⎪ ⎪ ⎪
⎪ aijyj −εi bi ⎪⎨j=1
(3)
⎪ ⎪ ⎩
n+1 ⎪
εi
⎧⎪ n
aijxj −bi (1im) m i=1 i
an x vector that renders the expression i =1 |ri | as small as possible. This problem is called thel problemforthissystemofequations.Othercriteria,leadingtodifferentapproximate
1 m2 solutions, might be to minimize r or max
which special algorithms have been designed (see Barrodale and Roberts [1974]). However, if one of these special programs is not available or if the problem is small in scope, linear programming can be used.
m 2i=1 i the problem of minimizing r .
|r |. Chapter 12 discusses in detail 1im i
ni=1 i
The minimization of i =1 |ri | by appropriate choice of the x vector is a problem for
A simple, direct restatement of the problem is
⎧⎪
⎪ minimize:
⎪ ⎪⎨
⎪
⎪ constraints: ⎪
⎪⎩
m i=1
⎪ ⎪⎨ aijxj−biεi (1im) j=1
⎪ n
⎩− aijxj+biεi (1im)
j=1
(2)
If a linear programming code is at hand in which the variables are not required to be nonnegative, then it can be used on Problem (2). If the variables must be nonnegative, the following technique can be applied. Introduce a variable yn+1, and write xj = yj − yn+1. Then define ai,n+1 = − nj =1 ai j . This step creates an additional column in the matrix A. Now consider the linear programming problem
⎧⎪
m
− εi
⎧i=1
⎪ ⎪ ⎪
maximize:
constraints:
⎪−
⎪ j=1
aijyj −εi −bi ⎩ y0 ε0
which is in first primal form with m + n + 1 variables and 2m inequality constraints.
that
17.3 Approximate Solution of Inconsistent Linear Systems 677 It is not hard to verify that Problem (3) is equivalent to Problem (2). The main point is
n+1 n
aijyj = aij(xj +yn+1)+ai,n+1yn+1 j=1 j=1
⎧⎪
⎪ maximize:
m m
ui − vi
⎪⎨
⎪ ⎪⎨
i = 1
aijyj −ui +vi =bi
n+1 ⎪ ⎪⎩ ⎪⎩ j = 1
(1im)
⎪constraints:
Using the preceding formulas, we have
ri =
=
aijxj −bi =
aijyj −yn+1
aij(yj −yn+1)−bi aij −bi
= =
aijxj +yn+1 aij +yn+1 j=1
−
aij
n n n
j=1 n
j=1
ai j x j
Another technique can be used to replace the 2m inequality constraints in Problem (3)
j=1
by a set of m equality constraints. We write
εi =|ri|=ui +vi
whereui =ri andvi =0ifri0butvi =−ri andui =0ifri <0.Theresultinglinear programming problem is
−
⎧ i = 1
u0 v0 y0
n j=1
n j=1
n+1
=
Fromit,weconcludethatr +v =u 0.Nowv andu shouldbeassmallaspossible,
aijyj −bi =ui −vi
n j=1
n j=1
j=1
consistentwiththisrestriction,becauseweareattemptingtominimize i=1(ui +vi).Soif
r 0,wetakev 0andu =r ,whereasifr <0,wetakev =−r andu =0.Ineither i iiimi iiim
case, |ri | = ui + vi . Thus, minimizing i=1(ui + vi ) is the same as minimizing i=1 |ri |. The example of the inconsistent linear system given by (1) could be solved in the l1
sense by solving the linear programming problem ⎧⎪minimize: u1 +v1 +u2 +v2 +u3 +v3
⎪⎨ ⎧⎪2y1 +3y2 −5y3 −u1 +v1 =4
⎪ ⎨ y1− y2 −u2+v2=2 (4)
⎪constraints:⎪ y +2y −3y −u +v =7 ⎪⎩ ⎪⎩1 2 3 3 3
i i i i i m
y1, y2, y3 0 u1,u2,u3 0 v1,v2,v3 0
678 Chapter 17
Linear Programming
The solution is
⎧⎪ minimize: ⎪
ε
⎧⎪ n
⎪⎨ aijxj −ε bi
⎨
(1im) (1im)
u1 = 0 v1 = 0 y1 = 2
u2 = 0 v2 = 0 y2 = 0
u3 = 0 v3 = 5 y3 = 0
From it, we recover the l1 solution of System (1) in the form
x1 = y1 − y3 = 2 r1 = u1 − v1 = 0 x2 = y2 − y3 = 0 r2 = u2 − v2 = 0 r3 = u3 − v3 = −5
We can use mathematical software systems such as Matlab, Maple, or Mathematica to solve this linear programming problem. For example, we obtain u1 = v1 = u2 = v2 = u3 = y2 = y3 = 0, v3 = 5, and y1 = 2, with 5 as the value of the objective function. For another system, we need to set the equality constraints. We obtain the solution corresponding to y1 = y2 = y3 =684.2887,u1 =u2 =u3 =v1 =v2 =0,andv3 =5with5asthevalue of the objective function. The x vector is x1 = 2 and x2 = 3.1494 × 10−11. This solution is slightly different from the one previously obtained, owing to roundoff errors, but the minimum value for the objective function is the same and all the constraints are satisfied.
l∞ Problem
Consider again a system of m linear equations in n unknowns:
n
aijxj =bi (1im)
j=1
If the system is inconsistent, we know that the residuals ri = nj =1 ai j x j − bi cannot all be zero for any x vector. So the quantity ε = max1 i m |ri | is positive. The problem of making ε a minimum is called the l∞ problem for the system of equations. An equivalent linear programming problem is
j=1 ⎪ n
⎪⎩−
j=1
If a linear programming code is available in which the variables need not be greater than or equal to zero, then it can be used to solve the l∞ problem as formulated above. If the variables must be nonnegative, we first introduce a variable yn+1 so large that the quantities
⎪ constraints: ⎪
aijxj −ε −bi
⎪⎩
17.3 Approximate Solution of Inconsistent Linear Systems 679 y j = x j + yn+1 are positive. Next, we solve the linear programming problem
⎧
⎪ minimize:
⎪ ⎪ ⎪⎨
⎪
⎪ constraints:
⎪ ⎪ ⎩
ε
⎩ε0 yj0 Here, we have again defined ai,n+1 = − nj =1 ai j .
y1 = 8
From it, the l∞ solution of (1) is recovered as follows:
⎧ n+1 ⎪
aijyj −ε bi aijyj −ε −bi
(1im) (1im)
(1jn+1)
⎪
⎪⎨ j=1
(5)
n+1 ⎪
⎪ −
⎪ j=1
For our System (1), the solution that minimizes the quantity max{|2x1 + 3x2 − 4|,|x1 − x2 − 2|,|x1 + 2x2 − 7|}
is obtained from the linear programming problem
The solution is
⎧⎪
minimize:
constraints:
ε
⎧⎪ ⎪
⎪ ⎪ ⎪ ⎪ ⎪
2y +3y −5y 1 2 3
−ε4 −ε2 −ε7 − ε −4 −ε−2 −ε−7
⎪ ⎨
⎪
⎪ ⎪ ⎪⎨
⎪ −2y1 − 3y2 + 5y3
⎪
⎪
⎪ −y1 + y2
⎪ −y1 −2y2 +3y3 ⎪⎩y1,y2,y30 ε0
⎪ ⎪⎩
y1−y2
y1 +2y2 −3y3
(6)
y2 = 5 y3 = 0 ε = 25 939
x1 = y1 − y3 = 8 x2 = y2 − y3 − 5 93
We can use mathematical software systems such as Matlab, Maple, or Mathematica to
solve the linear programming problem (6). For example, we obtain the solution y1 = 8 , 9
y2 = 5 , y3 = 0, and ε = 25 from two of these systems. But for one of the mathematical 39
systems, we obtain the solution corresponding to y1 = 1.0423 × 103, y2 = 1.0431 × 103. y = 1.0414 × 103, and ε = 2.778. We do obtain the same results as before
(0.8889, 1.6667) ≈ 8 , 5 . 93
3
680 Chapter 17
Linear Programming
In problems like (6), m is often much larger than n. Thus, in accordance with remarks made in Section 17.2, it may be preferable to solve the dual problem because it would have 2m variables but only n + 2 inequality constraints. To illustrate, the dual of Problem (6) is
⎧⎪maximize: ⎪
4u1 + 2u2 + 7u3 − 4u4 − 2u5 − 7u6
⎧⎪ 2u1+u2+ u3−2u4−u5− u6 0 ⎪⎨ 3u1 −u2 +2u3 −3u4 +u5 −2u6 0 ⎪ −5u1 − 3u3 + 5u4 + 3u6 0 ⎪ −u1−u2− u3− u4−u5− u6 −1 ⎪⎩ u i 0 ( 1 i 6 )
The three types of approximate solution that have been discussed (for an overdetermined system of linear equations) are useful in different situations. Broadly speaking, an l∞ solution is preferred when the data are known to be accurate. An l2 solution is preferred when the data are contaminated with errors that are believed to conform to the normal probability distribution. The l1 solution is often used when data are suspected of containing wild points—points that result from gross errors, such as the incorrect placement of a decimal point. Additional information can be found in Rice and White [1964]. The l2 problem is discussed in Chapter 12 also.
Summary
(1) We consider an inconsistent system of m linear equations in n unknowns
n
aijxj =bi (1im)
j=1
For the residuals ri = nj =1 ai j x j − bi , the l1 problem for this system is to minimize the
⎪⎨
⎪ constraints: ⎪
⎩
expression m i=1
|ri |. A direct restatement of the problem is
⎧⎪
⎪ minimize:
⎪ ⎪⎨
⎪
⎪ constraints: ⎪
⎪⎩
m εi
i=1
⎧⎪ n
a i j x j − b i ε i a i j x j + b i ε i
( 1 i m ) ( 1 i m )
⎪ ⎪⎨
⎪ n
⎪⎩ −
j=1
j=1
where εi = |ri |. If the variables must be nonnegative, we introduce a variable yn+1 and write xj = yj − yn+1. Define ai,n+1 = −nj=1 aij; an equivalent linear programming
problem is
⎧⎪ minimize: ⎪
ε
⎧⎪ n
⎪⎨ aijxj −ε bi
⎨
(1im)
(1im)
⎧
⎪ minimize:
⎪ ⎪ ⎪⎨
⎪
⎪ constraints:
⎪ ⎪⎩
ε
17.3 Approximate Solution of Inconsistent Linear Systems 681
⎧⎪
⎪ maximize:
⎪ ⎪ ⎪ ⎨
⎪
⎪
⎪ constraints:
⎪ ⎪ ⎩
m
− εi
⎧⎪ m m
ε0 yj0 wherewedefinedai,n+1 =−nj=1aij.
⎧i=1 n+1
⎪
⎪ aijyj −εi bi
(1im) (1im)
⎪⎨j=1 n+1
⎪ ⎪−
⎪ j=1
aijyj −εi −bi ⎩ y0 ε0
which is in first primal form with m + n + 1 variables and 2m inequality constraints.
(2) Another technique is to replace the 2m inequality constraints by a set of m equality constraints.Wewriteεi =|ri|=ui +vi,whereui =ri andvi =0ifri 0butvi =−ri andui =0ifri <0.Theresultinglinearprogrammingproblemis
⎪ maximize: −
⎪⎨ ⎧ i = 1
n+1 ⎪ ⎪⎨
(3) For an inconsistent system, the problem of making ε = max1 i m |ri | a minimum is the l∞ problem for the system. An equivalent linear programming problem is
⎪constraints:
⎪ ⎪⎩ ⎪⎩ j = 1
(1im)
ui − vi i = 1
aijyj −ui +vi =bi u0 v0 y0
j=1 ⎪ n
⎪⎩−
j=1
If the variables must be nonnegative, we introduce a large variable yn+1 so that the quantities yj = xj + yn+1 are positive and we have an equivalent linear programming problem:
⎪ constraints: ⎪
aijxj −ε −bi
⎪⎩
⎧
⎪
n+1
⎪ aijyj−εbi (1im)
⎨ j=1 n+1
⎪ ⎪ − ⎪⎩j=1
aijyj −ε −bi
(1im) (1jn+1)
682 Chapter 17
Linear Programming
Additional References
See Armstrong and Godfrey [1979], Barrodale and Phillips [1975], Barrodale and Roberts [1974], Bartels [1971], Bloomfield and Steiger [1983], Branham [1990], Ca ̈rtner [2006], Cooper and Steinberg [1974], Dantzi, Orden, and Wolfe [1963], Huard [1979], Nering and Tucker [1992], Orchard-Hays [1968], Rabinowitz [1968], Roos et al. [1997], Schrijver [1986], Wright [1997], Ye [1997], and Zhang [1995].
1. Considertheinconsistentlinearsystem
Problems 17.3
2.
a 3.
a 4.
a a. The equivalent linear programming problem for solving the system in the l1 sense. ab. Theequivalentlinearprogrammingproblemforsolvingthesysteminthel∞sense.
(Continuation) Repeat the preceding problem for the system
⎧⎪ ⎪ ⎪ 3 x + y = 7 ⎨ x− y= 11 ⎪⎩ x+6y= 13
−x + 3y = −12
We want to find a polynomial p of degree n that approximates a function f as well as possiblefrombelow;thatis,wewant0 f−pεforminimumε.Showhowpcould be obtained with reasonable precision by solving a linear programming problem.
To solve the l1 problem for the system of equations ⎧
⎪⎨ x − y = 4 ⎪⎩ 2 x − 3 y = 7 x+ y=2
we can solve a linear programming problem. What is it?
ObtainnumericalanswersforPartsaandbofProblem17.3.1. (Continuation)RepeatforProblem17.3.2.
6x1
Write the following with nonnegative variables:
⎧⎪ ⎪ ⎪ 5x1 + 2x2 =6 ⎨ x1 + x2 + x3 = 2
⎪⎩
7x2 −5x3 =11 +9x3=9
Computer Problems 17.3
a1. 2.
17.3 Approximate Solution of Inconsistent Linear Systems 683
a3. Find a polynomial of degree 4 that represents the function ex in the following sense: Select 20 equally spaced points xi in interval [0,1] and require the polynomial to minimize the expression max1 i 20 |exi − p(xi )|. Hint: This is the same as solving 20 equationsinfivevariablesinthel∞ sense.Theithequationis A+Bxi +Cxi2+Dxi3+ Exi4 = exi , and the unknowns are A, B, C, D, and E.
4. Use built-in routines in mathematical software systems such as Matlab, Maple, or Mathematica to solve the linear programming problem with the equation numbers below in first primal form, in second primal form, and the dual:
a. (4) b. (6)
A
Advice on Good Programming Practices
Because the programming of numerical schemes is essential to under- standing them, we offer here a few words of advice on good programming practices.
A.1 Programming Suggestions
The suggestions and techniques given here should be considered in context. They are not intended to be complete, and some good programming suggestions have been omitted to keep the discussion brief. Our purpose is to encourage the reader to be attentive to considerations of efficiency, economy, readability, and roundoff errors. Of course, some of these suggestions and admonitions may vary depending on the particular programming language that is being used and features in the language.
Be Careful and Be Correct Strive to write programs carefully and correctly. This is of utmost importance.
Use Pseudocode Before beginning the coding, write out in complete detail the mathemat- ical algorithm to be used in pseudocode such as that used in this text. The pseudocode serves as a bridge between the mathematics and the computer program. It need not be defined in a formal way, as is done for a computer language, but it should contain sufficient detail that the implementation is straightforward. When writing the pseudocode, use a style that is easy to read and understand. For maintainability, it should be easy for a person who is unfamiliar with the code to read it and understand what it does.
Check and Double-Check Check the code thoroughly for errors and omissions before beginning to edit on a computer terminal. Spend time checking the code before running it to avoid executing the program, showing the output, discovering an error, correcting the error, and repeating the process ad nauseam.∗
∗In 1962, the rocket carrying the Mariner I space probe to Venus went off course after only five minutes of flight and was destroyed. An investigation revealed that a single line of faulty Fortran code caused the disaster. A period was typed in the code DO 5 I=1,3 instead of the comma, resulting in the loop being executed once instead of three times. It has been estimated that this single typographical error cost the United States National Aeronautics and Space Administration $18.5 million dollars! For additional details, see material available online such as www-aix.gsi.de/∼giese/swr/mariner1.html and www-aix.gsi.de/∼giese/swr/ literatur1.html for a general reference.
684
Modern computing environments may allow the user to accomplish this process in only a few seconds, but this advice is still valid if for no other reason than that it is dangerously easy to write programs that may work on a simple test but not on a more complicated one. No function key or mouse can tell you what is wrong!
Use Test Cases After writing the pseudocode, check and trace through it using pencil- and-paper calculations on a typical yet simple example. Checking boundary cases, such as the values of the first and second iterations in a loop and the processing of the first and last elements in a data structure, will often reveal embarrassing errors. These same sample cases can be used as the first set of test cases on the computer.
Modularize Code Build a program in steps by writing and testing a series of segments (subprograms, procedures, or functions); that is, write self-contained subtasks as separate routines. Try to keep these program segments reasonably small, less than a page whenever possible, to make reading and debugging easier.
Generalize Slightly If the code can be written to handle a slightly more general situation, then in many cases, it is worth the extra effort to do so. A program that was written for only a particular set of numbers must be completely rewritten for another set. For example, only a few additional statements are required to write a program with an arbitrary step size compared with a program in which the step size is fixed numerically. However, one should be careful not to introduce too much generality into the code because it can make a simple programming task overly complicated.
Show Intermediate Results Print out or display intermediate results and diagnostic mes- sages to assist in debugging and understanding the program’s operation. Always echo-print the input data unless it is impractical to do so, such as with a large amount of data. Using the default read and print commands frees the programmer from errors associated with misalignment of data. Fancy output formats are not necessary, but some simple labeling of the output is recommended.
Include Warning Messages A robust program always warns the user of a situation that it is not designed to handle. In general, write programs so that they are easy to debug when the inevitable bug appears.
Use Meaningful Variable Names It is often helpful to assign meaningful names to the variables because they may have greater mnemonic value than single-letter variables. There is perennial confusion between the characters O (letter “oh”) and 0 (number zero) and between l (letter “ell”) and 1 (number one).
Declare All Variables All variables should be listed in type declarations in each program or program segment. Implicit type assignments can be ignored when one writes declaration statements that include all variables used. Historically, in Fortran, variables beginning with I/i, J/j, K/k, L/l, M/m, and N/n are integer variables, and ones beginning with other letters are floating-point real variables. It may be a good idea to adhere to this scheme so that one can immediately recognize the type of a variable without looking it up in the type
A.1 Programming Suggestions 685
686 Appendix A Advice on Good Programming Practices
declarations. In this book, we present algorithms using pseudocode and therefore do not
always follow this advice.
Include Comments Comments within a routine are helpful for revealing at some later time what the program does. Extensive comments are not necessary, but we recommend that you include a preface to each program or program segment explaining the purpose, the input and output variables, and the algorithm used and that you provide a few comments between major segments of the code. Indent each block of code a consistent number of spaces to improve readability. Inserting blank comment lines and blank spaces can greatly improve the readability of the code as well. To save space, we have not included any comments in the pseudocode in this book.
Use Clean Loops Never put unnecessary statements within loops. Move expressions and variables outside a loop from inside a loop if they do not depend on the loop or do not change. Also, indenting loops can add to the readability of the code, particularly for nested loops. Use a nonexecutable statement as the terminator of a loop so that the code may be altered easily.
Declare Nonchanging Constants Use a parameter statement to assign the values of key constants. Parameter values correspond to constants that do not change throughout the routine. Such parameter statements are easy to change when one wants to rerun the program with different values. Also, they clarify the role key constants play in the code and make the routines more readable and easier to understand.
Use Appropriate Data Structures Use data structures that are natural to the problem at hand. If the problem adapts more easily to a three-dimensional array than to several one-dimensional arrays, then a three-dimensional array should be used.
Use Arrays of All Types The elements of arrays, whether one-, two-, or higher-dimensional, are usually stored in consecutive words of memory. Since the compiler may map the value of an index for two- and higher-subscripted arrays into a single subscript value that is used as a pointer to determine the location of elements in storage, the use of two- and higher- dimensional arrays can be considered a notational convenience for the user. However, any advantage in using only a one-dimensional array and performing complicated subscript calculation is slight. Such matters are best left to the compiler.
Use Built-in Functions In scientific programming languages, many built-in mathematical functions are available for common functions such as sin, log, exp, arcsin, and so on. Also, numeric functions such as integer, real, complex, and imaginary are usually available for type conversion. One should utilize these and others as much as possible. Some of these intrinsic functions accept arguments of more than one type and return a result whose type may vary depending on the type of the argument used. Such functions are called generic functions, for they represent an entire family of related functions. Of course, care should be taken not to use the wrong argument type.
Use Program Libraries In preference to one that you might write yourself for a pro- gramming project, a preprogrammed routine from a program library should be used when
A.1 Programming Suggestions 687 applicable. Such routines can be expected to be state-of-the-art software, well tested, and,
of course, completely debugged.
Do Not Overoptimize Students should be primarily concerned with writing readable code that correctly computes the desired results. There are any number of tricks of the trade for making code run faster or more efficiently. Save them for use later on in your program- ming career. We are primarily concerned with understanding and testing various numerical methods. Do not sacrifice the clarity of a program in an effort to make the code run faster. Clarity of code may be preferable to optimization of code when the two criteria conflict.
Case Studies
We present some case studies that may be helpful.
Computing Sums When a long list of floating-point numbers is added in the computer, there will generally be less roundoff error if the numbers are added in order of increasing magnitude. (Roundoff errors are discussed in detail in Chapter 2.)
Mathematical Constants Some students are surprised to learn that in many programming languages, the computer does not automatically know the values of common mathematical constants such as π and e and must be explicitly told their values. Since it is easy to mistype a long sequence of digits in a mathematical constant, such as the real number π,
pi ← 3.14159 26535 89793
the use of simple calculations involving mathematical functions is recommended. For ex- ample, the real numbers π and e can be easily and safely entered with nearly full machine precision by using standard intrinsic functions such as
pi ← 4.0 arctan(1.0) e ← exp(1.0)
Another reason for this advice is to avoid the problem that arises if one uses a short approx- imation such as pi ← 3.14159 on a computer with limited precision but later moves the code to another computer that has more precision. If you overlook changing this assignment statement, then all results that depend on this value will be less accurate than they should be.
Exponents In coding for the computer, exercise some care in writing statements that in-
volve exponents. The general function xy is computed on many computers as exp(y ln x)
whenever y is not an integer. Sometimes this is unnecessarily complicated and may con-
tribute to roundoff errors. For example, it is preferable to write code with integer exponents
such as 5 rather than 5.0. Similarly, using exponents such as 1 or 0.5 is not recommended 2
because the built-in function sqrt may be used.
There is rarely any need for a calculation such as j ← (−1)k because there are better
ways of obtaining the same result. For example, in a loop, we can write j ← 1 before the loop and j ← − j inside the loop.
Avoid Mixed Mode In general, one should avoid mixing real and integer expressions in the computer code. Mixed expressions are formulas in which variables and constants of
688 Appendix A Advice on Good Programming Practices
different types appear together. If the floating-point form of an integer variable is needed, use a function such as real. Similarly, a function such as integer is generally available for obtaining the integer part of a real variable. In other words, use the intrinsic type conversion functions whenever converting from complex to real, real to integer, or vice versa. For example, in floating-point calculations, m/n should be coded as real(m)/real(n) when m and n are integer variables so that it computes the correct real value of m/n. Similarly, 1/m should be coded as 1.0/real(m) and 1/2 as 0.5 and so on.
Precision In the usual mode of representing numbers in a computer, one word of storage is used for each number. This mode of representation is called single precision. In calculations that require greater precision (called double precision or extended precision), it is possible to allot two or more words of storage to each number. On a 32-bit computer, approximately seven decimal places of precision can be obtained in single precision, and approximately 17 decimal places of precision can be obtained in double precision. Double precision is usually more time-consuming than single precision because it may use software rather than hardware to carry out the arithmetic. However, if more accuracy is needed than single precision can provide, then double or extended precision should be used. This is particularly true on computers with limited precision, such as a 32-bit computer, on which roundoff errors can quickly accumulate in long computations and reduce the accuracy to only three or four decimal places! (This topic is discussed in Chapter 2.)
Usually, two words of memory are used to store the real and imaginary parts of a complex number. Complex variables and arrays must be explicitly declared as being of complex type. Expressions involving variables and constants of complex type are evaluated according to the normal rules of complex arithmetic. Intrinsic functions such as complex, real, and imaginary should be used to convert between real and complex types.
Memory Fetches When using loops, write the code so that fetches are made from adjacent words in memory. To illustrate, suppose we want to store values in a two-dimensional array (ai j ) in which the elements of each column are stored in consecutive memory locations. Using i and j loops with the i th loop as the innermost one would process elements down the columns. For some programs and computer languages, this detail may be of only secondary concern. However, some computers have immediate access to only a portion or a few pages of memory at a time. In this case, it is advantageous to process the elements of an array so that they are taken from or stored in adjacent memory locations.
When to Avoid Arrays Although the mathematical description of an algorithm may indi- cate that a sequence of values is computed, thus seeming to imply the need for an array, it is often possible to avoid arrays. (This is especially true if only the final value of a sequence is required.) For example, the theoretical description of Newton’s method (Chapter 3) reads
xn+1=xn− f(xn) f′(xn)
but the pseudocode can be written within a loop simply as
for n = 1 to 10 do
x ← x − f(x)/f′(x)
end for
where x is a real variable and function procedures for f and f ′ have been written. Such an assignment statement automatically effects the replacement of the value of the old x with the new numerical value of x − f (x)/f ′(x).
Limit Iterations In a repetitive algorithm, one should always limit the number of permis- sible steps by the use of a loop with a control variable. This will prevent endless cycling due to unforeseen problems (e.g., programming errors and roundoff errors). For example, in Newton’s method above, one might write
If the function involves some erratic behavior, there is a danger here in not limiting the number of repetitions. It is better to use a loop with a control variable:
where n and n max are integer variables and the value of n max is an upper bound on the number of desired repetitions. All others are real variables.
Floating-Point Equality The sequence of steps in a routine should not depend on whether two floating-point numbers are equal. Instead, reasonable tolerances should be permitted to allow for floating-point arithmetic roundoff errors. For example, a suitable branching statement for n decimal digits of accuracy might be
if |x − y| < ε then . . . end if
provided that it is known that x and y have magnitude comparable to 1. Here, x, y, and ε
are real variables with ε = 1 × 10−n . This corresponds to requiring that the absolute error 2
between x and y be less than ε. However, if x and y have very large or small orders of magni- tude, then the relative error between x and y would be needed, as in the branching statement
if |x − y| < εmax{|x|,|y|} then ... end if
Equal Floating-Point Steps In some situations, notably in solving differential equations (see Chapter 8), a variable t assumes a succession of values equally spaced a distance of h apart along the real line. One way of coding this is
A.1 Programming Suggestions 689
d← f(x)/f′(x)
while |d| > 1 × 10−6 do
2
x←x−d output x
d← f(x)/f′(x)
end while
for n = 1 to n max do d← f(x)/f′(x)
x←x−d
output n, x
if|d| 1 ×10−6 thenexitloop 2
end for
690 Appendix A Advice on Good Programming Practices
t ← t0
output 0, t
for i.= 1 to n do
.
t←t+h
output i, t end for
Here, i and n are integer variables, and t0, t, and h are real variables. An alternative way is
In the first pseudocode, n additions occur, each with possible roundoff error. In the second, this situation is avoided but at the added cost of n multiplications. Which is better depends on the particular situation at hand.
Function Evaluations When values of a function at arbitrary points are needed in a program, several ways of coding this are available. For example, suppose values of the function
f (x) = 2x + ln x − sin x
are needed. A simple approach is to use an assignment statement such as
y ← 2x + ln(x) − sin(x)
at appropriate places within the program. Here, x and y are real variables. Equivalently, an
internal function procedure corresponding to the pseudocode f (x) ← 2x + ln(x) − sin(x)
could be evaluated at 2.5 by
y ← f (2.5)
or whatever value of x is desired. Finally, a function subprogram can be used such as in the
following pseudocode:
Which implementation is best? It depends on the situation at hand. The assignment state- ment is simple and safe. An internal or external function procedure can be used to avoid
for i.= 0 to n do .
t ← t0 + real(i)h
output i, t end for
real function f (x) real x
f ←2x+ln(x)−sin(x) end function f
duplicating code. A separate external function subprogram is the best way to avoid diffi- culties that inadvertently occur when someone must insert code into another’s program. In using program library routines, the user may be required to furnish an external function procedure to communicate function values to the library routine. If the external function procedure f is passed as an argument in another procedure, then a special interface must be used to designate it as an external function.
On Developing Mathematical Software
Fred Krogh [2003] has written a paper listing some of the things he has learned from a career at the Jet Propulsion Laboratory involving the development and writing of mathematical software used in application packages. Some of his helpful hints and random thoughts to remember in code development are as follows: Include internal output in order to see what your algorithm is doing; support debugging by including output at the interfaces; provide detailed error messages; fine-tune your code; provide understandable test cases; verify results with care; take advantage of your mistakes; keep units consistent; test the extremes; the algorithm matters; work on what does work; toss out what does not work; do not give up too soon on ideas for improving or debugging your code; your subconscious is a powerful tool, so learn to use it; test your assumptions; in the comments, keep a dictionary of variables in alphabetical order because it is quite helpful when looking at a code years after it was written; write the user documentation first; know what performance you should expect to get; do not pay too much, but just enough, attention to others; see setbacks as learning opportunities and as the staircase for keeping one’s spirits up; when comparing codes, do not change their features or capabilities in order to make the comparison fair, since you may not fully understand the other person’s code; keep action lists; categorize code features; organize things into groups; the organization of the code may be one of the most important decisions the developer makes; isolate the linear algebra parts of the code in an application package so that the user may make modifications to them; reverse communication is a helpful feature that allows users to leave the code and carry out matrix- vector operations using their own data structures; save and restore variables when the user is allowed to leave the code and return; portability is more important than efficiency. This is just a random sampling of some of the items in this paper.
A.1 Programming Suggestions 691
B
Representation of Numbers in Different Bases
In this appendix, we review some basic concepts on number representation in different bases.
B.1 Representation of Numbers in Different Bases
We begin with a discussion of general number representation but move quickly to bases 2, 8, and 16, as they are the bases primarily used in computer arithmetic.
The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. When we write a whole number such as 37294, the individual digits represent coefficients of powers of 10 as follows:
37294 = 4+90+200+7000+30000
= 4×100 +9×101 +2×102 +7×103 +3×104
Thus, in general, a string of digits represents a number according to the formula anan−1…a2a1a0 =a0 ×100 +a1 ×101 +···+an−1 ×10n−1 +an ×10n
This takes care of only the positive whole numbers. A number between 0 and 1 is represented by a string of digits to the right of a decimal point. For example, we see that
0.7215 = 7 + 2 + 1 + 5 10 100 1000 10000
= 7×10−1 +2×10−2 +1×10−3 +5×10−4 In general, we have the formula
0.b1b2b3…=b1 ×10−1 +b2 ×10−2 +b3 ×10−3 +···
Note that there can be an infinite string of digits to the right of the decimal point; indeed, there must be an infinite string to represent some numbers. For example, we note that
√
2 = 1.41421 35623 73095 04880 16887 24209 69 . . .
e = 2.71828 18284 59045 23536 02874 71352 66 . . . π = 3.14159 26535 89793 23846 26433 83279 50 . . . ln 2 = 0.69314 71805 59945 30941 72321 21458 17 . . .
1 =0.33333333333333333333333333333333… 3
692
For a real number of the form
(anan−1 …a1a0.b1b2b3 …)10 =
n
ak10k +
∞
bk10−k
B.1 Representation of Numbers in Different Bases 693
k=1
second summation. If ambiguity can arise, a number represented in base β is signified by
enclosing it in parentheses and adding a subscript β. Base β Numbers
The foregoing discussion pertains to the usual representation of numbers with base 10. Other bases are also used, especially in computers. For example, the binary system uses 2 as the base, the octal system uses 8, and the hexadecimal system uses 16.
In the octal representation of a number, the digits that are used are 0, 1, 2, 3, 4, 5, 6, and 7. Thus, we see that
(21467)8 = 7+6×8+4×82 +1×83 +2×84 = 7+8(6+8(4+8(1+8(2))))
= 9015
A number between 0 and 1, expressed in octal, is represented with combinations of 8−1,
8−2, and so on. For example, we have
(0.36207)8 = 3×8−1 +6×8−2 +2×8−3 +0×8−4 +7×8−5 = 8−5(3×84 +6×83 +2×82 +7)
= 8−5(7 + 82(2 + 8(6 + 8(3)))) = 15495
32768
= 0.47286 987 . . .
We shall see presently how to convert easily to decimal form without having to find a com- mon denominator.
If we use another base, say, β, then numbers represented in the β-system look like this:
k=0
the integer part is the first summation in the expansion and the fractional part is the
n (anan−1 …a1a0.b1b2b3 …)β =
akβk +
∞
bkβ−k
k=1
introduce symbols for 10, 11, . . . , β − 1. The separator between the integer and fractional
part is called the radix point, since decimal point is reserved for base-10 numbers. Conversion of Integer Parts
We now formalize the process of converting a number from one base to another. It is advisable to consider separately the integer and fractional parts of a number. Consider, then, a positive integer N in the number system with base γ :
n k=0
k=0
The digits are 0,1,…,β −2, and β −1 in this representation. If β > 10, it is necessary to
N=(anan−1…a1a0)γ =
akγk
694
Appendix B Representation of Numbers in Different Bases
EXAMPLE 1
Solution
Convert the decimal number 3781 to binary form using the division algorithm.
As was indicated above, we divide repeatedly by 2, saving the remainders along the way. Here is the work:
Quotients Remainders
2 ) 3781
2)1890 1=c0 ↓ ̇
2)945 0=c1 2)472 1=c2 2)236 0=c3 2)118 0=c4
2)59 0=c5 2)29 1=c6 2)14 1=c7
2)7 0=c8 2)3 1=c9 2)1 1=c10
Suppose that we wish to convert this to the number system with base β and that the calcu- lations are to be performed in arithmetic with base β. Write N in its nested form:
N = a0 +γ(a1 +γ(a2 +···+γ(an−1 +γ(an))···))
and then replace each of the numbers on the right by its representation in base β. Next, carry out the calculations in β-arithmetic. The replacement of the ak’s and γ by equivalent base-β numbers requires a table showing how each of the numbers 0, 1, . . . , γ − 1 appears in the β-system. Moreover, a base-β multiplication table may be required.
To illustrate this procedure, consider the conversion of the decimal number 3781 to binary form. Using the decimal binary equivalences and longhand multiplication in base 2, we have
(3781)10 =1+10(8+10(7+10(3)))
= (1)2 + (1 010)2 ((1 000)2 + (1 010)2 ((111)2 + (1 010)2(11)2)) = (111 011 000 101)2
This arithmetic calculation in binary is easy for a computer that operates in binary but tedious for humans.
Another procedure should be used for hand calculations. Write down an equation containingthedigitsc0,c1,…,cm thatweseek:
N =(cmcm−1…c1c0)β =c0 +β(c1 +β(c2 +···+β(cm)···))
Next, observe that if N is divided by β, then the remainder in this division is c0, and the
quotient is
If this number is divided by β, the remainder is c1, and so on. Thus, we divide repeatedly
c1 +β(c2 +···+β(cm)···) byβ,savingremaindersc0,c1,…,cm andquotients.
0 1=c11
EXAMPLE 2
Solution
B.1 Representation of Numbers in Different Bases 695 Here, the symbol ↓ ̇ is used to remind us that the digits ci are obtained beginning with the
digit next to the binary point. Thus, we have
(3781.)10 =(111011000101.)2
and not the other way around: (101 000 110 111.)2 = (2615)10. ■ Convert the number N = (111 011 000 101)2 to decimal form by nested multiplication.
N = 1×20 +0×21 +1×22 +0×23 +0×24 +0×25 +1×26 +1×27 +0×28 +1×29 +1×210 +1×211
= 1+2(0+2(1+2(0+2(0+2(0+2(1+2(1+2(0 + 2(1 + 2(1 + 2(1)))))))))))
= 3781
The nested multiplication with repeated multiplication and addition can be carried out on a
hand-held calculator more easily than can the previous form with exponentiation. ■
Another conversion problem exists in going from an integer in base γ to an integer in base β when using calculations in base γ . As before, the unknown coefficients in the equation
N = c0 +c1β +c2β2 +···+cmβm
are determined by a process of successive division, and this arithmetic is carried out in the γ -system. At the end, the numbers ck are in base γ , and a table of γ -β equivalents is used. For example, we can convert a binary integer into decimal form by repeated division by (1 010)2 [which equals (10)10 ], carrying out the operations in binary. A table of binary- decimal equivalents is used at the final step. However, since binary division is easy only for
computers, we shall develop alternative procedures presently.
Conversion of Fractional Parts
We can convert a fractional number such as (0.372)10 to binary by using a direct yet naive approach as follows:
(0.372)10 = 3×10−1 +7×10−2 +2×10−3 1 1 1
=10 3+10 7+10(2)
1 1 1
= (1 010) (011)2 + (1 010) (111)2 + (1 010) (010)2 222
Dividing in binary arithmetic is not straightforward, so we look for easier ways of doing this conversion.
Suppose that x is in the range 0 < x < 1 and that the digits ck in the representation
x= are to be determined. Observe that
∞ k=1
ckβ−k =(0.c1c2c3...)β βx = (c1.c2c3c4 . . .)β
696
Appendix B Representation of Numbers in Different Bases
EXAMPLE 3
Solution
etc.
In this algorithm, the arithmetic is carried out in the decimal system.
Use the preceding algorithm to convert the decimal number x = (0.372)10 to binary form. The algorithm consists in repeatedly multiplying by 2 and removing the integer parts. Here
because it is necessary to shift the radix point only when multiplying by base β. Thus, the unknown digit c1 can be described as the integer part of βx. It is denoted by I(βx). The fractional part, (0.c2c3c4 ...)β, is denoted by F(βx). The process is repeated in the following pattern:
is the work:
d0 = x
d1 =F(βd0) d2 = F(βd1)
c1 =I(βd0) ↓ ̇ c2 = I(βd1)
0.372 2 ↓ ̇ c1=.744 2 c2 = .488 2 c3 = .976 2 c4 = .952 2 c5 = .904 2 c6 = .808
etc.
0
1
0
1
1
Thus, we have (0.372)10 = (0.010 111 . . .)2 . Base Conversion 10 ↔ 8 ↔ 2
■
1
Most computers use the binary system (base 2) for their internal representation of numbers. The octal system (base 8) is particularly useful in converting from the decimal system (base 10) to the binary system and vice versa. With base 8, the positional values of the numbers are80 =1,81 =8,82 =64,83 =512,84 =4096,andsoon.Thus,forexample,wehave
and
(26031)8 = 2×84 +6×83 +0×82 +3×8+1 = ((((2)8+6)8+0)8+3)8+1
= 11289
(7152.46)8 = 7×83 +1×82 +5×8+2+4×8−1 +6×8−2 = (((7)8+1)8+5)8+2+8−2[(4)8+6]
= 3690 + 38 64
= 3690.59375
EXAMPLE 4
Solution
When numbers are converted between decimal and binary form by hand, it is convenient to use octal representation as an intermediate step. In the octal system, the base is 8, and, of course, the digits 8 and 9 are not used. Conversion between octal and decimal proceeds according to the principles already stated. Conversion between octal and binary is especially simple. Groups of three binary digits can be translated directly to octal according to the following table:
Binary 000 001 010 011 100 101 110 111
Octal 01234567
This grouping starts at the binary point and proceeds in both directions. Thus, we have
(101101001.110010100)2 =(551.624)8
To justify this convenient sleight of hand, we consider, for instance, a fraction expressed
in binary form:
x = (0.b1b2b3b4b5b6 . . .)2
= b12−1 + b22−2 + b32−3 + b42−4 + b52−5 + b62−6 + ··· =(4b1 +2b2 +b3)8−1 +(4b4 +2b5 +b6)8−2 +···
In the last line of this equation, the parentheses enclose numbers from the set {0, 1, 2, 3, 4, 5, 6, 7} because the bi ’s are either 0 or 1. Hence, this must be the octal representation of x .
Conversion of an octal number to binary can be done in a similar manner but in reverse order. It is easy! Just replace each octal digit with the corresponding three binary digits. Thus, for example,
(5362.74)8 =(101011110010.111100)2 What is (2576.35546 875)10 in octal and binary forms?
We convert the original decimal number first to octal and then to binary. For the integer part, we repeatedly divide by 8:
B.1 Representation of Numbers in Different Bases 697
Thus, we have
8 ) 2576
8 ) 322 0 ↓ ̇
8 ) 40 2 8)50 05
2576.=(5020.)8 =(101000010000.)2
using the rules for grouping binary digits. For the fractional part, we repeatedly multiply by 8
↓ ̇
0.35546875 8 .84375000 8 .75000000 8 .00000000
2
6
6
698 Appendix B Representation of Numbers in Different Bases so that
0.35546875=(0.266)8 =(0.010110110)2 Finally, we obtain the result
2576.35546 875 = (101 000 010 000.010 110 110)2
Although this approach is longer for this example, we feel that it is easier, in general and less likely to lead to error because one is working with single-digit numbers most of the time. ■
Base 16
Some computers whose word lengths are multiples of 4 use the hexadecimal system (base 16) in which A, B, C, D, E, and F represent 10, 11, 12, 13, 14, and 15, respectively, as given in the following table of equivalences:
Hexadecimal 0 1 2 3 4 5 6 7
Binary 0000 0001 0010 0011 0100 0101 0110 0111 Hexadecimal 8 9 A B C D E F
Binary 1000 1001 1010 1011 1100 1101 1110 1111
Conversion between binary numbers and hexadecimal numbers is particularly easy. We need only regroup the binary digits from groups of three to groups of four. For example, we have
(010101110101101)2 =(0010101110101101)2 =(2BAD)16 (111101011110010.110010011110)2 =(101011110010.110010011110)2
= (7AF2.C9E)16
More Examples
Continuing with more examples, let us convert (0.276)8 , (0.C8)16 , and (492)10 into different number systems. We show one way for each number and invite the reader to work out the details for other ways and to verify the answers by converting them back into the original base.
(0.276)8 = 2×8−1 +7×8−2 +6×8−3 = 8−3[((2)8 + 7)8 + 6]
= (0.37109 375)10
(0.C8)16 =(0.110010)2 = (0.62)8
= 6×8−1 +2×8−2 = 8−2[(6)8 + 2]
= (0.78125)10
and
because
Summary
(492)10 = (754)8
= (111 101 100)2
= (1EC)16
8) 492
8) 61 4 ↓ ̇
8) 7 5 07
B.1 Representation of Numbers in Different Bases 699
(1) It might seem that there are several different procedures for converting between number systems. Actually, there are only two basic techniques. The first procedure for converting the number (N)γ to base β can be outlined as follows:
• Express(N)γ innestedformusingpowersofγ.
• Replaceeachdigitbythecorrespondingbase-βnumbers. • Carryouttheindicatedarithmeticinbaseβ.
This outline holds whether N is an integer or a fraction. The second procedure is either the divide-by-β and remainder-quotient-split process for N an integer or the multiply-by-β and integer-fraction-split process for N a fraction. The first procedure is preferred when γ < β and the second when γ > β. Of course, the 10 ↔ 8 ↔ 2 ↔ 16 base conversion procedure should be used whenever possible because it is the easiest way to convert numbers between the decimal, octal, binary, or hexadecimal systems.
1. Findthebinaryrepresentationandcheckbyreconvertingtodecimalrepresentation. a a. e ≈ (2.718)10 b. 7 c. (592)10
8
2. Convertthefollowingdecimalnumberstooctalnumbers.
a. 27.1 b.12.34 c.3.14 d.23.58 e.75.232 a3. Converttohexadecimal,tooctal,andthentodecimal.
aa. (110111001.101011101)2 a4. Convertthefollowingnumbers:
f.57.321
b.(1001100101.01101)2 a. (100101101)2 = ( )8 = ( )10
b. (0.782)10 =( ac. (47)10 =(
)8 =( )8 =(
)2 )2
d. (0.47)10 =(
)8 =(
)2
Problems B.1
700 Appendix B
Representation of Numbers in Different Bases
5. a6.
7.
8. a 9.
10. 11.
a12. 13. 14.
h. (361.4)8 =( )2 =( )10
Convert (45653.127664)8 to binary and to decimal.
Convert (0.4)10 first to octal and then to binary. Check by converting directly to binary.
Prove that the decimal number 1 cannot be represented by a finite expansion in the 5
binary system.
Do you expect your computer to calculate 3 × 1 with infinite precision? What about
15. 16.
Explain the algorithm for converting an integer in base 10 to one in base 2, assuming that the calculations will be performed in binary arithmetic. Illustrate by converting (479)10 to binary.
Justify mathematically the conversion between binary and hexadecimal numbers by regrouping.
Justifyforintegerstherulegivenfortheconversionbetweenoctalandbinarynumbers. Provethatarealnumberhasafiniterepresentationinthebinarynumbersystemifand
only if it is of the form ±m/2n, where n and m are positive integers.
Prove that any number that has a finite representation in the binary system must have
a finite representation in the decimal system.
SomecountriesmeasuretemperatureinFahrenheit(F),whileothercountriesuseCel- sius (C). Similarly, for distance, some use miles and others use kilometers. As a frequent traveler, you may be in need of a quick approximate conversion scheme that you can do in your head.
a. Fahrenheit and Celsius are related by the equation F = 32 + (9/5)C. Verify the following simple conversion scheme for going from Celsius to Fahrenheit: A rough approximation is to double the Celsius temperature and add 32. To refine your approximation, shift the decimal place to the left in the doubled number (2C) and subtract it from the approximation obtained previously: F = [(2C ) + 32] − (2C )/10.
b. Determine a simple scheme to convert from Fahrenheit to Celsius.
c. Determine a simple scheme to convert from miles to kilometers.
d. Determine a simple scheme to convert from kilometers to miles.
Convert fractions such as 1 and 1 into their binary represention. 3 11
(Mayanarithmetic)TheMayacivilizationofCentralAmerica(2600B.C.to1200A.D.) understood the concept of zero hundreds of years before many other civilizations. For their calculations, the vigesimal (base 20) system was used, not the decimal (base 10) system. So instead of 1, 10, 100, 1000, 10000, they used 1, 20, 400, 8000, 16000. They used a dot for 1 and a bar for 5, and zero was represented by the shell symbol. For
ae. (51)10 =( )8 =( )2 f. (0.694)10 =( )8 =(
)2
ag. (110011.1110101101101)2 = ( )8 = ( )10
2 × 1 or 10 × 1 ? 2 10
3
B.1 Representation of Numbers in Different Bases 701 example, the calculations 11131 + 7520 = 18651 and 11131 − 7520 = 3611 was as
follows:
8000s
400s
20s 1s
11131 7520
18651 3611
Here, as an aid, some of our numbers are included; on the left, they indicate the powers used, and above, they are the numbers represented by the columns.
Do these calculations using Mayan symbols and arithmetic:
a. 92819+56313=149132, 92819−56313=36506
b. 3296+853 = 4149, 3296−853 = 2443
c. 2273+729 = 1544, 2273−729 = 1544
d. Investigate how the Mayans might have done multiplication and division in their number system. Work out some simple examples.
17. (Babylonianarithmetic)BabyloniansofancientMesopotania(nowIraq)usedasex- agesimal (base 60) positional number system with a decimal (base 10) system within it. The Babylonians based their number system on only two symbols! The influence of Babylonian arithmetic is still with us today. An hour consists of 60 minutes and is divided into 60 seconds, and a circle is measured in divisions of 360 degrees. Numbers are frequently called digits, from the Latin word for “finger.” The base-10 and base-20 systems most likely arose from the fact that ten fingers and ten toes could be used in counting. Investigate the early history of numbers and doing aritmetic calculations in different number systems.
1. Read into your computer x = 1.1 (base 10), and print it out using several different formats. Explain the results.
√
2. Show that eπ
3. Writeandtestaroutineforconvertingintegersintooctalandbinaryforms.
163 is incredibly close to being the 18-digit integer 262 53741 26407 68744. Hint: More than 30 decimal digits will be needed to see any difference.
Computer Problems B.1
702 Appendix B
Representation of Numbers in Different Bases
4. (Continuation)Writeandtestaroutineforconvertingdecimalfractionsintooctaland binary forms.
5. (Continuation) Using the two routines of the preceding problems, write and test a program that reads in decimal numbers and prints out the decimal, octal, and binary representations of these numbers.
6. Seehowmanybinarydigitsyourcomputerhasfor(0.1)10.Seetheintroductoryremarks at the beginning of this chapter.
7. Somemathematicalsoftwaresystemshavecommandsforconvertingnumbersbetween binary, decimal, hex, octal, and vice versa. Explore these commands using various numerical values. Also, see whether there are commands for determining the precision (the number of significant decimal digits in a number) and the accuracy (the number of significant decimal digits to the right of the decimal point in a number).
8. Write a computer program to verify the conclusions in evaluating f (x) = x − sin x for various values of x near 1.9, say, over the interval [0.1, 2.5] with increments of 0.1. For these values, compute the approximate value of f , the true calculated value, and the absolute error between them. Single-precision and double-precision computations may be necessary.
C
Additional Details on IEEE Floating-Point Arithmetic
In this appendix, we summarize some additional features in IEEE standard floating-point arithmetic. (See Overton [2001] for additional details.)
C.1 More on IEEE Standard Floating-Point Arithmetic
In the early 1980s, a working committee of the Institute for Electrical and Electronics Engineers (IEEE) established a standard floating-point arithmetic system for computers that is now known as the IEEE floating-point standard. Previously, manufacturers of different computers each developed their own internal floating-point number systems. This led to inconsistencies in numerical results in moving code from machine to machine, for example, in porting source code from an IBM computer to a Cray machine. Some impor- tant requirements for all machines adopting the IEEE floating-point standard include the following:
• Correctlyroundedarithmetic
• Consistentrepresentationoffloating-pointnumbersacrossmachines • Consistent and sensible treatment of exceptional situations
Suppose that we are using a 32-bit computer with IEEE standard floating-point arith- metic. There are exactly 23 bits of precision in the fraction field in a single-precision normalized number. By counting the hidden bit, this means that there are 24 bits in the sig- nificand and the unit roundoff error is u = 2−24. In single precision, the machine epsilon is εsingle = 2−23 because 1 + 2−23 is the first single-precision number larger than 1. Since 2−23 ≈ 1.19 × 10−7, we can expect only approximately six accurate decimal digits in the output. This accuracy may be reduced further by errors of various types, such as roundoff errors in the arithmetic, truncation errors in the formulas used, and so on.
For example, when computing the single-precision approximation to π, we obtain six accurate digits: 3.14159. Converting and printing the 24-bit binary number result in an actual decimal number with more than six nonzero digits, but only the first six digits are considered accurate approximations to π.
The first double-precision number larger than 1 is 1 + 2−52. So the double-precision machineepsilonisεdouble =2−52.Since2−52 ≈2.22×10−16,thereareonlyapproximately 15 accurate decimal digits in the output in the absence of errors. The fraction field has exactly 52 bits of precision, and this results in 53 bits in the significand when the hidden bit is counted.
703
704 Appendix C Additional Details on IEEE Floating-Point Arithmetic
For example, when approximating π in double precision, we obtain 15 accurate digits: 3.14159 26535 8979. As in the case with single precision, converting and printing the 54-bit binary significand results in more than 15 digits, but only the first 15 digits are accurate approximations to π.
There are some useful special numbers in the IEEE standard. Instead of terminating with an overflow when dividing a nonzero number by 0, the machine representation for ∞ is stored, which is the mathematically sensible thing to do. Because of the hidden bit representation, a special technique for storing zero is necessary. Note that all zeros in the fraction field (mantissa) represent the significand 1.0 rather than 0.0. Moreover, there are two different representations for the same number zero, namely, +0 and −0. On the other hand, there are two different representations for infinity that correspond to two quite different numbers, +∞ and −∞. NaN stands for Not a Number and is an error pattern rather than a number.
Is it possible to represent numbers smaller than the smallest normalized floating-point number 2−126 in IEEE standard floating-point format? Yes! If the exponent field contains a bit string of all zeros and the fraction field contains a nonzero bit string, then this representation is called a subnormal number. Subnormal numbers cannot be normalized because this would result in an exponent that does not fit into the exponent field. These subnormal numbers are less accurate than normal numbers because they have less room in the fraction field for nonzero bits.
By using various system inquiry functions (such as those in Table C.1 from Fortran 90), we can determine some of the characteristics of the floating-point number system on a typical PC with 32-bit IEEE standard floating-point arithmetic. Table C.2 contains the results. In most cases, simple programs can also be written to determine these values.
In Table C.3, we show the relationship between the exponent field and the possible single-precision 32-bit floating-points numbers corresponding to it. In this table, all lines except the first and the last are normalized floating-point numbers. The first line shows that zero is represented by +0 when all bits bi = 0, and by −0 when all bits are zero except b1 = 1. The last line shows that +∞ and −∞ have bit strings of all ones in the exponent field except for possibly the sign bit together with all zeros in the mantissa field.
TABLE C.1
Some Numeric Inquiry Functions in Fortran 90
EPSILON(X)
TINY(X)
HUGE(X)
PRECISION(X)
Machine epsilon (number almost negligible compared to 1) Smallest positive number
Largest number
Decimal precision (number of significant decimal digits in output)
TABLE C.2
Results with IEEE Standard Floating-Point on 32-Bit Machine
EPSILON(X)
TINY(X)
HUGE(X)
PRECISION(X)
X Single Precision 1.192 × 10−7 ≈ 2−23
1.175 × 10−38 ≈ 2−126 3.403 × 1038 ≈ 2128 6
X Double Precision 2.220 × 10−16 ≈ 2−52
2.225 × 10−308 ≈ (2 − 2−23) × 2127 1.798 × 10308 ≈ 21024
15
C.1 More on IEEE Standard Floating-Point Arithmetic 705
TABLE C.3 Single-Precision 32-Bit Word withSignBitb1 =0for+andb1 =1for−.
(b2 b3 . . . b9 )2 Exponent Field (00000000)2 = (0)10
(00000001)2 = (1)10 (00000010)2 = (2)10 (00000011)2 = (3)10 (00000100)2 = (4)10
.
(01111101)2 = (125)10 (01111110)2 = (126)10 (01111111)2 = (127)10 (10000000)2 = (128)10 (10000001)2 = (129)10
.
(11111011)2 = (251)10 (11111100)2 = (252)10 (11111101)2 = (253)10 (11111110)2 = (254)10
(11111111)2 = (255)10
Numerical Representation
±0, ifb10 =b11 =···=b32 =0
subnormal, otherwise ±(1.b10b11b12···b32)2×2−126 ±(1.b10b11b12···b32)2×2−125 ±(1.b10b11b12···b32)2×2−124 ±(1.b10b11b12···b32)2×2−123
.
±(1.b10b11b12 · · · b32)2 × 2−2
±(1.b10b11b12 · · · b32)2 × 2−1 ±(1.b10b11b12 · · · b32)2 × 20 ±(1.b10b11b12 · · · b32)2 × 21 ±(1.b10b11b12 · · · b32)2 × 22
.
±(1.b10b11b12 · · · b32)2 × 2124
±(1.b10b11b12 · · · b32)2 × 2125 ±(1.b10b11b12 · · · b32)2 × 2126 ±(1.b10b11b12 · · · b32)2 × 2127
±∞, ifb10 =b11 =···=b32 =0 NaN, otherwise
In the IEEE floating-point standard, the round to nearest or correctly rounded value of the real number x, denoted round(x), is defined as follows. First, let x+ be the closest floating-point number greater than x, and let x− be the closest one less than x. If x is a floating-point number, then round(x) = x. Otherwise, the value of round(x) depends on the rounding mode selected:
• Round to nearest: round(x) is either x− or x+, whichever is nearer to x. (If there is a tie, choose the one with the least significant bit equal to 0.)
• Round toward 0: round(x) is either x− or x+, whichever is between 0 and x. • Round toward −∞/round down: round(x) = x−.
• Round toward +∞/round up: round(x) = x+.
Round to nearest is almost always used, since it is the most useful and gives the floating-point number closest to x.
b1 b2b3b4 ···b9 b10b11 ···b32
D
Linear Algebra Concepts and Notation
In this appendix, we review some basic concepts and standard notation used in linear algebra.
D.1 Elementary Concepts
The two concepts from linear algebra that we are most concerned with are vectors and matrices because of their usefulness in compressing complicated expressions into a compact notation. The vectors and matrices in this text are most often real, since they consist of real numbers. These concepts easily generalize to complex vectors and matrices.
Vectors
A vector x ∈ Rn can be thought of as a one-dimensional array of numbers and is written
as
⎡⎤
x1
⎢ x2 ⎥
x = ⎢⎣ . . . ⎥⎦ xn
where xi is called the ith element, entry, or component. An alternative notation that is useful in pseudocodes is x = (xi )n . Sometimes the vector x displayed above is said to be a column vector to distinguish it from a row vector y written as
y = [ y1 , y2 , . . . , yn ] For example, here are some vectors:
⎡1⎤
⎢ 35⎥ 1
⎢−5⎥ [π, e, 5,−4] 21
⎣6⎦ 3 2
7
To save space, a column vector x can be written as a row vector such as
706
x = [x1,x2,…,xn]T or xT = [x1,x2,…,xn]
D.1 Elementary Concepts 707 by adding a T (for transpose) to indicate that we are interchanging or transposing a row or
column vector. As an example, we have
[1 2 3 4]T =⎢2⎥
Many operations involving vectors are component-by-component operations. For vec-
tors x and y
⎡⎤ ⎡⎤
the following definitions apply.
Equality x=yifandonlyifxi =yi foralli(1in) Inequality x
⎡1 0 0 0⎤ ⎡3 0 0 0⎤ ⎡⎢5 3 0 0 0⎤⎥ ⎢0 1 0 0⎥ ⎢0 5 0 0⎥ ⎢2 5 3 0 0⎥
⎣0010⎦ ⎣0070⎦ ⎢02920⎥
0001 0009 ⎣00372⎦ 00037
⎡⎤⎡⎤
6000 1−121 ⎢3 6 0 0⎥ ⎢0 5 −5 1⎥ ⎣4 −2 7 0⎦ ⎣0 0 9 −3⎦ 5−3921 0002
As with vectors, many operations involving matrices correspond to component opera- tions. For matrices A and B,
⎡a a ···a⎤ ⎡b b ···b⎤
11 12
⎢a21 a22 ···
11 12
⎢b21 b22 ···
Inequality A
2. ∥ax∥ = |a|∥x∥ for all vectors x and all scalars a. 3. ∥x+y∥∥x∥+∥y∥forallvectorsxandy.
■ PROPERTIES
Here, xi denotes the ith component of the vector. Any norm can be thought of as assigning a length to each vector. It is the Euclidean norm that corresponds directly to our usual concept of length, but other norms are sometimes much more convenient for our purposes. For example, if we know that ∥x − y∥∞ < 10−8, then we know that each component of x differs from the corresponding component of y by at most 10−8 and that the converse is also true. When we solve a system of linear equations Ax = b numerically, we shall want to know (among other things) how big the residual vector is. That is conveniently measured by ∥Ax − b∥, where some norm has been specified.
For n × n matrices, we can also have matrix norms, subject to the following requirements:
Properties of Matrix Norms
We usually prefer matrix norms that are related to a vector norm. When a vector norm has been specified on Rn, there is a standard way of introducing a related matrix norm for n × n matrices, namely,
∥A∥=sup{∥Ax∥:x∈Rn, ∥x∥1}
We say that this matrix norm is the subordinate norm to the given vector norm or the norm induced by the given vector norm. The close relationship between the two is useful,
∥x∥2 =
(l1-vector norm) (Euclidean/l2-vector norm) (l∞-vector norm)
1.∥A∥>0ifA≠ 0
2. ∥αA∥ = |α|∥A∥
3. ∥A+B∥∥A∥+∥B∥ (triangularinequality)
for matrices A, B and scalars α.
722 Appendix D Linear Algebra Concepts and Notation
because it leads to the following inequality, which is true for all vectors x: ∥Ax∥ ∥A∥ ∥x∥
The matrix norms subordinate to the vector norms discussed above are, respectively,
∥A∥ = max n |a | 1 1jn i=1 ij
(l -matrix norm) 1
(Spectral/l2-matrix norm) (l -matrix norm)
∥ A∥2 = max1k n σk
∥A∥ = max n |a ∞ 1in j=1 ij
|
∞
Here, σk are the singular values of A. (Refer to Section 8.2 for definitions.) Note from above that the matrix norm subordinate to the Euclidean vector norm is not what most students think that it should be, namely,
nn 1/2
∥ A∥ = a2 (Frobenius norm)
Fij i=1 j=1
This is indeed a matrix norm; however, it is not the one induced by the Euclidean vector norm.
Gram-Schmidt Process
The projection operator is defined to be
projy x = ⟨x, y⟩ y
⟨y, y⟩
that projects the vector x orthogonally onto the vector y. The Gram-Schmidt process can
be written as
q = z1 1 ||z1||
q2= z2 ||z2||
q3 = z3 ||z3||
z =v, 1 1
z2=v2−projzv2, 1
z3 = v3 −projz v3 −projz v3, 1 2
In general, the k step is
zk = vk −
k−1
zk
projvj vk, qk = ||zk||
j=1
Here {z1,z2,z3,…,zk} is an orthogonal set and {q1,q2,q3,…,qk} is an orthonormal
set. When implemented on a computer, the Gram-Schmidt process is numerically unstable because the vectors zk may not be exactly orthogonal due to roundoff errors. By a minor modification, the Gram-Schmidt process can be stabilized. Instead of computing the vectors uk as above, it can be computed a term at a time. A computer algorithm for the modified Gram-Schmidt process
for j = 1 to k
for i = 1 to j − 1
s ← ⟨v j , vi ⟩
vj ←vj −svi end for
vi ← vj/||vj|| end for
EXAMPLE 1
Solution
Herethevectorsv1,v2,…,vk arereplacedwithorthonormalvectorsthatspanthesame subspace.Thei-loopremovescomponentsinthevi directionfollowedbynormalizationof the vector. In exact arithmetic, this computation gives the same results as the original form above. However, it produces smaller errors in finite-precision computer arithmetic.
Consider the vectors v1 = (1,ε,0,0), v1 = (1,0,ε,0), and v1 = (1,0,0,ε). Assume ε is a small number. Carry out the standard Gram-Schmidt procedure and the modified Gram- Schmidt procedure. Check the orthogonality conditions of the resulting vectors.
Using the classical Gram-Schmidt process, we obtain u = (1, ε, 0, 0), u
=
√√
(0, −1, 1, 0)/ 2, and u
D.2 Abstract Vector Spaces 723
12
= (0, −1, 0, 1)/ 2. Using the modified Gram-Schmidt process,
3
√√
we find z1 = (1,ε,0,0), z2 = (0,−1,1,0)/ 2, and z3 = (0,−1,−1,2)/ 6. Checking
orthogonality, we find ⟨u2, u3⟩ = 1 and ⟨z2, z3⟩ = 0. ■
2
Answers for Selected Problems*
Problems 1.1
2. x = 6032 ; x = 6032 9990 10010
5a. sum←0
for i = 1 to n do
3. 6 × 10−5
4. Two other ways: pi ← 2.0 arcsin(1.0) or pi ← 2.0 arccos(0.0)
for j = 1 to n do
sum ← sum + ai j
end for end for
5d. sum ← 0.0
for i = 1 to n do
sum ← sum + aii end for
for j = 2 to n do
for i = j to n do
sum ← sum + ai,i− j+1 + ai− j+1,i end for
end for
6. n multiplications and n additions/subtractions
8a. fori =1to5do x←x·x
end for
p←x
8c. z←x+2
p ← z3 6 + z4 9 + z8 3 − z10
10. z←an/bn
fori =1ton−1do
z ← an−i (z + 1/bn−i ) end for
*Answers to problems marked in the text with the symbol a are given here and in the Student’s Solution Manual with more details. 724
11b. z←1 v←1
fori =1ton−1do v ← vx
z ← vz + 1 end for
for i = 1 to n do
aij ←1.0/real(i+j−1)
end for end for
Computer Problems 1.1
z ← vxz
12b. v=n aixi
12e. v=anxn +xn i=1
an−ixn−i
13. z=1+n i=2
i bj j=2
14. n(n+1)/2
i=0
15b. for j = 1 to n do
4. exp(1.0)≈2.71828182846
9. Computation deviates from theory when a1 = 10−12, 10−8, 10−4, 1020, for example. 10. x may underflow and be set to zero. 12. 40 different spellings
20a. The computation m/n may result in truncation so that x ≠ y.
Problems 1.2
4a. First derivative +∞ at 0. 4b. First derivative not continuous. 4e. Function −∞ at 0.
x2 6a.ecosx =e 1− 2 +···
+···
9. Yes. By using this formula, we avoid the series for e−x and use the one for ex .
x2 x3
15a. sinx+cosx=1+x− − +···; sin(0.001)+cos(0.001)≈1.00099949983
26
∞x2k
(2k)!; cosh0.7≈1.25517
5. coshx =
6b. sin(cosx) = (sin1)−(cos1)
Answers for Selected Problems 725
k=0
x2 2
7. m = 2 8. At least 18 terms ∞ x k 1 + x ∞ x 2 k − 1
k ; ln 1−x =2 (2k−1) k=1 k=1
11. ln(1−x)=−
12. x = 1 , ln 2 = 0.69313 (four terms); At least 10 terms.
3
15b. (sinx)(cosx)=x−2×3+ 2 x5− 4 x7+···; sin(0.0006)cos(0.0006)≈0.000599999857 3 15 315
∞ 1 x n
16. ln(e + x) = 1 + (−1)n−1
ne
n=1
17. At least seven terms.
24. s ← 0
for i = 2 to n do
s ← s + log(i)
output i, s end for
18. At least 100 terms.
20. − 5 h4 23. 1 x − 17 884
x2 1 1
28.cosx− 1− < =
2 16×24 384
726 Answers for Selected Problems
32. Maclaurin series: f (x) = 3 + 7x − 1.33x2 + 19.2x4;
(x −2)2 (x −2)3 (x −2)4
f (x) = 318.88 + (x − 2)616.08 + 918.94 + 921.6 + 460.8
2! 3! 4! 35. 400 terms. √
π1∞ h2k 3∞ h2k−1
38. cos + h 3
= (−1)k + 2 (2k)!
2
(−1)k ; cos(60.001◦ ) ≈ 0.49998 488 (2k − 1)!
k=0
39. sin(45.0005 )≈0.70711295
k=1
42. f(x−h)=(x−h) =x −mhx
◦
47. n=16orn=17 50b. lim arctanx =1 50c. lim cosx+1 =0
mmm−1 h2m−2 +m(m−1) x +···
x→0 x x→π sinx 2x3x5x7
2!
51. Atleast38terms.
10 5 53. 10 54. 10
17. α50 = 2 81437 53123
52. erf(x)=√π x− 3 +5(2!)−7(3!)+··· ; erf(1)≈0.8382
Computer Problems 1.2
1. c = 1 c = 108
x1 0 −1
x2 −108 −108
14. g converges faster (in five iterations)
Problems 2.1
1c. [B5000000]16
16. λ50 = 1 25862 69025
2d. [3FA 0000000000000]16; [BFA 0000000000000]16
4d. [3E7 00000]16,[3FCE 0000000000000]16
5d. −∞ 8a. −3.131968×106 8d. 9.992892×106 8g. −3.39×103
11c. m = −1,0,1. Nonnegative machine numbers: 0, 1, 1, 3, 1, 3, 1, 3 84824 2
15. 1 17. 1.00005; 1.0 18. |x| < 5 × 10−5 21. ≈3×2−25 25. ≈3×2−24 26. ≈2−22
19. β1−n
30. ≈n×2−24;n=1000,≈2−14
39. The relative error cannot exceed 5 × 2−24.
37. 1 × 10−12 rounding; 10−12 chopping 2
38. 9%
42. q − 2−252m, q + 2−252m Problems 2.2
4. y = cos2 x 1+sinx
6. f(x)=−1x3 − 1x4; 2 2
f(0.0125)≈−9.888×10−7
8. f(x)= √ 1 1+x2 +1
10. f(x)= √ 1
x2 +1+x
+3−1.7x2; ⎧√
⎨ ln x+ x2+1 11.f(x)=⎩ 0 √2
x2 x3
16. f(x)≈1−x+ 3 − 6; f(0.008)≈0.992020915 20.arctanx−x≈x
f(0)=3.5
x>0 x=0 −ln−x+x+1 x<0
x4 13.z=√x4+4+2
3 1 21 −3+x 5+x
2 1 −7
Answers for Selected Problems 727 22. e2x − 12x ≈ 1 + x(1 + (x/3)(2 + x)) 24a. Near π/2, sine curve is relatively flat.
26b. ln x − 1 = ln(x/e) 26d. x−2(sin x − ex + 1) ≈ −1 − x when x → 0 √23
28. |x| < 6ε, where ε machine precision 29. x1 ≈ 105, x2 ≈ 10−5 30. Not much. Expect to compute b2 − 4ac in double precision.
Computer Problems 2.2
1. No solution; (0, 0); (0, 0); Any solution; (−1., 0.); (−0.10208 42383, −4.89791 57617);
10.
x
1 −1 0.5 −0.123 −25.5 −1776 3.14159
Series 1.0
2.71828 18285 0.36787 94412 1.64872 12707 0.88426 36626 8.42346 37545 0
(4.00000 00001, 4.0009 99999); (1.99683 77223, 2.00316 22777)
(−0.10208 42383, −4.89791 57617); n
(1.0000 00000,
1.00000 0000E34);
Problems 3.1
1. 4.
0
1 10 10 8 5 × 10−12 25 25 17
23.14063 12270
14. |x| < 10−15 15. ρ50 = 2.85987
0.61906; 1.51213
π π 3π 5π
− − δ, 0, + ε, + ε, + ε, . . . , where δ ≈ 0.2 and ε starts at approximately 0.4 and decreases.
4444
9. 12.
0,±π,±π,±3π,±2π,... 10. x =0 22
If the original interval is of width h, then after, say, k steps, we have reduced the interval containing the root to width h2−k. From then on, we add one bit at each step. About three steps are needed for each decimal digit.
20 steps 18b. This could be false because if r is close to bn then r − an ≈ bn − an = 2−n(b0 − a0).
17.
18d. This is true because 0 r − an (obvious) and r − an bn − an = 2−n (b0 − a0 ). 19e. True. 21. n23. 23. No;No.
19a. False in some cases.
Computer Problems 3.1
10. 1,2,3,3−2i,3+2i,5+5i,5−5i,16
Problems 3.2
11. 2.365
3. xn+1 = 1[xn +1/(Rxn)]
1.6 7. y =
√2
x +
π
4. 0.79; 11. xn+1 =2xn xn2R+1 ;−0.49985
1−
9. π
√2
√2 2 4
2
12a. Yes,−3 R. 13a. xn+1 = 1 2xn +R/xn2 3
728 Answers for Selected Problems
13c. x =x x3+2R2x3+R 13e. x = xn 4R−x3 13g. x = R2x6+12Rx3+1
n+1 n n n 15. x1 = 1 17. xn+1 = −1
n+1 3R n n+1 xn2 n n
19. |x0| < √3
27. x =(m−1)xm +R/mxm−1; x =x (m+1)R−xm/(mR)
22
21. Newton’s method cycles if x0 ≠ 0.
22. x ← R
for n = 1 to n max do
x ← (2x + Rx2)/3 end for
n+1 n n n+1n n 29. Diverges. 31. xn+1 = xn − f (xn) f ′(xn)
′′2 ′′
32. xn+1 = xn − f (xn) + [f (xn)] −2f(xn)f (xn)
[ f ′(xn)]2 − f (xn) f ′′(xn)
35. en+1 =en2⎢ ⎣
1 2 f′′
36. en+1 = 2en g
(m − 1)! m! ′
⎡ f′′(xn)
f′′(xn) ⎤ f (m+1)(ξn)
f (m+1)(ηn) − m!
(m+1)(m−1)!⎥ f(m)(r) + en f(m+1)(ηn) ⎦
Computer Problems 3.2
37. |g(r)|<1if0<ω<2
41. 4thorder
4. 0.3279677853318183622377546 5. 2.094551481542326591482386540579
8. 1.8392867552 9. 0.47033169 10a. 1.8954942670340 10b. 1.9926668631307
10c. 0.51097 34293 8857 10d. 2.58280 14730 552 14. 3.13108; 3.15145
(two nearby roots)
17. x=4.510187
Problems 3.3
1.2.7385
12. en+1 = 1 −
3.−3 2
9.xn+1=xn− xn2−R
4.ln2 xn − x0
Computer Problems 3.3
1. −0.45896; 3.73308 6a. 1.53209 6b. 1.23618 7.
Problems 4.1
1. p3(x)=7−2x+x3 3. l2(x)=−(x−4)(x2−1)/8
7a. p3(x)=2+(x+1)(−3+(x−1)(2+(x−3)(−11/24)))
xn + xn−1
13a. Linear convergence
f ′(ξn ) en
13c. Quadraticconvergence 15. Show|ξ−xn+1|c|ξ−xn|.
f(xn)− f(x0)
√
16. 2
8. p4(x)=−1+(x−1)2 +(x−2)1 +(x−2.5)3 +(x−3)11 3846
1.36880 81078 21373
9. 20.80485 4
Answers for Selected Problems 729
9a. 0 1
8
13
1
223 0 35
493 12 83
9
14
7
1
6 259
9b. f(4.2)=104.49
13a. x3 − 3x2 + 2x − 1
12. q(x)=x4−x3+x2−x+1− 31(x+2)(x+1)(x)(x−1)(x−2) 120
14. p(x) = x − 2.5 16. α0 = 1 2
18. 2+x(−1+(x −1)(1−(x −3)x))
19. p4(x)=−1+2(x+2)−(x+2)(x+1)+(x+2)(x+1)x; p2(x)=1+2(x+1)x
22. p(x) = 0.76(x − 1.73)(x − 1.82)(x − 5.22)(x − 8.26) 25. 1.5727; No advantage
27. p(x) = −3 x3 − 2 x2 + 1 28. 0.38099; 0.077848
55
39. 0.85527; 0.87006 40. Divisions: 1 (n − 1); additions/subtractions: n(n − 1)
2n
42. zero 45. False, only unique for polynomial p of degree n − 1.
Computer Problems 4.1
1. p(x)=2+46(x−1)+89(x−1)(x−2)+6(x−1)(x−2)(x−3)+4(x−1)(x−2)(x−3)(x+4)
Problems 4.2
1. f [x0, x1, x2, x3, x4] = 0 6. 1.25 × 10−5 7. 9. 4.105 × 10−14 (Thm 1), 1.1905 × 10−23 (Thm 2)
Errors: 8.1 × 10−6, 10. 2.6 × 10−6
6.1 × 10−6 13. n 7
8. 497 table entries
n−1
hn(2n)!
|x−xi| 22nn! 16. Yes. Problems 4.3
14.
i=0
1. −hf′′(ξ) 2. Errorterm=−hf′′(ξ)forξ ∈(0,2h) 4. Nosuchformulaexists.
6. The point ξ for the first Taylor series is such that ξ ∈ (x, x + h), while the second is ξ ∈ (x − h, x). Clearly, they are
h2 (6) 9b. − 6 f (ξ)
not the same.
2 2 ′′′ h2 (5)
8a. −3h f (ξ) 9a. − 4 f (ξ) h2 ′′′
h ′′ 11. α=1,errorterm=− f (ξ); α≠ 1,errorterm=−(α−1) f (ξ)
262 12.Errorterm=−h f′′′(ξ1)+1f(4)(ξ2) forsomeξi∈(x−h,x+h). 13. p′ x0+x1 = f(x1)−f(x0)
6 2 2 x1−x0 h h2 h h h
16. L≈2φ −φ(h) 20. L≈ φ −φ(h)φ 2φ −φ(h)−φ
2 2323
730 Answers for Selected Problems
Computer Problems 4.3
3. 0.2021158503 Problems 5.1
1. 7 3. 0.00010000250006 6. n56738 7. U −L = 1[f(1)− f(0)] 18 n
11. L(f;P)M(f;P)U(f;P) Computer Problems 5.1
2. 0.94598385; 0.94723395 4. 4.84422
Problems 5.2
1. ≈ 0.70833 1 dx
2. T(f;P)=0.775; x2 +1 ≈0.7854;Error=0.0104
0 6.h 2 1 1/2 1/4
L 0 0 1/2 3/4 U 2 2 3/2 5/4 T2111
7. n1607439; toosmall
8. T = 1 1(n−1)(2n−1)n + 1 n3 6 2n
12. 0.000025 13. T( f ; P) ≈ 4.37132 14. T( f ; P) ≈ 0.43056 15. | error term | 0.3104 16. T(f;P)=7.125;No,theycannotbecomputedfromthegivendata.
(b−a)h ′ (b−a)h2 ′′
17a. =− f (ξ)forsomeξ ∈(a,b). 17b. =− f (ξ)forsomeξ ∈(a,b).
24
30. b f(x)dx=h2n f(a+ih)+EwhereE=1(b−a)hf′(ξ)forξ∈(a,b) a i=0 2
Computer Problems 5.2
2a. 2 2b. 1.71828 2c. 0.43882 Problems 5.3
1. 13 3. −136 5. 4.267 7. Notwell. 15
8. R(1,1)= h{f(−h)+4f(0)+ f(h)} Simpson’srule 10. 1+2m−1 3
13. R(2,2)= 2h [7f(a)+32f(a+h)+12f(a+2h)+32f(a+3h)+7f(b)] 45
26 18a. 1 h3 f′′(ξ) 18b. 1 n h3 f′′(ξi) 18c. b−ah2 f′′(ξ)
24 24 i=1 i
24. f(x)=xn (n>3)on[0,1],withpartition{0,1}
b′
25. L a f(x)dxTU 26. n1155 29. −(b−a)hf (ξ)/2
Answers for Selected Problems 731
14. X=(27v−u)/26 15. Z=4096f h −1344f h + 84 f h − 1 f(h) 2835 8 2835 4 2835 2 2835
17. xn+1+n3(xn+1−xn)/(3n2+3n+1) 18. |I−R(n,m)|=O(h2m)ash→0
23. Showb f(x)dx−R(n,0)≈c4−(n+1). 24. Letm=1andletn→∞inFormula(2). a 2π2m
27. E = A2m(2π) 4 [±42m cos(4ξ)] ± (2π)2m+142m+1 A2m cos(4ξ) Computer Problems 5.3
1. R(7,7) = 0.499969819 5. R(5,0) = 1.813799364 6. 2 = 0.22222 … 9
7. 0.62135732 11. R(7,7)=0.765197687 Problems 6.1
1. π 2a. h<0.03orn>33.97. 4
3a. 7.1667 3b. 7.0833 3c. 7.0777 7. b f(x)dx = 16S2(n−1) − 1 Sn−1
22. R(n+1,m+1)=R(n+1,m)+[R(n+1,m)−R(n,m)]/(8m−1)
2b. h<0.15orn>7.5.
2 dx −4
Problems 6.2
4. 1 x = 0.6933; Bound is 5.2 × 10 . 8. − 3 h5 f(4)(ξ)
a 15 15 80
1. ≈ 0.91949 4a. x = ± 1 4b. x = ±0.861136, ±0.339981 3
5.α=γ=4, β=−2 6. A=(b−a), B=1(b−a)2 332
5h2hh
7. f(a)+ f(a+h)− f(a+2h) 9. α= 5, a=c= 7 , b= 8
12312 72575 hh3 h3
10.w1=w2=2, w3=w4=−24 11. A=2h, B=0; C= 3 12. A=8, B=−4, C=8 Yes.Exactforpolynomialsofdegree3.
333
13. A = h , D = 0, C = h , B = 4 h 333
Computer Problems 6.2
2a. 1.4183 8a. 2.034805318577
Problems 7.1
14. True for n 3
8b. 0.892979511569
8c.0.43398771
1. Homogeneous: α = 0, zero solution; α = ±1, infinite number of solutions
2. For α ≈ 1, erroneous answer is produced.
3a. No solution
3b. Infinite number of solutions
4.
x1 = −697.3 x2 = 343.9
x1 = −720.79976 x2 = 356.28760
−0.001343
5. r= −0.001572 ,r= 0.0000000 ,e= −0.001 ,e=
6a. x2 =1, x1 =0 6b. x2 =1, x1 =1 6c. Letb1 =b2 =1.Thenx2 =1, x1 =0,whichisexact.
−0.0000001
−0.001
−0.659 0.913
732 Answers for Selected Problems
7a. x1=2, x2=1, x3=0 7b. x1=x2=x3=1
7c. x1 ≈ −7.233, x2 ≈ 1.133, x3 ≈ 2.433, x4 = 4.5 Computer Problems 7.1
6. z=[2i,i,i,i]T,λ=1+5i; z=[1,2,1,1]T,λ=2+6i; z=[−i,−i,0,−i]T,λ= −3−7i; z = [1, 1, 1, 0]T , λ = −4 − 8i
7a. (3.75, 90◦ ); (3.27, −65.7◦ ); (0.775, 172.9◦ ) 7b. (2.5, −90◦ );
Problems 7.2
⎡1/2 5/2 −4 −1⎤
1. ⎢⎣ 1/4 −1/2 −5/19 −62/19 ⎥⎦ 2. x = [1/3, 3, 1/3]T
3/4 9/10 38/5 9/10 4104
⎡1 0 3 0⎤ ⎡0 1 3−2⎤ ⎡0 1 3.⎢⎣0 1 3−1⎥⎦⇒⎢⎣0 1 3−1⎥⎦⇒⎢⎣0 0
(2.08, 56.3◦);
3−2⎤ 0 1⎥⎦
0 6 −2 −2
(1.55, −60.2◦)
3 −3 0 6 0 2 4 −6
⎡1/4 5/2 7/4 1/2⎤ 5.⎢⎣4 2 1 2⎥⎦
1/2 0 5/9 17/9 1/4 3/5 27/10 1/5
3 −3 0 6 0 2 4 −6
3 −3 0 0
6. l=(1,3,2),thesecondpivotrowisthethirdrow.
10. x4=−1, x3=0, x2=2, x1=1 13b. x3=1, x2=1, x1=1 13d. x1 ≈ 4.267, x2 ≈ −4.133, x3 ≈ −2.467 17. n(n + 1)
18. 29(n2 −1)+ 7 n(n−1)(2n−1)10−6 seconds
8. x3 =−1, x2 =1, x1 =0
10
19. n Time
30
10 102
1 × 10−3 sec 1 sec 33
103 5.56 min
104 3.86 days
0.005¢ 5¢
21. Solve these: U T y = b, L T x = y 23a. x1 = 5 , x2 = 2 , x3 = 1 × 10−9
Cost
$46.30
$46,296.30 999
Computer Problems 7.2
2. [3.4606,1.5610,−2.9342,−0.4301]T 3. [6.7831,3.5914,−6.4451,−1.5179]T
4. 2n10,xi ≈1foralli; forlargen,manyxi ≠ 1 6.x2=1, xi=0, fori≠ 2
Problems 7.3
5. bi =n2+2(i−1)
2a. 5n−4 3. n+2nk−k(k+1) 6. Yes,itdoes.
7. D−1AD=tridiagonal±√a c , d, ±√ac i−1 i−1 i i i
Computer Problems 7.3
3.
xn ←bn
12.
di ←di −1/di−1
bi ←bi −bi−1/di−1 (2in) xi ←(bi −xi+1)/xi
(n−1 j1) ⎪⎨⎛⎞
x =1 n−1 4.1 −1 11a.
aijxj⎠ aii (1=n−1,…,1)
xi =1−(4xi−1) ⎧
⎪⎩di+1 ←di+1−ai+1ci bi+1 ←bi+1−ai+1bi
⎪⎩xi ←⎝bi −
⎪⎨ c i ← c i / d i bi ← bi/di
j=1
Problems8.1
5a.M=⎢⎣ 0 1 0 0⎥⎦ 0 −x/b 1 0
−w/a (xy)/(bc) ⎡a 0 0 0
5b.L′=⎢⎣0 b 0 0 0xc 0
−y/c 1 ⎤
⎥⎦
6c. L′=⎢⎣−1 −1
⎡ 0
15/4 0 0⎥⎦ −1/4 56/15 0
−1 −16/15 24/7 ⎤
⎡0 0 y d−(wz)/a
⎤
⎤ U = ⎢⎣ 0 15/4 −1/4 −1 ⎥⎦
0 0 56/15 −16/15 0 0 0 24/7
4000
(2i100)
(2in)
2√000 ⎢−1/2 (1/2) 15 0 0⎥ 6d. L′′ =⎢⎣−1/2 −1/2√15 2√14/15 0⎥⎦
6e. 192
(1in−1)
⎧⎪x1 ←b1/a11
bn ← bn/dn
bi ←bi −cibi+1
⎡ ⎤ ⎡ ⎤ 100 303 1000 1002
1/3 −3 1 0 0 8 0 −3 1 0 −5 0 −2 1
1a.L= 0 10,U=0−13 2a.M=⎢⎣0 1 00⎥⎦ U=⎢⎣0300⎥⎦
⎡⎤⎡⎤
0 0 4 0 0 0 0 0
1 0 0 0 0 ⎢0 1 0 0 0⎥ 3a. M=⎢⎣ 0 −2 1 0 0⎥⎦
0 0 −2 1 0
⎡−40001⎤ 0⎡00020⎤
1000a00z
25 0 0 0 1 ⎢027 4 3 2⎥
321 U=⎢⎣ 0 0 50 −6 −4⎥⎦ 4b. A= 2 2 1
1000 4−1−10
6a. L = ⎢⎣ −1/4 1 0 −1/4 −1/15 1 0 −4/15 −2/7
0 ⎥⎦ 0
1
⎡4 0 0 0⎤ ⎡1−1/4−1/4 0⎤ 6b. D = ⎢⎣ 0 15/4 0 0 ⎥⎦ U ′ = ⎢⎣ 0 1 −1/15 −4/15 ⎥⎦
0 0 56/15 0 0 0 1 −2/7 ⎡0 0 024/7 ⎤ 0 0 0 1
0 0 0 0 0
1 1 1
U=⎢⎣0 b 0 0 ⎥⎦ 0 0 c 0
0 0 0 d − (wz)/a
⎡1 0 0 z/a⎤ U′=⎢⎣0 1 0 0 ⎥⎦
0010 0⎡0 0 1
Answers for Selected Problems 733
0 −2/√15 −(4/7)√14/15 2√6/7
734 Answers for Selected Problems ⎡1001⎤ ⎡1000⎤
8.U=⎢⎣0 1 0 −2⎥⎦ 0 0 1 4
0 0 0 −8
L=⎢⎣ 1 1 0 0⎥⎦ −1 1 1 0
1 −1 1 1
1 0 0
1 1 0 D=
3−11 003 001
9a. L=
10a.L= 2 1 0 , D= 0 1 0 , U′=
9b. x=[−1,2,1]T
1−1/2 1
0 1 1 10b.x=[−1,1,1]T 001 ⎤
0 0
l22u23 0 ⎥⎦
l32u23 +l33 l33u34
12. A−1 = 1 15
⎡ 16a. X−1 =⎢⎣
11 −5 −7
l11 l11u12
14a. ⎢⎣l21 l21u12 +l22
1 0 0 −2 0 0 −131 0⎡0−1
−13 −8
10 5
11 1
0 l32 0 0⎡
l43 ⎤ l43u34 +l44 0 −1 −1 1
2 0 0 1−1/2 1 0 −2 0 U′= 0 1 −1/2
⎤
1 1 −1 1⎥⎦ −1 0 1 −1
0 0 0 1
1 0 0 −1
Computer Problems 8.1
⎡ 536 −668 3. Case 4: p5(A) = ⎢⎣−668 994
Problems 8.2
3. d. 5. e. 9. b. Problems 8.3
9. c. 11. d.
Computer Problems 8.3
458 −854
−186 ⎤ 458⎥⎦
458 −854 994 −668 −186 458 −668 536
16b. X−1 =⎢⎣−1 0 −1 1⎥⎦ −1 −1 0 1
1 1 1 −1
11. Eigenvalues/eigenvectors:1,(−1,1,0,0); 2,(0,0,−1,1); 5,(−1,1,2,2) Problems 8.4
1. a. Problems 9.1
1. Yes
6. In Problem 9.1.5, the bracketed expression is f ′(ξ1) − f ′(ξ2) and in magnitude does not exceed 2C.
10. n f (t )S is a linear combination of 1st-degree spline functions having knots t , . . . , t . Hence, it is also such i=1ii n 0n
a function. Its value at tj is i=1 f(ti)Si(tj) = f(tj). Si(x) = 0 if x < ti−1 or x > ti+1. On (ti−1,ti), Si(x) is given by (x − ti−1)/(ti − ti−1). On (ti , ti+1), Si (x) is given by (x − ti+1)/(ti − ti+1). S0 and Sn are slightly different.
9. Knots 50π108 ≈ 1.57 × 1010.
12.
17. 19.
If S is piecewise quadratic, then clearly S′ is piecewise linear. If S is a quadratic spline then S ∈ C1. Hence, S′ ∈ C.
Hence, S′ is piecewise linear and continuous.
2 12 1 Q0(x)=−(x+1) +2, Q1(x)=−2x+1, Q2(x)=8 x−2 −2 x−2
Q3(x)=−5(x−1)2+6(x−1)+1, Q4(x)=12(x−2)2−4(x−2)+2
The answer is given by Equation (8). 20a. Yes 20b. No 20c. No 21. Yes
32. S0(x)=−5(x−1)3+12(x−1) 77
Answers for Selected Problems 735
Problems 9.2
1. No 2. No 4. a=−4, b=−6, c=−3, d=−1, e=−3 5. a=−5, b=−26, c=−27, d=27 6. No
8a. (m+1)n 8b. 2n 8c. (m−1)(n−1) 8d.m−1 ⎧
⎨x2 [0,1] 10. S=⎩1+2(x−1)+(x−1)2+(x−1)3 [1,2] 5+7(x −2)+4(x −2)2 [2,3]
12. a=3, b=3, c=1 13. No 15. a=−1, b=3,
2
7a. S(x) is not continuous at x = −1. S′′(x) is not continuous at x = −1, 1.
c=−2, d=2
No 26. S is linear.
S2(x) = −5(x −3)3 +6(4−x)3 +12(x −3)−6(4−x) 7 7 7 7
S3(x)= −5 (5−x)3+ 12 (5−x) 77
33. The conditions on S make it an even function. If S(x) = S0(x) in [−1, 0] and S(x) = S1(x) in [0, 1], then S1(0) = 1, S′ (0) = 0, S′′(1) = 0, and S1(1) = 0. An easy calculation yields S1(x) = 1 − 3 x2 + 1 x3.
19. f is not a cubic spline 22. p3(x) = x − 0.0175×2 + 0.1927×3;
17. n+3
S1(x) = 6(x −2)3 −5(3−x)3 −6(x −2)+12(3−x) 7777
11 22 38. 5n,n+4 39. Yes
Problems 9.3
2. Chebyshev polynomials recurrence relation. See Section 12.2. ⎧⎪ ⎪ ( x − t i ) 2
⎪ (ti+2 − ti)(ti+1 − ti), on [ti,ti+1]
⎪⎨ (x−ti)(ti+2−x) + (ti+3−x)(x−ti+1) , on[ti+1,ti+2] 3. Bi2(x)=⎪(ti+2−ti)(ti+2−ti+1) (ti+3−ti+1)(ti+2−ti+1)
⎪ (ti+3 − x)2
⎪ (ti+3 − ti+1)(ti+3 − ti+2), on [ti+2,ti+3]
⎩ 0, elsewhere
5. ∞ f(ti)B0(x) 14. n−kim−1
i=−∞ i
15. Use induction on k and Bk+i (x) = 0 on [ti , ti+1].
19. ∞ dx i=1 i i=1 i
24. Use Equation (14) with all A’s zero except A j = 1. Next, take all A’s zero except A j +1 = 1. 28. No 30. Let Ci2 = ti+1ti+2, then Ci1 = xti+1, and Ci0 = x2.
32. Bik(tj)=0ifftj ti+k+1 ortj ti 33. x =(ti+3ti+2 −titi+1)/(ti+3 +ti+2 −ti+1 −ti)
17. No
20. In Equation (9), take all ci = 1. Then di = 0. Hence, d n Bk(x) = 0 and n Bk(x) is constant.
i+i
16. No
i=−∞i
ti+1 B1(x)
736 Answers for Selected Problems
Computer Problems 9.3
7. 47040
Problems 10.1
1a. x=1t4+7t3−2t3/2+c 1b. x=cet 433
1e. x=c1et+c2e−t
or x=c1cosht+c2sinht
2a. x=1t3+3t4/3+7 34
2c.
5a.
f(t,x)=+ x/ 1−t2
real function f (t , x ) real t, x
f ←t2/(1−t+2x) end function f
3. x(−0.2)=1.92
∞
(−1)n
t2n+1
(2n + 1)(2n + 1)!
2 2 +c 3d. x=e−t /2 t2et /2dt+c
3c. x=
4. x =a0 +a0 n=1(−1) 2n−1(2n)! t + n=1(−1) (2n+1)!
n=0 ∞ n (2n − 1)! 2n ∞ n−1 n!2n
t
2n+1
6. Letp(t)=a0+a1t+a2t2+···anddetermineai.
9. t =10,Error=2.2×104ε; t =20,Error=4.8×108ε
10. x(4) = 18xx′x′′ + 6(x′)3 + 3x2x′′′
11a. x′=x+ex;x′′=(1+ex)x′;x′′′=(1+ex)x′′+ex(x′)2;x(4)=(1+ex)x′′′+3exx′x′′+ex(x′)3. 12. x(0.1) = 1.21633
14. n←20 s ← x(n)
for i =1to n−1do
s ← x(n−i) + [h/real(n + 1 − i)]s
end for
s ← x + h[s]
Computer Problems 10.1
1. x(2.77)=385.79118 3. x(10) = 22026.47
7. x(1)=1.6487212691
Problems 10.2
2b. x(1.75)=0.632999983 4a. Error at t = 1 is 1.8 × 10−10.
2c. x(5)=−0.2087351554 5. x(0) = 0.03245 34427
9. x(0)=1.6798409205×10−3
10. x(0)=−3.7594073450
8. Solve df =e−x2, f(0)=0. dx 3
2 2 2 10.h3 1−α D2f+h fxDf where D=∂+f∂ and D2=∂ +2f ∂ +f2∂
646 ∂t∂x∂t2∂x∂t∂x2 11.h= 1
1024
12. Let’s make local truncation error 10−13. Thus, 100h5 10−13 or h 10−3. So take h = 10−3 and hope that the three
extra digits will be enough to preserve 10-digit precision.
(4)3 2 2 3∂3 ∂3 2∂3 3∂3 14b.x =D f+fxD f+3DfxDf+fx Df where D =∂t3+3f∂x∂t2+3f ∂t∂x2+f ∂x2
15. f(x+th,y+tk)= f(x,y)+t[f1(x,y)h+ f2(x,y)k]+(1/2)t2f11(x,y)h2 +2f12(x,y)hk+ f22(x,y)k2+···. Now let t = 1 to get the usual form of Taylor’s series in two variables.
17. Taylor series of f (x, y) = g(x) + h(y) about (a, b) is equal to the Taylor series of g(x) about a plus that of h(y) about b.
18. f(1+h,k)≈−3h+3h2+k2 19. e1−xy ≈3−x−y 20. A=1+k+1k2,B=h(1+k) 22
21. A=1, B=h−k, C=(h−k)2
22. f(x+h,y+k)≈1+2xh+k+1+2×2h2+2hkx+1k2f; f(0.001,0.998)≈2.7128534
2
Computer Problems 10.2
2. x (1) = 1.5708 3b. n = 7; x (2) = 0.82356 78972 (RK),
3c. n = 7; x (2) = −0.49999 99998 (RK), −0.50000 00012 (TS) 4. x (1) = 0.60653 = x (3) 5. x(3) = 1.5 6. x(0) = 1.0 = x(1.6) 8. x(1) = 3.95249 9. x(10) = 1.344 × 1043
Problems 10.3
1. Let h = 1/n. Then x(1) = e−1 (true solution) and xn = {[1 − 1/(2n)]/[1 + 1/(2n)]}n approximate solution. 2. x(t+h)=x(t−h)+h[f(t−h,x(t−h))+4f(t,x(t))+ f(t+h,x(t+h))]
Answers for Selected Problems 737
3
4.a=24, b=−11, c=2, d=10, e=−2h2 13 13 13 13 39
5.a=1,b=c=h; ErrortermisO(h3). 2
0.82356 78970 (TS)
8. ∂ x(9, s) = e252 ≈ 10109 9a. All t. ∂s
Computer Problems 10.3
5. x1=2.25 6. x−1=−4.5 22
12. 0.2193839244 13. 0.9953087432 Problems 11.1
9c. Positive t.
9e. No t.
11. Divergent for all t.
y(x)=[1−lnv(x)]v(x)
9. y(e)=−6.3890560989 15. Si(1)=0.9460830703
where
1. x(t+h)=x1+1h2+ 1 h4+yh+1h3+ 1 h5,y(t+h)=y1+1h2+ 1 h4+xh+1h3+ 1 h5 2 24 6 120 2 24 6 120
2. Since system is not coupled, solve two separate problems.
3. System is not coupled so each differential equation can be solved individually by the program.
⎡⎤
1
4. X′=⎣x12+logx2+x02 ⎦, X(0)=[0,1,3]T
ex2 −cosx1+sin(x0x1)−(x1x2)7 Computer Problems 11.1
1. x(1)=2.4686939399, y(1)=1.2873552872 2. x(0.38)=1.90723×1012, y(0.38)=−8.28807×104 π π π π
4. x(−1)=3.36788, y(−1)=2.36788 5. x1 2 =x4 2 =0, x2 2 =1, x3 2 =−1 7. x(6) = 4.39411, y(6) = 3.10378
738 Answers for Selected Problems
Problems 11.2
x 2
1.X′= x3 X(0)=[1,−3,5]T 2×2 +logx3 +cosx1
3. Solve each equation separately since they are not coupled.
⎡x
2
−3/2
⎤ ⎡0.5 ⎤
4.X′=⎢−x1 x12+x32
⎣x4⎦ 0.25
⎡⎤
1
⎢x2 ⎥ 5. X′ =⎢⎣x3 ⎥⎦
x4
x42 + cos(x2x3) − sin(x0x1) + log(x1/x0) ⎡⎤⎡⎤
x4 3
⎢x5 ⎥ ⎢3 ⎥
⎥ X(0)=⎢⎣0.75⎥⎦ − x 3 x 12 + x 32 − 3 / 2 1 . 0
X(0)=[0,1,3,4,5]T
2tx6 +2tex1x3 3 x
2 7a. Letx1=x,x2=x′,x3=x′′.ThenX′= x3
6. X′=⎢x6 ⎥ X(1)=⎢2 ⎥ ⎢2x1x3x4 +3x12x2t2⎥ ⎢−79/12⎥
⎣ex2x5 +4x1t2x3 ⎦ ⎣2 ⎦
8.X′= x2
x2 − x1
X(0)=[0,1]T
−x3 sin x1 − t x2 − x3 ⎡⎤
1
⎢x3 ⎥ 9. Letx0=t,x1=x,x2=y,x3=x′,x4=y′.ThenX′=⎢⎣x4 ⎥⎦
X(0) = [0,1,3,2,4]T Problems 11.3
1. xj(t)=eλjtxj(0) Problems 12.1
1. y(x)=1
1 m
maxima exists—only a minima.
12. c = 10**(m + 1)−1 mk=0(yk − log xk). 13. y = (6x − 5)/10 16. a ≈ 2.5929, b ≈ −0.32583, c ≈ 0.022738
x1 +x2 −2×3 +3×4 +logx0 2×1 −3×2 +5×3 +x0x2 −sinx0
yk=(y0+···+ym)/(m+1),theaverageoftheyvalueswhichdoesnotinvolveanyxi. m m
2. f(x)=m+1
3. a = (1 + 2e)/(1 + 2e2), b = 1 5. a = 2.1, b = 0.9 7. c = k=0 yk log xk k=0 (log xk)2
k=0
11. φ involves the sum of m + 1 polynomials of degree two in c which is either concave upward or a constant. Thus, no
m 18. a=1, b=1 19. y(x)=2×2+29 20. y=x+1 21. c=
k=0
m exk f(xk)
k=0
Problems 12.2
2. 3.
5. 6. 7.
8.
wn+2 =wn+1 =0
wk =ck+3xwk+1+2wk+2 (k=n,n−1,…,0)
3 735
e2xk
f (x) = w0 − (1 + 2x)w1
Since cos(n − 2)θ = cos[(n − 1)θ − θ ] = cos(n − 1)θ cos θ + sin(n − 1)θ sin θ , we have 2 cos θ cos(n − 1)θ −
cos(n − 2)θ = cos(n − 1)θ cos θ − sin(n − 1)θ sin θ = cos(n θ ). Note if gn (θ ) = cos n θ , then gn (θ ) = 2 cos θ gn −1 (θ ) − gn−2(θ).
By the previous problem, the recursive relation is the same as (2) so that Tn(x) = fn(x) = cos(n arccos x). Tn(Tm(x)) = cos(n arccos(cos(m arccos x))) = cos(nm arccos x) = Tnm(x).
|Tn(x)|=|cos(narccosx)|1forallx ∈[−1,1]since|cosy|1andforarccosx toexistx mustbe|x|1. g0(x) = 1
g1(x) = (x + 1)/2
gj(x)=(x+1)gj−1(x)−gj−2(x) (j2)
10. n + 2 multiplications, 2n + 1 additions/subtractions if 2x is computed as x + x 12. n multiplications, 2n additions/subtractions
13. T6(x)=32×6 −48×4 +18×2 −1
y1x13 − y2x13
17. α = 2 1 . α is very sensitive to perturbations in y1.
Computer Problems 12.2
Answers for Selected Problems
739
x12x12(x2 − x1) 12
0 7.aij= (m+1)
(m + 1)/2 Problems 12.3
(i≠ j) (i=j=1) (i = j > 1)
2. Coefficient matrix for the normal equations has elements ai j = 1 i+j−1
3. c = 0 4. y = bx 6. c = ln 2 8. x = −1, y = 20 13
15.
16.
y≈ 1 .Changeto1≈a+bx.
a+bx y⎡ 2π ⎤ π02a (1/2)(e−1)
0 π/2 2 0
0 b π/2 c
=⎣−(2/5)(e2π +1)⎦ (1/5)(e2π +1)
c=3
Computer Problems 12.3
17.
1.a=2, b=3
Problems 13.1
y sinxn (sinx)2
20. c=n
i=1i i i=1 i
1. l0 = 123456; x1 = .96621 2243; x2 = .12917 3003; x3 = .01065 6910
by (5). 9a. c = 24
9b. c = 3
14. No.
π3
740 Answers for Selected Problems
Computer Problems 13.1
8. 32.5% 11. Sequence not periodic. 0123456789
13.
97 93 97 107 90 115 88 101 113 99 15. 5.6% 16. 200
Problems 13.2
1. m > 4 million
Computer Problems 13.2
2. 1.71828 4. 8 5. 49.9 7. 0.518 9. 1.11 10. 2.000346869 14. 0.635 17b. 8.3
Computer Problems 13.3
1. 2 2. 0.898 4. 7 6. 1.05 3 16
14. 11.6 kilometers 15. 0.14758 Problems 14.1
7. 5.24 17. 0.009
9. 0.996 12. 0.6394
21. 24.2 revolutions 23. 0.6617
2. c1 =1−2e1−e2, c2 =2e−e21−e2 3a. x(t)=eπ+t −eπ−te2π −1 3b. x(t)=t4−25t+1212 4a. x(t)=βsint+αcostforall(α,β)
√
4b. x(t)=c1sint+αcostforallα+β=0withc1arbitrary 6. φ(t)=z 7. φ(z)=z 8. φ(z)= 9+6z
9. φ(z)=e5+e+ze4−z2e2 10. Twoways:Usex′′(a)=zorx′(b)=z,x′′(b)=w.
11. x(t)=−et +2ln(t+1)+3t
14a. This is a linear problem. So two initial-value problems can be solved as in the text to obtain the solution. The two
sets of initial values would be x(0) = 0 and x(0) = 1 . x′(0) = 1 x′(0) = 0
15. Solutionofx′′ =−x,x(0)=1,x′(0)=zisx(t)=cost+zsint.Soφ(z)=x(π)=−1.Sinceφisconstant,we cannot get φ(z) = 3 by any choice of z!
2. x1 ≈0.29427, x2 ≈0.57016, x3 ≈0.81040
1′1
z1 = x (0) = B. So x1 = 3 cos t + z1 sin t, x2 = 3 cos t + z2 sin t. By Equation (10), x = λx1 + (1 − λ)x2 and
λ = [β − x2(b)]/[x1(b) − x2(b)] = [7 − (−3)]/[(−3) − (−3)] = 10/0.
Problems 14.2
1. − 1− h xi−1 +2(1+h2)x1 − 1− h xi+1 =−h2t 2 2
4.x′(0)=5 8.−xi−1+2+(1+ti)h2 xi−xi+1=0 3
9. x(t) = [7/u(2)]u(t)
11. x′′ = −x1,x1(0) = 3,x′(0) = z1 implies x = Acost + Bsint,3 = x(0) = A,x′ = −Asint + Bcost. Let
Computer Problems 14.2
2a. x =1/(1+t) 2b. x =−log(1+t) Problems 15.1
1 ∂ ∂u 1 ∂2u
1a. Elliptic. 1c. Parabolic. 1f. Hyperbolic. 2. r ∂r r ∂r + r2 ∂θ2 = 0
4. Equation (3) shows that u(x, t + k) is a convex combination of values of u(x, t) in the interval [0, 1]. So it remains in the interval.
5. a=[1+2kh−2(cosπh−1)]1/k
6. The right-hand side is changed by b1 + c0 in place of b1 and bn−1 + cn replacing bn−1 for both (5) and (7).
7. In(6),b1 isreplacedbyb1 +g(t),bn−1 bybn−1 +g(t).Atthelevelzero,bi = f(ih)for1in−1.
8. u(x,t+k)=h2(1−h)u(x+h,t)+h2 ⎡01 ⎤
⎢−1 0 1 ⎥
9. A = ⎢ … … … ⎥
⎣ −101⎦ −2 2
Problems 15.2
k +h−2 u(x,t)+h2u(x−h,t)
Answers for Selected Problems 741
k kh2 k
1. −0.21 2. uxx = f ′′(x + at) + g′′(x − at), utt = a2 f ′′(x + at) + a2g′′(x − at) = a2uxx
3. u(x,t)=1[F(x+t)−F(−x+t)]+1G(x+t)−G(−x+t)whereGistheantiderivativeofG
Computer Problems 15.2
1. real function fbar(x) real x, xbar
xbar ← x + 2 real(integer(−(1 + x )/2)) if xbar < 0 then
fbar ← − f (−xbar) else
fbar ← f (xbar) end if
end function fbar Problems 15.3
22
5. 20+ 2.5h ui+1,j + 20− 2.5h ui−1,j + −30+0.5h ui,j+1+ xi +yj xi +yj yj
−30+0.5h ui,j−1+20uij =69h2 yj
⎡−4 1 1 0⎤ 6. u0,1≈−8.932×10−3; u1,1≈4.643×10−1 7. A=⎢⎣ 1 −4 0 1⎥⎦
2 2 2 1 0 −4 1 0 1 1 −4
742 Answers for Selected Problems
Computer Problems 15.3
5. 18.41◦ 13.75◦
41.47◦ 36.60◦ 24.41◦
69.41◦ 66.77◦ 61.05◦ 53.01◦ 51.00◦
Problems 16.1
1. F(2,1,−2) = −15; F(0,0,−2) = −8; F(2,0,−2) = −12 2. F 9, 9 = −20.25 88
x =(3a+b)/4+δ if ax∗b′
4. Casen=2: x=(a+3b)/4−δ if a′x∗b 5a. ExactsolutionF(3)=−7.
√√ 7.A=α/5, A=−β/5
9. By(6),y+rb=a+r2(b−a)+rb=ar+bsincer2+r=1.Moreover,r(y+rb)=a+r(b−a)=x.Thus, yr + r2b = x or y + r2(b − y) = x.
10. n1+(k +logl−log2)/|logr| 11. n48
13. Minimum point of F is a root of F′. Newton’s method to find root of F′: xn+1 = xn − F′(xn) . Formula does not
involve F itself.
14. To find minimum of F, look for root of F′. Secant method to find root of F′ is
F′′(xn)
xn+1 = xn − F′(xn) xn − xn−1 . Formula does not involve F. F′(xn) − F′(xn−1)
2√
15b. Squarebothsidestoobtainr =1+ 1+ 1+···=1+r.
15d. 1+r−1+r−2+···=(1−r−1)−1byseriesexpansion.Hence,r=(1−r−1)−1−1= Problems 16.2
1a. Yes 1b. No 2. 1,9 3. F(x,y)=1+x−xy+1x2−1y2+··· 4422
1 orr2=r+1. r−1
6.Theslopeofthetangentisdy=−Fx≡m1.ThegradienthasdirectionnumbersFxandFy,anditsslopeisFy ≡m2.
dx Fy
The condition of perpendicularity m1m2 = −1 is met.
7b. F(x)= 3 − 1x2 +3x1x2 +x2x3 +2x2 − 1x2 +··· 2212325
9a. G(1,0)=
Fx
−2 5
9b. G(1,2,1)= 2
⎡2y2z2 sinx cosx ⎤ 10. G=⎣2yz2(1+sin2x)+2(y+1)(z+3)2⎦
2y2z(1+sin2 x)+2(y+1)2(z+3) Problems 17.1
2. maximize: −5x −6x +2x ⎧123
⎪−2x1 + 3x2 −5
⎨ x1 + x2 15 constraints: ⎪ 2x1 − x2 + x3 25 ⎪ ⎪⎩ − x 1 − x 2 + x 3 − 1
x1 0, x2 0, x3 0 4a. Minimum value 1.5 at (1.5, 0).
12. −19,−1 30 5
5b. maximize: −3x+2y−5z
⎧⎪⎨ − x − y − z − 4
constraints: x−y−z 2 ⎪⎩ − x + y + z − 2
x0, y0, z0 6a. maximize: 2x +2x −6x −x
6b. minimize: 25y +20y −5y ⎧1 2 3
7. Maximumof36at(2,6)
8. Minimumof36at(0,3,1) 13c. Unbounded solution
y1, y2, y3, y40
11. Minimum2for(x,x−2)wherex3
13f. No solution
13a. Maximum of 18 at (9, 0)
13h. Maximum of 54 at 18 , 0 55
14. Maximum of 100 at (24, 32, −124)
17. Its feasible set is empty.
⎧1 2 3 4
+ x4 ⎨x1+x2+x3+x4 =20
= 25 constraints: ⎪−4x1 − 6x3 + x5 = −5
⎪ 3x1
⎪3y1 + y2 − 4y3 − 2y4 2 ⎪⎨ y2 2 y2 − 6y3 − 3y4 −6 ⎪⎩ y1 +y2 −2y4 −1
⎪⎩−2x1 −3x3 −2x4 +x6 = 0 x1, x2, x3, x4, x5, x60
constraints:
Answers for Selected Problems 743
Computer Problems 17.1
1.
Felt Straw 0 200 150 0 150 0
Texas Hatters Lone Star Hatters Lariat Ranch Wear
3. $13.50 5. Cost 50¢ for 1.6 ounces of food f1, 1 ounce of food f3, and none of food f2.
Problems 17.2
1. maximize: nj=0 cj yj Here c0 = −nj=1 cj and ai0 = −nj=1 aij.
nj = 0 a i j y j b i yi0 (0in)
5. First primal form: maximize: −bT y
−AT y −c
constraints:
y0
constraints: 2. At most 2n .
6. GivenAx=b.Letyj =xj +yn+1.Now minimize: yn+1
n n n aijxj −bi = aijyj −yn+1
constraints:
⎧⎪⎨n n aijyj + − aij
⎪⎩ j = 1 y0
j = 1
Computer Problems 17.2
1b. x=0,0,5,2,0T 1c. x=0,8,5T 33 33
j=1
yn+1 =bi
j=1
(1in+1)
aij −bi. j=1
744 Answers for Selected Problems
Problems 17.3
1a. maximize: −4 (ui+vi) ⎧ i=1
⎪5y1 +2y2 − 7y4 −u1 +v1 = 6
⎨ y1+ y2+ y3− 3y4−u2+v2= 2 constraints:⎪ 7y2 −5y3 − 2y4 −u3 +v3 =11 ⎪⎩6y1 +9y3 −15y4 −u4 +v4 = 9
u0 v0 y0
1b. minimize: ε
⎧⎪ 5y1+2y2 − 7y4−ε 6
⎪ y1+ y2+ y3− 3y4−ε 2 ⎪ 7y2 −5y3 − 2y4 −ε 11 ⎪⎨ 6y1 +9y3−15y4−ε 9
constraints: ⎪−5y1 − 2y2 + 7y4 − ε −6 ⎪−y1− y2− y3+ 3y4−ε −2 ⎪ −7y2 +5y3 + 2y4 −ε −11 ⎪⎩−6y1 − 9y3 + 15y4 − ε −9
ε0 yj0 (1i4)
n
minimize:
constraints:
ε ⎧⎪n
ajxj. ⎪ axj f(x) (1im)
3. Takempointsxi (i=1,2,...,m).Letp(x)=
⎪ji i
⎨ j=0
⎪ n
⎪ ajxj+ε f(xi) (1im) ⎪ i
⎪⎩ j = 0 ε0
4. minimize: u1 +v1 +u2 +v2 +u3 +v3
⎧⎪⎨ y 1 − y 2 − u 1 + v 1 = 4
constraints: 2y1 −3y2 + y3 −u2 +v2 =7 ⎪⎩ y 1 + y 2 − 2 y 3 − u 3 + v 3 = 2
y1, y2, y30, u1, u2, u30, v1, v2, v30 Computer Problems 17.3
1a. x1 = 0.353, x2 = 2.118, x3 = 0.765 1b. x1 = 0.671, x2 = 1.768, 3. p(x) = 1.0001 + 0.9978x + 0.51307x2 + 0.13592x3 + 0.071344x4
x3 = 0.453
3b. (613.40625)10
9. (479)10 =(111011111)2
Problems B
1a. e ≈ (2.718)10 = (010.101 101 111 100 111 . . .)2
2e. (113.1666213...)8 2f. (71.24426416...)8
4c. (101111)2 4e. (110011)2 4g. (33.72664)8
2d. (27.45075 341 . . .)8 3a. (441.681640625)10
6. (0.31463146...)8
j=0
12. A real number R has a finite representation in binary system. ⇔ R = (amam−1 ...a1a0.b1b2 ...bn)2. ⇔ R = (am...a1a0b1b2...bn)2×2−n =m×2−n wherem=(amam−1...a1a0b1b2...bn)2.
Bibliography
Abell, M. L., and J. P. Braselton. 1993. The Mathematical Handbook. New York: Academic Press.
Abramowitz, M., and I. A. Stegun (eds.). 1964. Hand- book of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. National Bureau of Standards. New York: Dover, 1965 (reprint).
Acton, F. S. 1959. Analysis of Straight-Line Data. New York: Wiley. New York: Dover, 1966 (reprint).
Acton, F. S. 1990. Numerical Methods That (Usually) Work. Washington, D.C.: Mathematical Association of America.
Acton, F. S. 1996. Real Computing Made Real: Prevent- ing Errors in Scientific and Engineering Calculations. Princeton, New Jersey: Princeton University Press.
Ahlberg, J. H., E. N. Nilson, and J. L. Walsh. 1967. The Theory of Splines and Their Applications. New York: Academic Press.
Aiken, R. C., ed. 1985. Stiff Computation. New York: Oxford University Press.
Ames, W. F. 1992. Numerical Methods for Partial Dif- ferential Equations, 3rd Ed. New York: Academic Press.
Ammar, G. S., D. Calvetti, and L. Reichel, 1999. “Com- putation of Gauss-Kronrod quadrature rules with non- positive weights,” Electronic Transactions on Numerical Analysis 9, 26–38. http://etna.mcs.kent.edu
Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK User’s Guide, 3rd Ed.. Philadelphia: SIAM.
Armstrong, R. D., and J. Godfrey. 1979. “Two linear pro- gramming algorithms for the linear discrete l1 norm problem.” Mathematics of Computation 33, 289–300.
Ascher, U. M., R. M. M. Mattheij, and R. D. Russell. 1995. Numerical Solution of Boundary Value Prob- lems for Ordinary Differential Equations. Philadelphia: SIAM.
Ascher, U. M., and L. R. Petzold. 1998. Computer Meth- ods for Ordinary Differential Equations and Differential Algebraic Equations. Philadelphia: SIAM.
Atkinson, K. 1993. Elementary Numerical Analysis. New York: Wiley.
Atkinson, K. A. 1988. An Introduction to Numerical Anal- ysis, 2nd Ed. New York: Wiley.
Axelsson, O. 1994. Iterative Solution Methods. New York: Cambridge University Press.
Axelsson, O., and V.A. Barker. 2001. Finite Element Solu- tion of Boundary Value Problems: Theory and Compu- tations. Philadelphia: SIAM.
Azencott, R., ed. 1992. Simulated Annealing: Paralleliza- tion Techniques. New York: Wiley.
Bai, Z., J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. 2000. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Philadelphia: SIAM.
Baldick, R. 2006. Applied Optimization. New York, Cam- bridge University Press.
Barnsley, M. F. 2006. SuperFractals. New York, Cambridge University Press.
Barrett, R., M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. 1994. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods Philadelphia: SIAM.
Barrodale, I., and C. Phillips. 1975. “Solution of an overde- termined system of linear equations in the Chebyshev norm.” Association for Computing Machinery Transac- tions on Mathematical Software 1, 264–270.
Barrodale, I., and F. D. K. Roberts. 1974. “Solution of an overdetermined system of equations in the l1 norm.” Communications of the Association for Computing Machinery 17, 319–320.
Barrodale, I., F. D. K. Roberts, and B. L. Ehle. 1971. Ele- mentary Computer Applications. New York: Wiley.
Bartels, R. H. 1971. “A stabilization of the simplex method.” Numerische Mathematik 16, 414–434.
Bartels, R., J. Beatty, and B. Barskey. 1987. An Introduction to Splines for Use in Computer Graphics and Geometric Modelling. San Francisco: Morgan Kaufmann.
745
746 Bibliography
Bassien, S. 1998. “The dynamics of a family of one- dimensional maps.” American Mathematical Monthly 105, 118–130.
Bayer, D., and P. Diaconis. 1992. “Trailing the dovetail shuffle to its lair.” Annals of Applied Probability, 2, 294–313.
Beale, E. M. L. 1988. Introduction to Optimization. New York: Wiley.
Bjo ̈rck, Å. 1996. Numerical Methods for Least Squares Problems. Philadelphia: SIAM.
Bloomfield, P., and W. Steiger. 1983. Least Absolute De- viations, Theory, Applications, and Algorithms. Boston: Birkha ̈user.
Bornemann, F., D. Laurie, S. Wagon, and J. Waldvogel. 2004. The SIAM 100-Digit Challenge: A Study in High- Accuracy Numerical Computing. Philadelphia: SIAM.
Borwein, J. M., and P. B. Borwein. 1984. “The arithmetic- geometric mean and fast computation of elementary functions.” Society for Industrial and Applied Mathe- matics Review 26, 351–366.
Borwein, J. M., and P. B. Borwein. 1987. Pi and the AGM: A Study in Analytic Number Theory and Computational Complexity. New York: Wiley.
Boyce, W. E., and R. C. DiPrima. 2003. Elementary Differ- ential Equations and Boundary Value Problems, 7th Ed. New York: Wiley.
Branham, R. 1990. Scientific Data Analysis: An Introduc- tion to Overdetermined Systems. New York: Springer- Verlag.
Brenner, S., and R. Scott. 2002. The Mathematical Theory of Finite Element Methods. New York: Springer-Verlag. Brent, R. P. 1976. “Fast multiple precision evaluation of elementary functions.” Journal of the Association for
Computing Machinery 23, 242–251.
Briggs, W. 2004. Ants, Bikes, and Clocks: Problems Solving
for Undergraduates. Philadelphia: SIAM.
Buchanan, J. L., and P. R. Turner. 1992. Numerical Methods
and Analysis. New York: McGraw-Hill.
Burden, R. L., and J. D. Faires. 2001. Numerical Analysis,
7th Ed. Pacific Grove, California: Brooks/Cole.
Bus, J. C. P., and T. J. Dekker. 1975. “Two efficient algo- rithms with guaranteed convergence for finding a zero of a function.” Association for Computing Machinery
Transactions on Mathematical Software 1, 330–345. Butcher, J. C. 1987. The Numerical Analysis of Ordinary Differential Equations: Runge-Kutta and General Linear
Methods. New York: Wiley.
Calvetti, D., G. H. Golub, W. B. Gragg, and L. Reichel. 2000. “Computation of Gauss-Kronrod quadrature rules.” Mathematics of Computation 69, 1035–1052.
Carrier, G., and C. Pearson. 1991. Ordinary Differential Equations. Philadelphia: SIAM.
Ca ̈rtner, B. 2006. Understanding and Using Linear Pro- gramming. New York: Springer.
Cash, J. “Mesh selection for nonlinear two-point boundary- value problems.” Journal of Computational Methods in Science and Engineering, 2003.
Chaitlin, G. J. 1975. “Randomness and mathematical proof.” Scientific American May, 47–52.
Chapman, S. J. 2000. MATLAB Programming for Engineer- ing, Pacific Grove, California: Brooks/Cole.
Cheney, E. W. 1982. Introduction to Approximation Theory, 2nd Ed. Washington, D.C.: AMS.
Cheney, E. W. 2001. Analysis for Applied Mathematics, New York: Springer.
Chicone, C. 2006. Ordinary Differential Equations with Applications. 2nd Ed. New York: Springer.
Clenshaw, C. W., and A. R. Curtis. 1960. “A method for numerical integration on an automatic computer.” Numerische Mathematik 2, 197–205.
Colerman, T. F. and C. Van Loan. 1988. Handbook for Matrix Computations. Philadelphia: SIAM.
Collatz, L. 1966. The Numerical Treatment of Differential Equations, 3rd Ed. Berlin: Springer-Verlag.
Conte, S. D., and C. de Boor. 1980. Elementary Numerical Analysis, 3rd Ed. New York: McGraw-Hill.
Cooper, L., and D. Steinberg. 1974. Methods and Applica- tions of Linear Programming. Philadelphia: Saunders. Crilly, A. J., R. A. Earnshaw, H. Jones, eds. 1991. Fractals
and Chaos. New York: Springer-Verlag.
Cvijovic, D., and J. Klinowski. 1995. “Taboo search: An
approach to the multiple minima problem.” Science 267, 664–666.
Dahlquist, G., and A. Bjo ̈rck. 1974. Numerical Methods. Englewood Cliffs, New Jersey: Prentice-Hall.
Dantzi, G. B., A. Orden, and P. Wolfe. 1963. “Generalized simplex method for minimizing a linear from under linear inequality constraints.” Pacific Journal of Mathematics 5, 183–195.
Davis, P. J., and P. Rabinowitz. 1984. Methods of Numerical Integration, 2nd Ed. New York: Academic Press.
Davis, T. 2006. Direct Methods for Sparse Linear Systems. Philadelphia: SIAM.
de Boor, C. 1971. “CADRE: An algorithm for numerical quadrature.” In Mathematical Software, edited by J. R. Rice, 417–449. New York: Academic Press.
de Boor, C. 1984. A Practical Guide to Splines. 2nd Ed. New York: Springer-Verlag.
Dekker, T. J. 1969. “Finding a zero by means of succes- sive linear interpolation.” In Constructive Aspects of the
Fundamental Theorem of Algebra, edited by B. Dejon
and P. Henrici. New York: Wiley-Interscience.
Dekker, T. J., and W. Hoffmann. 1989. “Rehabilitation of the Gauss-Jordan algorithm.” Numerische Mathematik
54, 591–599.
Dekker, T. J., W. Hoffmann, and K. Potma. 1997. “Stability
of the Gauss-Huard algorithm with partial pivoting.”
Computing 58, 225–244.
Dekker, K., and J. G. Verwer. 1984. “Stability of
Runge-Kutta methods for stiff nonlinear differential equations.” CWI Monographs 2. Amsterdam: Elsevier Science.
Demmel, J. W., 1997. Applied Numerical Linear Algebra. Philadelphia: SIAM.
Dennis, J. E., and R. Schnabel. 1983. Quasi-Newton Meth- ods for Nonlinear Problems. Englewood Cliffs, New Jersey: Prentice-Hall.
Dennis, J. E., and R. B. Schnabel. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equa- tions. Philadelphia: SIAM.
Dennis, J. E., and D. J. Woods. 1987. “Optimization on microcomputers: The Nelder-Mead simplex algorithm.” In New Computing Environments, edited by A. Wouk. Philadelphia: SIAM.
de Temple, D. W. 1993. “A quicker convergence to Euler’s Constant.” American Mathematical Monthly 100, 468–470.
Devitt, J. S. 1993. Calculus with Maple V. Pacific Grove, California: Brooks/Cole.
Dixon, V. A. 1974. “Numerical quadrature: a survey of the available algorithms.” In Software for Numerical Math- ematics, edited by D. J. Evans. New York: Academic Press.
Dongarra, J. J., I. S. Duff, D. C. Sorenson, and H. van der Vorst. 1990. Solving Linear Systems on Vector and Shared Memory Computers. Philadelphia: SIAM.
Dorn, W. S., and D. D. McCracken. 1972. Numerical Meth- ods with FORTRAN IV Case Studies. New York: Wiley.
Edwards, C., and D. Penny. 2004. Differential Equations and Boundary Value Problems, 5th Ed. Upper Saddle River: New Jersey: Prentice-Hall.
Ellis, W., Jr., E. W. Johnson, E. Lodi, and D. Schwalbe. 1997. Maple V Flight Manual: Tutorials for Calculus, Linear Algebra, and Differential Equations. Pacific Grove, California: Brooks/Cole.
Ellis, W., Jr., and E. Lodi. 1991. A Tutorial Introduction to Mathematica. Pacific Grove, California: Brooks/Cole. Elman, H., D. J. Silvester, and A. Wathen. 2004. Finite
Element and Fast Iterative Solvers. New York: Oxford University Press.
England, R. 1969. “Error estimates for Runge-Kutta type solutions of ordinary differential equations.” Computer Journal 12, 166–170.
Enright, W. H. 2006. “Verifying approximate solutions to differential equations.” Journal of Computational and Applied Mathematics 185, 203–311.
Epureanu, B. I., and H. S. Greenside. 1998. “Fractal basins of attraction associated with a damped Newton’s method.” SIAM Review 40, 102–109.
Evans, G., J. Blackledge, and P. Yardlay. 2000. Numerical Methods for Partial Differential Equations. New York: Springer-Verlag.
Evans, G. W., G. F. Wallace, and G. L. Sutherland. 1967. Simulation Using Digital Computers. Englewood Cliffs, New Jersey: Prentice-Hall.
Farin, G. 1990. Curves and Surfaces for Computer Aided Geometric Design: A Practical Guide, 2nd Ed. New York: Academic Press.
Fauvel, J., R. Flood, M. Shortland, and R. Wilson (eds.). 1988. Let Newton Be! London: Oxford University Press.
Feder, J. 1988. Fractals. New York: Plenum Press. Fehlberg, E. 1969. “Klassische Runge-Kutta formeln fu ̈nfter und siebenter ordnung mit schrittweitenkon-
trolle.” Computing 4, 93–106.
Flehinger, B. J. 1966. “On the probability that a ran-
dom integer has initial digit A.” American Mathematical
Monthly 73, 1056–1061.
Fletcher, R. 1976. Practical Methods of Optimization. New
York: Wiley.
Floudas, C. A., and P. M. Pardalos (eds.). 1992. Recent Ad-
vances in Global Optimization. Princeton, New Jersey:
Princeton University Press.
Flowers, B. H. 1995. An Introduction to Numerical Methods
in C++. New York: Oxford University Press.
Ford, J. A. 1995. “Improved Algorithms of Ilinois-Type for the Numerical Solution of Nonlinear Equations.” Techni- cal Report, Department of Computer Science, University
of Essex, Colchester, Essex, UK.
Forsythe, G. E. 1957. “Generation and use of orthogonal
polynomials for data-fitting with a digital computer.”
Society for Industrial and Applied Mathematics Journal
5, 74–88.
Forsythe, G. E. 1970. “Pitfalls in computation, or why
a math book isn’t enough,” American Mathematical
Monthly 77, 931–956.
Forsythe, G. E., M. A. Malcolm, and C. B. Moler.
1977. Computer Methods for Mathematical Com- putations. Englewood Cliffs, New Jersey: Prentice- Hall.
Bibliography 747
748 Bibliography
Forsythe, G. E., and C. B. Moler. 1967. Computer Solu- tion of Linear Algebraic Systems. Englewood Cliffs, New Jersey: Prentice-Hall.
Forsythe, G. E., and W. R. Wasow. 1960. Finite Difference Methods for Partial Differential Equations. New York: Wiley.
Fox, L. 1957. The Numerical Solution of Two-Point Boundary Problems in Ordinary Differential Equations. Oxford: Clarendon Press.
Fox, L. 1964. An Introduction to Numerical Linear Algebra, Monograph on Numerical Analysis. Oxford: Clarendon Press. Reprinted 1974. New York: Oxford University Press.
Fox, L., D. Juskey, and J. H. Wilkinson, 1948. “Notes on the solution of algebraic linear simultaneous equations,” Quarterly Journal of Mechanics and Applied Mathemat- ics. 1, 149–173.
Frank, W. 1958. “Computing eigenvalues of complex ma- trices by determinant evaluation and by methods of Danilewski and Wielandt.” Journal of SIAM 6, 37–49.
Fraser, W., and M. W. Wilson. 1966. “Remarks on the Clenshaw-Curtis quadrature scheme.” SIAM Review 8, 322–327.
Friedman, A., and N. Littman. 1994. Industrial Mathe- matics: A Course in Solving Real-World Problems. Philadelphia: SIAM.
Fro ̈berg, C.-E. 1969. Introduction to Numerical Analysis. Reading, Massachusetts: Addison-Wesley.
Gallivan, K. A., M. Heath, E. Ng, B. Peyton, R. Plemmons, J. Ortega, C. Romine, A. Sameh, and R. Voigt. 1990. Par- allel Algorithms for Matrix Computations. Philadelphia: SIAM.
Gander, W., and W. Gautschi. 2000. “Adaptive quadra- ture—revisited.” BIT 40, 84–101.
Garvan, F. 2002. The Maple Book. Boca Raton, Florida: Chapman & Hall/CRC.
Gautschi, W. 1990. “How (un)stable are Vandermonde systems?” in Asymptotic and Computational Analysis, 193–210, Lecture Notes in Pure and Applied Mathemat- ics, 124. New York: Dekker.
Gautschi, W. 1997. Numerical Analysis: An Introduction. Boston, Massachusetts: Birkha ̈user.
Gear, C. W. 1971. Numerical Initial Value Problems in Ordinary Differential Equations. Englewood Cliffs, New Jersey: Prentice-Hall.
Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods, 2nd Ed. New York: Springer-Verlag. Gentleman, W. M. 1972. “Implementing Clenshaw-Curtis
quadrature.” Communications of the ACM 15, 337–346, 353.
Gerald, C. F., and P. O. Wheatley 1999. Applied Numeri- cal Analysis, 6th Ed. Reading, Massachusetts: Addison- Wesley.
Ghizetti, A., and A. Ossiccini. 1970. Quadrature Formulae. New York: Academic Press.
Gill, P. E., W. Murray, and M. H. Wright. 1981. Practical Optimization. New York: Academic Press.
Gleick, J. 1992. Genius: The Life and Science of Richard Feynman. New York: Pantheon.
Gockenbach, M. S., 2002. Partial Differential Equations: Analytical and Numerical Methods. Philadelphia: SIAM. Goldberg, D. 1991. “What every computer scientist should know about floating-point arithmetic.” ACM Computing
Surveys 23, 5–48.
Goldstine, H. H. 1977. A History of Numerical Analysis
from the 16th to the 19th Century. New York: Springer-
Verlag.
Golub, G. H., and J. M. Ortega. 1992. Scientific Computing
and Differential Equations. New York: Harcourt Brace
Jovanovich.
Golub, G. H., and J. M. Ortega. 1993. An Introduction
with Parallel Scientific Computing. New York: Academic
Press.
Golub, G. H., and C. F. Van Loan. 1996. Matrix Compu-
tations, 3rd Ed. Baltimore: Johns Hopkins University
Press.
Good, I. J. 1972. “What is the most amazing approxi-
mate integer in the universe?” Pi Mu Epsilon Journal 5,
314–315.
Greenbaum, A. 1997. Iterative Methods for Solving Linear
Systems. Philadelphia: SIAM.
Greenbaum, A. 2002. “Card Shuffling and the Polynomial
Numerical Hull of Degree k,” Mathematics Department,
University of Washington, Seattle, Washington. Gregory, R. T., and D. Karney, 1969. A Collection of Matrices for Testing Computational Algorithms. New
York: Wiley.
Griewark, A. 2000. Evaluating Derivatives: Principles and
Techniques of Algorithmic Differentiation. Philadelphia:
SIAM.
Groetsch, C. W. 1998. “Lanczos’ generalized derivative.”
American Mathematical Monthly 105, 320–326.
Haberman, R. 2004. Applied Partial Differential Equa- tions with Fourier Series and Boundary Value Problems. Upper Saddle River: New Jersey: Prentice-Hall.
Hageman, L. A., and D. M. Young. 1981. Applied Itera- tive Methods. New York: Academic Press; Dover 2004 (reprint).
Ha ̈mmerlin, G., and K.-H. Hoffmann. 1991. Numerical Mathematics. New York: Springer-Verlag.
Hammersley, J. M., and D. C. Handscomb. 1964. Monte Carlo Methods. London: Methuen.
Hansen, T., G. L. Mullen, and H. Niederreiter. 1993. “Good parameters for a class of node sets in quasi-Monte Carlo integration.” Mathematics of Computation 61, 225–234.
Haruki, H., and S. Haruki. 1983. “Euler’s Integrals.” Amer- ican Mathematical Monthly 7, 465.
Hastings, H. M. and G. Sugihara. 1993. Fractals: A User’s Guide for the Natural Sciences. New York: Oxford University Press.
Havie, T. 1969. “On a modification of the Clenshaw-Curtis quadrature formula.” BIT 9, 338–350.
Heath, J. M. 2002. Scientific Computing: An Introductory Survey, 2nd Ed. New York: McGraw-Hill.
Henrici, P. 1962. Discrete Variable Methods in Ordinary Differential Equations. New York: Wiley.
Heroux, M., P. Raghavan, and H. Simon. 2006. Paral- lel Processing for Scientific Computing. Philadelphia: SIAM.
Herz-Fischler, 1998. R. A Mathematical History of the Golden Number. New York: Dover
Hestenes, M. R., and E. Stiefel. 1952. “Methods of con- jugate gradient for solving linear systems.” Journal Research National Bureau of Standards 49, 409–436.
Higham, D., and N. J. Higham. 2006. MATLAB Guide, 2nd Ed. Philadelphia: SIAM.
Higham, N. J. 2002. Accuracy and Stability of Numerical Algorithms, 2nd Ed. Philadelphia: SIAM.
Hildebrand, F. B. 1974. Introduction to Numerical Analysis. New York: McGraw-Hill.
Hodges, A. 1983. Alan Turing: The Enigma. New York: Simon & Schuster.
Hoffmann, W. 1989. “A fast variant of the Gauss-Jordan algorithm with partial pivoting. Basic transformations in linear algebra for vector computing.” Doctoral disserta- tion, University of Amsterdam, The Netherlands.
Hofmann-Wellenhof, B., H. Lichtenegger, and J. Collins. 2001. Global Positioning System: Theory and Practice, 5th Ed. New York: Springer-Verlag.
Horst, R., P. M. Pardalos, and N. V. Thoai. 2000. Introduc- tion to Global Optimization, 2nd Ed. Boston: Kluwer. Householder, A. S. 1970. The Numerical Treatment of a Single Nonlinear Equation. New York: McGraw-Hill. Huard,P.1979.“Lame ́thodedusimplexesansinverse
explicite.” Bull. E.D.F. Se ́rie C 2.
Huddleston, J. V. 2000. Extensibility and Compressibility in
One-Dimensional Structures. 2nd Ed. Buffalo, NY: ECS
Publ.
Hull, T. E., and A. R. Dobell. 1962. “Random number gen-
erators.” Society for Industrial and Applied Mathematics Review 4, 230–254.
Hull, T. E., W. H. Enright, B. M. Fellen, and A. E. Sedg- wick. 1972. “Comparing numerical methods for ordi- nary differential equations.” Society for Industrial and Applied Mathematics Journal on Numerical Analysis 9, 603–637.
Hundsdorfer, W. H. 1985. “The numerical solution of non- linear stiff initial value problems: an analysis of one step methods.” CWI Tract, 12. Amsterdam: Stichting Mathematisch Centrum, Centrum voor Wiskunde en Informatica.
Isaacson, E., and H. B. Keller. 1966. Analysis of Numerical Methods. New York: Wiley.
Jeffrey, A. 2000. Handbook of Mathematical Formulas and Integrals. Boston: Academic Press.
Jennings, A. 1977. Matrix Computation for Engineers and Scientists. New York: Wiley.
Johnson, L. W., R. D. Riess, and J. T. Arnold. 1997. Introduction to Linear Algebra. New York: Addison- Wesley.
Kahaner, D. K. 1971. “Comparison of numerical quadra- ture formulas.” In Mathematical Software, edited by J. R. Rice. New York: Academic Press.
Kahaner, D., C. Moler, and S. Nash. 1989. Numerical Methods and Software. Englewood Cliffs, New Jersey: Prentice-Hall.
Keller, H. B. 1968. Numerical Methods for Two-Point Boundary-Value Problems. Toronto: Blaisdell.
Keller, H. B. 1976. Numerical Solution of Two-Point Bound- ary Value Problems. Philadelphia: SIAM.
Kelley, C. T. 1995. Iterative Methods for Linear and Non- linear Equations. Philadelphia: SIAM.
Kelley, C. T. 2003. Solving Nonlinear Equations with Newton’s Method. Philadelphia: SIAM.
Kincaid, D., and W. Cheney. 2002. Numerical Analysis: Mathematics of Scientific Computing, 3rd Ed. Belmont, California: Thomson Brooks/Cole.
Kincaid, D. R., and D. M. Young. 1979. “Survey of iter- ative methods.” In Encyclopedia of Computer Science and Technology, edited by J. Belzer, A. G. Holzman, and A. Kent. New York: Dekker.
Kincaid, D. R., and D. M. Young. 2000. “Partial dif- ferential equations.” In Encyclopedia of Computer Science, 4th Ed., edited by A. Ralston, E. D. Reilly, D. Hemmendinger. New York: Grove’s Dictionaries.
Kinderman, A. J., and J. F. Monahan. 1977. “Computer gen- eration of random variables using the ratio of uniform deviates.” Association of Computing Machinery Trans- actions on Mathematical Software 3, 257–260.
Bibliography 749
750 Bibliography
Kirkpatrick, S., C. D. Gelatt, Jr., and M. P. Vecchi. 1983. “Optimization by simulated annealing.” Science 220, 671–680.
Knight, A. 2000. Basics of MATLAB and Beyond. Boca Raton, Florida: CRC Press.
Knuth, D. E. 1997. The Art of Computer Programming, 3rd Ed. Vol. 2, Seminumerical Algorithms. New York: Addison-Wesley.
Krogh, F. T. 2003. “On developing mathematical software.” Journal of Computational and Applied Mathematics 185, 196–202.
Kronrod, A. S. 1964. “Nodes and Weights of Quadra- ture Rules.” Doklady Akad. Nauk SSSR, 154, 283–286. [Russian] (1965. New York: Consultants Bureau.)
Krylov, V. I. 1962. Approximate Calculation of Integrals, translated by A. Stroud. New York: Macmillan.
Lambert, J. D. 1973. Computational Methods in Ordinary Differential Equations. New York: Wiley.
Lambert, J. D. 1991. Numerical Methods for Ordinary Differential Equations. New York: Wiley.
Lapidus, L., and J. H. Seinfeld. 1971. Numerical Solution of Ordinary Differential Equations. New York: Academic Press.
Laurie, D. P. 1997. “Calculation of Gauss-Kronrod quadrature formulae.” Mathematics of Computation, 1133–1145.
Lawson, C. L., and R. J. Hanson. 1995. Solving Least- Squares Problems. Philadelphia: SIAM.
Leva, J. L. 1992. “A fast normal random number genera- tor.” Association of Computing Machinery Transactions on Mathematical Software 18, 449–455.
Lindfield, G., and J. Penny. 2000. Numerical Methods Us- ing MATLAB, 2nd Ed. Upper Saddle River: New Jersey: Prentice-Hall.
Lootsam, F. A., ed. 1972. Numerical Methods for Nonlinear Optimization. New York: Academic Press.
Lozier, D. W., and F. W. J. Olver. 1994. “Numerical eval- uation of special functions.” In Mathematics of Com- putation 1943–1993: A Half-Century of Computational Mathematics 48, 79–125. Providence, Rhode Island: AMS.
Lynch, S. 2004. Dynamical Systems with Applications. Boston: Birkha ̈user.
MacLeod, M. A. 1973. “Improved computation of cubic natural splines with equi-spaced knots.” Mathematics of Computation 27, 107–109.
Maron, M. J. 1991. Numerical Analysis: A Practical Ap- proach. Boston: PWS Publishers.
Marsaglia, G. 1968. “Random numbers fall mainly in the planes.” Proceedings of the National Academy of Sci- ences 61, 25–28.
Marsaglia, G., and W. W. Tsang. 2000. “The Ziggurat Method for generating random variables.” Journal of Statistical Software 5, 1–7.
Mattheij, R. M. M., S. W. Rienstra, and J. H. M. ten Thije Boonkkamp. 2005. Partial Differential Equa- tions: Modeling, Analysis, Computation. Philadelphia: SIAM.
McCartin, B. J. 1998. “Seven deadly sins of numerical computations,” American Mathematical Monthly 105, No. 10, 929–941.
McKenna, P. J., and C. Tuama. 2001. “Large torsional oscil- lations in suspension bridges visited again: Vertical forc- ing creates torsional response.” American Mathematical Monthly 108, 738–745.
Mehrotra, S. 1992. “On the implementation of a primal-dual interior point method.” SIAM Journal on Optimization 2, 575–601.
Metropolis, N. et al. 1953. “Equation of state calcula- tions by fast computing machines.” Journal of Physical Chemistry 21, 1087–1092.
Meurant, G. 2006. The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computa- tions. Philadelphia: SIAM.
Meyer, C. D., 2000. Matrix Analysis and Applied Linear Algebra. Philadelphia: SIAM.
Miranker, W. L. 1981. “Numerical methods for stiff equa- tions and singular perturbation problems.” In Mathe- matics and its Applications, Vol. 5. Dordrecht-Boston, Massachusetts: D. Reidel.
Moler, C. B., 2004. Numerical Computing with MATLAB. Philadelphia: SIAM.
More ́, J. J., and S. J. Wright. 1993. Optimization Software Guide. Philadelphia: SIAM.
Moulton, F. R. 1930. Differential Equations. New York: Macmillan.
Nelder, J. A., and R. Mead. 1965. “A simplex method for function minimization.” Computer Journal 7, 308–313. Nerinckx, D., and A. Haegemans. 1976. “A comparison of
nonlinear equation solvers.” Journal of Computational
and Applied Mathematics 2, 145–148.
Nering, E. D., and A. W. Tucker. 1992. Linear Programs
and Related Problems. New York: Academic Press. Niederreiter, H. 1978. “Quasi-Monte Carlo methods.” Bulletin of the American Mathematical Society 84,
957–1041.
Niederreiter, H. 1992. Random Number Generation and
Quasi-Monte Carlo Methods. Philadelphia: SIAM.
Nievergelt, J., J. G. Farrar, and E. M. Reingold. 1974. Com- puter Approaches to Mathematical Problems. Engle- wood Cliffs, New Jersey: Prentice-Hall.
Noble, B., and J. W. Daniel. 1988. Applied Linear Algebra, 3rd Ed. Englewood Cliffs, New Jersey: Prentice-Hall. Nocedal, J., and S. Wright. 2006. Numerical Optimization.
2nd Ed. New York: Springer.
Novak, E., K. Ritter, and H. Woz ́niakowski. 1995.
“Average-case optimality of a hybrid secant- bisection method.” Mathematics of Computation 64, 1517–1540.
Novak, M., ed. 1998. Fractals and Beyond: Complexities in the Sciences. River Edge, NJ: World Scientific.
O’Hara, H., and F. J. Smith. 1968. “Error estimation in Clenshaw-Curtis quadrature formula.” Computer Journal 11, 213–219.
Oliveira, S., and D. E. Stewart. 2006. Writing Scientific Software: A Guide to Good Style. New York: Cambridge University Press.
Orchard-Hays, W. 1968. Advanced Linear Programming Computing Techniques. New York: McGraw-Hill.
Ortega, J., and R. G. Voigt. 1985. Solution of Partial Dif- ferential Equations on Vector and Parallel Computers. Philadelphia: SIAM.
Ortega, J. M. 1990a. Numerical Analysis: A Second Course. Philadelphia: SIAM.
Ortega, J. M. 1990b. Introduction to Parallel and Vector Solution of Linear Systems. New York: Plenum.
Ortega, J. M., and W. C. Rheinboldt. 1970. Iterative Solu- tion of Nonlinear Equations in Several Variables. New York: Academic Press. (2000. Reprint. Philadelphia: SIAM.)
Ostrowski, A. M. 1966. Solution of Equations and Sys- tems of Equations, 2nd Ed. New York: Academic Press.
Overton, M. L. 2001. Numerical Computing with IEEE Floating Point Arithmetic. Philadelphia: SIAM.
Otten, R. H. J. M., and L. P. P. van Ginneken. 1989. The Annealing Algorithm. Dordrecht, Germany: Kluwer.
Pacheco, P. 1997. Parallel Programming with MPI. San Francisco: Morgan Kaufmann.
Patterson, T. N. L. 1968. “The optimum addition of points to quadrature formulae.” Mathematics of Computations 22, 847–856, and in 1969 Mathematics of Computations 23, 892.
Parlett, B. N. 1997. The Symmetric Eigenvalue Problem. Philadelphia: SIAM.
Parlett, B. 2000. “The QR Algorithm,” Computing in Sci- ence and Engineering 2, 38–42.
Pessens, R., E. de Doncker, C. W. Uberhuber, and D. K. Kahaner, 1983. QUADPACK: A Subroutine Package for Automatic Integration. New York: Springer-Verlag.
Peterson, I. 1997. The Jungles of Randomness: A Mathe- matical Safari. New York: Wiley.
Phillips, G. M., and P. J. Taylor. 1973. Theory and Applica- tions of Numerical Analysis. New York: Academic Press. Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 2002. Numerical Recipes in C++, 2nd Ed. New
York: Cambridge University Press.
Quinn, M. J. 1994. Parallel Computing: Theory and Prac-
tice. New York: McGraw-Hill.
Rabinowitz, P. 1968. “Applications of linear programming to numerical analysis.” Society for Industrial and Applied Mathematics Review 10, 121–159.
Rabinowitz, P. 1970. Numerical Methods for Nonlinear Algebraic Equations. London: Gordon & Breach.
Raimi, R. A. 1969. “On the distribution of first sig- nificant figures.” American Mathematical Monthly 76, 342–347.
Ralston, A. 1965. A First Course in Numerical Analysis. New York: McGraw-Hill.
Ralston, A., and C. L. Meek (eds.) 1976. Encyclopedia of Computer Science. New York: Petrocelli/Charter.
Ralston, A., and P. Rabinowitz 2001. A First Course in Numerical Analysis, 2nd Ed. New York: Dover.
Recktenwald, G. 2000. Numerical Methods with MATLAB: Implementation and Applications. New York: Prentice- Hall.
Reid, J. 1971. “On the method of conjugate gradient for the solution of large sparse systems of linear equations.” In Large Sparse Sets of Linear Equations, J. Reid (ed.), London: Academic Press.
Rheinboldt, 1998. Methods for Solving Systems of Nonlin- ear Equations, 2nd Ed. Philadelphia: SIAM.
Rice, J. R. 1971. “SQUARS: An algorithm for least squares approximation.” In Mathematical Software, edited by J. R. Rice. New York: Academic Press.
Rice, J. R. 1983. Numerical Methods, Software, and Anal- ysis. New York: McGraw-Hill.
Rice, J. R., and R. F. Boisvert. 1984. Solving Elliptic Prob- lems Using ELLPACK. New York: Springer-Verlag.
Rice, J. R., and J. S. White. 1964. “Norms for smooth- ing and estimation.” Society for Industrial and Applied Mathematics Review 6, 243–256.
Rivlin, T. J. 1990. The Chebyshev Polynomials, 2nd Ed. New York: Wiley.
Roger, H.-F. 1998. A Mathematical History of the Golden Number. New York: Dover.
Bibliography 751
752 Bibliography
Roos, C., T. Terlaky, and J.-Ph. Vial. 1997. Theory and Algorithms for Linear Optimization: An Interior Point Approach. New York: Wiley.
Saad, Y., 2003. Iterative Methods for Sparse Linear Sys- tems. Philadelphia: SIAM.
Salamin, E. 1976. “Computation of π using arithmetic- geometric mean.” Mathematics of Computation 30, 565–570.
Sauer, T. 2006. Numerical Analysis. New York: Pearson, Addison-Wesley.
Scheid, F. 1968. Theory and Problems of Numerical Anal- ysis. New York: McGraw-Hill.
Scheid, F. 1990. 2000 Solved Problems in Numerical Analysis. Schaum’s Solved Problem Series. New York: McGraw-Hill.
Schilling, R. J., and S. L. Harris. 2000. Applied Numerical Methods for Engineering Using MATLAB and C. Pacific Grove, California: Brooks/Cole.
Schmidt 1908. Title unknown. Rendiconti del Circolo Matematico di Palermo 25, 53–77.
Schoenberg, I. J. 1946. “Contributions to the problem of approximation of equidistant data by analytic func- tions.” Quarterly of Applied Mathematics 4, 45–99, 112–141.
Schoenberg, I. J. 1967. “On spline functions.” In Inequali- ties, edited by O. Shisha, 255–291. New York: Academic Press.
Schrage, L. 1979. “A more portable Fortran random number generator.” Association for Computing Machinery Trans- actions on Mathematical Software 5, 132–138.
Schrijver, A. 1986. Theory of Linear and Integer Program- ming. Somerset, New Jersey: Wiley.
Schultz, M. H. 1973. Spline Analysis. Englewood Cliffs, New Jersey: Prentice-Hall.
Schumaker, L. L. 1981. Spine Function: Basic Theory. New York: Wiley.
Shampine, J. D. 1994. Numerical Solutions of Ordinary Differential Equations. London: Chapman and Hall.
Shampine, L. F., R. C. Allen, and S. Pruess. 1997. Funda- mentals of Numerical Computing. New York: Wiley. Shampine, L. F., and M. K. Gordon. 1975. Computer Solu-
tion of Ordinary Differential Equations. San Francisco:
W. H. Freeman.
Shewchuk, J. R. 1994. “An introduction to the conjugate
gradient method without the agonizing pain,” online
Wikipedia.
Skeel, R. D., and J. B. Keiper. 1992. Elementary Numerical
Computing with Mathematica. New York: McGraw-Hill. Smith, G. D. 1965. Solution of Partial Differential Equa-
tions. New York: Oxford University Press.
Sobol, I. M. 1994. A Primer for the Monte Carlo Method. Boca Raton, Florida: CRC Press.
Southwell, R. V. 1946. Relaxation Methods in Theoretical Physics. Oxford: Clarendon Press.
Spa ̈th, H. 1992. Mathematical Algorithms for Linear Re- gression. New York: Academic Press.
Stakgold, I., 2000. Boundary Value Problems of Mathemat- ical Physics. Philadelphia: SIAM.
Steele, J. M., 1997. Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia: SIAM.
Stetter, H. J. 1973. Analysis of Discretization Methods for Ordinary Differential Equations. Berlin: Springer- Verlag.
Stewart, G. W. 1973. Introduction to Matrix Computations. New York: Academic Press.
Stewart, G. W. 1996. Afternotes on Numerical Analysis. Philadelphia: SIAM.
Stewart, G. W. 1998a. Afternotes on Numerical Analy- sis: Afternotes Goes to Graduate School. Philadelphia: SIAM.
Stewart, G. W. 1998b. Matrix Algorithms: Basic Decompo- sitions, Vol. 1. Philadelphia: SIAM.
Stewart, G. W. 2001. Matrix Algorithms: Eigensystems, Vol. 2. Philadelphia: SIAM.
Stoer, J., and R. Bulirsch. 1993. Introduction to Numerical Analysis, 2nd Ed. New York: Springer-Verlag.
Strang, G. 2006. Linear Algebra and Its Applications. Belmont, California: Thomson Brooks/Cole.
Strang, G., and K. Borre. 1997. Linear Algebra, Geodesy, and GPS. Cambridge, MA: Wellesley Cambridge Press. Street, R. L. 1973. The Analysis and Solution of Par- tial Differential Equations. Pacific Grove, California:
Brooks/Cole.
Stroud, A. H. 1974. Numerical Quadrature and Solution of
Ordinary Differential Equations. New York: Springer-
Verlag.
Stroud, A. H., and D. Secrest. 1966. Gaussian Quadrature
Formulas. Englewood Cliffs, New Jersey: Prentice-Hall. Subbotin, Y. N. 1967. “On piecewise-polynomial approxi- mation.” Matematicheskie Zametcki 1, 63–70. (Transla-
tion: 1967. Math. Notes 1, 41–46.)
Szabo, F. 2002. Linear Algebra: An Introduction Using
MAPLE. San Diego, California: Harcourt/Academic Press.
Torczon, V. 1997. “On the convergence of pattern search methods.” Society for Industrial and Applied Mathemat- ics Journal on Optimization 7, 1–25.
To ̈rn, A., and A. Zilinskas. 1989. Global Optimization. Lecture Notes in Computer Science 350. Berlin: Springer-Verlag.
Traub, J. F. 1964. Iterative Methods for the Solution of Equations. Englewood Cliffs, New Jersey: Prentice-Hall. Trefethen, L. N., and D. Bau. 1997. Numerical Linear
Algebra. Philadelphia: SIAM.
Turner, P. R. 1982. “The distribution of leading significant
digits.” Journal of the Institute of Mathematics and Its Applications 2, 407–412.
van Huffel, S. and J. Vandewalle. 1991. The Total Least Squares Problem: Computational Aspects and Analsyis. Philadelphia: SIAM.
Van Loan, C. F. 1997. Introduction to Computational Sci- ence and Mathematics. Sudbury, Massachusetts: Jones and Bartlett.
Van Loan, C. F. 2000. Introduction to Scientific Computing, 2nd Ed. Upper Saddle River: New Jersey: Prentice-Hall. Van der Vorst, H. A. 2003. Iterative Krylov Methods for Large Linear Systems. New York: Cambridge University
Press.
Varga, R. S. 1962. Matrix Iterative Analysis. Englewood
Cliffs: New Jersey: Prentice-Hall. (2000. Matrix Itera- tive Analysis: Second Revised and Expanded Edition. New York: Springer-Verlag.)
Wachspress, E. L. 1966. Iterative Solutions to Elliptic Systems. Englewood Cliffs: New Jersey: Prentice-Hall.
Watkins, D. S. 1991. Fundamentals of Matrix Computation. New York: Wiley.
Westfall, R. 1995. Never at Rest: A Biography of Isaac Newton, 2nd Ed. London: Cambridge University Press.
Whittaker, E., and G. Robinson. 1944. The Calculus of Ob- servation, 4th Ed. London: Blackie. New York: Dover, 1967 (reprint).
Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. Oxford: Clarendon Press. Reprinted 1988. New York: Oxford University Press.
Wilkinson, J. H. 1963. Rounding Errors in Algebraic Proc- esses. Englewood Cliffs, New Jersey: Prentice-Hall. New York: Dover 1994 (reprint).
Wood, A. 1999. Introduction to Numerical Analysis. New York: Addison-Wesley.
Wright, S. J. 1997. Primal-Dual Interior-Point Methods. Philadelphia: SIAM.
Yamaguchi, F. 1988. Curves and Surfaces in Computer Aided Geometric Design. New York: Springer-Verlag. Ye, Yinyu. 1997. Interior Point Algorithms. New York:
Wiley.
Young, D. M. 1950. Iterative methods for solving par-
tial difference equations of elliptic type. Ph.D. thesis. Cambridge, MA: Harvard University. See www.sccm .stanford.edu/pub/sccm/david young thesis.ps.gz.
Young, D. M., 1971. Iterative Solution of Large Linear Sys- tems. New York: Academic Press: Dover 2003 (reprint). Young, D. M., and R. T. Gregory. 1972. A Survey of Numer- ical Mathematics, Vols. 1–2. Reading, Massachusetts:
Addison-Wesley. New York: Dover 1988 (reprint). Ypma, T. J. 1995, “Historical development of the Newton- Raphson method.” Society for Industrial and Applied
Mathematics Review 37, 531–551.
Zhang, Y. 1995. “Solving large-scale linear programs by interior-point methods under the MATLAB envi- ronment.” Technical Report TR96–01, Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, MD.
Bibliography 753
Index
Absolute errors, 5
Abstract vector spaces in linear algebra,
716–723 bases for, 718
change in similarity of, 719–720 eigenvalues and eigenvectors in, 719 Gram-Schmidt process for, 722–723 linear independence in, 717–718 linear transformations for, 718–719 norms for, 721–722
orthogonal matrices and spectral theorem in, 720–721
subspaces in, 717
Accelerated steepest decent procedure,
655 (CPb 16.2.2) Accuracy
first-degree polynomial, 375 first-degree spline, 375
in ordinary differential equation (ODE)
solutions, 435 precision and, 5–6
A−1 computation, 307
A-conjugate vectors, 332 Adams-Bashforth-Moulton methods
adaptive scheme for, 488
example of, 488–489
for first-order ordinary differential
equations, 455–456 predictor-corrector scheme in, 483–484 problems on, 241 (Pb 6.2.15), 461
(CPb 10.3.2–4) pseudocode for, 484–488 stiff equations and, 489–491
Adaptive Runge-Kutta methods, 450–454 Adaptive Simpson’s rule, 221–225 Adaptive two-point Gaussian integration,
242 (CPb 6.2.7) Advection equation, 601–602
Aiken acceleration formula, 363 A-inner product, of vectors, 332 Airy differential equation, 483
(CPb 11.2.2) Algebra. See Linear algebra
Algorithms
Berman, 638 (16.1.5)
complete Horner’s, 7, 23–24 conjugate gradient, 334 converting bases of numbers, 696
Fibonacci search, 628–631 Gauss-Huard, 279–280 (CPb 7.2.24) Gaussian, 248, 250–251
golden section search, 631–633 Gram-Schmidt process, 519
linear least squares, 497 Moler-Morrison, 122 (CPb 3.3.14) multivariate case of minimization of
functions, 644–646
natural cubic spline functions, 388–392 Neider-Mead, 647–648
Neville’s, 142–144
Newton, 129
normalized tridiagonal, 289
(CPb 7.2.12)
orthogonal systems, 508–510 polynomial interpolation, 136–138 power method, 361–362
quadratic interpolation, 633–635 random numbers, 533–535, 535 Romberg, 165, 168, 204–215
description of, 204–205 Euler-Maclaurin formula and,
206–209
pseudocode for, 205–206 Richardson extrapolation of,
209–211
secant method for roots of equations,
112–113
shooting method for ordinary
differential equations, 565–567 simplex, 672–673
variable metric, 647
Alternating series theorem, 28–30, 32
(Pb 1.2.13)
Antiderivative, 181. See also Integration,
numerical
Approximation. See Least squares
method; Spline functions
Area and volume estimation, 544–552
computing, 547–548
“ice cream cone” example of, 548 numerical integration for, 544–545 pseudocode for, 545–547
Arithmetic
Babylonian, 701
IEEE standard floating-point, 703–705 Mayan, 700–701
partial double-precision, 492 (CPb 11.3.2)
Arithmetic mean, 15 (CPb 1.1.7)
Arrays, 686, 688–689
Attraction, fractile basins of, 99–100, 108
(CPb 3.2.27) Autonomous ordinary differential
equations, 471–472, 479–480
Back substitution, in Gaussian algorithm, 248, 250–251
Backward error analysis, 52
Banded storage mode, 291 (CPb 7.2.19) Banded systems of linear equations,
280–292
block pentadiagonal, 285–286 pentadiagonal, 283–285 strictly diagonal dominance in,
282–283 tridiagonal, 280–282
Banker’s rounding, 6
Bases for numbers, 692–702
β, 693
conversion between, 693–696 16, 698
10, 692–693
from 10 to 8 to 2, 696–698
Basic Simpson’s rule, 216–220, 228 (Pb 6.1.8)
Basic trapezoid rule, 190
Basins of attraction, 99–100, 108
(CPb 3.2.27)
Basis functions, 500–501, 505–508 Berman algorithm, 638 (CPb 16.1.5) Bernoulli numbers, 208
Bernstein polynomials, 416
Bessel functions, 42 (CPb 1.2.23), 186,
215 (CPb 5.3.11)
Best-step steepest descent procedure, 643 Bézier curves, 416–418
Big O notation, 27
Biharmonic equation, 583
Binary search, for intervals, 384
(CPb 9.1.2)
Binary system, 693, 696–697. See also
Bases for numbers Binomial series, 31 (Pb 1.2.1)
Birthday problem, 553–555
754
Bisection method for locating roots of equations, 76–85
convergence analysis in, 81–83 example of, 79–81
false position method in, 83–84 pseudocode in, 78–79
secant method and Newton’s method versus, 117
Bivariate functions, 144–145
Block pentadiagonal systems of linear
equations, 285–286 Boundary cases, 685
Boundary-value problems. See Ordinary differential equations,
boundary-value problems in Bratu’s problem, 581 (CPb 14.2.7) B spline functions, 404–425
for Bézier curves, 416–418 interpolation and approximation by,
410–412
pseudocode and example of, 412–413 Schoenberg’s process for, 414–415 theory of, 404–410
Bucking of a circular ring project, 581 (CPb 14.2.8)
Buffon’s needle problem, 555–556
Calculus, Fundamental Theorem of, 181, 195
Cantilever beam, 341 (CPb 8.1.10) Cardinal polynomials, 126–127
Case studies in programming, 687–691 Cauchy-Riemann equation, 105
(Pb 3.2.40) Cauchy-Schwartz inequality, 503
(Pb 12.1.9), 643 Cayley-Hamilton Theorem, 358
(CPb 8.2.5)
Central difference formula, 15
(CPb 1.1.3), 166, 171 Centroids, 648
Chapeau functions of B splines, 406 Characteristic equations, 719 Characteristic polynomials, 343 Chebyshev nodes, 155–156, 158, 163
(CPb 4.2.10), 174 Chebyshev polynomials
orthogonal systems and, 505–518 algorithm for, 508–510 orthonormal basis functions
in, 505–508
polynomial regression in, 510–515
properties of, 140–141
Checkerboard ordering, 620 (Pb 15.3.3) Cholesky factorization, 305–306, 315
(Pb 8.1.24) Chopping numbers, 6, 51
Clamped cubic splines, 387
Clean loops, 686
Code, modularizing, 685, 687–688 Coefficients aj, 131–136 Collocation method, 618
Column vectors, 671–672, 706 Companion matrix, 358 (CPb 8.2.3) Complete Horner’s algorithm, 23–24 Complete partial pivoting, 261–264 Components, in vectors, 706
Composite Gaussian three-point rule, 243
(CPb 6.2.11)
Composite midpoint rule for equal
subintervals, 188 (Pb 5.1.12) Composite (left) rectangle rule, 202
(Pb 5.2.28)
Composite rectangle rule with uniform
spacing, 202–203 (Pb 5.2.29) Composite Simpson’s rule, 220–221, 228
(Pb 6.1.6), 243 (CPb 6.2.11) Composite trapezoid rule, 191, 194, 243
(CPb 6.2.11)
Composite trapezoid rule with unequal
spacing, 203 (Pb 5.2.32) Computation, noise in, 174
Computer-aided geometric design, 425 (CPb 9.3.19)
Condition number, in linear equations, 321–322
Conjugate gradient method, 332–335 Constrained minimization problems,
625–626
Continuity of functions, 373–375 Contour diagrams, 644
Control points, in drawing curves,
371, 416 Convergence analysis
in bisection method, 81–83
in Newton’s method, 93–96
in secant method, 114–116 Convergence theorems, 328–331
Convex hull, of vectors, 417
Corollaries on divided differences, 160 Correctly rounded value, 705
Correct rounding, 50
Cramer’s Rule, 715
Crank-Nicolson method, 588–591
Crout factorization, 317 (CPb 8.1.2) Cubic B spline, 423 (Pb 9.3.38)
Cubic interpolating spline, 371. See also
Spline functions
Curves. See Ordinary differential
equations; Spline functions
Dawson integral, 439 (CPb 10.1.12) Decimal places, accuracy to, 5 Decimal point, 693
Decomposition, in matrix
factorizations, 296
Deflation of polynomials, 8, 11
Delay ordinary differential equations, 450
(CPb 10.2.17) Derivatives, 164–179
of B splines, 408
divided differences and, 159
of functions, 9–10
Lanczos’ generalized, 178 (Pb 4.3.21)
Index 755 noise in computation and, 174
polynomial interpolation estimating of, 170–174
Richardson extrapolation for, 166–170
Taylor series estimating of, 164–166 Determinants, 278 (CPb 7.2.14) Diagonal dominance, 282–283, 330 Diagonal matrices, 346–347, 709
Diet problem, 670 (CPb 17.1.5) Differential equations, 353–355. See also
Ordinary differential equations;
Partial differential equations Differentiation, 718
Diffusion equation, 584 Dimension, 718
Direct error analysis, 52
Direction vectors, 333
Direct method, for eigenvalues, 343 Dirichlet function, 154, 184, 584,
593, 618
Discretization method, 570–572 Divergent curves, 458
Divided differences
for calculating coefficients aj, 131–136
corollary on, 160
derivatives and, 159
Doolittle factorization, 300, 317
(CPb 8.1.2)
Dot product of vectors, 708
Double-precision floating-point representation, 48–49
Dual problem, in linear programming, 661–663, 673
Economical version of singular value decomposition, 356 (Pb 87.2.5)
Eigenvalues and eigenvectors, 258
(CPb 7.1. 6), 342–360. See also Power method for linear equations
calculating, 343–344
Gershgorin’s Theorem and, 347–348 in linear algebra, 719
in linear differential equations,
353–355
in mathematical software, 344 matrix spectral theory of, 349–351 properties of, 345–347
singular value decomposition of,
348–349, 351–353 Elements, in vectors, 706, 708 Elliptic integrals, 39 (CPb 1.2.14),
180, 186
Elliptic problems, in differential
equations, 584, 594 (Pb15.1.1),
605–624
finite-difference method for, 606–609 finite-element methods for, 613–619 Gauss-Seidel iterative method
for, 610
Helmholtz equation model, 605–606 pseudocode for, 610–613
756 Index
Entry, in vectors, 706, 708
Epsilon, machine, 47–48
Equal oscillation property, 141 Equations, roots of. See Roots of
equations, locating
Error. See also Polynomial interpolation
absolute and relative, 5
in ordinary differential equations
(ODE), 435
roundoff, 50, 52, 54, 63, 253, 687 single-step, 453
trapezoid rule analysis of, 192–196 truncation, 165–166, 174
unit roundoff, 703
vectors of, 254–255, 279 (CPb 7.2.19)
Error function, 34 (Pb 1.2.52), 185–186 Error term, 25, 27, 174 Euclidean/l2-vector norm, 721 Euler-Bernoulli beam, 340 (CPb 8.1.10) Euler-Maclaurin formula, 206–209, 214
(Pb 5.3.26)
Euler’s constant, 59–60 (CPb 2.1.7) Euler’s method, 432–433, 437
(Pb 10.1.15) European Space Agency, 54
Expanded reflected points, 648 Expansion, finite, 44
Explicit method for partial differential
equations, 587, 591, 595
(Pb 15.1.12)
Exponents, 44, 544 (CPb 13.1.20), 687
Factorial notation, 21 Factoring, 296. See also Matrix
factorizations Fairing curves, 371
False position method, 83–84
Feasible set, of vectors, 658
Fehlberg method of order 4, 451 Fibonacci numbers, 40 (CPb 1.2.16), 115,
628–631
Finite-difference method, 570–571, 574,
606–609
Finite-dimensional number, 718 Finite-element methods, 613–619
Finite expansion, 44
First bad case, of quadratic interpolation
algorithm, 635 First-degree polynomial accuracy
theorem, 375
First-degree spline accuracy theorem, 375 First-derivative formulas, 164–166,
170–174
First primal form, in linear programming,
657–658, 660–661, 673 Five-point formula for Laplace’s
equation, 606–607
Fixed point iteration, 117–118 Flatness test, 648
Floating-point numbers, 43–55, 102
(Pb 3.2.24)
computer errors in, 50–51, 54, 687
double-precision, 48–49
equality of, 689–690
floating-point machine number [fl(x)]
and, 51–55
IEEE standard arithmetic for, 703–705 normalized, 44–46
single-precision, 46–47
standard, 46
Forward elimination, in Gaussian algorithm, 248, 250
Fourier series, 73 (CPb 2.2.15)
Fractile basins of attraction, 99–100, 108
(CPb 3.2.27)
Fractional numbers, converting bases of,
695–696 Fractional parts, 696
French curves, 371
French railroad system problem, 559
(CPb 13.3.3)
Fresnel integral, 186, 204 (CPb 5.2.5) Frobenius norm, 338 (Pb 8.1.10) Fully implicit method for partial
differential equations, 595
(Pb 15.1.13)
Functions, minimization of, 625–658
multivariate case of, 639–656 advanced algorithms for, 644–646 contour diagrams for, 644 minimum, maximum and saddle
points in, 646
Neider-Mead algorithm for, 647–648 positive definite matrix and, 647 quasi-Newton methods for, 647 simulated annealing method for,
648–649
steepest descent procedure for, 643 Taylor Series for F in, 640–642
one-variable case of, 625–639 Fibonacci search algorithm and,
628–631
golden section search algorithm and,
631–633
quadratic interpolation algorithm
and, 633–635
special case of, 626–627 unconstrained and constrained
problems in, 625–626 unimodal functions F as,
627–628
Fundamental Theorem of Calculus,
181, 195
Galerkin equation, 617 Gauss-Huard algorithm, 279–280
(CPb 7.2.24)
Gaussian continued functions, 73
(CPb 2.2.18) Gaussian elimination
naive, 245–258 algorithm for, 248–250 example of, 247–248 failure of, 259–260
in matrix factorizations, 295–296, 311 (Pb 8.1.1)
pseudocode for, 250–254 residual and error vectors in,
254–255
with scaled partial pivoting, 259–280
complete partial pivoting versus, 261–264
example of, 265–266
long operation count for, 269–270 numerical stability of, 271 pseudocode for, 266–269
Gaussian method for elliptic integrals, 39 (CPb 1.2.14)
Gaussian quadrature formulas, 230–244 change of intervals in, 231
composite three-point, 243
(CPb 6.2.11)
description of, 230–231
integrals with singularities in, 237–239 Legendre polynomials in, 234–237 nodes and weights in, 232–234
Gauss-Jordan algorithm, 279–280 (CPb 7.2.24)
Gauss-Legendre quadrature formulas, 232
Gauss-Seidel method, 323–325, 330–331, 610
Generalized Neumann equation, 584 Generalized Newton’s method, 104
(Pb 3.2.36)
General quadratic functions, 652
(Pb 16.2.15) Gershgorin’s Theorem, 347–348 Global positioning systems, 111
(CPb 3.2.41)
Golden ratio, 115, 638 (CPb 16.1.5) Golden section search algorithm,
631–633 Goodness of fit, 374
Gradient of quadratic forms, 333 Gradient vector matrix, 640–641 Gram-Schmidt process, 506, 519,
722–723
Greatest lower bound, in integration, 182 Great Internet Mersenne Prime Search
(GIMPS), 541
Halley’s method, 122 (CPb 3.3.13)
Handbook of Mathematical Functions with Formulas, Graphs, and
Mathematical Tables
(Abramowitz and Stegun), 186 Harmonic functions, 607, 618
Harmonic series, 59–60 (CPb 2.1.7) Hat functions of B splines, 406 Heat equation model, 583–586 Helmholtz equation model, 584,
605–606 Hermitian matrices, 345 Hessian matrix, 640–641
Heun’s method, 437 (Pb 10.1.15)
Hexadecimal system, 693, 698. See also Bases for numbers
Hidden bits, 47
Hilbert matrix, 276 (CPb 7.2.4), 527
(Pb 12.3.2)
Histograms, 560 (CPb 13.3.13) Horner’s algorithm, 7, 23–24 Hyperbolic problems, in differential
equations, 584, 594 (Pb15.1.1),
596–605
advection equation as, 601 analytical solution for, 597–598 Lax method for, 602
Lax-Wendroff method for, 602–603 numerical solution for, 598–599 pseudocode for, 600–601
upwind method for, 602
wave equation model as, 596–597
Idemtity matrix, 709
IEEE floating-point standard arithmetic,
703–705 Ill-conditioning, 321–322, 448
(CPb 10.2.5) Improved Euler’s method, 437
(Pb 10.1.15)
IMSL mathematical library, 10 Incompatible systems, 519 Inconsistent systems, 519
Index vector, 262, 266
Inductive definition, in Newton’s
method, 91
Initial-value problem, 426–428, 431, 463
(CPb 10.3.17)
Inner product, 332, 512, 708
Integer parts, 696 Integrals
Dawson, 439 (CPb 10.1.12)
elliptic, 39 (CPb 1.2.14), 180, 186 sine, 189 (CPb 5.1.2), 204 (CPb 5.2.5),
463 (CPb 10.3.15) Integration, numerical, 180–244
for area and volume estimation, 544–545
definite and indefinite, 180–181 Gaussian quadrature formulas in,
230–244
change of intervals in, 231 description of, 230–231 integrals with singularities in,
237–239
Legendre polynomials in, 234–237 nodes and weights in, 232–234
lower and upper sums in, 181–183 of ordinary differential equations
(ODE), 428–429 pseudocode and examples of,
184–187 Riemann-integrable functions in,
183–184
Romberg algorithm in, 204–215
description of, 204–205
Euler-Maclaurin formula and, 206–209
pseudocode for, 205–206 Richardson extrapolation of,
209–211
Simpson’s rule in, 216–229
adaptive, 221–225
basic, 216–220
composite, 220–221 Newton-Cotes rules and, 225–226
trapezoid rule in, 190–204
error analysis in, 192–197 multidimensional integration in,
198–199
uniform spacing in, 191–192
Intermediate-value theorem, 78, 194 Interpolation. See B spline functions;
Polynomial interpolation;
Quadratic interpolation algorithm Invariance theorem, 135
Inverse polynomial interpolation, 141–142, 567
Inverse power method, 364–365 Irregular five-point formula for Laplace’s
equation, 607
Iterations. See also Linear equations,
systems of
fixed point, 117–118 limiting, 689 Newton-Raphson, 89 Richardson, 322–323
Jacobean matrix, 97–98, 100 Jacobi method, 323–325, 330–331 Jacobi overrelaxation (JOR)
method, 332
Kepler’s equation, 106 (CPb 3.2.6) Knots, in spline theory, 372, 378 Kronecker delta equation, 145
kth residual, 519
Lagrange form of polynomial interpolation, 25, 126–128, 144
Lanczos’ generalized derivative, 178 (Pb 4.3.21)
LAPACK mathematical software, 344, 351
Laplace’s equations, 286, 583–584, 605–606, 618
Laws of Motion, Newton’s, 428, 465 Lax method, 602
Lax-Wendroff method, 602–603 LDLT factorizations, 302–304, 315
(Pb 8.1.24)
Least lower bound, in integration, 182 Least squares method, 495–505,
518–531, 652 (Pb 16.1.20) basis function in, 500–501
linear example of, 521–522 nonlinear example of, 520–522 nonpolynomial example of, 499–500
Index 757 singular value decomposition (SVD)
and, 522–527
weight function in, 519–520
Least upper bound, of number set, 374 Lebesgue constants, 73 (CPb 2.2.15) Legendre polynomials, 234–237 Legendre’s elliptic integral relation, 39
(CPb 1.2.14) Lemma, upper bound, 157
Length of vectors, 320 L’Hôpital’s rule, 34 (Pb 1.2.49) Libraries, program, 10, 686–687 Linear algebra, 706–723
abstract vector spaces in, 716–723 bases for, 718–720
change in similarity of, 719–720 eigenvalues and eigenvectors in, 719 Gram-Schmidt process for, 722–723 linear independence in, 717–718 linear transformations for, 718–719 norms for, 721–722
orthogonal matrices and spectral theorem in, 720–721
subspaces in, 717
Cramer’s Rule and, 715 matrices in, 708–710
matrix product in, 711–713 matrix-vector product in, 711 symmetric matrices in, 714–715 transpose matrices in, 713–714 vectors in, 706–708
Linear B spline, 422 (Pb 9.3.36) Linear combinations, 707
Linear convergence, 82
Linear equations, systems of, 245–370
banded, 280–292
block pentadiagonal, 285–286 pentadiagonal, 283–285 strictly diagonal dominance in,
282–283 tridiagonal, 280–282
eigenvalues and eigenvectors in, 342–360
calculating, 343–344
Gershgorin’s Theorem and, 347–348 in linear differential equations,
353–355
in mathematical software, 344 matrix spectral theory of, 349–351 properties of, 345–347
singular value decomposition of,
348–349, 351–353 Gaussian elimination with scaled
partial pivoting of, 259–280 complete partial pivoting versus,
261–264
example of, 265–266
long operation count for, 269–270 numerical stability of, 271 pseudocode for, 266–269
inconsistent, 675–683 iterative solutions of, 319–341
758 Index
Linear equations, systems of (continued)
basic methods of, 322–327 condition number and
ill-conditioning in, 321–322 conjugate gradient method of,
332–335
convergence theorems for, 328–331 matrix formulation for, 331–332 overrelaxation in, 332
pseudocode for, 327–328
vector and matrix norms in, 319–320
matrix factorizations in, 293–319 Cholesky factorization as, 305–306 derivation of, 296–300
example of, 294–296
A−1 in, 307
LDLT factorization as, 302–304 LU factorization as, 300–302 multiple right-hand sides in,
306–307
pseudocode for, 300
software package example of,
307–309
naive Gaussian elimination of,
245–258
algorithm for, 248–250 example of, 247–248
failure of, 259–260 pseudocode for, 250–254 residual and error vectors in,
254–255, 279 (CPb 7.2.19) power method for, 360–370
Aiken acceleration formula for, 363 algorithms for, 361–362
inverse, 364–365
in mathematical software, 363 shifted inverse, 365–366
Linear functions, 361, 641
Linear interpolation, 162 (Pb 4.2.8) Linearize and solve approach to solving
nonlinear equations, 96, 117 Linearly independent sets, 501
Linear polynomial interpolation, 125–126 Linear programming, 657–683
approximate solution of inconsistent linear systems from, 675–683
l∞ problem for, 678–680
l1 problem for, 676–678 dual problem in, 661–663 first primal form in, 657–658,
660–661
optimization example of, 658–660 second primal form in, 663–664 simplex method for, 670–675
l∞-matrix norm, 320
l∞ problem, 678–680
l∞-vector norm, 320, 721
l∞-matrix norm, 722
l1-matrix norm, 722
Loaded die problem, 552–553 Localization theorems, 347
Local minimum points of functions, 626
Local truncation error, 435 Logarithmic integral, 186, 189
(CPb 5.1.3)
l1 approximation, 496
l1-matrix norm, 320
l1 problem, 676–678
l1-vector norm, 320, 721
Loops, clean, 686
Lower and upper sums, in integration,
181–183
Lower triangular matrix, 710
Lucas-Lehmer test, 540 LU factorization
derivation of, 296–300
description of, 294
problems in, 314–315 (Pb 8.1.18), 319
(CPb 8.1.14)
solving linear systems with, 300–302
Machine epsilon, 47–48, 703 Machine numbers, 44, 51. See also
Floating-point numbers Maclaurin series, 31 (Pb 1.2.1), 41
(CPb 1.2.21)
Macsyma mathematical software, 10 Magnitude of vectors, 320
Main diagonal matrix, 710 Mantissa, normalized, 44, 47
Maple mathematical software, 10
boundary-value problem, 577 differential equations, 427 eigenvalues, 343–344
error function in, 186
linear programming, 678–679 LU factorization in, 308 minimal solution, 526 minimization problems, 626 nonlinear equations, 99, 111
(CPb 3.2.42), 123 (CPb 3.3.19) partial differential equations, 592 polynomial interpolation in, 153
(CPb 4.1.11), 164 (CPb 4.2.12) random numbers, 533, 535
roots of equations in, 81, 88
(CPb 3.1.12), 93
singular value decomposition, 351 splines, 409–410, 418
symbolic verification in, 20
(CPb 1.1.26)
Marching problem/method, 586
March of B splines, 424 (CPb 9.3.6) Mathematica mathematical software, 10
boundary-value problem, 577 differential equations, 427 eigenvalues, 343–344
error function in, 186
linear programming, 678–679 LU factorization in, 308 minimal solution, 526 minimization problems, 626 nonlinear equations, 99, 111
(CPb 3.2.42), 123 (CPb 3.3.19)
partial differential equations, 592 polynomial interpolation in, 153
(CPb 4.1.11), 164 (CPb 4.2.12) random numbers, 533, 535
roots of equations in, 81, 88
(CPb 3.1.12), 93 splines, 418
symbolic verification in, 20 (CPb 1.1.26)
Matlab mathematical software, 10 boundary-value problem, 577 eigenvalues, 343–344
error function in, 186
linear programming, 678–679 LU factorization in, 308 minimal solution, 526 minimization problems, 626 nonlinear equations in, 99, 111
(CPb 3.2.42), 123 (CPb 3.3.19) not-a-knot condition, of
splines, 394
PDE Toolbox, 584, 592–593, 612 polynomial interpolation in, 153
(CPb 4.1.11), 164 (CPb 4.2.12) random numbers, 533, 535
roots of equations in, 81, 88
(CPb 3.1.12), 93
singular value decomposition, 351 splines, 409
vector fields, 430
Matrices. See also Linear algebra; Singular value
decomposition (SVD) companion, 358 (CPb 8.2.3) diagonal, 346–347 Gershgorin’s Theorem and, 348 gradient vector, 640–641 Hermitian, 345–346
Hessian, 640–641
Hilbert, 276 (CPb 7.2.4), 527
(Pb 12.3.2) Jacobean, 97–98
of near-deficiency in rank, 526 permutation, 307
positive definite, 305, 332–333,
345, 647
pseudo-inverse of, 525–526 row-equilibrated, 275 (Pb 7.2.23) similar, 345
singular values of, 349
symmetric, 332, 345, 640 symmetric positive definite (SPD),
305, 330 transpose of, 345
triangular, 346
unitarily similar, 345–346 Vandermonde, 139–141, 152
(Pb 4.1.47), 254 Matrix factorizations, 293–319
Cholesky, 305–306 derivation of, 296–300 example of, 294–296
A−1 in, 307
LDLT , 302–304
LU, 300–302
multiple right-hand sides in, 306–307 pseudocode for, 300
software package example of, 307–309
Matrix formulations, 331–332
Matrix norms, 319–320, 721–722 Matrix spectral theory, 349–351 Maximal linearly independent basis, 718 Maximum points of functions, 646 Mayan arithmetic, 700–701
Mean, arithmetic, 15 (CPb 1.1.7) Mean-Value Theorem, 26, 193, 397 Memory fetches, 688
Mersenne prime number, 534 Midpoint method, 188 (Pb 5.1.10), 188
(Pb 5.1.12), 201 (Pb 5.2.18), 462
(CPb 10.3.8)
Minimal solution, to linear equations,
524–526
Minimization of functions. See
Functions, minimization of Minimum points of functions, 626, 646 Mixed Dirichlet/Neumann equation, 584 Mixed mode coding, 687–688
Modified false position method, 84 Modified Newton’s method, 104
(Pb 3.2.35) Modularizing code, 685
Modulus of continuity in spline functions, 374–375
Molecular conformation, 655
(CPb 16.2.2), 655 (CPb 16.2.10)
Moler-Morrison algorithm, 122 (CPb 3.3.14)
Monte Carlo methods. See also Simulation
area and volume estimation by, 544–552
computing, 547–548
“ice cream cone” example of, 548 numerical integration for, 544–545 pseudocode for, 545–547
random numbers and, 532–544 algorithms and generators for,
533–535
examples of, 535–537 pseudocode for, 537–541
Muller’s method, 123 (CPb 3.3.17) Multidimensional integration, 198–199 Multiple zero, 96, 104 (Pb 3.2.35) Multiplication, nested, 7–9, 12
(Pb 1.1.6), 131
Multipliers, in Gaussian algorithm, 249 Multistep methods, 483
Multivariate case of minimization of
functions
advanced algorithms for, 644–646 contour diagrams for, 644
minimum, maximum and saddle points
Neider-Mead algorithm for, 647–648 positive definite matrix and, 647 quasi-Newton methods for, 647 simulated annealing method for, 648–649 steepest descent procedure for, 643 Taylor Series for F in, 640–642
NAG mathematical library, 10 NaN (Not a Number), 704 Natural cubic spline functions
algorithm for, 388–392
introduction to, 385–387 pseudocode for, 392–394 smoothness property from, 396–398 space curves from, 394–396
Natural logarithm (ln), 1
Natural ordering, 262–264, 609 Navler-Stokes equation, 583–584 Near-deficiency in rank, matrix with, 526 Neider-Mead algorithm, 647–648
Nested form of polynomial interpolation,
130–131
Nested multiplication, 7–9, 12
(Pb 1.1.6), 131 Neumann equation, 584
Neutron shielding simulation, 557–558 Neville’s algorithm, 142–144 Newton-Cotes rules, 225–226, 229
(CPb 6.1.7) Newton-Raphson iteration, 89 Newton’s form of polynomial
interpolation, 128–130, 133, 150–151 (Pb 4.1.38), 164 (CPb 4.2.14)
Newton’s Laws of Motion, 428, 465 Newton’s method for locating roots of
equations, 89–100
bisection method and secant method
versus, 117
convergence analysis in, 93–96 fractile basins of attraction
in, 99–100
generalized, 104 (Pb 3.2.37) interpretation of, 90–91
modified, 104 (Pb 3.2.35)
nonlinear equation systems in, 96–99 pseudocode in, 92–93
Newton’s method for nonlinear systems, 98
Nine-point formula for Laplace’s equation, 607, 621 (Pb 15.3.10)
Nodes
Chebyshev, 155–156, 158, 163
(CPb 4.2.10), 174 Gaussian, 230, 232–234
in polynomial interpolation, 125 in spline theory, 378
Noise in computation, 174
Nonlinear equation systems, 83, 96–99,
104 (Pb 3.2.39)
Nonlinear least squares problems,
520–522
Index 759 Nonperiodic spline filter, 291
(CPb 7.2.22)
Normal equations, 497, 499, 501, 617 Normalized floating-point representation,
44–46
Normalized mantissa, 44, 47
Normalized scientific notation, 43 Normalized tridiagonal algorithm, 289
(CPb 7.2.12) Norm induced, 721
Norms, 319–320, 721–722 n-simplex sets, 648 Number representation. See
Floating-point numbers
Objective functions, 658
Octal system, 693, 696–697. See also
Bases for numbers
Octave mathematical software, 10 Odd periodic functions, 598
Olver’s method, 122 (CPb 3.3.12) One-variable case of minimization of
functions, 625–639 Fibonacci search algorithm and,
628–631
golden section search algorithm and,
631–633
quadratic interpolation algorithm and,
633–635
special case of, 626–627 unconstrained and constrained
problems in, 625–626 unimodal functions F as, 627–628
Optimization example, of linear programming, 658–660
Ordering, natural, 262–264, 609 Ordering, red-black (checkerboard), 620
(Pb 15.3.3)
Ordinary differential equations (ODE),
426–464 Adams-Bashforth-Moulton formulas
for, 455–456 error types in, 435
Euler’s method pseudocode for, 432–433
initial-value problem in, 426–428 integration and, 428–429 Runge-Kutta methods for, 439–450
adaptive, 450–454
example of, 454–455
of order 4, 442–443
of order 2, 441–442
pseudocode for, 443–444
Taylor series in two variables and,
440–441
stability analysis for, 456–459 Taylor series methods for, 431–435 vector fields in, 429–431
Ordinary differential equations, boundary-value problems in,
563–581
discretization method for, 570–572
in, 646
760 Index
Ordinary differential equations
(continued) shooting method for
algorithm for, 565–567 in linear case, 574–575 overview of, 563–565 pseudocode for, 575–577 refinements to, 567
Ordinary differential equations, systems of, 465–494
Adams-Bashforth-Moulton methods for, 483–494
adaptive scheme for, 488 example of, 488–489 predictor-corrector scheme in,
483–484
pseudocode for, 484–488 stiff equations and, 489–491
first order methods for, 465–477
for autonomous ODE, 471–471 Runge-Kutta, 469–471
Taylor series, 466–469
uncoupled and coupled systems in,
465–466
vector notation for, 467–469
higher order, 477–483
Orthogonal matrices, 720–721 Orthogonal systems. See also Chebyshev
polynomials
algorithm for, 508–510 orthonormal basis functions in,
505–508
polynomial regression in, 510–515
Overflow, of range, 45
Overrelaxation, 324, 326–327, 331–332
Padé interpolation, 153 (CPb 4.1.17) Padé rational approximation, 41
(CPb 1.2.22), 73 (CPb 2.2.17) Parabolic problems, in differential
equations, 582–596, 594
(Pb15.1.1) applied, 582–585
Crank-Nicolson alternative method for, 590–591
Crank-Nicolson method for, 588–589 heat equation model as, 585–586 pseudocode for Crank-Nicolson
method for, 589–590 pseudocode for explicit model of, 587 stability and, 591–593
Parametric representation, of curves, 394 Partial differential equations, 582–624
elliptic problems in, 605–624 finite-difference method for,
606–609
finite-element methods for,
613–619
Gauss-Seidel iterative method
for, 610
Helmholtz equation model, 605–606 pseudocode for, 610–613
hyperbolic problems in, 596–605 advection equation as, 601 analytical solution for, 597–598 Lax method for, 602
Lax-Wendroff method for, 602–603 numerical solution for, 598–599 pseudocode for, 600–601
upwind method for, 602
wave equation model, 596–597 parabolic problems in, 582–596
applied, 582–585
Crank-Nicolson alternative method
for, 590–591 Crank-Nicolson method for,
588–589
heat equation model as, 585–586 pseudocode for Crank-Nicolson
method for, 589–590 pseudocode for explicit model
of, 587
stability and, 591–593
Partial double-precision arithmetic, 492 (CPb 11.3.2)
Partition of unity on interval, 417 Pascal’s triangle, 37 (CPb 1.2.10c) Penrose properties, 526–527 Pentadiagonal systems of linear
equations, 280, 283–285 Periodic cubic splines, 387, 401
(Pb 9.2.23) Periodicity, 67, 598
Periodic sequences of random numbers, 535
Periodic spline filter, 292 (CPb 7.2.23) Permutation matrices, 307
Piecewise bilinear polynomial, 384
(CPb 9.1.3)
Piecewise linear functions, 372 Pierce decomposition, 356 (Pb 8.2.6) , computing value of, 12 (Pb 1.1.1,
Pb 1.1.4) Pivoting, 246
pivot element for, 249, 271 pivot equation for, 247, 249 scaled partial, 259–280
complete partial pivoting and, 261 example of, 265–266
Gaussian elimination with, 262–264 long operational count and, 269–270 numerical stability and, 271 pseudocode for, 266–269
Poisson equation, 584, 605, 613, 615 Polygonal functions, 372
Polyhedral set, 671
Polynomial(s), 8, 11, 343 Polynomial interpolation, 124–164
algorithms and pseudocode for, 136–138
of bivariate functions, 144–145 derivative estimating by, 170–174 divided differences for calculating
coefficients aj in, 131–136
errors in, 153–164
Dirichlet function as, 154 Runge function as, 154–156 theorems on, 156–160
inverse, 141–142
Lagrange form of, 126–128
linear, 125–126, 162 (Pb 4.2.8) nested form of, 130–131
Neville’s algorithm for, 142–144 Newton form of, 128–130 Vandermonde matrix for, 139–141
Polynomial regression, 510–515 Positive definite matrices, 305, 332–333,
345, 647
Power method for linear equations.
See also Eigenvalues and
eigenvectors
Aiken acceleration formula for, 363 algorithms for, 361–362
inverse, 364–365
in mathematical software, 363 shifted inverse, 365–366
Precision, 3–6, 63–64, 688. See also IEEE floating-point standard
arithmetic Preconditioning, 335
Predator-prey models, 465 Predictor-corrector scheme, 461
(CPb 10.3.4), 483–484 Prime numbers, 534, 540
Probability integral, 204 (CPb 5.2.5) Product, matrix, 711–713
Program libraries, 686–687 Programming derivatives, 9–10 Programming suggestions, 684–691 Projection, 356 (Pb 8.2.6) Projection operator, 722
Prony’s method, 530 (CPb 12.3.2) Protein folding, 655 (CPb 16.2.10) Pseudocode
Adams-Bashforth-Moulton methods, 484–488
area and volume estimation, 545–547 bisection method, 78–79
as bridge, 684
B spline functions, 412–413 conjugate gradient algorithm, 334 Crank-Nicolson method, 589–590 elliptic problems, 610–613
Euler’s method, 432–433
explicit model of partial differential
equations, 587
Gaussian elimination with scaled
partial pivoting, 266–269 Gauss-Seidel method, 327, 610 hyperbolic problems, 600–601 Jacobi method, 327
linear equations, 327–328
loaded die problems, 552–553
matrix factorizations, 300
naive Gaussian elimination, 250–254 natural cubic spline functions, 392–394
Newton’s method, 92–93
numerical integration, 184–187 polynomial interpolation, 136–138 power method, 361–362
random numbers, 535, 537–541 Romberg algorithm, 205–206 Runge-Kutta-Fehlberg methods, 452 Runge-Kutta methods, 443–444,
453–454
Schoenberg’s process, 415 secant method, 112
shooting method for ordinary
differential equations (ODE),
575–577
successive overrelaxation (SOR)
method, 327
Taylor series of order 4, 468–469
Pseudo-inverse, of matrices, 525–526 Pseudo-random numbers, 533
Quadratic B spline, 423 (Pb 9.3.37) Quadratic convergence, 93, 100 Quadratic form, 333
Quadratic functions, 642, 652
(Pb 16.2.15)
Quadratic interpolation algorithm,
633–635
Quadratic splines, 376–378
Quadrature rules, 187
Quasi-Newton methods for minimization
of functions, 647 Quasi-random number sequences, 540
Radix point, 693
Random numbers, 532–544
algorithms and generators for, 533–535
examples of, 535–537
pseudocode for, 537–541 Random walk problem, 561
(CPb 13.3.17–18) Range, of computer, 45 Range reduction, 67–68
Rationalizing numerators, 64
Rayleigh quotient, 368 (Pb 8.3.7) Reciprocals of numbers, 102 (Pb 3.2.23) Recursive definition, in Newton’s
method, 91
Recursive property of divided differences
theorem, 134
Recursive trapezoid formula for equal
subintervals, 196–197 Red-black ordering, 620 (Pb 15.3.3) Reflected points, 648
Regression, polynomial, 510–515 Regula falsi method, 83–84 Relative errors, 5
Relaxation factor, 326. See also
Overrelaxation Remainder, 25
Residual, 254–255, 279 (CPb 7.2.19), 519, 619
Richardson extrapolation
estimating derivatives and, 166–170,
177 (Pb 4.3.19) Euler-Maclaurin formula and, 207 of Romberg algorithm, 209–211
Richardson iteration, 322–323 Riemann-integrable functions,
183–184
Riffle shuffles, 562 (CPb 13.3.27) Rising sequences, 562 (CPb 13.3.27) Robust software, 269
Rolle’s Theorem, 156–157
Romberg algorithm
convergence in, 165 description of, 204–205 Euler-Maclaurin formula and,
206–209 notation for, 196
pseudocode for, 205–206 Richardson extrapolation and, 168,
209–211
Roots of equations, locating, 76–123
bisection method for, 76–85 convergence analysis in, 81–83 example of, 79–81
false position method in, 83–84 pseudocode for, 78–79
Newton’s method for, 89–100 convergence analysis in, 93–96 fractile basins of attraction in,
99–100
interpretation of, 90–91 nonlinear equation systems in,
96–99
pseudocode in, 92–93
secant method for, 111–119 algorithm for, 112–113 bisection and Newton’s methods
versus, 117
convergence analysis in, 114–116 fixed point iteration and, 117–118
Rounding modes, 705
Rounding numbers, 6, 50
Roundoff error, 50, 52, 54, 63, 253, 435,
687, 703 Round-to-even method, 6
Round to nearest value, 705 Row-equilibrated matrix, 275 (Pb 7.2.23) Row vectors, 706
Runge function, 125, 154–156 Runge-Kutta-England method, 463–464
(CPb 10.3.19) Runge-Kutta methods, 439–450
adaptive, 450–454
example of, 454–455
of order 5, 451
of order 4, 442–443
of order 3, 445–446 (Pb 10.2.7)
of order 2, 441–442
pseudocode for, 443–444
for systems of ordinary differential
equations, 469–472
Index 761 Taylor series in two variables and,
440–441
Saddle points of functions, 646 Scale vector, 262
Scaling, 271
Schoenberg’s process, 414–415 Scientific notation, normalized, 43 Secant method for locating roots of
equations, 111–119 algorithm for, 112–113 bisection and Newton’s methods
versus, 117
convergence analysis in, 114–116 fixed point iteration and, 117–118
Second bad case, of quadratic interpolation algorithm, 635
Second-derivative formulas, 173–174 Second primal form, in linear
programming, 663–664
Seed, for random number sequence, 534 Serpentine curves, 395
Shifted inverse power method, 365–366 Shooting method for ordinary differential
equations (ODE), 563–570 algorithm for, 565–567
in linear case, 574–575 overview of, 563–565 pseudocode for, 575–577 refinements to, 567
Shure’s Theorem, 346 Significance
loss of, 61–68
avoiding in subtraction, 64–67 computer-caused, 62–63 range reduction and, 67–68 theorem for, 63–64
significant digits in, 3–5, 61 Significands, 47
Similar matrices, 345
Simplex method, 670–675 Simple zero, 93
Simpson’s rule, 216–229
adaptive, 221–225
basic, 216–220, 228 (Pb 6.1.8) composite, 220–221, 228 (Pb 6.1.6),
243 (CPb 6.2.11)
Simulated annealing method, 648–649 Simulation, 552–562. See also Monte
Carlo methods
birthday problem as, 553–555 Buffon’s needle problem as, 555–556 loaded die problem as, 552–553 neutron shielding, 557–558
two dice problem as, 556–557
Simultaneous nonlinear equations, 104 (Pb 3.2.39)
Sine integral, 189 (CPb 5.1.2), 204 (CPb 5.2.5), 463 (CPb 10.3.15)
Single-precision floating-point representation, 46–47
Single-step error, 453
762 Index Single-step methods, 483
Singular value decomposition (SVD) economical version of, 356 (Pb 8.3.5) eigenvalues and eigenvectors and,
348–349
least squares method and, 519,
522–527
matrix spectral theory and, 350 numerical examples of, 351–353
Singular values, 320
sin x, periodicity of, 67 Smoothing data, 396–398. See also
Chebyshev polynomials; Least
squares method Software, mathematical, 10–11
boundary-value problem, 577 development of, 691 differential equations, 427 eigenvalues and eigenvectors,
343–344
error function in, 186
linear programming, 678–679 LU factorization, 308
matrix factorizations, 307–309 minimal solution, 526 minimization problems, 626 nonlinear equations, 99, 111
(CPb 3.2.42), 123 (CPb 3.3.19) partial differential equations, 584, 592 polynomial interpolation, 153
(CPb 4.1.11), 164 (CPb 4.2.12) power method for linear equations, 363 random numbers, 533, 535
robust, 269
roots of equations, 81, 88
(CPb 3.1.12), 93
singular value decomposition, 351 splines, 394, 409–410, 418
symbolic verification, 20 (CPb 1.1.26) vector fields, 430
Solution case, of quadratic interpolation algorithm, 634
Solutions for differential equations, 426 Sparse factorization, 315 (Pb 8.1.24) Spectral/l2-matrix norm, 320. See also
Matrix spectral theory Spectral/l2-vector norm, 722
Spectral radius, 320, 329 Spectral theorem, 720–721 Spline functions, 371–425
B, 404–425
for Bézier curves, 416–418 interpolation and approximation by,
410–412
pseudocode and example of,
412–413
Schoenberg’s process for, 414–415 theory of, 404–410
first-degree, 371–374
interpolating quadratic, Q(x), 376–378 modulus of continuity in, 374–375 natural cubic, 385–404
algorithm for, 388–392
introduction to, 385–387 pseudocode for, 392–394 smoothness property from, 396–398 space curves from, 394–396
second-degree, 376
Subbotin quadratic, 378–380 Spurious zeros, 62
Stability
numerical, 271
in ordinary differential equations
(ODE), 456–459
in partial differential equations,
591–593
Standard deviation, 15 (CPb 1.1.7) Standard floating-point
representation, 46
Stationary points of functions, 646 Statistician’s rounding, 6
Steady state of systems, 489 Steepest descent procedure, 643, 655
(CPb 16.2.2)
Steffensen’s method, 104 (Pb 3.2.36) Stiff equations, 489–491
Stirling’s formula, 34 (Pb 1.2.47) Subbotin quadratic spline functions,
378–380
Subdiagonal matrix, 280, 710
Subnormal numbers, 704
Subordinate norms, 721
Subtraction, significance and, 64–67 Successive overrelaxation (SOR) method,
324, 326, 331–332 Superdiagonal matrix, 280, 710 Superlinear convergence, 84, 115 Supremum (least upper bound), 374 Symbolic computations, 435 Symbolic verification, 20 (CPb 1.1.26) Symmetric banded storage mode, 291
(CPb 7.2.20)
Symmetric matrices, 332, 345, 640,
714–715
Symmetric positive definite (SPD)
matrices, 305, 330 Symmetric storage mode, 278
(CPb 7.2.13) Synthetic division, 7
Tacoma Narrows Bridge project, 493 (CPb 11.3.9)
Taylor series, 20–31, 177 (Pb 4.3.19) alternating series and, 28–30 complete Horner’s algorithm in,
23–24
derivative estimating by, 164–166 examples of, 20–22
of f at the point c, 22–23
for F in minimization of functions,
640–642
machine precision and, 70 (Pb 2.2.28) in Mean-Value Theorem, 26
for natural logarithm (ln), 1
for ordinary differential equations, 431–435, 466–469
Runge-Kutta methods and, 440–441 Taylor’s Theorem in terms of h and,
27–28
Taylor’s Theorem in terms of (x − c)
and, 24–26
Telescoped rational functions, 73
(CPb 2.2.18) Tensor-product interpolation, 144 Tent function, 122 (CPb 3.3.15) Test cases, 685
Theorems
alternating series, 28–30, 32 (Pb 1.2.13)
axioms for a vector space, 716 bisection method, 82 Cayley-Hamilton, 358 (CPb 8.2.5) Cholesky factorizations, 305
cubic spline smoothness, 397
divided differences and derivatives, 159 duality, 662
eigenvlaues of similar matrices, 345 Euler-Maclaurin formula, 208
on existence of polynomial
interpolation, 128
first-degree polynomial accuracy, 375 first-degree spline accuracy, 375
first primal form, 658
Fundamental Theorem of Calculus,
181, 195
Gaussian quadrature, 232 Gershgorin’s, 347
initial value problem uniqueness, 431 intermediate-value, 78, 194
on interpolation errors, 156–160
on interpolation properties, 143 invariance, 135
Jacobi and Gauss-Seidel
convergence, 330
linear differential equations, 354
linear independence, 718
localization, 347
long operations, 270
on loss of precision, 63–64
LU factorization, 298
matrix spectral, 349
Mean-Value, 26, 397
Mean-Value Theorem for Integrals, 193 minimal solution, 525
Newton’s method of locating roots of
equations, 94 orthogonal basis, 350 Penrose properties of
pseudo-inverse, 526
on polynomial interpolation error,
156–160
primal and dual problems, 662 recursive property of divided
differences, 134
recursive trapezoid formula, 197 Richardson extrapolation, 168
Riemann integral, 183 Rolle’s, 156–157 second primal form, 663 Shure’s, 346
spectral, 720–721
spectral radius, 329
spectral theorem for symmetric
matrices, 720
successive overrelaxation (SOR), 331 SVD least squares, 523
Taylor’s, 166
Taylor’s Theorem in terms of h, 27–28 Taylor’s Theorem in terms of (x - c),
24–26
trapezoid rule precision, 192 vertices and column vectors, 671 Weierstrass approximation, 416 weighted Gaussian quadrature, 232
3-simplex sets, 648
Transpose of matrices, 345, 707, 713–714 Trapezoid rule, 190–204. See also
Simpson’s rule
composite, 191, 194, 243 (CPb 6.2.11) composite with unequal spacing, 203
(Pb 5.2.32)
error analysis in, 192–196 multidimensional integration in,
198–199
recursive formula for equal
subintervals in, 196–197 uniform spacing in, 191–192 Triangular inequality, 320, 721
Triangular matrix, 346, 710
Tridiagonal matrix, 709
Tridiagonal systems of linear equations,
280–282, 289 (CPb 7.2.12) Troesch’s problem, 581 (CPb 14.2.7)
Truncated series, 25, 28
Truncation error, 165–166, 174, 435
Two dice problem, 556–557 Two-dimensional integration over the unit
square, 198 2-simplex sets, 648
Unconstrained minimization problems, 625–626
Underflow, of range, 45 Undetermined coefficients, method
of, 233
Uniformly distributed numbers, 533 Unimodal functions F, 627–628 Unitarily similar matrices, 345–346 Unit roundoff error, 50, 703
Unit vectors, 708
Unstable functions, roots as, 88
(CPb 3.1.12) Upper bound lemma, 157
Upper triangular matrix, 710
Upper triangular system, 248 Upwind method, 602
Usual case, of quadratic interpolation
algorithm, 634
Vandermonde matrix, 139–141, 152 (Pb 4.1.47), 254
Variable metric algorithm, 647 Variables, declaring, 685–686 Variance, 15 (CPb 1.1.7, CPb 1.1.8) Vector norms, 319–320, 721
Vector notation, 467–469
Vectors. See also Abstract vector spaces
in linear algebra; Eigenvalues and
eigenvectors column, 671
A-conjugate, 332
convex hull of, 417
direction, 333
gradient, 640–641
index, 262, 266
inner product of, 332
in linear algebra, 706–708 matrix-vector product and, 711 in ordinary differential equations
(ODE), 429–431
residual, 254–255, 279 (CPb 7.2.19) scale, 262
vector inequality of, 658
Verification, symbolic, 20 (CPb 1.1.26) Vertices in K , 671–672
Volume estimation. See Area and volume
estimation
Warning messages, 685
Wave equation model, 582, 584,
596–597
Weierstrass approximation theorem, 416 Weight function, 519–520
Weights, Gaussian, 230, 232–234 Wilkinson’s polynomial, 88 (CPb 3.1.12),
121 (CPb 3.3.9)
Zeros
of f , 76–77, 81
of multiplicity, 96, 104 (Pb 3.2.35) simple, 93
spurious, 62
Index 763
This page intentionally left blank
Formulas from Integral Calculus xa+1
xa dx = (a + 1) + C (a ̸= 1)
ex dx = ex + C 1
eax dx = aeax +C 1
xeax dx = a2 eax (ax − 1) + C
x−1 dx =ln|x|+C
ln x d x = x ln |x | − x + C
x2x2
xlnxdx= 2 ln|x|− 4 +C
dx 1
dx−1 (a+bx)2 =b(a+bx)+C
dx 1 x
dx 1 x a2+x2 =aarctan a +C
dxx
√ = arcsin + C
cosxdx=sinx+C
tanxdx=ln|secx|+C
secxdx=ln|secx+tanx|+C
xsinxdx=sinx−xcosx+C
sec2xdx=tanx+C
secxtanxdx=secx+C
sinhxdx=coshx+C
coshxdx=sinhx+C
tanhxdx=ln|coshx|+C
a+bx =bln|a+bx|+C
ln + C ax + b
cothxdx=ln|sinhx|+C dx11√ x1
=
a+bx2 =√abarctan ax ab +C
x(ax + b)
b
sin2 x d x = 2 − 4 sin 2x + C x1
arcsinxdx=xarcsinx+ 1−x2+C
(a̸=1) (a ̸= 1)
cos2xdx=2+4sin2x+C
a2 −x2 a 1
√
x2 ±a2 dx= x2 ±a2 ± lnx+ x2 ±a2+C 22
sin x dx = −cos x + C
arccosxdx=xarccosx− 1−x2+C
1 2 arctanxdx=xarctanx− ln 1+x +C
2
F′(g(x))g′(x) dx = F(g(x)) + C
dx=ln x2+a2+x+C xa2
x2 +a2
Fundamental Theorem of Calculus Integration by Parts
dx
f(t)dt= f(x)
Mean Value for Integrals
b b
f(x)g(x)dx = f(ξ) g(x)dx aa
udv=uv− vdu
dx
a
(g(x)0)
Series
x x2x3x4x5x6 ∞xk
e =1+x+2!+3!+4!+5!+6!+···=
(|x|<∞)
( |x| < ∞)
x
= 1 + x ln a +
(xlna)2 2!
+
+
(xlna)3 3!
+ · · · =
a
sin x = x −
3!
x 11 11!
k=0 ∞
k! (−1)k
x 2k +1 (2k + 1)!
x2k (2k)!
2 π2
x3 1 3 x5 1 3 5 x7
arcsinx=x+ 6 +24 5 +246 7 +··· (x2 <1)
cosx =1−
2!
x2 x4 x6 x8 x10
+···= (−1)k k=0
+
tanx=x+ 3 + 15 + 315 +2835+··· x < 4
−
+
8!
−
k! ∞ (xlna)k
k=0
x 3
+
x 5 5!
−
x 7 7!
x 9 9!
−
+ · · · =
( |x| < ∞) (|x|<∞)
k=0 ∞
x3
4! 2x5
6! 17x7
10! 62x9
x 3 x 5 arctanx=x− +
(x2 <1) (−1 < x 1)
∞
357 k=0 (2k+1)
x 2 k + 1 x2 x3 x4 ∞ xk
1+xx3x5x7 ∞x2k−1
x 7
(−1)k 234 k=1 k
−
ln(1 + x) = x − + − + · · · = (−1)k−1
+···=
ln 1−x =2 x+ 3 + 5 − 7 +··· =2
n(n−1) n(n−1)(n−2)
(|x|<1) xn−3y3 +···=
(|x|<1)
(x+y)n =xn +nxn−1y+ 2! xn−2y2 +
1 ∞
1−x =1+x+x2 +x3 +x4 +x5 +···= Formal Taylor Series for f about c
k=0
k xn−kyk
k=1
2k−1 3!
n n
k=0
f(x)∼ f(c)+ f′(c)(x−c)+ f′′(c)(x−c)2 + f′′′(c)(x−c)3 +···=∞ f(k)(c)(x−c)k 2! 3! k=0k!
xk
Taylor Series for f (x)
f (x) = n f (k)(c)(x − c)k + En+1 where En+1 = f (n+1)(ξ)(x − c)n+1
k=0 k! (n + 1)! Taylor Series for f (x + h)
f (x + h) = n f (k)(x)hk + En+1 where En+1 = f (n+1)(ξ)hn+1 k=0 k! (n + 1)!
Alternating Series
Ifa1a2 ···an ···0forallnandlimn→∞an =0then ∞ n
k=1
(−1)k−1ak = lim (−1)k−1ak = lim Sn = S. Moreover, |S − Sn| an+1 for all n. n→∞ n→∞
k=1
Mean-Value Theorem
f(b)= f(a)+(b−a)f′(ξ) forsomeξ in(a,b)