NUMERICAL MATHEMATICS AND COMPUTING
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
SEVENTH EDITION
NUMERICAL MATHEMATICS AND COMPUTING
Ward Cheney & David Kincaid
The University of Texas at Austin
Australia • Brazil • Japan • Korea • Mexico • Singapore • Spain • UnitedKingdom • UnitedStates
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
This is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it. For valuable information on pricing, previous editions, changes to current editions, and alternate formats, please visit www.cengage.com/highered to search by
ISBN#, author, title, or keyword for materials in your areas of interest.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Numerical Mathematics and Computing, Seventh Edition
Ward Cheney & David Kincaid
Publisher: Richard Stratton
Acquisitions Editor: Molly Taylor
Assistant Editor: Shaylin Walsh Hogan
Editorial Assistant: Alex Gontar
Media Editor: Andrew Coppola
Marketing Manager: Jennifer P. Jones
Marketing Communications Manager: Mary Anne Payumo
Content Project Manager: Alison Eigel Zade
Senior Art Director: Linda May
Manufacturing Planner: Doug Bertke
Rights Acquisition Specialist: Shalice Shah-Caldwell
Production Service: MPS Limited
Interior Chapter Opener image: Roofoo/©istockphoto
Illustrator: MPS Limited
Cover Designer: KeDesign
Cover Image: Roofoo/©istockphoto Compositor: MPS Limited
© 2013, 2008, 2004 Brooks/Cole, Cengage Learning
ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except as permitted under Section 107 or 108 of
the 1976 United States Copyright Act, without the prior written permission of the publisher.
Library of Congress Control Number: 2012935124 ISBN-13: 978-1-133-10371-4
ISBN-10: 1-133-10371-5
Brooks/Cole
20 Channel Center Street Boston, MA 02210
USA
Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the United Kingdom, Australia, Mexico, Brazil and Japan. Locate your local office at: international.cengage.com/region.
Cengage Learning products are represented in Canada by Nelson Education, Ltd.
For your course and learning solutions, visit www.cengage.com. Purchase any of our products at your local college store or at our
preferred online store www.cengagebrain.com
Instructors: Please visit login.cengage.com and log in to access
instructor-specific resources.
Printed in the United States of America 1 2 3 4 5 6 7 16 15 14 13 12
For product information and technology assistance, contact us at
Cengage Learning Customer & Sales Support, 1-800-354-9706. For permission to use material from this text or product,
submit all requests online at www.cengage.com/permissions. Further permissions questions can be e-mailed to permissionrequest@cengage.com.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1
Contents
Preface xix
Mathematical Preliminaries and Floating-Point Representation
1.1 Introduction 2
Objectives 2
Limitations 2
Programming 2
Strategies 3
Significant Digits of Precision: Examples 3 Errors: Absolute and Relative 5
Accuracy and Precision 6
Rounding and Chopping 7
Computation of π 8
Nested Multiplication and Horner’s Algorithm Pairs of Easy/Hard Problems 11
First Programming Experiment 12 Mathematical Software 13 Websites 13
Additional Reading 13
Summary 1.1 14
Exercises 1.1 14 Computer Exercises 1.1 16
1.2 Mathematical Preliminaries 20
Taylor Series 20
Complete Horner’s Algorithm 24 Taylor’s Theorem in Terms of (x − c) 25 Mean-Value Theorem 26
Taylor’s Theorem in Terms of h 27 Alternating Series 28
Summary 1.2 31
Exercises 1.2 31
Computer Exercises 1.2 34
1.3 Floating-Point Representation 38
Normalized Floating-Point Representation 38 Floating-Point Representation 41 Single-Precision Floating-Point Form 42
8
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1
v
vi
Contents
2
Double-Precision Floating-Point Form 43 Computer Errors in Representing Numbers 46 Notation fl(x) and Backward Error Analysis 47 Historical Notes 51
Summary 1.3 51
Exercises 1.3 52 Computer Exercises 1.3 54
1.4 Loss of Significance 56
Significant Digits 56
Computer-Caused Loss of Significance 57 Theorem on Loss of Precision 59
Avoiding Loss of Significance in Subtraction 60 Range Reduction 62
Summary 1.4 63
Exercises 1.4 63
Computer Exercises 1.4 65
Linear Systems
2.1 Naive Gaussian Elimination 69
A Larger Numerical Example 71 Algorithm 72
Pseudocode 75
Testing the Pseudocode 77 Residual and Error Vectors 79 Summary 2.1 79
Exercises 2.1 80 Computer Exercises 2.1 81
2.2 Gaussian Elimination with Scaled Partial Pivoting
Naive Gaussian Elimination Can Fail 82
Partial Pivoting and Full (Complete) Pivoting 84 Gaussian Elimination with Scaled Partial Pivoting 85 A Larger Numerical Example 87
Pseudocode 89
Long Operation Count 92
Numerical Stability 93
Scaling 94
Variants of Gaussian Eliminations 94
Condition Number 97
Backslash Operator in MATLAB 97
Summary 2.2 97
Exercises 2.2 98
Computer Exercises 2.2 100
2.3 Tridiagonal and Banded Systems 103
Tridiagonal Systems 103
Strictly Diagonal Dominance 106
69
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
82
Pentadiagonal Systems 107 Block Pentadiagonal Systems Summary 2.3 109
Exercises 2.3 110 Computer Exercises 2.3 110
3 Nonlinear Equations
3.1 Bisection Method 114
108
Introduction 114
Bisection Algorithm 116
Pseudocode 116
Numerical Examples 117
Convergence Analysis 119
False Position (Regula Falsi) Method 121 Modified False Position Method 121 Summary 3.1 122
Exercises 3.1 123
Computer Exercises 3.1 124
3.2 Newton’s Method 125
Interpretations of Newton’s Method 126 Pseudocode 127
Illustration 128
Convergence Analysis 129
Discussion of Newton’s Method 130 Systems of Nonlinear Equations 132 Fractal Basins of Attraction 134 Summary 3.2 135
Exercises 3.2 136 Computer Exercises 3.2 138
3.3 Secant Method 142
Secant Algorithm 143 Convergence Analysis 144 Comparison of Methods 147 Hybrid Schemes 148 Fixed-Point Iteration 148 Summary 3.3 149
Exercises 3.3 149 Computer Exercises 3.3 150
Contents vii
114
4 Interpolation and Numerical Differentiation 153
4.1 Polynomial Interpolation 153
Preliminary Remarks 153 Polynomial Interpolation 154
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
viii
Contents
Interpolating Polynomial: Lagrange Form 155 Existence of Interpolating Polynomial 156 Interpolating Polynomial: Newton Form 157 Nested Form 158
Calculating Coefficients ai Using Divided Differences 160 Algorithms and Pseudocode 165
Vandermonde Matrix 167
Inverse Interpolation 169
Polynomial Interpolation by Neville’s Algorithm 170 Interpolation of Bivariate Functions 172
Summary 4.1 173
Exercises 4.1 174
Computer Exercises 4.1 177
4.2 Errors in Polynomial Interpolation 178
Introduction 178
Dirichlet Function 179
Runge Function 179
Theorems on Interpolation Errors 181 Summary 4.2 185
Exercises 4.2 186
Computer Exercises 4.2 187
4.3 Estimating Derivatives and Richardson Extrapolation
187
First-Derivative Formulas via Taylor Series 188 Richardson Extrapolation 190
First-Derivative Formulas via Interpolation Polynomials
194
Second-Derivative Formulas via Taylor Series Noise in Computation 197
Summary 4.3 197
Exercises 4.3 198
Computer Exercises 4.3 200
Numerical Integration
5.1 Trapezoid Method 201
Definite and Indefinite Integrals 201 Trapezoid Rule: Nonuniform Spacing 202 Composite Trapezoid Rule: Uniform Spacing Pseudocode 204
Error Analysis 205
Applying the Error Term 208
Recursive Trapezoid Formula 210 Multidimensional Integration 211
Remarks 212
Summary 5.1 213
196
5
201
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
203
Exercises 5.1 214 Computer Exercises 5.1 216
5.2 Romberg Algorithm 217
Description 217
Pseudocode 218 Euler-Maclaurin Formula 220 General Extrapolation 222 Summary 5.2 224
Exercises 5.2 224
Computer Exercises 5.2 226
5.3 Simpson’s Rules and Newton-Cotes Rules
Basic Trapezoid Rule 227
Basic Simpson’s Rule 228
Basic Simpson’s Rule: Uniform Spacing 229 Composite Simpson’s Rule 231
An Adaptive Simpson’s Scheme 231
Example Using Adaptive Simpson Procedure 234 Newton-Cotes Rules 235
Some Closed Newton-Cotes Rules with Error Terms Some Open Newton-Cotes Rules with Error Terms Summary 5.3 236
Exercises 5.3 237
Computer Exercises 5.3 238
227
235 236
5.4 Gaussian Quadrature Formulas
Description 239
Change of Intervals 240
Gaussian Quadrature Rules 241 Legendre Polynomials 243
Gaussian Quadrature Nodes and Weights Integrals with Singularities 246 Summary 5.4 246
Exercises 5.4 248
Computer Exercises 5.4 249
6 Spline Functions
239
244
6.1 First-Degree and Second-Degree Splines
First-Degree Splines 252
252
Modulus of Continuity Second-Degree Splines 256 Interpolating Quadratic Splines 257 Subbotin Quadratic Splines 258
255
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Contents ix
252
x
Contents
7
299
Summary 6.1 260
Exercises 6.1 261 Computer Exercises 6.1 263
6.2 Natural Cubic Splines 263
Introduction 263
Natural Cubic Splines 265
Algorithm for Natural Cubic Splines 266
Algorithm for Solving Tridiagonal System 269 Pseudocode for Natural Cubic Splines 271
Using Pseudocode for Interpolating and Curve Fitting 272 Space Curves 273
Smoothness Property 274
Summary 6.2 276
Exercises 6.2 277
Computer Exercises 6.2 280
6.3 B Splines: Interpolation and Approximation
281
Introduction 281
Bi0 Splines 281
Bi1 and Bik Splines 282
Linear Combination of Bik Splines 283 Derivatives of B Splines 284
Integration of B Splines 285
Interpolation and Approximation by B Splines Higher-Degree B Splines 287
Pseudocode and a Curve-Fitting Example Schoenberg’s Process 289
Pseudocode: Schoenberg’s Process
Be ́zier Curves 291
Summary 6.3 293
Exercises 6.3 295
Computer Exercises 6.3 297
Initial Values Problems
7.1 Taylor Series Methods 299
286
Initial-Value Problem: Analytical versus Numerical Solution An Example of a Practical Problem 301
Solving Differential Equations and Integration 302
Vector Fields 303
Taylor Series Methods 304
Euler’s Method and Pseudocode 305
Taylor Series Method of Higher Order 306
Pseudocode and Results 307
Types of Errors 307
Taylor Series Method Using Symbolic Computations 308 Summary 7.1 308
299
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
290
288
Exercises 7.1 309 Computer Exercises 7.1 310
7.2 Runge-Kutta Methods 311
Introduction 311
Taylor Series for f (x, y) 311 Runge-Kutta Method of Order 2 312 Runge-Kutta Method of Order 4 314 Pseudocode 315
Summary 7.2 316
Exercises 7.2 316
Computer Exercises 7.2 318
7.3 Adaptive Runge-Kutta and Multistep Methods 320
An Adaptive Runge-Kutta-Fehlberg Method 320 Pseudocode 321
An Industrial Example 323 Adams-Bashforth-Moulton Formulas 324 Stability Analysis 325
Summary 7.3 328
Exercises 7.3 328 Computer Exercises 7.3 329
7.4 Methods for First and Higher-Order Systems 331
Uncoupled and Coupled Systems 332 Taylor Series Method 332
Vector Notation 334
Systems of ODEs 334
Taylor Series Method: Vector Notation 334 Runge-Kutta Method 335
Pseudocode 336
Autonomous ODE 337
Higher-Order Differential Equations and Systems 339 Higher-Order Differential Equations 339
Systems of Higher-Order Differential Equations 341 Autonomous ODE Systems 341
Summary 7.4 342
Exercises 7.4 344 Computer Exercises 7.4 346
7.5 Adams-Bashforth-Moulton Methods 347
A Predictor-Corrector Scheme 347 Pseudocode 348
An Adaptive Scheme 352
An Engineering Example 352
Stiff ODEs and an Example 353
History of ODE Numerical Methods 355 Summary 7.5 355
Exercises 7.5 355
Computer Exercises 7.5 356
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Contents xi
xii
Contents
8
More on Linear Systems
8.1 Matrix Factorizations 358
LU Factorization 358
Numerical Example 359
Formal Derivation 361
Pseudocode 365
Solving Linear Systems Using LU Factorization 365 LDLT Factorization 367
Cholesky Factorization 369
Multiple Right-Hand Sides 371 Computing A−1 371
Example Using Software Packages 372 Summary 8.1 373
Exercises 8.1 375 Computer Exercises 8.1 378
8.2 Eigenvalues and Eigenvectors 380
Av Scalar Multiple of v 380
Calculating Eigenvalues and Eigenvectors 381 Mathematical Software 382
Properties of Eigenvalues 383
Gershgorin’s Theorem 385
Singular Value Decomposition 387
Numerical Examples of Singular Value Decomposition Application: Linear Differential Equations 391 Application: A Vibration Problem 392
Summary 8.2 393
Exercises 8.2 394
Computer Exercises 8.2 395
8.3 Power Method 396
Mathematical Derivation
Power Method Pseudocode
Aitken Acceleration 399
Inverse Power Method 400
Example: Inverse Power Method 401 Shifted (Inverse) Power Method 401 Example: Shifted Inverse Power Method 402 Summary 8.3 402
Exercises 8.3 403 Computer Exercises 8.3 404
8.4 Iterative Solutions of Linear Systems 405
Vector and Matrix Norms 405
Condition Number and Ill-Conditioning 406 Basic Iterative Methods 408
Pseudocode 412
Convergence Theorems 414
358
396 398
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
389
Matrix Formulation of Iterative Methods 416 Another View of Overrelaxation 417 Conjugate Gradient Method 417
CG Pseudocode and Features 419
Summary 8.4 421
Exercises 8.4 422 Computer Exercises 8.4 423
9 Least Squares Methods and Fourier Series 426
9.1 Method of Least Squares 426
Linear Least Squares 426
Linear Example 429 Nonpolynomial Example 430 BasisFunctions{g0,g1,…,gn} 431 Summary 9.1 432
Exercises 9.1 433 Computer Exercises 9.1 434
9.2 Orthogonal Systems and Chebyshev Polynomials 435
OrthonormalBasisFunctions{g0,g1,…,gn} 435 Outline of Algorithm 438
Smoothing Data: Polynomial Regression 439
A Procedure for Polynomial Regression 440 Least-Squares Problem Revisited 443 Gram-Schmidt Process 444
Summary 9.2 445
Exercises 9.2 446 Computer Exercises 9.2 447
9.3 Examples of the Least-Squares Principle 447
Inconsistent Systems 447
Modified Gram-Schmidt Process 448
Use of a Weight Function w(x) 448
Nonlinear Example 449
Linear and Nonlinear Example 450
Additional Details on SVD 451
Using the Singular Value Decomposition 453 Examples 455
Orthogonal Matrices and Spectral Theorem 456 Summary 9.3 456
Exercises 9.3 457
Computer Exercises 9.3 458
9.4 Fourier Series 459
Introduction 459
Least-Squares Approximation 459 Orthogonality Properties 460
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Contents xiii
xiv Contents
Standard Integrals 461
Trigonometric Polynomial 462
Cosine Series and Sine Series 463 Sawtooth Wave Function on [−π, π ] 464 Infinite Fourier Series 465
Periodic on [−P/2, P/2] 466
Periodic on [−L, L] 466
Periodic on [0, 2L] 466
Fourier Series and Music 467
Sawtooth Wave Function on [0, 2L] 467 Fourier Series Examples 468
Euler’s and de Moivre Formulas 469 Complex Fourier Series 470
Roots of Unity 471
Discrete Fourier Transform 472
Fast Fourier Transforms 473 Mathematical Software 474
Summary 9.4 475
Exercises 9.4 478
Computer Exercises 9.4 479
10 Monte Carlo Methods and Simulation 481
10.1 Random Numbers 481
Random-Number Algorithms and Generators 482 Examples 484
Uses of Pseudocode Random 486
Summary 10.1 489
Exercises 10.1 490 Computer Exercises 10.1 490
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques 491 Numerical Integration 491
Example and Pseudocode 493
Computing Volumes 494
Ice Cream Cone Example 495 Summary 10.2 495
Exercises 10.2 496
Computer Exercises 10.2 496
10.3 Simulation 498
Loaded Die Problem 498 Birthday Problem 499
Buffon’s Needle Problem 501 Two Dice Problem 502 Neutron Shielding Problem 502
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Summary 10.3 504 Computer Exercises 10.3 504
11 Boundary-Value Problems
11.1 Shooting Method 507
Introduction 507
Shooting Method 508
Shooting Method Algorithm 508 Modifications and Refinements 510 Summary 11.1 511
Exercises 11.1 512
Computer Exercises 11.1 513
507
Finite-Difference Approximations 513 The Linear Case 514
Numerical Example and Pseudocode 515 Pseudocode 515
Shooting Method in the Linear Case 516 Numerical Example Revisited 517 Pseudocode 518
Summary 11.2 520
Exercises 11.2 521 Computer Exercises 11.2 522
12 Partial Differential Equations
12.1 Parabolic Problems 524
Some Partial Differential Equations from Applied Problems Heat Equation Model Problem 526
Finite-Difference Method 527
Pseudocode for Explicit Method 528
Crank-Nicolson Method 529
Pseudocode for the Crank-Nicolson Method 530 Alternative Version of the Crank-Nicolson Method 531 Stability 532
Summary 12.1 534
Exercises 12.1 535
Computer Exercises 12.1 536
12.2 Hyperbolic Problems 536
Wave Equation Model Problem 536 Analytic Solution 537
Numerical Solution 538
524
Contents xv
11.2 A Discretization Method
513
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
524
xvi Contents
Pseudocode 539
Advection Equation 541
Lax Method 541
Upwind Method 541 Lax-Wendroff Method 542 Summary 12.2 542
Exercises 12.2 543 Computer Exercises 12.2 543
12.3 Elliptic Problems
544
Helmholtz Equation 544 Finite-Difference Method 544 Gauss-Seidel Iterative Method 548 Numerical Example and Pseudocode 549 Finite-Element Methods 551
More on Finite Elements 555 Summary 12.3 557
Exercises 12.3 558 Computer Exercises 12.3 559
13 Minimization of Functions
13.1 One-Variable Case 561
Unconstrained and Constrained Minimization Problems One-Variable Case 562
Unimodal Functions F 563
Fibonacci Search Algorithm 563
Golden Section Search Algorithm 566 Quadratic Interpolation Algorithm 568 Summary 13.1 571
Exercises 13.1 571
Computer Exercises 13.1 572
13.2 Multivariate Case 573
Taylor Series for F: Gradient Vector and Hessian Matrix Alternative Form of Taylor Series 575
Steepest Descent Procedure 576
Contour Diagrams 577
More Advanced Algorithms 577
Minimum, Maximum, and Saddle Points 579 Positive Definite Matrix 580
Quasi-Newton Methods 580
Nelder-Mead Algorithm 580
Method of Simulated Annealing 581 Summary 13.2 582
Exercises 13.2 583
Computer Exercises 13.2 585
561
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
561
573
14.1
14.2
14.3
Appendix A
A.1
Appendix B
B.1
Standard Forms and Duality 587
First Primal Form 587
Numerical Example 588
Transforming Problems into First Primal Form 590 Dual Problem 590
Second Primal Form 592
Summary 14.1 593
Exercises 14.1 594
Computer Exercises 14.1 596
Simplex Method 597
Vertices in K and Linearly Independent Columns of A 597 Simplex Method 599
Summary 14.2 600
Exercises 14.2 600
Computer Exercises 14.2 601
Inconsistent Linear Systems 601
l1 Problem 601
l∞ Problem 604
Summary 14.3 605
Exercises 14.3 607 Computer Exercises 14.3 607
Advice on Good Programming Practices 608
Programming Suggestions 608
Case Studies 611
On Developing Mathematical Software 615
Representation of Numbers in Different Bases 616
Representation of Numbers in Different Bases 616
Base β Numbers 617
Conversion of Integer Parts 617 Conversion of Fractional Parts 619 Base Conversion 10 ↔ 8 ↔ 2 620 Base 16 622
More Examples 622
Summary B.1 623
Exercises B.1 623
Computer Exercises B.1 624
Additional Details on IEEE Floating-Point Arithmetic
More on IEEE Standard Floating-Point Arithmetic
Contents xvii
14 Linear Programming Problems
587
Appendix C
C.1
626
626
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
xviii
Contents
Appendix D
D.1
Linear Algebra Concepts and Notation 629
Elementary Concepts 629
Vectors 629
Matrices 631 Matrix-Vector Product 634 Matrix Product 634
Other Concepts 636 Cramer’s Rule 638
Answers for Selected Exercises 639 Bibliography 657
Index 665
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Preface
Our basic objective is to acquaint students of science and engineering with the potentiali- ties of using computers for solving numerical problems that may arise in their professions. A secondary objective is to give students an opportunity to hone their skills in computer programming and problem solving. A final objective is to help students arrive at an under- standing of the important subject of errors that inevitably arise in scientific computing as well as learning methods for detecting, predicting, and controlling them.
Much of science today involves complex computations built upon mathematical soft- ware systems. The users may have little knowledge of the underlying numerical algorithms used in these problem-solving environments. By studying numerical methods, one can be- come a more informed user and be better prepared to evaluate and judge the accuracy of the results. Students should study mathematical algorithms to learn not only how they work, but also how they can fail! Critical thinking and constant skepticism are traits we want students to acquire. An extensive numerical calculation, even when carried out by state-of-the-art software, may need to be subjected to independent verification, if possible.
We have tried to achieve an elementary style of presentation since we want this book to be accessible to students who may not be advanced in their formal study of mathematics and computer sciences. Toward this end, we have provided numerous examples and figures for illustrative purposes and fragments of pseudocode, which are informal descriptions of computer algorithms.
Believing that most students at this level need a survey of the subject of numerical mathematics and computing, we have presented a wide diversity of topics, including some rather advanced ones that play an important role in scientific computing. We recommend that the reader have at least a one-year study of calculus, plus some basic knowledge of matrices, vectors, and ordinary differential equations.
Seventh Edition Features
Following suggestions and comments from the reviewers and based on our experience in teaching this material, we have revised and enhanced the entire book.
• Some chapters have been combined and the order of others changed. Also, the Linear Systems chapter has been moved earlier because it is a topic that arises frequently and needs to be presented sooner.
• AnewsectiononFourierSerieshasbeenaddedbecauseitarisesinvariousengineering applications. The section on lower and upper sums in numerical integrations has been removed.
• Nowdisplayedindoublecolumns,theproblemshavebeenrenamedasexercisestoclarify that they are intended for practice and for further learning.
• Inanefforttomaketheneweditionmorestudentfriendly,marginnoteshavebeenadded as well as additional figures, tables, examples, and exercises.
xix
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
xx Preface
Suggestions for Use
Numerical Mathematics and Computing, Seventh Edition, can be used in a variety of ways, depending on the emphasis the instructor prefers and the inevitable time constraints. Exer- cises have been supplied in abundance to enhance the book’s versatility. They are divided into two categories: Exercises and Computer Exercises. In the first category, there are more than 800 exercises in analysis that require pencil, paper, and possibly a calculator. In the second category, there are approximately 500 computer exercises involving programming and using a computer. Students are asked to solve some exercises using advanced software systems such as MATLAB, Mathematica, or Maple. While students may be asked to write their own computer codes, they can often follow a model or example to assist them, but in other cases they must proceed on their own from a mathematical description.
In some of the Computer Exercises, there is something to be learned beyond simply writing code—a moral, if you like. For example, this can happen if the exercise being solved and the pseudocode provided are somehow mismatched. Some Computing Exercises are designed to give the students experience in using, mathematical software.
The Student Solution Manual is sold as a separate publication. Also, teachers who adopt the book can obtain the Instructor Solution Manual from the publisher or via the Instructor Companion website. Sample programs based on the pseudocode have been coded in several programming languages and are on the textbook website.
To access course materials and companion resources, please visit the websites on p. iv; e.g., www.cengagebrain.com. At the CengageBrain.com home page, one can search using either the International Standard Book Number (ISBN), the title of the book, or the last name of one of the authors. This takes you to the product page, where free companion resources can be found. The authors have established the following website:
www.ma.utexas.edu/CNA/NMC7/
The arrangement of chapters reflects our own view of how the material might best un- fold for a student new to the subject. In some cases, there may be little mutual dependence among the chapters, and the instructor can order the sequence of presentation in various ways. Certainly, instructor may omit some sections and chapters for want of time.
Our own recommendations for courses based on this text are as follows:
• A one-term course carefully covering Chapters 1 through 7 (possibly omitting Sections 4.2, 6.3, and 7.3–5, for example), followed by a selection of material from the remaining chapters as time permits.
• A one-term survey rapidly skimming over almost all of chapters and omitting some of the more advanced material.
• Atwo-termcoursecarefullycoveringallchapters.
Student Research Projects
Throughout there are some Computer Exercises designated as Student Research Projects. These are opportunities for students to explore topics beyond the scope of the textbook. Many of these involve application areas for numerical methods. These projects usually include computer programming and numerical experiments. A favorable aspect of these assignments is to allow students to choose a topic of interest to them, possibly something that may arise in their future profession or their major study area. For example, any topic suggested may be delved into more deeply by consulting other texts and references. In
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
preparing such a project, the students have to learn about the topic, locate the significant references (books, research papers, and websites), do the computing, and write a report that explains all this in a coherent way. Students can avail themselves of mathematical software systems such as MATLAB, Maple, or Mathematica, as well as doing their own computer programming in whatever programming language they prefer.
Acknowledgments
We have profited from advice and suggestions kindly offered by a large number of col- leagues, students, and users of the previous editions.
We are grateful to have had the opportunity and privilege of teaching classes in scientific computing, numerical analysis, and various other topics in the Department of Mathematics and the Department of Computer Sciences of The University of Texas at Austin. Without their support and the use of their computing facilities, writing this book would not have been possible. In particular, we thank Maorong Zou and Margaret Combs for help with computer issues.
Recently, the second author taught classes in the Department of Petroleum Engineering and Geosystems Engineering of The University of Texas at Austin and in the Applied Mathematics and Computer Sciences Division of the King Abdullah Univeristy of Science and Technology in Saudi Arabia and wishes to thank them both.
Valuable comments and suggestions were made by our colleagues and friends. In par- ticular, we miss our friend David Young who was always very generous with suggestions for improving the accuracy and clarity of the exposition in previous editions. Some parts of those editions were typed with great care and attention to detail by Sheri Brice, Katy Burrell, Kata Carbone, Margaret Combs, and Belinda Trevino. Aaron Naiman was partic- ularly helpful in sharing material he prepared for his courses.
We wish to acknowledge the reviewers who have provided detailed critiques for this new edition: Eugino Aulisa, Texas Tech University; Erin Bach, University of Wiscon- sin, Madison; Marcin Bownik, University of Oregon; Olga Brezhneva, Miami Univer- sity; George Grossman, Central Michigan University; Luke Olson, University of Illinois at Urbana-Champaign; Ronald Taylor, Wright State University.
Reviewers from previous editions were Krishan Agrawal, Eric Back, Neil Berger, Thomas Boger, Marcin Bownik, Olga Brezhneva, Jose E. Castillo, Charles Collins, Charles Cullen, Elias Y. Deeba, F. Emad, Gentil A. Este ́vez, Terry Feagin, Jose Flores, Leslie Foster, Bob Funderlic, Mahadevan Ganesh, William Gearhart, Juan Gil, John Gregory, George Grossman, Bruce P. Hillam, Patrick Lang, Ren Chi Li, Wu Li, Xiaofan Li, Vania Mascioni, Bernard Maxum, Edward Neuman, Roy Nicolaides, Luke Olson, Amar Raheja, J. N. Reddy, Daniel Reynolds, Asok Sen, Ching-Kuang Shene, William Slough, Ralph Smart, Thiab Taha, Ronald Taylor, Jin Wang, Stephen Wirkus, Marcus Wright, Quiang Ye, Tjalling Ypma, and Shangyou Zhan.
Many individuals took the time to write us with suggestions and criticisms. We are grateful to the following individuals and others who have send us e-mails concerning the textbook or solution manuals: A. Aawwal, Nabeel S.Abo-Ghander, Krishan Agrawal, Roger Alexander, Husain Ali Al-Mohssen, Kistone Anand, Keven Anderson, Vladimir Andrije- vik, Jon Ashland, Hassan Basir, Steve Batterson, Neil Berger, Adarsh Beohar, Bernard Bialecki, Jason Brazile, Keith M. Briggs, Carl de Boor, Jose E. Castillo, Fatih Celiker, Debao Chen, Ellen Chen, Hwen Chin, Edmond Chow, Lloyd Clark, John Cook, Brad Copper, Roger Crawfis, Charles Cullen, Antonella Cupillari, Jonathan Dautrich, James Arthur Davis, Tim Davis, Elias Y. Deeba, Suhrit Dey, Alan Donoho, Jason Durheim, Wayne
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Preface xxi
xxii Preface
Dymacek, John Eisenmenger, Fawzi P. Emad, Paul Enigenbury, Terry Feagin, Leslie Foster, Peter Fraser, Richard Gardner, Mohamad El Gharamti, John Gregory, Katherine Hua Guo, Scott Hagerup, Kent Harris, Scott Henry, Bruce P. Hillam, Tom Hogan, Jackie Hohnson, Christopher M. Hoss, Jason S. Howel, Kwang-il In, Victoria Interrante, Sadegh Jokar, Erni Jusuf, Jason Karns, Jacob Y. Kazakia, Grant Keady, Achim Kehrein, Jacek Kierzenka, Daniel Kopelove, S. A. (Seppo) Korpela, Andrew Knyazev, Gary Krenz, Jihoon Kwak, Kim Kyungjin, Minghorng Lai, Patrick Lang, Kevin Lee, Wu Li, Grace Liu, Wenguo Liu, Stacy Long, Mark C. Malburg, Igor Malkiman, P. W. Manual, Hamidreza Mashalyekh, Peter McNamara, Juan Meza, F. Milianazzo, Milan Miklavcic, Sue Minkoff, George Minty, Baharen Momken, Justin Montgomery, Ramon E. Moore, Harunrashid Muhammad, Aaron Naiman, Asha Nallana, Edward Neuman, Durene Ngo, Roy Nicolaides, Jeff Nunemacher, Valia Guerra Ones, David Parker, Tony Praseuth, Rolfe G. Petschek, Terri Prakash, Mi- haela Quirk, Helia Niroomand Rad, Jeremy Rahe, Frank Roberts, Frank Rogers, Simen Rokaas, Hossein Roodi, Robert S. Raposo, Chris C. Seib, Granville Sewell, Keh-Ming Shyue, Daniel Somerville, Nathan Smith, Mandayam Srinivas, Alexander Stromberger, Xingping Sun, Thiab Taha, Hidajaty Thajeb, Joseph Traub, Phuoc Truong, Vincent Tsao, Bi Roubolo Vona, David Wallace, Charles Walters, Kegnag Wang, Layne T. Watson, Andre Weideman, Perry Wong, Richard Fa Wai, Yuan Xu, and Rick Zaccone.
It is our pleasure to thank those who helped with the task of preparing the new edition. The staff of Cengage Learning and associated individuals have been most understanding and patient in bringing this book to fruition. In particular, we thank Shaylin Walsh, Alison Eigel Zade, Charu Khanna, and Christine Sabooni for their efforts on behalf of this project. Some of those who were involved with previous editions were Seema Atwal, Craig Barth, Carol Benedict, Stacy Green, Jeremy Hayhurst, Janet Hill, Cheryll Linthicum, Gary Ostedt, Merrill Peterson, Bob Pirtle, Sara Planck, Elizabeth Rammel, Ragu Raghavan, Elizabeth Rodio, Anne Seitz, and Marlene Thom.
We offer our heartfelt gratitude to Victoria Cheney, Joyce Pfluger, and Martha Wells.
We would appreciate any comments, questions, suggestions, or corrections that readers may wish to communicate to us using the e-mail address below.
Ward Cheney & David Kincaid kincaid@ices.utexas.edu
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Dedication
In memory of
our friend and colleague
David M. Young, Jr. (“Dr. SOR”) (1923–2008)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1
Mathematical Preliminaries and Floating-Point Representation
The Taylor series for the natural logarithm ln(1 + x) is 1111111
ln 2 = 1 − 2 + 3 − 4 + 5 − 6 + 7 − 8 + · · ·
Adding together the eight terms shown, we obtain ln 2 ≈ 0.63452,∗ which
is a poor approximation to ln 2 = 0.69315. . . . On the other hand, the Taylor
series for ln[(1 + x)/(1 − x)] gives us (with x = 1 ) 3
−1 3−3 3−5 3−7 ln2=23 +3+5+7+···
By adding the four terms shown between the parentheses and multiplying
by 2, we obtain ln 2 ≈ 0.69313. This illustrates the fact that rapid conver-
gence of a Taylor series can be expected near the point of expansion but
not at remote points. Evaluating the series ln[(1 + x)/(1 − x)] at x = 1 is 3
a mechanism for evaluating ln 2 near the point of expansion. It also gives an example in which the properties of a function can be exploited to obtain a more rapidly convergent series. Taylor series and Taylor’s Theorem are two of the principal topics we discuss in this chapter. They are ubiquitous features in much of numerical analysis.
Computers usually do not use base-10 arithmetic for storage or compu- tation. Numbers that have a finite expression in one number system may have an infinite expression in another system. This phenomenon is illus- trated when the familiar decimal number 1/10 is converted into the binary system:
(0.1)10 = (0.0 0011 0011 0011 0011 0011 0011 0011 0011 . . .)2
We explain the floating-point number system and develop basic facts about roundoff errors. Another topic is loss of significance, which occurs when nearly equal numbers are subtracted. It is studied and shown to be avoid- able by various programming techniques.
∗The symbol ≈ means “approximately equal to.”
1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2 Chapter 1 Mathematical Preliminaries and Floating-Point Representation
1.1 Introduction Objectives
The objective of this text is to help the reader to understand some of the many methods for solving scientific problems using computers. We intentionally limit ourselves to the typical problems that arise in science, engineering, and technology. Thus, we do not touch on problems of accounting, modeling in the social sciences, information retrieval, artificial intelligence, and so on.
Usually, our treatment of problems do not begin at the source, for that would take us far afield into such areas as physics, engineering, and chemistry. Instead, we consider problems after they have been cast into certain standard mathematical forms. The reader is therefore asked to accept on faith the assertion that the chosen topics are indeed important ones in scientific computing.
To survey many topics, we must treat some in a superficial way. But it is hoped that the reader acquires a good bird’s-eye view of the subject and therefore is better prepared for a further, deeper study of numerical analysis.
For each principal topic, we list good current sources for more information. In any realistic computing situation, considerable thought should be given to the choice of method to be employed. Although most procedures presented here are useful and important, they may not be the optimum ones for a particular problem. In choosing among available methods for solving a problem, the analyst or programmer should consult recent references.
Limitations
Becoming familiar with basic numerical methods without realizing their limitations would be foolhardy. Numerical computations are almost invariably contaminated by errors, and it is important to understand the source, propagation, magnitude, and rate of growth of these errors. Numerical methods that provide approximations and error estimates are more valuable than those that provide only approximate answers. While we cannot help but be impressed by the speed and accuracy of the modern computer, we should temper our admiration with generous measures of skepticism. As the eminent numerical analyst Carl- Erik Fro ̈berg once remarked:
Thus, one of our goals is to help the reader arrive at this state of skepticism, armed with methods for detecting, estimating, and controlling errors.
Programming
The reader is expected to be familiar with the rudiments of programming. Algorithms are presented as pseudocode, and no particular programming language is adopted. The pseudocodes are an important intermediate step in translating the algorithms presented in the textbook before coding, running, and debugging them in a computer language and on a computer.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Never in the history of mankind has it been possible to produce so many wrong answers so quickly!
EXAMPLE 1
Solution
Some of the primary issues related to numerical methods are the nature of numerical errors, the propagation of errors, and the efficiency of the computations involved, as well as the number of operations and their possible reduction.
Many students have graphing calculators and access to mathematical software systems that can produce solutions to complicated numerical problems with minimal difficulty. The purpose of a numerical mathematics course is to examine the underlying algorithmic techniques so that students learn how the software or calculator found the answer. In this way, they would have a better understanding of the inherent limits on the accuracy that must be anticipated in working with such systems.
Strategies
One of the fundamental strategies behind many numerical methods is the replacement of a difficult problem with a string of simpler ones. By carrying out an iterative process, the solutions of the simpler problems can be put together to obtain the solution of the original, difficult problem. This strategy succeeds in solving linear systems (Chapters 2 and 8), finding zeros of functions (Chapter 3), interpolation (Chapter 4), numerical integration (Chapters 5), and more.
Students majoring in computer science and mathematics as well as those majoring in engineering and other sciences are usually well aware that numerical methods are needed to solve problems that they frequently encounter. It may not be as well recognized that sci- entific computing is quite important for solving problems that come from fields other than engineering and science, such as economics. For example, finding zeros of functions may arise in problems using the formulas for loans, interest, and payment schedules. Also, prob- lems in areas such as those involving the stock market may require least-squares solutions (Chapter 9). In fact, the field of computational finance requires solving quite complex math- ematical problems utilizing a great deal of computing power. Economic models routinely require the analysis of linear systems of equations with thousands of unknowns.
Significant Digits of Precision: Examples
Significant digits are digits beginning with the leftmost nonzero digit and ending with the rightmost correct digit, including final zeros that are exact.
Using an industrial laser cutting machine, a technician cuts a 2-meter by 3-meter rectangular sheet of steel into two equal triangular pieces.
What is the diagonal measurement of each triangle?
Can these pieces be slightly modified so the diagonals are exactly 3.6 meters?
Since the piece is rectangular, the Pythagorean Theorem can be invoked. Thus, to compute the diagonal, d, in Figure 1.1 (p. 4), we write 22 + 32 = d2. It follows that
√√
d = 4+9 = 13 = 3.605551275
This last number is obtained by using a handheld calculator. The accuracy of d as given can be verified by computing (3.60555 1275) ∗ (3.60555 1275) = 13.
Is this numerical value for the diagonal to be taken seriously? Certainly not. To begin with, the given dimensions of the rectangle cannot be expected to be precisely 2 and 3. If the dimensions are accurate to 1 millimeter, the dimensions may be as large as 2.001 and 3.001.
1.1 Introduction 3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4
Chapter 1 Mathematical Preliminaries and Floating-Point Representation
FIGURE 1.1
Rectangular sheet of steel
3
2
EXAMPLE 2
Solution
Using the Pythagorean Theorem again, one finds that the diagonal may be as large as 22√√
d = 2.001 + 3.001 = 4.00400 1 + 9.00600 1 = 13.01002 ≈ 3.6069 Similar reasoning indicates that d may be as small as 3.6042. These are both worst cases.
We can conclude that
3.6042 ≦ d ≦ 3.6069 No greater accuracy can be claimed for the diagonal.
If we want the diagonal to be exactly 3.6, we require (3−c)2 +(2−c)2 = 3.62
For simplicity, we reduce each side by the same amount. This leads to the equation c2 −5c+0.02=0
Using the quadratic formula, we find the smaller root to be
√
c = 2.5 − 6.23 ≈ 0.00400
By cutting off 4 millimeters from the two perpendicular sides, we have triangular pieces of sizes 1.996 by 2.996 meters. Check: (1.996)2 + (2.996)2 ≈ 3.62. ■
To show the effect of the number of significant digits used in a calculation, we consider the problem of solving a linear system of equations.
Let’s concentrate on solving for the variable y in this linear system of equations in two
variables
0.1036 x + 0.2122 y = 0.7381 0.2081 x + 0.4247 y = 0.9327
First, carry only three significant digits of precision in the calculations. Second, repeat with four significant digits throughout. Finally, use ten significant digits.
Step 1. In the first task, we round all numbers in the original problem to three digits and round all the calculations, keeping only three significant digits. We take a multiple α of the first equation and subtract it from the second equation to eliminate the x-term in the second equation. The multiplier is α = 0.208/0.104 ≈ 2.00. Thus, in the second equation, the new coefficient of the x-term is 0.208 − (2.00)(0.104) ≈ 0.208 − 0.208 = 0 and the new y-term coefficient is 0.425 − (2.00)(0.212) ≈ 0.425 − 0.424 = 0.001. The right-hand side is 0.933 − (2.00)(0.738) = 0.933 − 1.48 = −0.547. Hence, we find that
y = −0.547/(0.001) ≈ −547
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
d
(1)
FIGURE 1.2
In 2D, well-conditioned and ill-conditioned linear systems
Step 2. We decide to keep four significant digits throughout and repeat the calculations. Now the multiplier is α = 0.2081/0.1036 ≈ 2.009. In the second equation, the new coefficient of the x-term is 0.2081 − (2.009)(0.1036) ≈ 0.2081 − 0.2081 = 0, the new coefficient of the y-term is 0.4247 − (2.009)(0.2122) ≈ 0.4247 − 0.4263 = −0.00160 0, and the new right-hand side is 0.9327 − (2.009)(0.7381) ≈ 0.9327 − 1.483 ≈ −0.5503. Hence, we find
y = −0.5503/(−.00160 0) ≈ 343.9
We are shocked to find that the answer has changed from −547 to 343.9, which is a huge
difference!
Step 3. In fact, if we repeat this process and carry ten significant digits, we find that even 343.9 is not accurate, since we obtain y ≈ 356.29071 99.
Using a computer, we find
y ≈ 3.56290 74421 51324 × 102
The lesson learned in this example is that data thought to be accurate should be carried with full precision and not be rounded prior to each of the calculations. ■
Figure 1.2 shows a geometric illustration of what can happen in solving two equations in two unknowns. The point of intersection of the two lines is the exact solution. As is shown by the dotted lines, there may be a degree of uncertainty from errors in the measurements or roundoff errors. So instead of a sharply defined point, there may be a small trapezoidal area containing many possible solutions. However, if the two lines are nearly parallel, then this area of possible solutions can increase dramatically! This is related to well-conditioned and ill-conditioned systems of linear equations, which are discussed more in Chapter 8.
In most computers, the arithmetic operations are carried out in a double-length ac- cumulator that has twice the precision of the stored quantities. However, even this may not avoid a loss of accuracy! Loss of accuracy can happen in various ways such as from roundoff errors and subtracting nearly equal numbers. We shall discuss loss of precision in Section 1.4, and the solving of linear systems in Chapter 2.
Errors: Absolute and Relative
Suppose that α and β are two numbers, of which one is regarded as an approximation to the other. The error of β as an approximation to α is α − β; that is, the error equals the exact value minus the approximate value. The absolute error of β as an approximation to α is
|α − β|
The relative error of β as an approximation to α is |α − β|
|α|
1.1 Introduction 5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
6 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Errors
Notice that in computing the absolute error, the roles of α and β are the same, whereas in computing the relative error, it is essential to distinguish one of the two numbers as correct. (Observe that the relative error is undefined in the case α = 0.) For practical reasons, the relative error is usually more meaningful than the absolute error.
In summary, we have
Absolute Error = |Exact Value − Approximate Value| Relative Error = |Exact Value − Approximate Value|
|Exact Value|
Here the exact value may be the true value or the best known value.
Let α1 = 1.333, β1 = 1.334, and α2 = 0.001, β2 = 0.002. Whataretheabsoluteerrorsandrelativeerrorsofβi asanapproximationtoαi?
EXAMPLE 3
Solution
EXAMPLE 4
Solution
The absolute error of βi as an approximation to αi is the same in both cases—namely, 10−3. However, the relative errors are 3 × 10−3 and 1, respectively. The relative error
4
clearly indicates that β1 is a good approximation to α1, but that β2 is a poor approximation
to α2. ■
Consider x = 0.00347 rounded to x = 0.0035 and y = 30.158 rounded to y = 30.16. In each case, what are the number of significant digits, absolute errors, and relative errors? Interpret the results.
Case 1. x = 0.35 × 10−2 has two significant digits, absolute error 0.3 × 10−4 and relative error 0.865 × 10−2.
Case 2. y = 0.3016 × 102 has four significant digits, absolute error 0.2 × 10−2 and relative error 0.66 × 10−4.
Clearly, the relative error is a better indication of the number of significant digits than the absolute error. ■
Accuracy and Precision
Accurate to n decimal places means that you can trust n digits to the right of the decimal place. Accurate to n significant digits means that you can trust a total of n digits as being meaningful beginning with the leftmost nonzero digit.
Suppose you use a ruler graduated in millimeters to measure lengths. So the measure- ments are accurate to 1 millimeter, or 0.001m, which is three decimal places written in meters. A measurement such as 12.345m would be accurate to three decimal places. A measurement such as 12.34567 89m would be meaningless, since the ruler produces only three decimal places, and it should be 12.345m or 12.346m. If the measurement 12.345m has five dependable digits, then it is accurate to five significant figures. On the other hand, a measurement such as 0.076m has only two significant figures.
When using a calculator or computer in a laboratory experiment, one may get a false sense of having higher precision than is warranted by the data. For example, the result
(1.2) + (3.45) = 4.65
actually has only two significant digits of accuracy because the second digit in 1.2 may be
the effect of rounding 1.24 down or rounding 1.16 up to two significant figures. Then the
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 5
Solution
left-hand side could be as large as
(1.249) + (3.454) = (4.703)
or as small as
(1.16) + (3.449) = (4.609)
There are really only two significant decimal places in the answer! In adding and subtracting numbers, the result is accurate only to the smallest number of significant digits used in any step of the calculation. In the above example, the term 1.2 has two significant digits; therefore, the final calculation has an uncertainty in the third digit.
In multiplication and division of numbers, the results may be even more misleading. For instance, perform these computations on a calculator:
(1.23)(4.5) = 5.535, (1.23)/(4.5) = 0.27333 3333
You think that there are four and nine significant digits in the results, but there are really only two! As a rule of thumb, one should keep as many significant digits in a sequence of calculations as there are in the least accurate number involved in the computations.
Rounding and Chopping
Rounding reduces the number of significant digits in a number. The result of rounding is a number similar in magnitude that is a shorter number having fewer nonzero digits. There are several slightly different rules for rounding. The round-to-even method is also known as statistician’s rounding or bankers’ rounding. It is discussed next. Over a large set of data, the round-to-even rule tends to reduce the total rounding error with (on average) an equal portion of numbers rounding up as well as rounding down.
We say that a number x is chopped to n digits or figures when all digits that follow the nth digit are discarded and none of the remaining n digits are changed. Conversely, x is rounded to n digits or figures when x is replaced by an n-digit number that approximates x with minimum error.
The question of whether to round up or down an (n + 1)-digit decimal number that ends with a 5 is best handled by always selecting the rounded n-digit number with an even nth digit. This may seem strange at first, but remarkably, this is essentially what computers do in rounding decimal calculations when using the standard floating-point arithmetic! (This is a topic discussed in Section 1.3.)
Give some examples of rounding three-decimal numbers to two digits. The results of rounding are
0.217 ≈ 0.22, 0.365 ≈ 0.36, 0.475 ≈ 0.48, 0.592 ≈ 0.59 while chopping them gives
0.217 ≈ 0.21, 0.365 ≈ 0.36, 0.475 ≈ 0.47, 0.592 ≈ 0.59
On computers, the user sometimes has the option to have all arithmetic operations done
with either chopping or rounding. The latter is usually preferable, of course!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.1 Introduction 7
■
8 Chapter 1 Mathematical Preliminaries and Floating-Point Representation Computation of π
History of Computing π
Computing π has a rich and colorful history. In 1650 BCE, the Rhind Papyrus of ancient Egypt contained numerical algorithms such as this approximation
92 256
π ≈ 82 ⇒ π ≈ ≈ 3.1605
2 81 In 287–212 BC, Archimedes determined that
3.1408≈ 223 <π < 22 ≈3.142 71 7
by noting that the value of π is between the length of the permieter of a polygon inscrib- ing and a polygon circumscribing a circle of radius one-half. Around 1700, John Machin dicovered the identity
π=16tan1−4tan 1 5 239
and calculated the first hundred digits of π. In 1973, the first million digits of π were determined. Since then, many more sophisticated techniques have been developed for com- puting π. (See Moler [2011], for a recent article.)
Nested Multiplication and Horner’s Algorithm
We begin with some remarks on evaluating a polynomial efficiently and on rounding and chopping real numbers. To evaluate the polynomial
p(x)=a0 +a1x+a2x2 +···+an−1xn−1 +anxn (2) we group the terms in a nested multiplication:
p(x)=a0 +x(a1 +x(a2 +···+x(an−1 +x(an))···))
The pseudocode‡ that evaluates p(x) starts with the innermost parentheses and works out-
ward. It can be written as
Here we assume that numerical values have been assigned to the integer variable n, the real variable x, as well as the coefficients a0,a1,...,an, which are stored in a real linear array. (Throughout, we use semicolons between these declarative statements to save space.) The left-pointing arrow (←) means that the value on the right is stored in the location
‡ A pseudocode is a compact and informal description of an algorithm that uses the conventions of a programming language but omits the detailed syntax. When convenient, it may be augmented with
natural language. Usually, writing the pseudocode is a good way of organizing the ideas in a mathematical algorithm before coding them in a particular programming language.
integer i, n; real p, x real array (ai )0:n
p ← an
fori =n−1to0
p←ai +xp end for
Nested Multiplication Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 6
Solution
Mathematical Notation
named on the left (i.e., overwrites from right to left). The for-loop index i runs backward, taking values n − 1,n − 2,...,0. The final value of p is the value of the polynomial at x. This nested multiplication procedure is also known as Horner’s algorithm or synthetic division.
In the pseudocode (p. 8), there is exactly one addition and one multiplication each time the loop is traversed. Consequently, Horner’s algorithm can evaluate a polynomial with only n additions and n multiplications. This is the minimum number of operations possible.
A naive method of evaluating a polynomial would require many more operations. Show how p(x) = 5 + 3x − 7x2 + 2x3 should be computed.
Let
p(x) = 5 + x(3 + x(−7 + x(2)))
for a given value of x. We have avoided all the exponentiation operations by using nested
n ni p(x)= aixi = ai x
Deflation
p(x) = (x − r)q(x) + p(r) (3)
q(x)=b0 +b1x+b2x2 +···+bn−1xn−1 (4)
multiplication! ■ The polynomial in Equation (2) can be written in an alternative form by utilizing the
mathematical symbols for sum and product ; namely,
Recall that if n ≦ m, we write m
and
k=n
By convention, whenever m < n, we define
i=0 i=0 j=1
xk = xn + xn+1 + · · · + xm k=n
m
xk = xnxn+1 ···xm
m k=n
xk=0 and
m k=n
xk=1
Horner’s algorithm can be used in the deflation of a polynomial. This is the process of removing a linear factor from a polynomial. If r is a root of the polynomial p, then x − r is a factor of p. The remaining roots of p are the n − 1 roots of a polynomial q of degree 1 less than the degree of p such that
where
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.1 Introduction 9
10 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
integer i, n; real r
real array (ai )0:n , (bi )0:n−1
bn−1 ← an
fori =n−1to0
bi−1 ←ai +rbi end for
Synthetic Division Pseudocode
The pseudocode for Horner’s algorithm can be written as follows:
Notice that b−1 = p(r) in this pseudocode. If f is an exact root, then b−1 = p(r) = 0.
If the calculation in Horner’s algorithm is to be carried out with pencil and paper, the
following arrangement is often used:
an an−1 an−2 ... a1 a0
EXAMPLE 7
Solution
bn−1 bn−2 bn−3 ... b0 b−1 Use Horner’s algorithm to evaluate p(3), where p is the polynomial
EXAMPLE 8
Solution
Thus, we obtain p(3) = 19, and we can write
p(x)=(x−3)(x3 −x2 +4x+7)+19 ■
In the deflation process, if r is a zero of the polynomial p, then x − r is a factor of p, and conversely. The remaining zeros of p are the n − 1 zeros of q(x).
Deflate the polynomial p of the preceding example, using the fact that 2 is one of its zeros. We use the same arrangement of computations as explained previously:
r) rbn−1 rbn−2 ... rb1 rb0 −−−−−−
p(x)=x4 −4x3 +7x2 −5x−2 We arrange the calculation as suggested:
1 −4 7 −5 −2 3) 3 −3 12 21
− 1 −1 4 7
1 −4 7 −5 −2 2) 2−4 6 2
− 1 −2 3 1
Thus, we have p(2) = 0, and
x4 −4x3 +7x2 −5x−2=(x−2)(x3 −2x2 +3x+1) ■
We can use Horner’s algorithm for evaluating a derivative of a polynomial. By Equa- tion (3), we can write polynomial p with root r as
p(x) = (x − r)q(x) + p(r)
where q(x) is given by Equation (4). If we differentiate, we obtain
p′(x) = q(x) + (x − r)q′(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
19
0
EXAMPLE 9
Solution
cn−2 cn−3 cn−4 ... c0 = p′(r) Given the polynomial
Clearly, we have
p′(r) = q(r)
The pseudocode for Horner’s algorithm for determining p(r) and p′(r) can be written as
follows:
Notice that the final values are α = p(r) and β = p′(r).
If the calculation in Horner’s algorithm (synthetic division) is to be carried out with
pencil and paper, the following arrangement is often used:
an an−1 an−2 ... a2 a1 a0
r) rbn−1 rbn−2 ... rb2 rb1 rb0 −−−−−−−
bn−1 bn−2 bn−3 ... b1 b0 = p(r)
rcn−2 rcn−3 ... c2 rc1 −−−−−−
p(x)=x4 −4x3 +7x2 −5x−3 Use synthetic division to find p(3) and p′(3).
We use this arrangement of computation as explained previously: 1 −4 7 −5 −2
3) 3−3 12 21 −
1−1 4 7 19=p(3)
3 6 30
− ′
1 2 10 37=p(3)
Thus, we have
x4 −4x3 +7x2 −5x−2=(x−3)(x3 −x2 +4x+7)+19
So p(x) = (x−3)q(x)+19and p′(x) = q(x)+(x−3)q′(x)whereq(x) = x3−x2+4x+7. Hence, we have p(3) = 19 and p′(3) = q(3) = 37. ■
Pairs of Easy/Hard Problems
Sometimes in scientific computing, we encounter a pair of problems, one of which is easy and the other hard, and they are inverses of each other. This is the main idea in cryptology, in which multiplying two numbers together is trivial, but the reverse problem (factoring a huge number) verges on the impossible!
The same phenomenon arises with polynomials. Given the roots, we can easily find the power form of the polynomial as in (2). Given the power form, it may be a hard problem
c−1
1.1 Introduction 11
integer i, n; real p, r
real array (ai )0:n , (bi )0:n−1 α ← an; β ← 0
fori =n−1to0
β ← α + rβ
α←ai +rα end for
Horner’s Algorithm Pseudocode
b−1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
12 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Sample Problem Pairs
to compute the roots (or it may be an ill-conditioned problem). Computer Exercise 1.1.24 calls for the writing of code to compute the coefficients in the power form of a polynomial from its roots. It is a do-loop with simple formulas. One adjoins one factor (x − r) at a time. This theme arises again in linear algebra, in which computing b = Ax is trivial, but finding x from A and b (the inverse problem) is hard. (See Section 2.1.)
Easy/hard problems come up again in two-point boundary value problems. Finding D f , f (0), and f (1) when f is given and D is a differential operator is easy, but finding f from knowledge of D f , f (0) and f (1) is hard. (See Section 9.1.)
Likewise, computing the eigenvalues/eigenvectors of a matrix is a hard problem. Supposewearegiventheeigenvaluesλ1,λ2,...,λn ofann×nmatrixandcorresponding eigenvectors v1, v2, . . . , vn of an n × n matrix. We can get A by putting the eigenvalues on the diagonal of a diagonal matrix D and the eigenvectors as columns in a matrix V . Then
AV = V D, and we can get A from this by solving the equation for A. But finding λi and vi from A itself is hard. (See Section 8.2.)
The reader may think of other examples.
It needs to be pointed out that there may be a huge gap between the complexity of factoring integers and that of the other problems mentioned, which are probably all solvable in polynomial time. Also, there may be a cultural difference at work here. To a numerical analyst, hard usually means: “I can solve it in polynomial time, but the degree of the polynomial is too high,” whereas to a complexity theorist, hard means: “It is not solvable in polynomial time, as far as I know.” On the other hand, there are NP-hard problems with a numerical flavor such as the traveling salesman problem.
First Programming Experiment
We conclude this section with a short programming experiment involving numerical com- putations. Here we consider, from the computational point of view, a familiar operation in calculus—namely, taking the derivative of a function. Recall that the derivative of a function f at a point x is defined by the equation
f′(x)=lim f(x+h)−f(x)
h→0 h
A computer has the capacity of imitating the limit operation by using a sequence of numbers
h such as
h = 4−1,4−2,4−3,...,4−n, ...
This sequence certainly approaches zero rapidly! Of course, many other simple sequences are possible, such as 1/n, 1/n2, and 1/10n. The sequence 1/4n consists of machine numbers in a binary computer and, for this experiment on a 32-bit computer, it is sufficiently close to zero when n is 10.
If f (x) = sin x, here is pseudocode for computing f ′(x) at the point x = 0.5:
Derivative Formula
program First
integer i, imin, n ← 30
real error, y, x ← 0.5, h ← 1, emin ← 1 fori =1ton
h ← 0.25h
y ← [sin(x + h) − sin(x)]/h
(Continued)
First Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.1 Introduction 13
error ← |cos(x) − y|; output i, h, y, error
if error < emin then
emin ← error; imin ← i end if
end for
output imin, emin end program First
We have neither explained the purpose of the experiment nor shown the output from this pseudocode. We invite the reader to discover this by coding and running it (or one like it) on a computer. (See Computer Exercises 1.1.1—1.1.3.)
Mathematical Software
The algorithms and programming problems in this book have been coded and tested in a variety of ways, and they are available on the website for this book as given in the Preface. Some are best done by using a scientific programming language such as C, C++, Fortran, or any other that allows for calculations with adequate precision. Sometimes it is instruc- tive to utilize mathematical software systems such as MATLAB, Maple, Mathematica, or Octave, since they contain built-in problem-solving procedures. Alternatively, one could use a mathematical program library such as IMSL, NAG, or others when locally available. Some numerical libraries may have been specifically optimized for the processor, such as Intel and AMD. Software systems are particularly useful for obtaining graphical results as well as for experimenting with various numerical methods for solving a difficult prob- lem. Mathematical software packages containing symbolic-manipulation capabilities, such as in Maple, Mathematica, and Macsyma, are particularly useful for obtaining exact as well as numerical solutions. In solving the computer problems, students should focus on gaining insights and better understanding of the numerical methods involved. Appendix A offers advice on computer programming for scientific computations. The suggestions are independent of the particular language being used.
Websites
With World Wide Web and the Internet, good mathematical software has become easy to locate and to use. Browsers, search engines, and websites may be used to find software that is applicable to a particular area of interest. Collections of mathematical software exist, ranging from large, comprehensive libraries to smaller versions of these libraries for personal computers; some of these may be interactive. Also, references to computer programs and collections of routines can be found in books and technical reports. The textbook website is given in the Preface. It contains an overview of mathematical software and other available supporting material.
Additional Reading
For additional study of topics found in this book, see the references in the Bibliography as well as an extensive list of items at the textbook website. Two interesting papers containing numerous examples of why numerical methods are critically important are Forsythe [1970]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
14
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
and McCartin [1998]. See Briggs [2004] and Friedman and Littman [1994] for some indus- trial and real-world problems.
Summary 1.1
• AbsoluteError=|ExactValue−ApproximateValue| Relative Error = |Exact Value − Approximate Value|
|Exact Value|
• Use nested multiplication to evaluate a polynomial efficiently:
p(x) = a0 +a1x +a2x2 +···+an−1xn−1 +anxn
=a0 +x(a1 +x(a2 +···+x(an−1 +x(an))···))
A segment of pseudocode for doing this is
• Deflation of the polynomial p(x) is removing a linear factor: p(x) = (x − r)q(x) + p(r)
where
q(x)=b0 +b1x+b2x2 +···+bn−1xn−1
The pseudocode for Horner’s algorithm for deflation of a polynomial is:
p ← an
for k = 1 to n
p ← xp + an−k end for
bn−1 ← an
fori =n−1to0
bi−1 ←ai +rbi end for
1.
Hence, we obtain b−1 = p(r).
In high school, some students have been misled to believe that 22/7 is either the actual value of π or an accept- able approximation to π. Show that 355/113 is a better approximation in terms of both absolute and relative er- rors. Find some other simple rational fractions n/m that
approximate π. For example, ones for which |π − n/m| < 10−9
Hint: See Exercise 1.1.4.
∗Exercises marked with a have answers in the back of the book.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 1.1
a2.
a3. a4.
5.
a6. 7. 8.
9.
a 10.
11.
A real number x is represented approximately by 0.6032, and we are told that the relative error is at most 0.1%. What is x?
Note: There are two answers.
Whatistherelativeerrorinvolvedinrounding4.9997to 5.000?
Thevalueofπcanbegeneratedbythecomputertonearly full machine precision by the assignment statement
pi ← 4.0 arctan(1.0)
Suggest at least four other ways to compute π using basic
functions on your computer system.
A given doubly subscripted array (ai j )n×n can be added in any order. Write the pseudocode segments for each of the following parts. Which is best?
1.1 Introduction 15 12. Usingsummationandproductnotation,writemathemat-
ical expressions for the following pseudocode segments:
aa. n n aij b. n n aij i=1 j=1 j=1 i=1
c. n i aij +i−1 aji i=1 j=1 j=1
ad. n−1 aij e. 2n n aij k=0 |i−j|=k k=2 i+j=k
Count the number of operations involved in evaluating a polynomial using nested multiplication. Do not count subscript calculations.
For small x, show that (1 + x)2 can sometimes be more accurately computed from (x + 2)x + 1. Explain. What other expressions can be used to compute it?
Showhowthesepolynomialscanbeefficientlyevaluated: aa. p(x) = x32 b. p(x) = 3(x −1)5 +7(x −1)9 ac. p(x) = 6(x +2)3 +9(x +2)7 +3(x +2)15 −(x +2)31
d. p(x)=x127−5x37+10x17−3x7
Usingtheexponentialfunctionexp(x),writeanefficient pseudocode segment for the statement y = 5e3x +7e2x + 9ex + 11.
Write a pseudocode segment to evaluate the expression
n
z= b−1 aj
d. integeri,n;realv,x,z real array (ai )0:n
v ← a0
z←x
fori =1ton
v ← v + zai
z ← xz end for
i
where (a1 , a2 , . . . , an ) and (b1 , b2 , . . . , bn ) are linear ar-
i=1 rays containing given values.
j=1
Writesegmentsofpseudocodetoevaluatethefollowing expressions efficiently:
a13. Expressinmathematicalnotationwithoutparenthesesthe final value of z in the following pseudocode segment:
a. p(x) = n−1 kxk k=0
ab. z = n i xn−j+1 i=1 j=1
c.z=n i xj i=1 j=1
d. p(t)=n aii−1(t−xj) i=1 j=1
i
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a. integeri,n;realv,x real array (ai )0:n
v ← a0
fori =1ton
v ← v + xai end for
ab. integeri,n;realv,x real array (ai )0:n
v ← an
fori =1ton
v ← vx + an−i end for
c. integeri,n;realv,x real array (ai )0:n
v ← a0
fori =1ton
v ← vx + ai end for
ae. integeri,n;realv real array (ai )0:n
v ← an
fori =1ton
v ← (v + an−i )x end for
integer k, n; real z real array (bi )0:n z ← bn + 1
for k = 1 to n − 2
z ← zbn−k + 1 end for
16 Chapter 1 Mathematical Preliminaries and Floating-Point Representation a14. Howmanymultiplicationsoccurinexecutingthefollow-
ing pseudocode segment?
c. integeri,j,n;realarray(aij)0:n×0:n for j = 1 to n
fori =1ton
ai j ← 1/(i + j − 1)
end for end for
integer i, j, n; real x
real array (ai j )0:n×0:n , (bi j )0:n×0:n x ← 0.0
for j = 1 to n
for i = 1 to j
x ← x + ai j bi j
end for end for
15. Criticize the following pseudocode segments and write improved versions:
16. a. b.
c.
SolveExample1.1.2tofullprecision. Repeatforthisaugmentedmatrix
3.5713 2.1426 7.2158 10.714 6.4280 1.3379
for a system of two equations and two unknowns x and y.
Cansmallchangesinthedataleadtomassivechange in the each of these solutions?
a. integeri,n;realx,z;realarray(ai)0:n fori =1ton
x ← z2 + 5.7
ai ←x/i end for
17.
18.
A base 60 approximation (circa 1750 BCE) is
√
2 ≈ 1 + 60 + 602 + 603
Determine how accurate it is. See Sauer [2006] for addi- tional details.
UseHorner’salgorithmtoevaluateeachofthesepolyno- mials at the point indicated.
a. 2x4+9x2−16x+12at−6 b. 2x4−3x3−5x2+3x+8at2 c. 3x5 −38x3 +5x2 −1at4
as h → 0. We learn in Section 4.3 that the truncation errorforthisformulais−1h2f′′′(ξ)forsomeξinthe
6
interval(x−h,x+h).
Modify and run the code for the experiment First so
that approximate values for the rounding error and trun- cation error are computed. On the same graph, plot the rounding error, the truncation error, and the total error (sum of these two errors) using a log-scale; that is, the axes in the plot should be − log10 |error| versus log10 h. Analyze these results.
Thelimit
n→∞ n
defines the number e in calculus. Estimate e by taking
the value of this expression for n = 8,82,83,...,810.
24 51 10
ab. integeri,j,n
real array (ai j )0:n×0:n fori =1ton
for j = 1 to n
ai j ← 1/(i + j − 1)
end for end for
1. Write and run a computer program that corresponds to the pseudocode program First described on pp. 12–13 and interpret the results.
2. (Continuation)Selectafunctionfandapointxandcarry out a computer experiment like the one given in the text. Interpret the results. Do not select too simple a function. For example, you might consider 1/x, log x, ex , tan x, cosh x, or x3 − 23x.
3. (Continuation) As we saw in the computer experiment First, the accuracy of a formula for numerical differenti- ation may deteriorate as the step-size h decreases. Study the following central difference formula:
′ f(x+h)− f(x−h) f (x)≈ 2h
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
a4.
1n e=lim 1+
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 1.1
Compare with e obtained from e ← exp(1.0). Interpret the results.
5. It is not difficult to see that the numbers
1.1 Introduction 17 Hint: Consider extremely small and large numbers and
print to full machine precision.
a 10. Inacomputer,itcanhappenthata+x =awhenx ̸=0.
Explain why. Describe the set of n for which 1 + 2−n = 1 in your computer. Write and run appropriate programs to illustrate the phenomenon.
11. Writeaprogramtotesttheprogrammingsuggestioncon-
cerning the roundoff error in the computation of t ← t +h
versust←t0+ih.Forexample,useh= 1 andcom- 10
pute t ← t + h in double precision for the correct single-
precision value of t; print the absolute values of the dif-
ferences between this calculation and the values of the
two procedures. What is the result of the test when h is a
machine number, such as h = 1 , on a binary computer 128
(with more than seven bits per word)?
a12. The Russian mathematician P. L. Chebyshev (1821– 1894) spelled his name Qebywev. Many transliterations from the Cyrillic to the Latin alphabet are possible. Cheb can alternatively be rendered as Ceb, Tscheb, or Tcheb. The y can be rendered as i. Shev can also be rendered as schef, cev, cheff, or scheff. Taking all combinations of these variants, program a computer to print all possible spellings.
13. Compute n! using logarithms, integer arithmetic, and double-precision floating-point arithmetic. For each part, print a table of values for 0 ≦ n ≦ 30, and determine the largest correct value.
14. Given two arrays, a real array v = (v1,v2,...,vn) and an integer permutation array p = (p1, p2,..., pn) of integers 1, 2, . . . , n, can we form a new permuted array v = (v p1 , v p2 , . . . , v pn ) by overwriting v and not involv- ing another array in memory? If so, write and test the code for doing it. If not, use an additional array and test. Consider these cases:
Case1. v=(6.3,4.2,9.3,6.7,7.8,2.4,3.8,9.7), p = (2,3,8,7,1,4,6,5)
Case2. v=(0.7,0.6,0.1,0.3,0.2,0.5,0.4), p = (3,5,4,7,6,2,1)
15. Using a computer algebra system (e.g., Maple, Mathe-
pn =
1
xnex dx
0
satisfy the inequalities p1 > p2 > p3 > ··· > 0. Es-
tablish this fact. Next, use integration by parts to show that
pn+1 =e−(n+1)pn
and that p1 = 1. In the computer, use the recurrence re- lation to generate the first 20 values of pn and explain why the inequalities shown are violated. Do not use sub- scripted variables. (See Dorn and McCracken [1972], pp. 120–129.)
6. (Continuation) Let p20 = 1 and use the formula in the 8
preceding computer problem to compute p19, p18, . . . , p2, and p1. Do the numbers generated obey the inequal- ities1= p1 > p2 > p3 >···>0?Explainthe difference in the two procedures. Repeat with p20 = 20
or p20 = 100. Explain what happens.
7. Writeanefficientroutinethatacceptsasinputalistofreal numbersa1,a2,…,an andthencomputesthefollowing:
n
k=1 1
8. (Continuation)Showthatanotherformulais
Arithmetic mean
1
m = n v =
ak n
(ak − m)2 Test the routine on a set of data of your choice.
Variance
Standard deviation
n−1 k=1 σ = √v
Variance
1n v = n − 1 ak2 − nm2
k=1
Of the two given formulas for v, which is more accurate in the computer? Verify on the computer with a data set. Hint: Use a large set of real numbers that vary in magni- tude from very small to very large.
a9. Let a1 be given. Write a program to compute for 1 ≦ n ≦ 1000 the numbers
matica, MATLAB, etc.), print 200 decimal digits of
√
10.
bn = nan−1,
an = bn/n
Print the numbers a100, a200, . . . , a1000. Do not use sub- scripted variables. What should an be? Account for the deviation of fact from theory. Determine four values for a1 so that the computation does deviate from theory on your computer.
b. Use Maple or some other mathematical software system in which you can set the number of digits of precision.
Hint: In Maple, use Digits.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
16. a.
Repeat the Example 1.1.1 on loss of significant digits of accuracy, but perform the calculations with twice the precision before rounding them. Does this help?
18
Chapter 1 Mathematical Preliminaries and Floating-Point Representation
17.
18.
In1706,Machinusedtheformula Whatisthepurposeofeachprogram?Isitachieved?Ex- π = 16arctan1−4arctan 1 plain. Code and run each one to verify your conclusions.
5 239 20. Consider some oversights involving assignment state- to compute 100 digits of π. Derive this formula. Repro- ments.
duce Machin’s calculations by using suitable software.
Hint: Let tan θ = 1 , and use standard trigonometric aa. 5
identities.
Using a symbol-manipulating program such as Maple, Mathematica, MATLAB, or Macsyma, carry out the following tasks. Record your work in some manner, for example, by using a diary or script command.
a. Find the Taylor series, up to and including the term x10, for the function (tan x)2, using 0 as the point x0.
b. Find the indefinite integral of (cos x )−4 .
c. Find the definite integral 01 log |log x | d x .
d. Findthefirstprimenumbergreaterthan27448.
e. Obtain the numerical value of 01 1 + sin3 x d x .
f. Findthesolutionofthedifferentialequationy′+y= (1+ex)−1.
g. Definethefunction f(x,y)=9×4−y4+2y2−1.You wanttoknowthevalueof f(40545,70226).Compute this in the straightforward way by direct substitution of x = 40545 and y = 70226 in the definition of
f (x, y), using first 6-decimal accuracy, then 7-, 8-, and so on up to 24-decimal digits of accuracy. Next, prove by means of elementary algebra that
f(x,y)=(3×2 −y2 +1)(3×2 +y2 −1)
Use this formula to compute the same value of f(x,y), again using different precisions, from 6- decimal to 24-decimal. Describe what you have learned. To force the program to do floating-point operations instead of integer arithmetic, write your
What is the difference between the following two assignment statements? Write a code that contains them and illustrate with specific examples to show that sometimes x = y and sometimes x ̸= y.
numbers in the form 9.0, 40545.0, and so forth. 19. Considerthefollowingpseudocodesegments:
b. Whatvaluedoesnreceive?
What happens when the last statement is replaced with the following?
n ← integer(x) + integer(y)
21. Writeacomputercodethatcontainsthefollowingassign-
ment statements exactly as shown. Analyze the results.
a. Print these values first using the default format and then with an extremely large format field:
real p,q,u,v,w,x,y,z x ← 0.1
y ← 0.01
z←x−y
p ← 1.0/3.0
q ← 3.0p
u ← 7.6
v ← 2.9
w←u−v
output x, y, z, p, q, u, v, w
a. integeri;realx,y,z fori =1to20
x ←2+1.0/8i
y ← arctan(x) − arctan(2) z ← 8i y
output x, y, z end for
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
integer m, n; real x, y x ← real(m/n)
y ← real(m)/real(n) output x, y
integer n; real x, y x ← 7.4
y ← 3.8
n←x+y
output n
b. realepsi←1 while 1 < 1 + epsi
epsi ← epsi/2
output epsi end while
b. Whatvalueswouldbecomputedforx,y,andzifthis code is used?
integer n; real x, y, z for n = 1 to 10
x ←(n−1)/2 y ← n2/3.0
z ← 1.0 + 1/n output x, y, z
end for
c. What values would the following assignment state- ments produce?
d. Discusswhatiswrongwiththefollowingpseudocode segment:
22. Criticize the following pseudocode for evaluating limx →0 arctan(|x | )/x . Code and run it to see what hap- pens.
25.
26.
27.
28.
29.
real area, circum, radius radius ← 1
area ← (22/7)(radius)2 circum ← 2(3.1416)radius output area, circum
integer i; real x, y x←1
fori =1to24
x ← x/2.0
y ← arctan(|x|)/x output x, y
end for
23. Carryoutsomecomputerexperimentstoillustrateortest the programming suggestions in Appendix A. Specific topics to include are these:
a. Whentoavoidarrays.
b. Whentolimititerations.
c. Checkingforfloating-pointequality.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
24.
1.1 Introduction 19 d. Waysfortakingequalfloating-pointsteps.
e. Variouswaystoevaluatefunctions.
Hint: Comparing single and double precision results may
be helpful.
(Easy/Difficult Problem Pairs) Write a computer pro- gram to obtain the power form of a polynomial from its roots. Let the roots be r1,r2,...,rn. Then (except for a scalar factor) the polynomial is the product
p(x) = (x −r1)(x −r2)···(x −rn). Find the coefficients in the expression
n j=0
Test your code on the Wilkinson polynomials in Com- puter Exercises 3.1.10 and 3.3.9. Explain why this task of getting the power form of the polynomial is trivial, whereas the inverse problem of finding the roots from the power form is quite difficult.
A prime number is a positive integer that has no integer factors other than itself and 1. How many prime num- bers are there in each of these open intervals: (1, 40), (1, 80), (1, 160), and (1, 2000)? Make a guess as to the percentage of prime numbers among all numbers.
Mathematical software systems such as Maple, Math- ematica, and MATLAB are able to do both numerical calculations and symbolic manipulations. Verify symbol- ically that a nested multiplication is correct for a general polynomial of degree ten.
In MATLAB, the rat function finds a rational fraction approximation (numerator and denominator) within a certain tolerance to a given floating-point number. For example, [a,b]=rat(pi, 8000e-6) return a=22 and b=7. However, the relative error between 19/6 and π is 0.007981306248670 in format long, which is less than the tolerance 0.008. What’s going on here? In terms of absolute and relative errors, is 19/6 or 22/7 the better approximation to π?
Use mathematical software to reproduce the three solu- tions to Example 1.1.2.
Hint: In MATLAB, use commands str2nun(num2str (x,4)) for rounding to four significant decimal digits as well as format long.
Explaintheresultsfromcodingandexecutingthefollow- ing pseudocode using mathematical software such as in MATLAB with format long:
integer i, j; real c, f,x,half x ← 10/3
i ← integer(x + 1/2)
half ← 1/2
j ← integer(half) c←(5/9)(f −32)
f ←9/5c+32 output x,i,half, j,c, f
p(x) =
aj x j
20
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Common Taylor series
x
sin x
x2x3 ∞xk =1+x+ + +···=
2! 3! k=0k!
x3x5 ∞ x2k+1
= x − + − · · · = (−1)k
3! 5! k=0 (2k + 1)!
(|x|<∞) (1)
(|x| < ∞) (2)
(|x| < ∞) (3)
(|x|<1) (4)
integer k; real dt, s, t t←0; s←1
dt ← 0.1
fork =1to10
t ← t + dt
s ← s ∗ dt output k,t,s end
Hint: Print results with a very large number of decimal places.
30. Byplottinglnxandln[(1+x)/(1−x)],showthatthey both contain the point ln 2. Are there other values that match up?
1.2
Mathematical Preliminaries
Most students have encountered infinite series (particularly Taylor series) in their study of calculus without necessarily having acquired a good understanding of this topic. Conse- quently, this section is particularly important for numerical analysis and deserves careful study.
Once students are well grounded with a basic understanding of Taylor series, the Mean- Value Theorem, and alternating series (all topics in this section) as well as computer number representation (Section 1.3), they can proceed to study the fundamentals of numerical methods with better comprehension. Well-prepared students may wish to skip over some of this material.
Taylor Series
Familiar (and useful) examples of Taylor series are the following:
e
x2x4 ∞x2k = 1 − + − · · · = (−1)k
cos x
1 ∞
2! 4! k=0 (2k)!
=1+x+x2 +x3 +··· = 1−x k=0
xk x2x3 ∞ xk
ln(1 + x) = x − + − · · · = (−1)k−1 23 k=1 k
(−1 < x ≦ 1) (5) For each case, the series represents the given function and converges in the interval specified.
Series (1)–(5) are Taylor series expanded about c = 0.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 1
Solution
EXAMPLE 2
Solution
■ Rule
where 0 < x ≦ 2. The reader should recall the factorial notation n! = 1 · 2 · 3 · 4 · · · · · n
for n ≧ 1 and the special definition of 0! = 1.
Series of this type are often used to compute good approximate values of complicated
functions at specific points.
Use five terms in the ln(1 + x) series (5) to approximate ln(1.1). Taking x = 0.1 in the first five terms of the series for ln(1 + x) gives us
ln(1.1) ≈ 0.1 − 0.01 + 0.001 − 0.0001 + 0.00001 = 0.09531 03333 . . . 2345
where ≈ means approximately equal. This value is correct to six decimal places of accuracy. ■
On the other hand, such good results are not always obtained in using series.
Try to compute e8 by using the ex series (1). The result is
e8 =1+8+64+512+4096+32768+··· 2 6 24 120
It is apparent that many terms are needed to compute e8 with reasonable precision. By repeatedsquaring,wefinde2 =7.389056,e4 =54.5981500,ande8 =2980.957987.The first six terms given yield 570.06666 5. ■
A Taylor series expanded about c = 1 is
(x−1)2 (x−1)3 ∞ (x−1)k
ln(x) = (x − 1) − + − · · · = (−1)k−1
2 3 k=1 k
1.2 Mathematical Preliminaries 21
Partial Summations for sin x
x3
x3 x5
S5 =x− 6 +120
These examples illustrate a general rule:
Rule of Thumb
A Taylor series converges rapidly near the point of expansion and slowly (or not at all) at more remote points.
A graphical depiction of the phenomenon can be obtained by graphing a few partial sums of a Taylor series. In Figure 1.3 (p. 22), we show the function
and the partial-sum functions
S3 = x − 6
y = sin x S1 = x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
22
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
23 22 21
y
2
1 0
21 22
S1
1 2 3
S5 sin x x
S3
FIGURE 1.3
Approximations to sin x
■ Theorem1
which come from the sin x series (2). While S1 may be an acceptable approximation to sin x when x ≈ 0, the graphs for S3 and S5 match that of sin x on larger intervals about the origin.
All of the series illustrated here are examples of the following general series:
Formal Taylor Series for f about c
f(x) ∼ f(c)+ f′(c)(x −c)+ f′′(c)(x −c)2 + f′′′(c)(x −c)3 +··· 2! 3!
f(x)∼∞ f(k)(c)(x−c)k k=0 k!
(6)
EXAMPLE 3
Here, rather than using =, we have written ∼ to indicate that we are not allowed to assume that f (x) equals the series on the right. All we have at the moment is a formal series that can be written down provided that the successive derivatives f ′ , f ′′ , f ′′′ , . . . exist at the point c. Series (6) is called the Taylor series of f at the point c.
In the special case c = 0, the Formal Taylor series (6) is also called a Maclaurin series: f(x)∼ f(0)+ f′(0)x+ f′′(0)x2 + f′′′(0)x3 +···
f(x)∼∞ f(k)(0)xk k=0 k!
Thefirsttermis f(0)whenk =0. What is the Taylor series of the function
(7)
at the point c = 2?
f(x)=3x5 −2x4 +15x3 +13x2 −12x−5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2! 3!
Solution
1.2 Mathematical Preliminaries 23 To compute the coefficients in the series, we need the numerical values of f (k)(2) for k ≧ 0.
Here are the details of the computation:
f(x) = 3x5 −2x4 +15x3 +13x2 −12x −5 f′(x) = 15x4 −8x3 +45x2 +26x −12
f′′(x) = 60x3 −24x2 +90x +26 f ′′′(x) = 180x2 − 48x + 90
f (4)(x) = 360x − 48 f (5)(x) = 360
f (k)(x) = 0
⇒ f (2) = 207 ⇒ f ′(2) = 396 ⇒ f ′′(2) = 590 ⇒ f ′′′(2) = 714 ⇒ f (4)(2) = 672 ⇒ f (5)(2) = 360 ⇒ f (k)(2) = 0
(k ≧ 6)
Therefore, we have
f (x) ∼ 207 + 396(x − 2) + 295(x − 2)2
+ 119(x −2)3 +28(x −2)4 +3(x −2)5
In this example, it is not difficult to see that ∼ may be replaced by = . Simply expand all the terms in the Taylor series and collect them to get the original form for f . Taylor’s Theorem, discussed soon, allows us to draw this conclusion without doing any work! ■
It is interesting to show how well we can approximate a function f (x) at a point x = a by taking only a few terms of the Maclaurin series (7). We illustrated with three cases. With only the first term, the function is assumed to be a constant: f (a) ≈ f (0). With two terms, the slope at 0 is taken into account by way of the straight line from f (0) to f (a); namely,
f (a) ≈ f (0) + f ′(0)x when x = a. With three terms, the curivature due to f ′′(0) comes into play and we obtain a parabolic curve: f (a) ≈ f (0)+ f ′(0)x + 1 f ′′(0)x2 when x = a.
2
Each additional term improves the accuracy in the approximation for f (a). In Figure 1.4,
we show these partial sums in the Maclaurin series (7) when f (x) = ex and x = a = 1.
2.8 2.6 2.4 2.2
2.7183 2.5
f5ex f2511x1x2/2
1.8 1.6 1.4 1.2
f1511x
22
FIGURE 1.4
Approximations to ex
f051 11
0 0.2 0.4 0.6 0.8 1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
24 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Complete Horner’s Algorithm
An application of Horner’s algorithm is that of finding the Taylor expansion of a polynomial about any point. Let p(x) be a given polynomial of degree n with coefficients ak as in (2) in Section 1.1, and suppose that we desire the coefficients ck in the equation
p(x)=anxn +an−1xn−1 +···+a0
= cn(x −r)n +cn−1(x −r)n−1 +···+c1(x −r)+c0
Of course, Taylor’s Theorem asserts that ck = p(k)(r)/k!, but we seek a more efficient algorithm. Notice that p(r) = c0, so this coefficient is obtained by applying Horner’s algorithm to the polynomial p with the point r . The algorithm also yields the polynomial
q(x)= p(x)−p(r)=cn(x−r)n−1+cn−1(x−r)n−2+···+c1 x−r
This shows that the second coefficient, c1, can be obtained by applying Horner’s algorithm to the polynomial q with point r, because c1 = q(r). Notice that the first application of Horner’s algorithm does not yield q in the form shown, but rather as a sum of powers of x. (See (3)–(4) in Section 1.1.) This process is repeated until all coefficients ck are found.
We call the algorithm just described the complete Horner’s algorithm. The pseu- docode for executing it is arranged so that the coefficients ck overwrite the input coeffi- cients ak .
integer n,k, j;
real r ; real array (ai )0:n for k = 0 to n − 1
for j = n − 1 to k
aj ←aj +raj+1
end for end for
Complete Horner’s Algorithm Pseudocode
EXAMPLE 4
Solution
This procedure can be used in carrying out Newton’s method for finding roots of a poly- nomial, which we discuss in Chapter 3. Moreover, it can be done in complex arithmetic to handle polynomials with complex roots or coefficients.
Using the complete Horner’s algorithm, find the Taylor expansion of the polynomial p(x)=x4 −4x3 +7x2 −5x+2
about the point r = 3.
The work can be arranged as follows:
1 −4 7 −5 2
3) 3 −3 12 21 − 1 −1 4 7 23
3 6 30
−
1 2 10 37
−
3 15
−
1 5 25
3
18
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x
e Series
In practical computations with Taylor series, it is usually necessary to truncate the series because it is not possible to carry out an infinite number of additions. A series is said to be truncated if we ignore all terms after a certain point. Thus, if we truncate the exponential series (1) after seven terms, the result is
x x2 x3 x4 x5 x6 e≈1+x+++++
2! 3! 4! 5! 6!
This no longer represents ex except when x = 0. But the truncated series should approximate ex . Here is where we need Taylor’s Theorem. With its help, we can assess the difference between a function f and its truncated Taylor series.
Theexplicitassumptioninthistheoremisthat f(x), f′(x), f′′(x),..., f(n+1)(x)areall continuous functions in the interval I = [a, b]. The final term En+1 in (8) is the remainder or error term. The given formula for En+1 is valid when we assume only that f (n+1) exists at each point of the open interval (a,b). The error term is similar to the terms preceding it, but notice that f (n+1) must be evaluated at a point other than c. This point ξ depends on x and is in the open interval (c,x) or (x,c). Other forms of the remainder are possible; the one given here is Lagrange’s form. (We do not prove Taylor’s Theorem here.)
Derive the Taylor series for ex at c = 0, and prove that it converges to ex by using Taylor’s Theorem.
If f(x)=ex,then f(k)(x)=ex fork≧0.Therefore, f(k)(c)= f(k)(0)=e0 =1forallk. From (8), we have
nxk eξ
ex = + xn+1 (9)
k=0 k! (n+1)!
■ Theorem2
The calculation shows that
p(x)=(x−3)4 +8(x−3)3 +25(x−3)2 +37(x−3)+23
Taylor’s Theorem in Terms of (x − c)
■
1.2 Mathematical Preliminaries 25
Taylor’s Theorem for f (x)
If the function f possesses continuous derivatives of orders 0, 1, 2, . . . , (n + 1) in a closed interval I = [a,b], then for any c and x in I,
f(x)=n f(k)(c)(x−c)k +En+1 (8) k=0 k!
where the error term En+1 can be given in the form
En+1 = f (n+1)(ξ)(x − c)n+1
(n + 1)!
Here ξ is a point that lies between c and x and depends on both.
EXAMPLE 5
Solution
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
26 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Now let us consider all the values of x in some symmetric interval around the origin, for example, −s ≦ x ≦ s. Then |x|≦ s, |ξ|≦ s, and eξ ≦ es. Hence, the remainder term satisfies
this inequality:
eξ es
f (x) = ln(1 + x) f ′(x) = (1 + x)−1
f ′′(x) = −(1 + x)−2 f ′′′(x) = 2(1 + x)−3
f (4)(x) = −6(1 + x)−4 .
⇒ f(0)= 0 ⇒ f′(0)= 1 ⇒ f′′(0)=−1 ⇒ f′′′(0)= 2 ⇒ f(4)(0)=−6
xn+1≦ lim sn+1=0 n→∞ (n+1)! n→∞ (n+1)!
lim
Thus, if we take the limit as n → ∞ on both sides of (9), we obtain
x nxk∞xk e = lim =
EXAMPLE 6
Solution
This example illustrates how we can establish, in specific cases, that a formal Taylor series (6) actually represents the function. Let’s examine another example to see how the formal series can fail to represent the function.
Derive the formal Taylor series for f (x) = ln(1 + x) at c = 0, and determine the range of positive x for which the series represents the function.
We need f (k)(x) and f (k)(0) for k ≧ 1. Here is the work:
n→∞ k=0 k! k=0 k!
■
⇒ .
f (k)(x) = (−1)k−1(k − 1)!(1 + x)−k ⇒ f (k)(0) = (−1)k−1(k − 1)!
Hence by Taylor’s Theorem, we obtain
n ln(1 + x) =
k=1
n x k ( − 1 ) n − n − 1
= (−1)k−1 + 1+ξ xn+1
(−1)k−1
(k−1)! (−1)nn!(1+ξ)−n−1 xk +
k! (n + 1)!
xn+1
(10)
k=1 k n+1
For the infinite series to represent ln(1 + x), it is necessary and sufficient that the error term converge to zero as n → ∞. Assume that 0 ≦ x ≦ 1. Then 0 ≦ ξ ≦ x (because zero is the point of expansion); thus, 0 ≦ x /(1 + ξ ) ≦ 1. Hence, the error term converges to zero in this case. If x > 1, the terms in the series do not approach zero, and the series does not converge. Hence, the series represents ln(1 + x ) if 0 ≦ x ≦ 1, but not if x > 1. (The series alsorepresentsln(1+x)for−1
√
By substituting −h for h in the series, we obtain
√
x possesses derivatives of all orders at any point x > 0. 1.00001 ≈ 1 + 0.5 × 10−5 − 0.125 × 10−10 = 1.00000 49999 87500
notice that the function f (x) = In (12), let h = 10−5. Then
Hence, we have
√1−h=1−1h−1h2− 1h3ξ−5/2 2 8 16
√
0.99999 ≈ 0.99999 49999 87500
Since 1 < ξ < 1 + h, the absolute error does not exceed
1 h3ξ−5/2 < 1 10−15 = 0.00000 00000 00000 0625
16 16
and both numerical values are correct to all 15 decimal places shown. ■
Alternating Series
Another theorem from calculus is often useful in establishing the convergence of a series and in estimating the error involved in truncation. From it, we have the following important principle for alternating series:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Alternative Series Principle
■ Theorem4
This theorem applies only to alternating series—that is, series in which the successive terms are alternately positive and negative.
EXAMPLE 9
Solution
Principle. If the magnitudes of the terms in an alternating series converge monoton- ically to zero, then the error in truncating the series is no larger than the magnitude of the first omitted term.
1.2 Mathematical Preliminaries 29
Alternating Series Theorem
Ifa1≧a2≧ ···≧an≧ ···0forallnandlimn→∞an =0,thenthealternatingseries a1 −a2 +a3 −a4 +···
converges; that is,
∞ k=1
(−1)k−1ak = lim n→∞
n k=1
(−1)k−1ak = lim Sn = S n→∞
where S is its sum and Sn is the nth partial sum. Moreover, for all n, |S − Sn|≦ an+1
EXAMPLE 8
Solution
If the sine series is to be used in computing sin 1 with an error less than 1 × 10−6 , how 2
many terms are needed? From Series (2), we have
sin1=1− 1 + 1 − 1 +··· 3! 5! 7!
If we stop at 1/(2n − 1)!, the error does not exceed the first neglected term, which is 1/(2n + 1)!. Thus, we should select n so that
1 < 1 × 10−6 (2n + 1)! 2
Using logarithms to base 10, we obtain log(2n + 1)! > log 2 + 6 = 6.3. With a calculator, we compute a table of values for log n! and find that log 10! ≈ 6.6. Hence, if n ≧ 5, the error is acceptable. ■
If the logarithmic series (5) is to be used for computing ln 2 with an error of less than
1 × 10−6, how many terms are required? 2
To compute ln 2, we take x = 1 in the series, and using ≈ to mean approximate equality, we have
1 1 1 (−1)n−1
S = ln 2 ≈ 1 − 2 + 3 − 4 + · · · + n = Sn
By the Alternating Series Theorem, the error involved when the series is truncated with n terms is
1 |S−Sn|≦ n+1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
30 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Caution
Hence, more than two million terms would be needed! We conclude that this method of computing ln 2 is not practical. (See Exercises 1.2.10–1.2.12 for several good alternatives.) ■
A word of caution is needed about this technique of calculating the number of terms to be used in a series by just making the (n + 1)st term less than some tolerance. This procedure is valid only for alternating series in which the terms decrease in magnitude to zero, although it is occasionally used to get rough estimates in other cases. For example, it can be used to identify a nonalternating series as one that converges slowly. When this technique cannot be used, a bound on the remaining terms of the series has to be established. Determining such a bound may be somewhat difficult.
EXAMPLE 10
Solution
It is known that
We select n so that
1 <1×10−6 n+1 2
π4
=1−4 +2−4 +3−4 +··· 90
How many terms should we take to compute π4/90 with an error of at most 1 × 10−6? 2
A naive approach is to take
1−4 +2−4 +3−4 +···+n−4
where n is chosen so that the next term, (n + 1)−4, is less that 1 × 10−6. This value of n is
2 37, but this is an erroneous answer because the partial sum
S37 =
n so that all the omitted terms add up to less than 1 × 10−6; that is,
k−4< ×10−6 k=n+1 2
By a technique familiar from calculus (see Figure 1.5), we have
37
2 ∞ 1
k−4
differs from π4/90 by approximately 6 × 10−6. What we should do, of course, is to select
k=1
∞
k−4<
k=n+1
∞ n
x − 3 ∞ 1 x−4dx=−3 =3n3
n
y 5 x24
(n11)24
FIGURE 1.5
Illustrating Example 10
n n11 n12 n13
etc.
x
(n 1 2)24
(n13)24
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.2 Mathematical Preliminaries 31 Thus, it suffices to select n so that (3n3)−1 < 1 × 10−6, or n ≧ 88. (A more sophisticated
2
analysis improves this considerably.) ■
Summary 1.2
• Complete Horner’s Algorithm:
• The Taylor series expansion about c for f (x) is
for k = 0 to n − 1
for j = n − 1 to k
aj ←aj +raj+1 end for
end for
with error term
f(x)=n f(k)(c)(x−c)k +En+1 k=0 k!
En+1 = f (n+1)(ξ)(x − c)n+1 (n + 1)!
A more useful form for us is the Taylor series expansion for f (x + h), which is
with error term
f(x+h)=n f(k)(x)hk +En+1 k=0 k!
En+1 = f (n+1)(ξ)hn+1 = O(hn+1) (n + 1)!
• An alternating series
∞
1. The Maclaurin series for (1 + x)n is also known as the binomial series. It states that
(1+x)n=1+nx+n(n−1)x2+··· (x2<1) 2!
Derive this series. Then give its particular forms in sum-
mation notation by letting n = 2, n = 3, and n = 1. 2
|S − Sn|≦ an+1
Next use the last form to compute 15 decimal places (rounded).
√
1.0001 correct to
(−1)k−1ak Sn differ from S by an amount that is bounded by
S =
convergeswhenthetermsak convergedownwardtozero.Furthermore,thepartialsums
k=1
2.
(Continuation)Usetheseriesintheprecedingproblem to obtain series (4). How could this series be used on a computing machine to produce x/y if only addition and multiplication are built-in operations?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 1.2
32
Chapter 1 Mathematical Preliminaries and Floating-Point Representation
3. 4.
a5.
6.
a7.
a8.
a9.
10. a11. a 12.
13.
14.
15.
(Continuation) Use the previous problem to obtain a se- ries for (1 + x2)−1.
WhydothefollowingfunctionsnotpossessTaylorseries expansions at x = 0?
aa. ab.
f (x) = (sin x) + (cos x) and find an approximate value for f (0.001).
g(x) = (sin x)(cos x) and find an approximate value for g(0.0006).
aa. f(x)=√x
c. f(x)=arcsin(x−1)
ab. f(x)=|x| d. f(x)=cotx
a16.
a 17. a 18.
19.
a20. 21.
22. a 23. a 24. 25.
26.
27. a28.
Compare the accuracy of these approximations to those obtained from tables or via a calculator.
Verify this Taylor series and prove that it converges on theinterval−e
a. Determine the rational functions R1,1(x) and R2,2(x). Produce and compare computer plots for f (x) = ex , R1,1, and R2,2. Do these low-order ratio-
nal functions approximate the exponential function ex satisfactorily within [−1, 1]? How do they compare to the truncated Maclaurin polynomials of the preceding problem?
b. Repeat using R2,2(x) and R3,1(x) for the function g(x) = ln(1 + x).
Information on the life and work of the French mathe- matician Herni Euge`ne Pade ́ (1863–1953) can be found in Wood [1999]. This reference also has examples and exercises similar to these. Further examples of Pade ́ ap- proximation can be seen.
(Continuation) Repeat for the Bessel function J0(2x), whose Maclaurin series is
x4 x6 ∞ xi2 1−x2+ − +···= (−1)i
4 36 i=0 i!
Then determine R2,2(x), R4,3(x), and R2,4(x) as well as
comparing plots.
Carry out the details in the introductory example to this
chapter by first deriving the Taylor series for ln(1 + x)
1.2 Mathematical Preliminaries 37 b1,b2,…,bk
cm cm−1 cm+1 cm
· · · cm−(k−2)
· · · cm−(k−3)
cm−(k−1) cm−(k−2)
. .
cm+(k−1) cm+(k−2) · · · cm+1 cm
.. . …..
b1 −cm +1
b2 −cm+2
×. = . bk−1 −cm+(k−1)
q k ( x ) kj = 0 b j x j
where b0 = 1. Here we have normalized with respect to b0 ̸= 0 and the values of m and k are modest. We choose the k coefficients b j and the m + 1 coefficients ai in Rm,k to match f and a specified number of its derivatives at the fixed point x = 0.
23.
24.
First, we construct the truncated Maclaurin series n cixi in which ci = f(i)(0)/i! and ci = 0 for
i=0
i <0.Next,wematchthefirstm+k+1derivativesof
Rm,k with respect to x at x = 0 to the first m+k+1 coeffi- cients ci . This leads to the following displayed equations. Since b0 = 1, we solve this k × k system of equations for
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
38
Chapter 1 Mathematical Preliminaries and Floating-Point Representation
25. 26.
and computing ln 2 ≈ 0.63452 using the first eight terms. Then establish the series ln[(1 + x)/(1 − x)] and calcu- late ln 2 ≈ 0.69313 using the terms shown. Determine the absolute error and relative errors for each of these answers.
Reproduce Figure 1.3 using your computer as well as adding the curve for S4.
Use a mathematical software system that does symbolic manipulations such as Maple or Mathematica to carry out
27. Can you obtain the following numerical results?
√
1.00001 = 1.00000 49999 87500 06249 96093
77734 37500 0000
√
0.99999 = 0.99999 49999 87499 93749 96093
72265 62500 00000 Are these answers accurate to all digits shown?
28. Sometimes the values of a Taylor series cannot be eas- ily reformulated. For 15 values of x = 1, 0.1, 0.01, . . ., compare the following.
a. (−1+ex)/xversuseighttermsinthetruncatedTaylor series.
b. (1 − cos x)/x2 versus sin2 x/[x2 + cos x].
1.3
Floating-Point Representation
a.
Example 1.2.3 b. Example 1.2.6
The standard way to represent a nonnegative real number in decimal form is with an integer part and a fractional part with a decimal point between them such as
37.21829, 0.00227 1828, 30 00527.11059
(We group five digits together as shown.)
Another standard form, often called normalized scientific notation, is obtained by
shifting the decimal point and supplying appropriate powers of 10. Thus, the preceding numbers have alternative representations as
37.21829 = 0.37218 29 × 102 0.00227 1828 = 0.22718 28 × 10−2
30 00527.11059 = 0.30005 27110 59 × 107
In normalized scientific notation, the number is represented by a fraction multiplied by a power of 10, and the leading digit in the fraction is not zero (except when the number involved is zero). Thus, we write 79325 as 0.79325×105 , not 0.07932 5×106 or 7.9325×104 or some other way.
Normalized Floating-Point Representation
In the context of computer science, normalized scientific notation is also called normalized floating-point representation. In the decimal system, any real number x (other than zero) can be represented in normalized floating-point form as
x = ±0.d1d2d3 . . . × 10n
where d1 ≠ 0 and n is an integer (positive, negative, or zero). Each of the numbers d1, d2,
d3,... are the decimal digits 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.3 Floating-Point Representation 39 Stated another way, the real number x, if different from zero, can be represented in
normalized floating-point decimal form as
x=±r×10n 1 ≦r<1 10
This representation consists of three parts: a sign that is either + or −, a number r in the interval 1 , 1, and an integer power of 10. The number r is called the normalized
The floating-point representation in the binary system is similar to that in the decimal system in several ways. If x ≠ 0, it can be written as
x = ±q × 2m 1 ≦ q < 1 2
10
mantissa and n the exponent.
The mantissa q would be expressed as a sequence of zeros or ones in the form q =(0.b1b2b3 ...)2, where b1 ≠ 0. Hence, b1 = 1 and then necessarily q ≧ 1 and q < 1.
2
A floating-point number system within a computer is similar to what we have just
EXAMPLE 1
Solution
described, with one important difference: Every computer has only a finite word length and a finite total capacity, so only numbers with a finite number of digits can be represented. A number is allotted only one word of storage in the single-precision mode (two or more words in double or long-double precision). In either case, the degree of precision is strictly limited. Clearly, irrational numbers cannot be represented, nor can those rational numbers that do not fit the finite format imposed by the computer. Furthermore, numbers may be either too large or too small to be representable. The real numbers that are representable in a computer are called its machine numbers.
Since any number used in calculations with a computer must conform to the format of numbers in that computer system, it must have a finite expansion. Numbers that have a nonterminating expansion cannot be accommodated precisely. Moreover, a number that has a terminating expansion in one base may have a nonterminating expansion in another. A good example of this is the following simple fraction as given in the introductory example to this chapter:
1 =(0.1)10 =(0.06314631463146314...)8 10
= (0.0 0011 0011 0011 0011 0011 0011 0011 0011 . . .)2
The important point here is that most real numbers cannot be represented exactly in a computer. (See Appendix B for a discussion of representation of numbers in different bases.)
We illustrate that the effective number system for a computer is not a continuum, but a rather peculiar discrete set.
List all the floating-point numbers that can be expressed in the form x = ±(0.b1b2b3)2 × 2±k
where b1 , b2 , b3 , and m are allowed to have only the value 0 or 1. Then allow only normalized floating-point numbers; that is, all numbers (with the exception of zero) having the form
x = ±(0.b1b2b3)2 × 2±k
There are two choices for the ±, two choices for b1, two choices for b2, two choices for b3,
and three choices for the exponent. Thus, at first, one would expect 2 × 2 × 2 × 2 × 3 = 48
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
40 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
different numbers. For example, all of the possible nonnegative numbers in this system are as follows:
(0.000)2 × 2−1 = 0, (0.001)2 × 2−1 = 1,
16
(0.010)2 × 2−1 = 2, 16
(0.011)2 × 2−1 = 3, 16
(0.100)2 × 2−1 = 4, 16
(0.101)2 × 2−1 = 5, 16
(0.110)2 × 2−1 = 6, 16
(0.111)2 × 2−1 = 7, 16
(0.000)2 × 20 = 0, (0.000)2 × 21 = 0 (0.001)2×20=1, (0.001)2 × 21 = 1
8 4
(0.010)2×20=2, (0.010)2 × 21 = 2 8 4
(0.011)2×20=3, (0.011)2 × 21 = 3 8 4
(0.100)2×20=4, (0.100)2 × 21 = 4 8 4
(0.101)2×20=5, (0.101)2 × 21 = 5 8 4
(0.110)2×20=6, (0.110)2 × 21 = 6 8 4
(0.111)2×21=7, (0.111)2 × 20 = 7 4 8
Here there are many duplications! So we obtain only these nonnegative numbers 1 , 3 , 5 , 16 16 16
7 ; 1, 3, 5, 7; 1, 3, 5, 7; 1, 3; 0, 1. Altogether there are 31 distinct numbers in the system. 16 8 8 8 8 4 4 4 4 2 2
The nonnegative numbers obtained are shown as dots on a line in Figure 1.6.
0113153715371 5 3 7 16 8 16 4 16 8 16 2 8 4 8 4 2 4
FIGURE 1.6 Example 1: Nonnegative machine numbers
Observe that the numbers are symmetrically, but unevenly distributed, about zero.
Now allowing only normalized floating-point numbers (b2 = 1), we cannot represent
1 , 1 , and 3 . Hence, there are only 25 distinct numbers, and the nonnegative machine 168 16
0 Hole at zero Underflow
15371537 5 3 7
numbers are now distributed as in Figure 1.7.
4168162 8 4 8
1
4 2
4
FIGURE 1.7 Example 1: Nonnegative normalized machine numbers
Overflow
■
If, in the course of a computation, a number x is produced which is outside the com- puter’s permissible range, then we say that an overflow has occurred or that x is outside the range of the computer. Generally, an overflow results in a fatal error (or exception), and the normal execution of the program stops! An underflow is usually treated automatically by setting x to zero without any interruption of the program and without a warning message in most computers.
In a computer whose floating-point numbers are restricted to the form in Example 1,
any number closer to zero than 1 would underflow to zero, and any number outside the 4
range to the left of −1.75 to the right of +1.75 would overflow to machine ± infinity,
respectively. Notice that there is a relatively wide gap between zero and the smallest positive
machine number, which is (0.100)2 × 2−1 = 1 . This creates a phenomenon known as the 4
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
hole at zero. Figure 1.7 illustrates concepts such as the hole-at-zero, underflow, and overflow for Example 1.
We can store the normalized floating-point numbers from Example 1 in a five-bit computer with one bit for the sign of the number, two bits for the exponent, and two bits for the mantissa:
All possible combinations of positive normalized floating-point numbers are
1.3 Floating-Point Representation 41
±
e1
e2
b2
b3
(0.100)2 = 1 1, 1,1 2 42 (0.101)2 = 5 5 ,5,5
(0.1b2b3)2 × 2m = 8 ×2−1,0,1= 16 8 4 (0.110)2 = 3 3, 3, 3
4 842 (0.111)2 = 7 7 , 7 , 7
8 16 8 4 A machine number in floating-point single-precision is of the form
(−1)s q × 2m = (−1)s × 2c−1 × (1.b2b3)2
Here we set the exponent to m = c − 1. (In a 32-bit computer with an 8-bit exponent,
m = c − 127. Why?) Some special cases are for ±0, ±∞, and so on. Floating-Point Representation
Before the Standard for Floating-Point Arithmetic (IEEE-754) was established in the early 1980s, computers used many different forms of floating-point representation, differing in the word length, the format of the representation, and the rounding used between opera- tions! Now IEEE-754 has been accepted by almost all hardware and software manufactures worldwide. It defines the floating-point number system used by computers and it offers sev- eral rounding schemes, which affects the accuracy, among other things. In most computers, there are three common levels of precision for floating-point numbers, with the number of bits allocated for each organized as shown in the following table.
Precision Bits Sign Exponent Mantissa
Single 32 1 8 23 Double 64 1 11 52 Long Double 80 1 15 64
A computer that operates in floating-point mode represents numbers as described earlier except for the limitations imposed by the finite word length. Many binary computers have a word length of 32 or 64 bits (binary digits). We shall describe a machine of this type whose features mimic many workstations and personal computers in widespread use. The internal representation of numbers and their storage is standard floating-point form, which is used in almost all computers. For simplicity, we have left out a discussion of some of the details and features. Fortunately, one need not know all the details of the floating-point arithmetic system used in a computer to use it intelligently. Nevertheless, it is generally helpful in debugging a program to have a basic understanding of the representation of numbers in your computer.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
42
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
By single-precision floating-point numbers, we mean all acceptable numbers in a computer using the standard single-precision floating-point arithmetic format. (In this dis- cussion, we are assuming that such a computer stores these numbers in 32-bit words.) This set is a finite subset of the real numbers. It consists of ±0, ±∞, normal and subnormal single-precision floating-point numbers, and even NotaNumber (NaN) values. (More detail on these subjects are in Appendix B and in the references.)
Recall that most real numbers cannot be represented exactly as floating-point numbers,
since they have infinite decimal or binary expansions (all irrational numbers and some
rational numbers); for example, π, e, 1 , 0.1, and so on. 3
Because of the 32-bit word-length, as much as possible of the normalized floating-point number
±q × 2m
must be contained in those 32 bits. One way of allocating the 32 bits is as follows:
sign of q integer |m| number q
1 bit
8 bits 23 bits
Information on the sign of m is contained in the 8 bits allocated for the integer |m|. In such a scheme, we can represent real numbers with |m| as large as 27 − 1 = 127. The exponent represents numbers from −127 through 128.
Single-Precision Floating-Point Form
We now describe a machine number of the following form in standard single-precision floating-point representation:
(−1)s × 2c−127 × (1.f )2
The leftmost bit is used for the sign of the mantissa, where s = 0 corresponds to + and s = 1 corresponds to −. The next 8 bits are used to represent the number c in the exponent of 2c−127, which is interpreted as an excess-127 code. Finally, the last 23 bits represent f from the fractional part of the mantissa in the 1-plus form: (1.f)2. Each floating-point single-precision word is partitioned as in Figure 1.8.
s
biased exponent c
f from one-plus mantissa (1.f )2
FIGURE 1.8
Partitioned floating-point single-precision computer word
Sign of mantissa
9 bits
23 bits
radix point
In the normalized representation of a nonzero floating-point number, the first bit in the mantissa is always 1 so that this bit does not have to be stored. This can be accomplished by shifting the binary point to a “1-plus” form (1.f )2 . The mantissa is the rightmost 23 bits and contains f with an understood binary point as in Figure 1.8. So the mantissa (significand) actually corresponds to 24 binary digits since there is a hidden bit. (An important exception is the number ±0.)
We now outline the procedure for determining the representation of a real number x . If x is zero, it is represented by a full word of zero bits with the possible exception of the sign
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
bit. For a nonzero x, first assign the sign bit for x and consider |x|. Then convert both the integer and fractional parts of |x| from decimal to binary. Next one-plus normalize (|x|)2 by shifting the binary point so that the first bit to the left of the binary point is a 1 and all bits to the left of this 1 are 0. To compensate for this shift of the binary point, adjust the exponent of 2; that is, multiply by the appropriate power of 2. The 24-bit one-plus- normalized mantissa in binary is thus found. Now the current exponent of 2 should be set equal to c − 127 to determine c, which is then converted from decimal to binary. The sign bit of the mantissa is combined with (c)2 and (f )2 . Finally, write the 32-bit representation of x as eight hexadecimal digits.
The value of c in the representation of a floating-point number in single precision is restricted by the inequality
0 < c < (11111111)2 = 255
The values 0 and 255 are reserved for special cases, including ±0 and ±∞, respectively.
Hence, the actual exponent of the number is restricted by the inequality −126≦ c − 127≦ 127
Likewise, we find that the mantissa of each nonzero number is restricted by the inequality
1≦(1.f)2≦(1.11111111111111111111111)2 =2−2−23
The largest number representable is therefore (2 − 2−23)2127 ≈ 2128 ≈ 3.4 × 1038. The smallest positive number is 2−126 ≈ 1.2 × 10−38.
The binary machine floating-point number ε = 2−24 is called the machine epsilon when using single precision. It is the smallest positive machine number ε such that 1 + ε ≠ 1. Because 2−24 ≈ 5.96 × 10−8, we infer that in a simple computation, approximately 7 significant decimal digits of accuracy may be obtained in single precision. Recall that 23 bits are allocated for the mantissa plus the hidden bit.
Double-Precision Floating-Point Form
When more precision is needed, double precision can be used, in which case each double- precision floating-point number is stored in two computer words in memory. In double precision, there are 52 bits allocated for the mantissa. The double precision machine epsilon is 2−53 ≈ 1.11 × 10−16, so approximately 15 significant decimal digits of precision are available. There are 11 bits allowed for the exponent, which is biased by 1023. The exponent represents numbers from −1022 through 1023. A machine number in standard double- precision floating-point form corresponds to
(−1)s × 2c−1023 × (1.f )2
The leftmost bit is used for the sign of the mantissa with s = 0 for + and s = 1 for −. The next 11 bits are used to represent the exponent c corresponding to 2c−1023. Finally, 52 bits represent f from the fractional part of the mantissa in the one-plus form: (1.f )2 .
The value of c in the representation of a floating-point number in double precision is restricted by the inequality
0 < c < (1111111111)2 = 2047
As in single precision, the values at the ends of this interval are reserved for special cases.
Hence, the actual exponent of the number is restricted by the inequality −1022 ≦ c − 1023 ≦ 1023
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.3 Floating-Point Representation 43
44 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
EXAMPLE 2
Solution
We find that the mantissa of each nonzero number is restricted by the inequality 1≦(1.f)2≦(1.111111111···1111111111)2 =2−2−52
Because 2−52 ≈ 1.2 × 10−16, we infer that in a simple computation approximately 15 significant decimal digits of accuracy may be obtained in double precision.
Recall that 52 bits are allocated for the mantissa. The largest double-precision machine number is (2 − 2−52)21023 ≈ 21024 ≈ 1.8 × 10308. The smallest double-precision positive machine number is 2−1022 ≈ 2.2 × 10−308.
Single precision on a 64-bit computer is comparable to double precision on a 32-bit computer, whereas double precision on a 64-bit computer gives four times the precision available on a 32-bit computer.
In single precision, 31 bits are available for an integer because only 1 bit is needed for the sign. Consequently, the range for integers is from −(231 −1) to (231 −1) = 21474 83647. In double precision, 63 bits are used for integers, giving integers in the range −(263 − 1) to (263 − 1). In using integer arithmetic, accurate calculations can result in only approximately 9 digits in single precision and 18 digits in double precision! For high accuracy, most computations should be done by using double-precision floating-point arithmetic.
At this point, some students may wish to read Appendix B for a review of representing numbers in different bases.
Determine the machine representation of the decimal number −52.23437 5 in both single precision and double precision.
Converting the integer part to binary, we have (52.)10 = (64.)8 = (110 100.)2 . Next, con- verting the fractional part, we have (.23437 5)10 = (.17)8 = (.001 111)2. Now
(52.23437 5)10 = (110 100.001 111)2 = (1.101 000 011 110)2 × 25
is the corresponding one-plus form in base 2, and (.101 000 011 110)2 is the stored man- tissa.Nexttheexponentis(5)10,andsincec−127=5,weimmediatelyseethat(132)10 = (204)8 = (10 000 100)2 is the stored exponent. Thus, the single-precision machine repre- sentation of −52.23437 5 is
[11000010010100001111000000000000]2 = [11000010010100001111000000000000]2 =[C250F000]16
Here is the bit pattern for −52, 23437 5 in single-precision floating-point using 32 bits: ↓ 8 bit exp. ↓
↑ sign bit ↑ 23 bit mantissa ↑
Indoubleprecision,fortheexponent(5)10,weletc−1023=5,andwehave(1028)10 = (2004)8 = (10 000 000 100)2 , which is the stored exponent. Thus, the double-precision machine representation of −52.23437 5 is
[110000000100101000011110000 ··· 00]2 = [1100000001001010000111100000···0000]2 =[C04A1E0000000000]16
Here[···]k isthebitpatternofthemachineword(s)thatrepresentsfloating-pointnumbers, which is displayed in base-k. Here is the bit pattern for −52, 23437 5 in double-precision
1
1
0
0
0
0
1
0
0
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
floating-point using 64 bits:
↓ 11 bit exponent ↓
↑ sign bit ↑ 52 bit mantissa ↑
1.3 Floating-Point Representation 45
1
1
0
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
0
···
0
0
0
Mathematical software can be used to display the range of the numbers in a computer. In MATLAB, commands realmin('single') and realmax('single') return the smallest and largest finite floating-point numbers in single precision—double precision is similar. Here is a table of the range of floating-point normalized numbers with the hole at zero. Because this number range is not a continuum, there are lots of holes or gaps (discontinuities) throughout.
■
EXAMPLE 3
Solution
Determine the decimal numbers that correspond to these machine words: [45DE4000]16 [BA390000]16
The first number in binary is
[0100 0101 1101 1110 0100 0000 0000 0000]2
Thestoredexponentis(10001011)2 =(213)8 =(139)10,so139−127=12.Themantissa is positive and represents the number
(1.101 111 001)2 × 212 = (1 101 111 001 000.)2 = (15710.)8
=0×1+1×8+7×82 +5×83 +1×84 = 8(1 + 8(7 + 8(5 + 8(1))))
= 7112
Similarly, the second word in binary is
[1011 1010 0011 1001 0000 0000 0000 0000]2
The exponential part of the word is (01 110 100)2 = (164)8 = 116, so the exponent is 116 − 127 = −11. The mantissa is negative and corresponds to the following floating-point
Single Precision
Double Precision
Range
−2128 ≈−3.4×1038 ≦x≦ −1.2×10−38 ≈−2−126 0
2−126 ≈1.2×10−38 ≦x≦ 3.4×1038 ≈2128
−21024 ≈ −1.8 × 10318 ≦ x ≦ −2.2 × 10−308 ≈ 2−1022 0
2−1022 ≈2.2×10−308 ≦x≦ 1.8×10308 ≈21024
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
46 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
number:
−(1.011 100 100)2 × 2−11 = −(0.000 000 000 010 111 001)2 = −(0.00027 1)8
= −2×8−4 −7×8−5 −1×8−6 = −8−6(1 + 8(7 + 8(2)))
=− 185 ≈−7.0571899×10−4 ■ 26214 4
FIGURE 1.9
A possible relationship between x−, x+, and x.
Computer Errors in Representing Numbers
We turn now to the errors that can occur when we attempt to represent a given real number x in the computer. We use a model computer with a 32-bit word length. Suppose first that we let x = 253 21697 or x = 2−32591 . The exponents of these numbers far exceed the limitations of the machine (as described above). These numbers would overflow and underflow, respectively, and the relative error in replacing x by the closest machine number would be very large. Such numbers are outside the range of a 32-bit word-length computer.
Consider next a positive real number x in normalized floating-point form x = q × 2m 1 ≦ q < 1, −125 ≦ m ≦ 128
2
The process of replacing x by its nearest machine number is called correct rounding, and the error involved is called roundoff error. We want to know how large it can be. We suppose that q is expressed in normalized binary notation, so
x = (0.1b2b3b4 ...b24b25b26 ...)2 × 2m
One nearby machine number can be obtained by rounding down or by simply dropping the excess bits b25 b26 . . . , since only 23 bits have been allocated to the stored mantissa. This machine number is
x− =(0.1b2b3b4...b24)2 ×2m
It lies to the left of x on the real-number axis. Another machine number, x+, is just to the right of x on the real axis and is obtained by rounding up. It is found by adding one unit to b24 in the expression for x−. Thus, we have
x+ =(0.1b2b3b4...b24)2 +2−24×2m
The closer of these machine numbers is the one chosen to represent x.
x2x x1x2 xx1
The two situations are illustrated by the simple diagrams in Figure 1.9. If x lies closer
to x− than to x+, then
In this case, the relative error is bounded as follows:
x − x− 2−25+m 2−25
x ≦(0.1b2b3b4...)2×2m ≦ 1/2 =2−24 =u
where u = 2−24 is the unit roundoff error for a 32-bit binary computer with standard
floating-point arithmetic.
|x − x−| ≦ 1 |x+ − x−| = 2−25+m 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Mathematical software can be used to display the precision of a computer. For example in MATLAB, command eps('single') returns the distance from 1.0 to the next largest single-precision floating-point number—double precision is similar.
1.3 Floating-Point Representation 47
Single Precision
Double Precision
Precision
2−23 ≈ 1.2 × 10−7 2−52 ≈ 2.2 × 10−16
epsi ← 1.0
while (1.0 + eps ≧ 1.0)
epsi ← epsi/2.0 end for
epsi ← 2.0 ∗ epsi
Machine Epsilon Pseudocode
Recall that machine epsilon is ε = 2−24, so u = ε. Moreover, u = 2−k, where k is the number of binary digits used in the mantissa, including the hidden bit (k = 24 in single precision and k = 53 in double precision). On the other hand, if x lies closer to x+ than to x−, then
|x − x+| ≦ 1 |x+ − x−| 2
andthesameanalysisshowsthattherelativeerrorisnogreaterthan2−24 =u.Sointhecase of rounding to the nearest machine number, the relative error is bounded by u. We note in passing that when all excess digits or bits are discarded, the process is called chopping. If a 32-bit word-length computer has been designed to chop numbers, the relative error bound would be twice as large as above.
The terms machine epsilon (ǫ) and unit roundoff error (u) are used interchangeably. Machine epsilon is used to study the effect of rounding errors because the actual errors of machine arithmetic are extremely complicated. Program libraries may provide precomputed values for these and other standard numerical quantities.
Often students are assigned the textbook exercise to compute an approximate value for machine epsilon. It is done in the sense of the spacing of the floating-point numbers at 1 rather than in the sense of the unit roundoff error. The following pseudocode produces an approximation to machine epsilon (within a factor of 2).
As with any computational results, it depends on the particular computer platform used as well as the programming language, the floating-point format (float, double, long double, etc.), and the runtime library.
Notation fl(x) and Backward Error Analysis
Next let us turn to the errors that are produced in the course of elementary arithmetic oper- ations. To illustrate the principles, suppose that we are working with a five-place dec- imal machine and wish to add numbers. Two typical machine numbers in normalized
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
48 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
Machine Epsilon
floating-point form are
x = 0.37218 × 104, y = 0.71422 × 10−1
Many computers perform arithmetic operations in a double-length work area, so assume that our computer has a ten-place accumulator. First, the exponent of the smaller number is adjusted so that both exponents are the same. Then the numbers are added in the accumulator, and the rounded result is placed in a computer word:
x = 0.37218 00000 × 104
y = 0.00000 71422 × 104 x + y = 0.37218 71422 × 104
The nearest machine number is z = 0.37219 × 104, and the relative error involved in this machine addition is
|x + y − z| 0.00000 28578 × 104
|x + y| = 0.3721871422×104 ≈ 0.77×10−5
This relative error would be regarded as acceptable on a machine of such low precision. To facilitate the analysis of such errors, it is convenient to introduce the notation fl(x) to denote the floating-point machine number that corresponds to the real number x. Of course, the function fl depends on the particular computer involved. Our hypothetical five-
decimal-digit machine would give
fl(0.37218 71422 × 104 ) = 0.37219 × 104
For a 32-bit word-length computer, we established previously that if x is any real number within the range of the computer, then
|x − fl(x)| ≦ u u = 2−24 (1) |x|
Here and throughout, we assume that correct rounding is used. This inequality can also be expressed in the more useful form
fl(x) = x(1 + δ) |δ| ≦ 2−24
To see that these two inequalities are equivalent, simply let δ = [fl(x) − x]/x. Then, by Inequality (1), we have |δ| ≦ 2−24 and solving for fl(x) yields fl(x) = x(1 + δ).
By considering the details in the addition 1+ε, we see that if ε ≧ 2−23, then fl(1+ ε) > 1, whereas if ε < 2−23, then fl(1 + ε) = 1. Consequently, if machine epsilon is the smallest positive machine number ε such that
fl(1 + ε) > 1
then ε = 2−23. Sometimes it is necessary to furnish the machine epsilon to a program. Because it is a machine-dependent constant, it can be found by either calling a system routine or by writing a simple program that finds the smallest positive number x = 2m such that1+x >1inthemachine.
Now let the symbol ⊙ denote any one of the arithmetic operations +, −, ×, or ÷. Suppose a 32-bit word-length computer has been designed so that whenever two machine numbers x and y are to be combined arithmetically, the computer produces fl(x ⊙ y) instead of x ⊙ y. We can imagine that x ⊙ y is first correctly formed, then normalized, and finally rounded to become a machine number. Under this assumption, the relative error does not
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
fl Properties
exceed 2−24 by the previous analysis:
fl(x ⊙ y) = (x ⊙ y)(1 + δ) |δ| ≦ 2−24
Special cases of this are, of course,
fl(x ± y) = (x ± y)(1 + δ)
fl(xy) = xy(1 + δ) fl x = x ( 1 + δ )
yy
In these equations, δ is variable but satisfies −2−24 ≦ δ ≦ 2−24. The assumptions that we have made about a model 32-bit word-length computer are not quite true for a real computer. For example, it is possible for x and y to be machine numbers and for x ⊙ y to overflow or underflow. Nevertheless, the assumptions should be realistic for most computing machines.
The equations given can be written in a variety of ways, some of which suggest alter- native interpretations of roundoff. For example, we have
fl(x + y) = x(1 + δ) + y(1 + δ)
This says that the result of adding machine numbers x and y is not in general x + y but is the true sum of x(1 + δ) and y(1 + δ). We can think of x(1 + δ) as the result of slightly perturbing x. Thus, the machine version of x + y, which is fl(x + y), is the exact sum of a slightly perturbed x and a slightly perturbed y. The reader can supply similar interpretations in the examples given in the exercises.
This interpretation is an example of backward error analysis. It attempts to determine what perturbation of the original data would cause the computer results to be the exact results for a perturbed problem. In contrast, a direct error analysis attempts to determine how computed answers differ from exact answers based on the same data. In this aspect of scientific computing, computers have stimulated a new way of looking at computational errors.
Because the set of machine numbers is finite, some of the basic mathematical operations are not well defined and may breakdown in floating-point arithmetic. For example, we have
fl(x) = x(1 + ǫ)
in which |ǫ| ≦ ǫm where ǫm is machine epsilon, which is the smallest number such that
fl(1 + ǫm ) ≠ 1. Moreover, we obtain
fl(x ⊙ y) = (x ⊙ y)(1 + ǫ⊙)
where|ǫ⊙|≦ǫm fortheoperations⊙=+,−,∗,/and fl(x ⊙ y) = fl(y ⊙ x)
for operations ⊙ = +, ∗.
If x, y, and z are machine numbers in a 32-bit word-length computer, what upper bound
can be given for the relative roundoff error in computing z(x + y)?
In the computer, the calculation of x + y would be done first. This arithmetic operation pro- duces the machine number fl(x + y), which differs from x + y because of roundoff. By the principles established above, there is a δ1 such that
fl(x + y) = (x + y)(1 + δ1) |δ1| ≦ 2−24
1.3 Floating-Point Representation 49
EXAMPLE 4
Solution
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
50 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
EXAMPLE 5
When z multiplies the machine number fl(x + y), the result is the machine number fl[z fl(x + y)] because z is already a machine number. This, too, differs from its exact counterpart, and we have, for some δ2,
fl[z fl(x + y)] = z fl(x + y)(1 + δ2) |δ2| ≦ 2−24 Putting both of our equations together, we have
fl[zfl(x + y)] = z(x + y)(1+δ1)(1+δ2)
= z(x + y)(1+δ1 +δ2 +δ1δ2)
≈ z(x + y)(1 + δ1 + δ2)
= z(x + y)(1 + δ) |δ| ≦ 2−23
In this calculation, |δ1δ2| ≦ 2−48, and so we ignore it. Also, we put δ = δ1 + δ2 and then reasonthat|δ|=|δ1 +δ2|≦|δ1|+|δ2|≦2−24 +2−24 =2−23. ■
Critique the following attempt to estimate the relative roundoff error in computing the sum of two real numbers, x and y. In a 32-bit word-length computer, the calculation yields
z = fl[fl(x) + fl(y)]
= [x(1+δ)+ y(1+δ)](1+δ) = (x + y)(1 + δ)2
≈ (x + y)(1 + 2δ)
Therefore, the relative error is bounded as follows:
(x + y) − z = 2δ(x + y) = |2δ| ≦ 2−23
(x+y) (x+y) Why is this calculation not correct?
The quantities δ that occur in such calculations are not, in general, equal to each other. The correct calculation is
z = fl[fl(x) + fl(y)]
= [x(1+δ1)+ y(1+δ2)](1+δ3)
= [(x + y)+δ1x +δ2y](1+δ3)
= (x + y)+δ1x +δ2y +δ3x +δ3y +δ1δ3x +δ2δ3y ≈(x+y)+x(δ1 +δ3)+y(δ2 +δ3)
Therefore, the relative roundoff error is
(x+y)−z=x(δ1 +δ3)+y(δ2 +δ3)
(x+y) (x+y) =(x+y)δ3 +xδ1 +yδ2
(x + y) = δ3 + xδ1 + yδ2
(x + y)
This cannot be bounded, because the second term has a denominator that can be zero or
close to zero. Notice that if x and y are machine numbers, then δ1 and δ2 are zero, and a
Solution
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Examples of Numerical Computer Failures
useful bound results—namely, δ3. But we do not need this calculation to know that! It has been assumed that when machine numbers are combined with any of the four arithmetic operations, the relative roundoff error would not exceed 2−24 in magnitude (on a 32-bit word-length computer). ■
Historical Notes
In the 1991 Gulf War, a failure of the Patriot missile defense system was the result of a software conversion error. The system clock measured time in tenths of a second, but it was stored as a 24-bit floating-point number, resulting in rounding errors. The system failed to intercept an incoming Iraqi Scud missile, which resulted in the death of 28 American soldiers in a barracks in Dhahran, Saudi Arabia. (Field data had shown that the system would fail to track and intercept an incoming missile after being on for 20 consecutive hours and would need to be rebooted—it had been on for 100 hours!)
In 1994, Professor Thomas R. Nicely discovered that the Intel Pentium floating- point processor returned erroneous results for certain division operations. For example, 0001/824633702441.0 was calculated incorrectly for all digits beyond the eighth signif- icant digit. After initially declaring that it would not impact many users, Intel eventually set aside $420 million dollars to fix the problem and replaced the chip for anyone that requested it!
In 1996, the Ariane 5 rocket launched by the European Space Agency exploded 40 sec- onds after lift-off from Kourou, French Guiana. An investigation determined that the hori- zontal velocity required the conversion of a 64-bit floating-point number to a 16-bit signed integer. It failed because the number was larger than 32,767, which was the largest inte- ger of this type that could be stored in memory. The rocket and its cargo were valued at $500 million.
Further details about these disasters can be found by searching the World Wide Web on the Internet. There are other interesting accounts of calamities that could have been averted by more careful computer programming, especially in using floating-point arithmetic.
Summary 1.3
• A single-precision floating-point number in a 32-bit word-length computer with standard floating-point representation is stored in a single word with the bit pattern
which is interpreted as the real number
(−1)b1 ×2(b2b3…b9)2 ×2−127 ×(1.b10b11 …b32)2
• A double-precision floating-point number in a 32-bit word-length computer with standard floating-point representation is stored in two words with the bit pattern
which is interpreted as the real number
(−1)b1 ×2(b2b3…b12)2 ×2−1023×(1.b13b14…b64)2
1.3 Floating-Point Representation 51
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
b1b2b3 ···b9b10b11 ···b32
b1b2b3 · · · b9b10b11b12b13 · · · b32b33b34b35 · · · · · · b64
52
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
1.
2.
3.
4.
5.
Determine the machine representation in single precision on a 32-bit word-length computer for the following dec- imal numbers.
a. 2−30 b. 64.015625 ac. −8×2−24
Determinethesingle-precisionanddouble-precisionma- chine representation in a 32-bit word-length computer of the following decimal numbers:
a. 0.5, −0.5 b. 0.125, −0.125
c. e0.0625,−0.0625 ad. 0.03125,−0.03125
Whichofthesearemachinenumbers? a. 10403 b. 1+2−32 c. 1/5
d. 1/10 e. 1/256
Determine the single-precision and double-precision ma-
chine representation of the following decimal numbers:
−127 −128 +2
−127 −150
a. 1.0,−1.0 ad. 0.23437 5 g. −285.75
b. +0.0,−0.0 a e. 492.78125
c. −9876.54321 f. 64.37109 375
Are these machine representations? Why or why not?
a. [4BAB2BEB]16 b. [1A1AIA1A]16
c. [FADEDEAD]16 d. [CABE6G94]16
The computer word associated with the variable ap- pears as [7F7FFFFF]16 , which is the largest representable floating-point single-precision number. What is the deci- mal value of ? The variable ε appears as [00800000]16, which is the smallest positive number. What is the deci- mal value of ε?
Enumeratethesetofnumbersinthefloating-pointnum- ber system that have binary representations of the form ±(0.b1 b2 ) × 2k , where
and left of 2m? How far is each from 2m?
Generally, when a list of floating-point numbers is added, less roundoff error will occur if the numbers are added in order of increasing magnitude. Give some examples to illustrate this principle.
• The relationship between a real number x and the floating-point machine number fl(x) can be written as
fl(x) = x(1 + δ) |δ| ≦ 2−24
If ⊙ denotes any one of the arithmetic operations, then we write
fl(x ⊙ y) = (x ⊙ y)(1 + δ) In these equations, δ depends on x and y.
7.
8.
9.
10.
11.
12. 13.
k =127
Determine the decimal numbers that have the following
a. 2
c. 2−127+2−130
b. 2
+2 d. 150 2−k
machine representations:
a. [3F27E520]16 b. [3BCDCA00]16 c. [BF4F9680]16 d. [CB187ABC]16
Determine the decimal numbers that have the following machine representations:
aa. [CA3F2900]16 c. [494F96A0]16 e. [45223000]16
ag. [C553E000]16
b. [C705A700]16 ad. [4B187ABC]16 f. [45607000]16 h. [437F0000]16
6.
h. 10−2 Identifythefloating-pointnumberscorrespondingtothe
following bit strings:
a. b. c.
ad. e. f. g. h.
What are the bit-string machine representations for the following subnormal numbers?
b. k∈{−1,1} Whatarethemachinenumbersimmediatelytotheright
0 00000000 00000000000000000000000
1 00000000 00000000000000000000000
0 11111111 00000000000000000000000
1 11111111 00000000000000000000000
0 00000001 00000000000000000000000
0 10000001 01100000000000000000000
0 01111111 00000000000000000000000
0 01111011 10011001100110011001100
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a. k∈{−1,0} ac. k∈{−1,0,1}
Exercises 1.3
14.
a 15. 16.
a17.
a 18. a19.
20.
a21. 22.
23.
24.
(Continuation) The principle of the preceding exercise is not universally valid. Consider a decimal machine with two decimal digits allocated to the mantissa. Show that the four numbers 0.25, 0.0034, 0.00051, and 0.061 can be added with less roundoff error if not added in ascending order.
In the case of machine underflow, what is the relative error involved in replacing a number x by zero?
a 25. If x and y are real numbers within the range of a 32-bit word-length computer and if x y is also within the range, what relative error can there be in the machine computa- tionofxy?
Hint: The machine produces fl[fl(x)fl(y)].
a26. Letxandybepositiverealnumbersthatarenotmachine numbers but are within the exponent range of a 32-bit word-length computer. What is the largest possible rela- tive error in the machine representation of x + y 2 ? Include errors made to get the numbers in the machine as well as errors in the arithmetic.
27. Showthatifxandyarepositiverealnumbersthathavethe same first n digits in their decimal representations, then y approximates x with relative error less than 101−n. Is the converse true?
28. Show that a rough bound on the relative roundoff error when n machine numbers are multiplied in a 32-bit word- length computer is (n − 1)2−24.
29. Show that fl(x + y) = y on a 32-bit word-length com- puter if x and y are positive machine numbers and x
x = x/2.0 end while
y = 2.0∗x output y
56
Chapter 1
Mathematical Preliminaries and Floating-Point Representation
1.4
Loss of Significance
c. x=1.0
while x > 0.0
y=x
x = x/2.0 end while
output y
14. MATLABdec2binandbin2decforconvertingnum- bers from decimal to binary and vise versa as well as hex2dec and dec2hex for converting a hexadecimal numbers to double-precision and vise versa. Test these commands with these numbers:
a. (11)10 c. (0.625)10 b. (197)10 d. (0.2)10
In this section, we show how loss of significance in subtraction can often be reduced or eliminated by various techniques, such as the use of rationalization, Taylor series, trigono- metric identities, logarithmic properties, double precision, and/or range reduction. These are some of the techniques that can be used when one wants to guard against the degradation of precision in a calculation. Of course, we cannot always know when a loss of significance has occurred in a long computation, but we should be alert to the possibility and take steps to avoid it, if possible.
Significant Digits
We first address the elusive concept of significant digits in a number. Suppose that x is a real number expressed in normalized scientific notation in the decimal system
An Example of Significant Digits
Examples of Measured Quantities
The digits 3, 7, 2, 1, 4, 9, 8 used to express r do not all have the same significance because they represent different powers of 10. Thus, we say that 3 is the most significant digit, and the significance of the digits diminishes from left to right. In this example, 8 is the least significant digit.
If x is a mathematically exact real number, then its approximate decimal form can be given with as many significant digits as we wish. Thus, we may write
π
≈ 0.31415 92653 58979
10
and all the digits given are correct. If x is a measured quantity, however, the situation is quite different. Every measured quantity involves an error whose magnitude depends on the nature of the measuring device. Thus, if a meter stick is used, it is not reasonable to measure any length with precision better than 1 millimeter. Therefore, the result of measuring, say, a plate glass window with a meter stick should not be reported as 2.73594 meters. That would be misleading! Only digits that are believed to be correct or in error by at most a few units should be reported. It is a scientific convention that the least significant digit given in a measured quantity should be in error by at most five units; that is, the result is rounded correctly.
Similar remarks pertain to quantities computed from measured quantities. For example, if the side of a square is reported to be s = 0.736 meter, then one can assume that the error
For example, x might be
x=±r×10n 1 ≦r<1 10
x = 0.37214 98 × 10−5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.4 Loss of Significance 57 does not exceed a few units in the third decimal place. The diagonal of that square is then
√
s 2 ≈ 0.10408 61182 × 101
but should be reported as 0.1041 × 101 or (more conservatively) 0.104 × 101. The infinite √
precision available in 2,
√
2 = 1.41421 35623 73095 . . .
√
Loss of Significance x − sin(x)
does not convey any more precision to s 2 than was already present in s. Computer-Caused Loss of Significance
Perhaps it is surprising that a loss of significance can occur within a computer. It is essential to understand this process so that blind trust will not be placed in numerical output from a computer. One of the most common causes for a deterioration in precision is the subtraction of one quantity from another nearly equal quantity. This effect is potentially quite serious and can be catastrophic. The closer these two numbers are to each other, the more pronounced is the effect.
To illustrate this phenomenon, consider the assignment statement
and suppose that at some point in a computer program this statement is executed with an x value of 1 . Assume further that our computer works with floating-point numbers that have
15
ten decimal digits. Then
x ← 0.66666 66667 × 10−1 sin(x ) ← 0.66617 29492 × 10−1 x − sin(x ) ← 0.00049 37175 × 10−1 x − sin(x ) ← 0.49371 75000 × 10−4
In the last step, the result has been shifted to normalized floating-point form. Three zeros have then been supplied by the computer in the three least significant decimal places. We refer to these as spurious zeros; they are not significant digits. In fact, the ten-decimal-digit correct value is
1 −sin 1 ≈0.4937174327×10−4 15 15
Another way of interpreting this is to note that the final digit in x − sin(x) is derived from the tenth digits in x and sin(x ). When the eleventh digit in either x or sin(x ) is 5, 6, 7, 8, or 9, the numerical values are rounded up to ten digits so that their tenth digits may be altered by plus one unit. Since these tenth digits may be in error, the final digit in x − sin(x) may also be in error—which it is!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
y ← x − sin(x)
58 Chapter 1 Mathematical Preliminaries and Floating-Point Representation
EXAMPLE 1
Solution
If x = 0.37214 48693 and y = 0.37202 14371, what is the relative error in the computation of x − y in a computer that has five decimal digits of accuracy?
The numbers would first be rounded to x = 0.37214 and y = 0.37202. Then we have x − y = 0.00012, while the correct answer is x − y = 0.00012 34322. The relative error involved is
|(x − y)−(x −y)| = 0.0000034322 ≈ 3×10−2 |x − y| 0.00012 34322
This magnitude of relative error must be judged quite large when compared with the relative
error of x and y. (They cannot exceed 1 ×10−4 by the coarsest estimates, and in this example, 2
they are, in fact, approximately 1.3 × 10−5.) ■ It should be emphasized that this discussion pertains not to the operation
but rather to the operation
Roundoff error in the former case is governed by the equation fl(x − y) = (x − y)(1 + δ)
fl Notation
fl(x − y) ← x − y
fl[fl(x) − fl(y)] ← x − y
where |δ| ≦ 2−24 on a 32-bit word-length computer, and on a five-decimal-digit computer in the preceding example |δ| ≦ 1 × 10−4.
2
In Example 1 above, we observe that the computed difference of 0.00012 has only two
significant figures of accuracy, whereas in general, one expects the numbers and calculations in this computer to have five significant figures of accuracy.
The remedy for this difficulty is first to anticipate that it may occur and then to re- program. The simplest technique may be to carry out part of a computation in double- or long-double-precision arithmetic (that means roughly twice as many significant digits), but often a slight change in the formulas is required. Several illustrations of this will be given, and additional examples are found among the exercises.
Again consider Example 1, but imagine that the calculations to obtain x , y, and x − y are being done in double precision. Suppose that single-precision arithmetic is used thereafter. In the computer, all ten digits of x, y, and x − y are retained, but at the end, x − y is rounded to its five-digit form, which is 0.12343 × 10−3. This answer has five significant digits of accuracy, as we would like. Of course, the programmer or analyst must know in advance where double-precision arithmetic may be necessary in the computation. Programming everything in double precision is very wasteful if it is not needed. This approach has another drawback: There may be such serious cancellation of significant digits that even double precision might not help!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exactly how many significant binary digits are lost in the subtraction x − y when x is close to y?
Loss of Precision Theorem
Let x and y be normalized floating-point machine numbers, where x > y > 0. If 2−p ≦ 1−(y/x)≦ 2−q for some positive integers p and q, then at most p and at least q significant binary bits are lost in the subtraction x − y.
■ Theorem1
Proof
Theorem on Loss of Precision
Before considering other techniques for avoiding this problem, we ask the following question:
The closeness of x and y is conveniently measured by |1 − (y/x)|. Here is the result:
We prove the second part of the theorem and leave the first as an exercise. To this end, let
x = r × 2n and y = s × 2m , where 1 ≦ r, s < 1. (This is the normalized binary floating-point 2
form.) Since y < x , the computer may have to shift y before carrying out the subtraction. In any case, y must first be expressed with the same exponent as x. Hence, y = (s2m−n) × 2n and
x − y = (r − s2m−n) × 2n The mantissa of this number satisfies
r−s2m−n =r1−s2m=r1−y<2−q r2n x
Hence, to normalize the representation of x − y , a shift of at least q bits to the left is necessary. Then at least q (spurious) zeros are supplied on the right-hand end of the mantissa. This means that at least q bits of precision have been lost. ■
In the subtraction 37.59362 1 − 37.58421 6, how many bits of significants are lost?
Let x denote the first number and y the second. Then
y
1 −
This lies between 2−12 and 2−11. These two numbers are 0.00024 4 and 0.00048 8. Hence,
at least 11 but not more than 12 bits are lost. ■ Here is an example in decimal form.
In the subtraction of y = .6311 from x = .6353, how many significance are lost?
These numbers are close, and 1 − y/x = .00661 < 10−2. In the subtraction, we have x − y = .0042. There are two significant figures in the answer, although there were four significant figures in x and y. ■
1.4 Loss of Significance 59
EXAMPLE 2
Solution
EXAMPLE 3
Solution
x
= 0.00025 01754
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
60 Chapter 1 Mathematical Preliminaries and Floating-Point Representation Avoiding Loss of Significance in Subtraction
EXAMPLE 4
Solution
Now we take up various techniques that can be used to avoid the loss of significance that may occur in subtraction.
Explore the function
f(x)=x2 +1−1 (1) whose values may be required for x near zero.
√
x2 + 1 ≈ 1 when x ≈ 0, we see that there is a potential loss of significance in the subtraction. However, the function can be rewritten in the form
Since
2 √x2 +1+1
f(x)= x+1−1√
x2 +1+1
x2
=√ (2)
x2 +1+1
EXAMPLE 5
Solution
by rationalizing the numerator—that is, removing the radical in the numerator. This pro-
cedure allows terms to be canceled and thereby removes the subtraction. For example, if
we use five-decimal-digit arithmetic and if x = 10−3, then f (x) is computed incorrectly as
zero by the first formula, but as 1 × 10−6 by the second. If we use the first formula together 2
with double precision, the difficulty is ameliorated, but not circumvented altogether. For example, in double precision, we have the same problem when x = 10−6. ■
How can accurate values of the function
f (x) = x − sin x (3)
be computed near x = 0.
A careless programmer might code this function just as indicated in Equation (3), not
realizing that a serious loss of accuracy occurs. Recall from calculus that lim sin x = 1
x→0 x
to see that sin x ≈ x when x ≈ 0. One cure for this problem is to use the Taylor series for
sinx:
x3 x5 x7
sin x = x − + − + · · ·
3! 5! 7!
This series is known to represent sin x for all real values of x . For x near zero, it converges
quite rapidly. Using this series, we can write the function f as
x3 x5 x7 f(x)=x− x− + −
x3 x5 x7
−··· = − + −··· (4)
3! 5! 7!
We see in this equation where the original difficulty arose; namely, for small values of x,
the term x in the sine series is much larger than x3/3! and thus more important. But when f (x) is formed, this dominant x term disappears, leaving only the lesser terms. The series that starts with x3/3! is very effective for calculating f (x) when x is small. ■
In Example 5, further analysis is needed to determine the range in which Series (4)
should be used and the range in which Formula (3) can be used. Using the Theorem on Loss
of Precision, we see that the loss of bits in the subtraction of Formula (3) can be limited to
at most 1 bit by restricting x so that 1 ≦ 1 − sin x /x . (Here we are considering only the case 2
3! 5! 7!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
when sin x > 0.) With a calculator, it is easy to see that x must be at least 1.9. Thus, for |x|<1.9,weusethefirstfewtermsinSeries(4),andfor|x|≧1.9,weuse f(x)=x−sinx. We can verify that for the worst case (x = 1.9), ten terms in the series give f (x) with an error of at most 10−16. (That is good enough for double precision on a 32-bit word-length computer.)
Now we can construct a function procedure for computing f (x) = x − sin x and write its pseudocode. Notice that the terms in the series can be obtained inductively by the algorithm
t1=x3
6 −tnx2
tn+1 = (2n+2)(2n+3) Then the partial sums can be obtained inductively by
so that
A suitable pseudocode for the function in Example 5 is given here:
1.4 Loss of Significance 61
(n≧1)
sn = tk = (−1)k+1
k=1 k=1 (2k + 1)!
s1 = t1
sn+1 =sn +tn+1
(n≧1) n n x2k+1
real function f (x) integeri,n←10; reals,t,x if |x| ≧ 1.9 then
s ← x − sin x else
t ← x3/6 s←t
fori =2ton
t ← −tx2/[(2i + 2)(2i + 3)] s←s+t
end for end if
f←s
end function f
f (x) = x − sin x Function Pseudocode
EXAMPLE 6
Solution
How can accurate values of the function
f(x)=ex −e−2x
be computed in the vicinity of x = 0?
Since ex and e−2x are both equal to 1 when x = 0, there is a loss of significance in the subtraction when x is close to zero. Inserting the appropriate Taylor series, we obtain
x2 x3
f (x) = 1 + x + + + · · ·
2! 3!
= 3x − 3x2 + 3x3 −··· 22
− 1 − 2x +
4x2 8x3
− + · · ·
2! 3!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
62 Chapter 1
Mathematical Preliminaries and Floating-Point Representation
EXAMPLE 7
Solution
EXAMPLE 8
Solution
An alternative approach is to write f(x)=e−2xe3x −1
=e−2x3x+ 9x2+27x3+··· 2! 3!
By using the Theorem on Loss of Precision, we find that at most 1 bit is lost in the subtraction ex − e−2x when x > 0 and
1 ≦ 1 − e−3x 2
This inequality is valid when x ≧ 1 ln 2 = 0.23105. Similar reasoning when x < 0 shows 3
that for x ≦ − 0.23105 and at most 1 bit is lost. Hence, the series should be used for |x| < 0.23105. ■
Criticize the assignment statement
y ← cos2(x) − sin2(x)
When cos2(x)−sin2(x) is computed, there is a loss of significance at x = π/4 (and at other points). The simple trigonometric identity
cos 2θ = cos2 θ − sin2 θ
can be used. Thus, the assignment statement can be replaced by
Criticize the assignment statement
y ← cos(2x) ■
y ← ln(x) − 1
EXAMPLE 9
If the expression ln x − 1 is used for x near e, there is a cancellation of digits and a loss of accuracy. Use elementary facts about logarithms to overcome the difficulty. Thus, we have y = ln x − 1 = ln x − ln e = ln(x/e). Here is a suitable assignment statement
y ← ln x ■ e
Range Reduction
Another cause of loss of significant figures is the evaluation of various library functions with large arguments. This problem is more subtle than those previously discussed. We illustrate with the sine function.
A basic property of the function sin x is its periodicity: sin x = sin(x + 2nπ)
for all real values of x and for all integer values of n. Because of this relationship, we need to know only the values of sin x in some fixed interval of length 2π to compute sin x for arbitrary x. This property can be used in the computer evaluation of sinx and is called range reduction.
Discuss how to evaluate sin(12532.14), by subtracting integer multiples of 2π. Show that it equals sin(3.47), if we retain only two decimal digits of accuracy.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Solution
From sin(12532.14) = sin(12532.14 − 2kπ), we want 12532 = 2kπ and k = 3989/2π ≈ 1994. Consequently, we obtain 12532.14 − 2(1994)π = 3.49 and sin(12532.14) ≈ sin(3.49). Thus, although our original argument 12532.14 had seven significant figures, the reduced argument has only three. The remaining digits disappeared in the subtraction of 3988π. Since 3.47 has only three significant figures, our computed value of sin(12532.14) has no more than three significant figures. This decrease in precision is unavoidable, if there is no way of increasing the precision of the original argument. If the original argu- ment (12532.14) can be obtained with more significant figures, these additional figures are present in the reduced argument (3.47). In some cases, double- or long-double-precision programming may be helpful. ■
For sin x , how many binary bits of significance are lost in range reduction to the interval [0, 2π )?
Given an argument x > 2π , we determine an integer n that satisfies the inequality 0 ≦ x − 2nπ < 2π. Then in evaluating elementary trigonometric functions, we use f (x) = f (x − 2nπ). In the subtraction x − 2nπ, there is a loss of significance. By the Theorem on
Loss of Precision, at least q bits are lost if
1 − 2nπ ≦ 2−q
EXAMPLE 10
Solution
1.4 Loss of Significance 63
Since
x
1 − 2nπ = x − 2nπ < 2π xxx
we conclude that at least q bits are lost if 2π/x ≦ 2−q . Stated otherwise, at least q bits are lost if 2q ≦ x/2π. ■
Summary 1.4
• Toavoidlossofsignificanceinsubtraction,onemaybeabletoreformulatetheexpres- sion using rationalizing, series expansions, or mathematical identities.
• If x and y are positive normalized floating-point machine numbers with 2− p ≦ 1 − y ≦ 2−q
x
then at most p and at least q significant binary bits are lost in computing x − y.
(Note: It is permissible to leave out the hypothesis x > y here.)
√
1. Howcanvaluesofthefunction f(x)= x+4−2be
computed accurately when x is small?
2. Calculate f (10−2) for the function
f(x)=ex −x−1
The answer should have five significant figures and can
easily be obtained with pencil and paper. Contrast it
with the straightforward evaluation of f (10−2) using e0.01 ≈ 1.0101.
3. What is a good way to compute values of the function f (x) = ex − e if full machine precision is needed?
Note: There is some difficulty when x = 1.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 1.4
64
Chapter 1 Mathematical Preliminaries and Floating-Point Representation
a4.
5.
a6.
7.
a8.
9.
a10. a 11.
12.
a 13.
What difficulty could the following assignment cause? y ← 1 − sin x
Circumvent it without resorting to a Taylor series if possible.
14.
15. a 16.
17.
18.
19.
a 20. 21.
a22. 23.
24.
assuming that z is sometimes needed for an x close to zero.
√√ Howcanvaluesofthefunction f(x)= x+2− x
The hyperbolic sine function is defined by sinhx = 1 (ex − e−x ). What drawback could there be in using
√4 x+4−√4 xforpositivex. Find a way to calculate
be computed accurately when x is large? Writeafunctionthatcomputesaccuratevaluesoff(x)=
2
this formula to obtain values of the function? How can
values of sinh x be computed to full machine precision when|x|≦1?
2
Determine the first two nonzero terms in the expansion about zero for the function
tanx −sinx f (x) = √
x− 1+x2 Give an approximate value for f (0.0125).
Find a method for computing
1
y← (sinhx−tanhx)
x
that avoids loss of significance when x is small. Find ap- propriate identities to solve this problem without using Taylor series.
Find a way to calculate accurate values for
√
f(x)= 1+x2−1− x2sinx
x2 x − tan x Determine limx→0 f (x).
For some values of x, the assignment statement y ← 1 − cos x involves a difficulty. What is it, what values of x are involved, and what remedy do you propose?
√
Forsomevaluesofx,thefunction f(x)=
cannot be accurately computed by using this formula. Explain and find a way around the difficulty.
The inverse hyperbolic sine is given by f (x ) = ln x + √2
x + 1 . Show how to avoid loss of significance in computing f (x) when x is negative.
Hint: Find and exploit the relationship between f (x) and
f(−x). Onmostcomputers,ahighlyaccurateroutineforcosxis
provided. It is proposed to base a routine for sin x on the √
formula sin x = ± 1 − cos2 x . From the standpoint of precision (not efficiency), what problems do you foresee and how can they be avoided if we insist on using the routine for cos x ?
Criticize and recode the assignment statement
−x f(x)=(cosx−e )/sinx
correctly. Determine f (0.008) correctly to ten decimal places (rounded).
Without using series, how could the function
sin x f (x) = √
x− x2−1
be computed to avoid loss of significance?
Write a function procedure that returns accurate values of the hyperbolic tangent function
ex −e−x tanhx = ex +e−x
for all values of x . Notice the difficulty when |x | < 1 . 2
Findagoodwaytocomputesinx+cosx−1forx near zero.
Find a good way to compute arctan x − x for x near zero. Findagoodboundfor|sinx−x|usingTaylorseriesand
assuming that |x | < 1 . 10
Howwouldyoucompute(e2x−1)/(2x)toavoidlossof significance near zero?
For any x0 > −1, the sequence defined recursively by
x2 +1−x
xn+1 =2n+11+2−nxn −1
converges to ln(x0 + 1). Arrange this formula in a way
that avoids loss of significance.
Indicate how the following formulas may be useful for arranging computations to avoid loss of significant digits.
aa. sinx−siny=2sin1(x−y)cos1(x+y) 22
(n≧0)
b. logx−logy=log(x/y) c. ex−y =ex/ey
d. 1−cosx=2sin2(x/2)
e. arctanx−arctany=arctan
x−y 1 + xy
z ←
x4 + 4 − 2
25. Whatisagoodwaytocomputetan x−x whenx isnear zero?
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
26. Find ways to compute these functions without serious loss of significant figures:
31. Refer to the discussion of the function f (x) = x − sin x given in the text. Show that when 0 < x < 1.9, there will be no undue loss of significance from subtraction in Equation (3).
32. Discuss the exercise of computing tan(10100). (See Gleick [1992], p. 178.)
33. Letxandybetwonormalizedbinaryfloating-pointma-
chinenumbers.Assumethatx=q×2n,y=r×2n−1,
1 ≦r, q < 1, and 2q − 1≧r. How much loss of sig- 2
nificance occurs in subtracting x − y? Answer the same question when 2q − 1 < r. Observe that the Theorem on Loss of Precision is not strong enough to solve this exercise precisely.
34. ProvethefirstpartoftheTheoremonLossofPrecision.
35. Showthatifxisamachinenumberona32-bitcomputer that satisfies the inequality x > π 225 , then sin x will be computed with no significant digits.
36. Let x and y be two positive normalized floating-point
27.
a28. a29.
a 30.
a1.
2. 3.
b(x) = c(x)= +
a. ex −sinx−cosx
c. logx−log(1/x) ad. x−2(sinx −ex +1)
Let
a(x)= 1−cosx
sin x sin x
Show that b(x) is identical to a(x) and that c(x) approx- imates a(x) in a neighborhood of zero.
On your computer determine the range of x for which (sin x)/x ≈ 1 with full machine precision.
Hint: Use Taylor series.
ab. ln(x)−1
e. x −arctanhx
1.4 Loss of Significance 65
1+cosx x x3
2 24
Thefamiliarquadraticformula 122
machine numbers in a 32-bit computer. Let x = q × 2m
and y = r × 2n with 1 ≦ r, q < 1. Show that if n = m,
− 4ac
will cause a problem when the quadratic equation x2 − 105 x + 1 = 0 is solved with a machine that carries only eight decimal digits. Investigate the example, observe the difficulty, and propose a remedy.
Hint: An example in the text is similar.
When accurate values for the roots of a quadratic equa-
tion are desired, some loss of significance may occur if b2 ≈ 4ac. What (if anything) can be done to overcome this when writing a computer routine?
Write a routine for computing the two roots x1 and x2 of the quadratic equation f(x) = ax2 + bx + c = 0 with real constants a, b, and c and for evaluating f (x1) and f(x2).Useformulasthatreduceroundofferrorsand write efficient code. Test your routine on the following (a, b, c) values: (0, 0, 1); (0, 1, 0); (1, 0, 0); (0, 0, 0); (1, 1, 0); (2, 10, 1); (1,−4,3.99999); (1,−8.01,16.004); (2 × 1017, 1018, 1017); and (10−17, −1017, 1017).
(Continuation) Write and test a routine for solving a quadratic equation that may have complex roots.
Alter and test the pseudocode in the text for computing x − sin x by using nested multiplication to evaluate the series.
x = 2a −b ± b
then at least 1 bit of significance is lost in the subtraction x − y.
37. (Student Research Project) Read about and discuss the difference between cancellation error, a bad algorithm, and an ill-conditioned problem.
Suggestion: One example involves the quadratic equa- tion. Read Stewart [1996].
38. On a three-significant-digit computer, calculate 3.00, with as much accuracy as possible.
√
9.01 −
4. Write a routine for the function f (x) = ex − e−2x using the examples in the text for guidance.
5. Writecodeusingdoubleorextendedprecisiontoevaluate f (x ) = cos(104 x ) on the interval [0, 1]. Determine how
many significant figures the values of f (x) will have.
6. Writeaproceduretocompute f(x)=sinx−1+cosx.
The routine should produce nearly full machine precision
for all x in the interval [0, π/4].
Hint: The trigonometric identity sin2 θ = 1 (1 − cos 2θ ) 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
may be useful.
7. Write a procedure to compute f (x, y) = x ty dt for
arbitrary x and y .
1
Computer Exercises 1.4
66 Chapter 1 Mathematical Preliminaries and Floating-Point Representation
Note: Notice the exceptional case y = −1 and the nu- merical problem near the exceptional case.
8. Suppose that we wish to evaluate the function f (x) = (x − sin x)/x3 for values of x close to zero.
a. Write a routine for this function. Evaluate f (x ) 16times.Initially,letx←1,andthenletx←1x
10
15 times. Explain the results.
Note:L’Hoˆpital’sRuleindicatesthat f(x)shouldtend
to 1 . Test this code. 6
b. Write a function procedure that produces more accu- rate values of f (x) for all values of x. Test this code.
13.
found in the supporting documentation of your computer system.
Quite important in many numerical calculations is the ac- curate computation of the absolute value |z| of a complex number z = a + bi. Design and carry out a computer experiment to compare the following three schemes:
w2 1/2 a. |z|=(a2+b2)1/2 b. |z|=v 1+
v 1 w21/2
c. |z|=2v
where v = max{|a|,|b|} and w = min{|a|,|b|}. Use
+
9. Write a program to print a table of the function f (x) = √2ax
4 2v
very small and large numbers for the experiment.
5− 25+x forx=0to1withstepsof0.01.Besure that your program yields full machine precision, but do not program the exercise in double precision. Explain the results.
a10. Write a routine that computes ex by summing n terms of the Taylor series until the n + 1st term t is such that |t| < ε = 10−6. Use the reciprocal of ex for negative values of x. Test on the following data: 0, +1, −1, 0.5, −0.123, −25.5, −1776, 3.14159. Compute the relative error, the absolute error, and n for each case, using the ex- ponential function on your computer system for the exact value. Sum no more than 25 terms.
11. (Continuation) The computation of ex can be reduced to computing eu for |u| < (ln 2)/2 only. This algorithm re- moves powers of 2 and computes eu in a range where the series converges very rapidly. It is given by
ex =2meu
where m and u are computed by the steps
Here the minus sign is used if x < 0 because z < 0. Incorporate this range reduction technique into the code.
12. (Continuation) Write a routine that uses range reduction
ex = 2meu and computes eu from the even part of the
14. Forwhatrangeofxistheapproximation(e −1)/2x≈ 0.5 correct to 15 decimal digits of accuracy? Using this information, write a function procedure for (ex − 1)/2x, producing 15 decimals of accuracy throughout the inter- val [−10, 10].
a 15.
In the theory of Fourier series, some numbers known as Lebesgue constants play a role. A formula for them is
1 2n1 πk ρn = 2n + 1 + π k tan 2n + 1
k=1
Write and run a program to compute ρ1,ρ2,...,ρ100 with eight decimal digits of accuracy. Then test the va- lidity of the inequality
4
0≦ π2 ln(2n+1)+1−ρn ≦0.0106
Compute in double or extended precision the following number:
x = 1 ln(6 403203 + 744)2 π
What is the point of this exercise?
Write a routine to compute sin x for x in radians as fol- lows. First, using properties of the sine function, reduce the range so that −π/2 ≦ x ≦ π/2. Then if |x| < 10−8, set sinx ≈ x;if|x| > π/6,setu = x/3,computesinubythe formula below, and then set sin x ≈ [3 − 4 sin2 u] sin u; if |x| ≦ π/6, set u = x and compute sin u as follows:
29593 2 34911 4 1−u+u
z←x/ln2;m← integer(z±1) 2
w ← z − m; u ← w ln 2
Gaussian continued fraction; that is,
207636 7613320
eu=s+u, s=2+u2 2520+28u2 sin u ≈ u
s − u 15120 + 420u2 + u4 1+ 69212 u2 + 351384 u4
16.
17.
1671 97
Test on the data given in Computer Exercise 1.4.10. Note: Some of the computer exercises in this section con- tain rather complicated algorithms for computing various intrinsic functions that correspond to those actually used on a large mainframe computer system. Descriptions of these and other similar library functions are frequently
479249 6 −u
1 15113 39840 + 2623 u6
16444 77120
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.4 Loss of Significance 67 + 0.00934 2806u16 + 0.01835 667u18
− 0.01186 224u20 + 0.03162 712u22 Finally,setarcsinx ≈a+barcsinu.Testthisroutinefor
various values of x.
21. Write and test a routine to compute arctan x for x in radi- ansasfollows.If0≦x≦1.7×10−9,setarctanx≈x.If 1.7 × 10−9 < x ≦ 2 × 10−2 , use the series approximation
ingnandrsuchthatx=r×2nwith1≦r<1.
√√2 xxx
Try to determine whether the sine function on your com- puter system uses this algorithm.
Note: This is the Pade ́ rational approximation for sine.
18. Write a routine to compute the natural logarithm by the
algorithm outlined here based on telescoped rational
and Gaussian continued fractions for ln x and test for
several values of x. First check whether x = 1 and
return zero if so. Reduce the range of x by determin-
357
Next, set u = (r − 2/2)/(r + 2/2), and compute ln[(1 + u)/(1 − u)] by the approximation
arctan x ≈ x − + − 357
Otherwise, set y = x, a = 0, and b = 1 if 0≦x≦1; sety=1/x,a=π/2,andb=−1if1
Finally, set arctan x ≈ a + b(c + arctan u).
Note: This algorithm uses telescoped rational and Gaus- sian continued fractions.
1 35135 − 17336.106u2
+ 379.23564u4 − 1.01186 25u6
+ 1 35135 − 62381.106u2 + 3154.9377u4 − 28.17694u6
22.
A fast algorithm for computing arctan x to n-bit precision for x in the interval (0, 1] is as follows: Set a = 2−n/2,
√
b=x/(1+ 1+x2),c=1,andd=1.Thenrepeat-
tan u ≈ u
Finally, if |x| > π/4, set tanx ≈ 1/tanu; if |x|≦π/4, set tan x ≈ tan u.
Note: This algorithm is obtained from the telescoped ra- tional and Gaussian continued fraction for the tangent function.
20. Write a routine to compute arcsin x based on the fol-
lowing algorithm, using telescoped polynomials for the
edly update these variables by these formulas (in order from left to right and top to bottom):
After each sweep, print f = c ln[(1 + b)/(1 − b)]. Stop when 1 − a ≦ 2−n . Write a double-precision routine to implement this algorithm and test it for various values of x . Compare the results to those obtained from the arctan- gent function on your computer system.
Note: This fast multiple-precision algorithm depends on the theory of elliptic integrals, using the arithmetic- geometric mean iteration and ascending Landen trans- formations. Other fast algorithms for trigonometric func- tions are discussed in Brent [1976].
On your computer, show that in single precision, you have only six decimal digits of accuracy if you enter 20 digits. Show that going to double precision is effective only if all work is done in double precision. For example, if you use
arcsine. If |x| < 10−8, set arcsin x ≈ x. Otherwise, if 111√
0≦x≦2,setu=x,a=0,andb=1;if2
j←i end if
end for
lj ↔lk
for i = k + 1 to n
xmult ← ali ,k /alk ,k ali,k ←xmult
for j = k + 1 to n
ali,j ← ali,j −(xmult)alk,j end for
end for end for
deallocate array (si ) end procedure Gauss
Gauss Procedure Pseudocode
Discussion of Pseudocode
A detailed explanation of the above procedure is now presented. In the first loop, the initial form of the index array is being established, namely, li = i. Then the scale array (si) is computed.
The statement for k = 1 to n − 1 initiates the principal outer loop. The index k is the subscript of the variable whose coefficients will be made 0 in the array (ai j ); that is, k is the index of the column in which new 0’s are to be created. Remember that the 0’s in the array (ai j ) do not actually appear because those storage locations are used for the multipliers. This fact can be seen in the line of the procedure where xmult is stored in the array (ai j ). (See Section 8.1 on the LU factorization of A for why this is done.)
Once k has been set, the first task is to select the correct pivot row, which is done by computing |alik|/sli for i = k,k + 1,…,n. The next set of lines in the pseudocode is calculating this greatest ratio, called rmax in the routine, and the index j where it occurs. Next,lk andlj areinterchangedinthearray(li).
The arithmetic modifications in the array (ai j ) due to subtracting multiples of row lk from rows lk+1, lk+2, . . . , ln all occur in the final lines. First the multiplier is computed and stored; then the subtraction occurs in a loop.
In the procedure Naive Gauss for naive Gaussian elimination from Section 2.1, the right-hand side b was modified during the forward elimination phase; however, this was not done in the procedure Gauss. Therefore, we need to update b before considering the back substitution phase. For simplicity, we discuss updating b for the naive forward elimination first. Stripping out the pseudocode from Naive Gauss that involves the (bi ) array in the
Warning
Caution: Values in array (ai j ) that result as output from procedure Gauss are not the same as those in array (ai j ) at input. If the original array must be retained, store a duplicate of it in another array.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
for k = 1 to n − 1
for i = k + 1 to n
bi =bi −aikbk end for
end for
Naive Forward Elimination on rhs
2.2 Gaussian Elimination with Scaled Partial Pivoting 91 forward elimination phase, we obtain
This updates the (bi ) array based on the stored multipliers from the (ai j ) array. When scaled partial pivoting is done in the forward elimination phase, such as in procedure Gauss, the multipliers for each step are not one below another in the (ai j ) array, but are jumbled around. To unravel this situation, all we have to do is introduce the index array (li ) into the above pseudocode:
After the array b has been processed in the forward elimination, the back substitution process is carried out. It begins by solving the equation
for k = 1 to n − 1
for i = k + 1 to n
bli =bli −alikblk end for
end for
Modified Forward Elimination on rhs
whence
Then the equation
is solved for xn−1:
aln,nxn =bln (6)
xn = bln aln ,n
aln−1,n−1xn−1 +aln−1,nxn =bln−1
xn−1 = 1 bln−1 −aln−1,nxn aln−1,n−1
Back Substitution Process
After xn,xn−1,…,xi+1 have been determined, xi is found from the equation ali,ixi +ali,i+1xi+1 +···+ali,nxn =bli
whose solution is
1n
xi =a bli − ali,jxj (7)
li,i j=i+1
Except for the presence of the index array li , this is similar to the back substitution formula
(7) in Section 2.1 (p. 76) obtained for naive Gaussian elimination.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
92 Chapter 2
Linear Systems
Solve Procedure Pseudocode
The procedure for processing the array b and performing the back substitution phase is given next:
procedure Solve (n, (ai j ), (li ), (bi ), (xi )) integer i, k, n; real sum
real array (ai j )1:n×1:n , (li )1:n , (bi )1:n , (xi )1:n for k = 1 to n − 1
for i = k + 1 to n
bli ←bli −ali,kblk end for
end for
xn ←bln/aln,n fori =n−1to1
sum ← bli
for j = i + 1 to n do
sum ← sum − ali , j x j end for
xi ←sum/ali,i end for
end procedure Solve
Here, the first loop carries out the forward elimination process on array (bi ), using arrays (ai j ) and (li ) that result from procedure Gauss. The next line carries out the solution of Equation (6). The final part carries out Equation (7). The variable sum is a temporary variable for accumulating the terms in parentheses.
As with most pseudocode in this book, those in this chapter contain only the basic ingredients for good mathematical software. They are not suitable as production code for various reasons. For example, procedures for optimizing code are ignored. Furthermore, the procedures do not give warnings for difficulties that may be encountered, such as division by zero! General-purpose software should be robust; that is, it should anticipate every possible situation and deal with each in a prescribed way. (See Computer Exercise 2.2.11.)
Long Operation Count
Solving large systems of linear equations can be expensive in terms of computer time. To understand why, let us perform an operation count on the two algorithms whose codes have been given. We count only multiplications and divisions (long operations) because they are more time consuming than addition/subtraction. Furthermore, we lump multiplications and divisions together even though division is slower than multiplication. In modern comput- ers, all floating-point operations are done in hardware, so long operations may not be as significant, but this still gives an indication of the operational cost of Gaussian elimination.
Consider first procedure Gauss. In Step 1, the choice of a pivot element requires the calculation of n ratios—that is, n divisions. Then for rows l2 , l3 , . . . , ln , we first compute a multiplierandthensubtractfromrowli thatmultipliertimesrowl1.Thezerothatisbeing created in this process is not computed. So the elimination requires n − 1 multiplications per row. If we include the calculation of the multiplier, there are n long operations (divisions or multiplications) per row. There are n − 1 rows to be processed for a total of n(n − 1) opera- tions. If we add the cost of computing the ratios, a total of n2 operations is needed for Step 1.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Gauss ops Count
Solve ops Count
■ Theorem1
Step 2 is like Step 1 except that row l1 is not affected, nor is the column of multipliers created and stored in Step 1. So Step 2 requires (n − 1)2 multiplications or divisions because it operates on a system without row l1 and without column 1. Continuing this reasoning, we conclude that the total number of long operations for procedure Gauss is
n n3
n2 +(n−1)2 +(n−2)2 +···+42 +32 +22 = 6(n+1)(2n+1)−1≈ 3 =O(n3)
(The derivation of this formula is outlined in Exercise 2.2.16.) Note that the number of long operations in this procedure grows like n3/3, the dominant term.
Now consider procedure Solve. The forward processing of the array (bi ) involves n − 1 steps. Step 1 contains n − 1 multiplications, Step 2 contains n − 2 multiplications, and so on. The total of the forward processing of array (bi ) is thus
(n−1)+(n−2)+···+3+2+1= n(n−1) 2
(See Exercise 2.2.15.) In the back substitution procedure, one long operation is involved in Step 1, two in Step 2, and so on. The total is
1 + 2 + 3 + · · · + n = n (n + 1) = O (n2) 2
Thus, procedure Solve involves altogether n2 long operations. To summarize:
2.2 Gaussian Elimination with Scaled Partial Pivoting 93
Theorem on Long Operations
The forward elimination phase of the Gaussian elimination algorithm with scaled par- tial pivoting, if applied only to the n×n coefficient array, involves approximately n3/3 long operations (multiplications or divisions). Solving for x requires an additional n2 long operations.
Remarks on Numerical Stability
Brief History of Gaussian Elimination
An intuitive way to think of this result is that the Gaussian elimination algorithm involves a triply nested for-loop. So an O(n3) algorithmic structure is driving the elimination process, and the work is heavily influenced by the cube of n (the number of equation unknowns).
Numerical Stability
The numerical stability of a numerical algorithm is related to the accuracy of the procedure. An algorithm can have different levels of numerical stability because many computations can be achieved in various ways that are algebraically equivalent, but may produce different results. A robust numerical algorithm with a high level of numerical stability is desirable. Gaussian elimination is numerically stable for strictly diagonally dominant matrices or symmetric positive definite matrices. (These properties are discussed in Section 2.3 and Chapter 8, respectively.) For matrices with a general dense structure, Gaussian elimination with partial pivoting is usually numerically stable in practice. Nevertheless, there exist unstable pathological examples in which it may fail. For additional details, see Golub and Van Loan [1996] and Highman [2002].
An early version of Gaussian elimination was found in the Chinese mathematics text (jiuzhang suanshu or The Nine Chapters on the Mathematical Art, Chapter 8, Rectangular Arrays). The method was illustrated in 18 problems with two to five equations. The first
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
94 Chapter 2
Linear Systems
Discussion on Scaling
reference to this book is dated 179 C.E., but parts of it were written as early as approximately 150 B.C.E.. In Europe, Isaac Newton’s notes on solving simultaneous equations were pub- lished in 1707. In 1816, Carl Friedrich Gauss devised a notation for symmetric elimination. Because of some confusion over its history, the Gaussian elimination method was named for Gauss in the 1950s.
Scaling
Readers should not confuse scaling in Gaussian elimination (which is not recommended) with our discussion of scaled partial pivoting in Gaussian elimination.
The word scaling has more than one meaning. It could mean actually dividing each row by its maximum element in absolute value. We certainly do not advocate that. In other words, we do not recommend scaling of the matrix at all! However, we do compute a scale array and use it in selecting the pivot element in Gaussian elimination with scaled partial pivoting. We do not actually scale the rows; we just keep a vector of the “row infinity norms,” that is, the maximum element in absolute value for each row. This and the need for a vector of indices to keep track of the pivot rows makes the algorithm somewhat complicated, but that is the price to be paid for some degree of robustness in the procedure.
The simple 2 × 2 system in Equations (2) and (4) show that scaling does not help in choosing a good pivot row. In this example, scaling is of no use. Scaling of the rows is contemplated in Exercises 2.2.23 and Computer Exercise 2.2.17. Notice that this procedure requires at least n2 arithmetic operations. Again, we are not recommending it for a general- purpose code.
Some codes actually move the rows around in storage. Because that should not be done in practice, we do not do it in the code, since it might be misleading. Also, to avoid misleading the casual reader, we called our initial algorithm (in Section 2.1) naive, hoping that nobody would mistake it for a reliable code.
Variants of Gaussian Eliminations
There are four ways to view Gaussian elimination:
• astheeliminationofvariablesinalinearsystem,
• asrowoperationsonalinearsystem,
• as a transformation of a matrix into triangular form by using elementary lower triangular matrices,
• asthefactorization(ordecomposition)ofamatrixintotheproductoflowerand upper triangular factors.
Variant of Gaussian Elimination
Pivot/Pivoting
Since each of these approaches are related, they yield variants of the Gaussian elimination algorithm, with each having its own advantages and disadvantages in specific applications. As we have seen, it is easy to incorporate pivoting into Gaussian elimination. Some of the standard terminology used is as follows. At the k-th stage of the algorithm, the element akk is the pivot element or simply the pivot. The process of performing interchanges of
rows or columns is pivoting, and it alters the selection of the pivot.
The process of selecting pivots has two aspects: where the pivots come from and how the
pivots are chosen. The details of pivoting depend on the algorithm used and its application. It is useful to note that Gaussian elimination with pivoting is equivalent to making all the
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Partial Pivoting for Size
Pivoting for Sparsity
Classical Gaussian Elimination
interchanging in the original matrix first and then performing Gaussian elimination without pivoting of the resulting matrix. This is nice in theory, but in practice, this is easier said than done!
Since the basic Gaussian elimination algorithm can fail when a division by zero is encountered, row and/or column interchanges may be used to avoid this difficulty; in other words, pivoting. Although pivoting is a simple idea, it is a nontrivial matter to decide which element to use as a pivot! Of all the pivoting strategies, by far the most common is partial pivoting for size, which means selecting a pivot from a set of candidates by choosing the one that is largest in magnitude. (This usage is natural for dense matrices where pivoting for size utilizes a norm.) In fact, Gaussian elimination and variations of it are some of the most frequently used algorithms in computational mathematics!
A major virtue of Gaussian elimination is its ability to be adapted to special structured matrices such as sparse or banded matrices. The introduction of nonzero elements in the place of a zero element is called fill-in. In many applications involving large sparse linear systems, the coefficient matrix has predominantly zero elements. Clearly, the choice of the pivot strategy influences the amount of fill-in. Most algorithms for sparse matrices use a pivoting strategy that reduces fill-in, called pivoting for sparsity. Unfortunately, pivoting for size and pivoting for sparsity can be at odds with one another!
Classical Gaussian elimination can be thought of as expanding the L U decomposition or factorization of the coefficient matrix A in a linear system. In Figure 2.2, the shaded area represents the part of the LU decomposition that has already been computed with L and U separated by a diagonal line. The thin horizontal rectangular area represents the next partial row of U to be computed; whereas the thin vertical rectangular area represents the corresponding partial column elements of L. These two rectangles overlap, reflecting the fact that the diagonal elements of L are 1’s, which are not stored. (We discuss LU factorization more in Section 8.1.) For a given matrix, the operations for classical Gaussian elimination are well known. However, there is considerable freedom in how these operations can be interleaved one with another. Moreover, each style of interleaving gives rise to a variant of the basic algorithm.
2.2 Gaussian Elimination with Scaled Partial Pivoting 95
Ab
U (expanding)
Pivot
FIGURE 2.2
Classical Gaussian elimination A = LU (no pivoting)
Partial Pivoting
Partial pivoting involves row interchanges. To minimize the roundoff errors, the row that moves the largest pivot to the diagonal position is chosen. A pivot element is selected from the left thin vertical rectangular area in Figure 2.3 (p. 96), and an interchange of that entire row in the A array is done with the pivot row (as indicated by the arrows).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
L (expanding)
96 Chapter 2 Linear Systems
Ab
Pivot
FIGURE 2.3
Partial pivoting Gaussian elimination
Scaled Partial Pivoting
Full Pivoting
Scaled partial pivoting is a rather clever modification of partial pivoting that simulates full pivoting by using an index vector and a scale vector containing information about the relative sizes of elements in each row.
Complete (full) pivoting is more complicated, involving exchanges of both rows and columns, which can change the order of the unknowns. For increased stability, the largest possible pivot is sought, requiring a search in the entire submatrix as shown in Figure 2.4. Full pivoting is less susceptible to roundoff errors, but this increase in numerical stability comes at the cost of an increase in the work associated with searching and in the amount of data movement involved. The general feeling is that the benefits of full pivoting are not worth the extra effort!
Ab
Interchange rows
Pivot
FIGURE 2.4
Complete pivoting Gaussian elimination
Backward Stable
Interchange columns
Interchange rows
In practice, applying Gaussian elimination with partial pivoting and back substitution gives the exact solution to a nearby problem, which is exactly the right answer to nearly the right question! (See Trefethen and Bau [1997].) Such an algorithm is called backward stable. Gaussian elimination without partial pivoting is not backward stable for a linear system with a general coefficient matrix A, but it is if A is symmetric and positive definite.
In 1810, Gauss gave the first layout, up to diagonal scaling, of the algorithm now known as “classical Gaussian elimination.” The terms “partial pivoting” and “complete
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2.2 Gaussian Elimination with Scaled Partial Pivoting 97 pivoting” are attributable to Wilkinson [1965]. There are four or more algorithmic variants
for classical Gaussian elimination. For additional details, see Stewart [1998b].
Condition Number
Often mathematical software for solving linear systems return not only the approximate solution but the condition number of the linear system. Use the following rules to interpret these results.
■ Rule
Rules of Thumb
1. Theconditionnumberκ(A)(p.406)indicateshowcloseAistobeingnumerically singular (non-invertible).
2. In practice, applying Gaussian elimination with a variant of partial pivoting and back substitution to solve Ax = b yields a numerical solution such that the residual vector r = b − A is small even if the condition number κ ( A) is large.
3. If κ(A) is large, A is ill-conditioned, and even the best numerical algorithm pro- duces a solution that cannot be guaranteed to be close to the true solution.
4. IfAandbarestoredtomachineprecisionεm,thenumericalsolutiontoAx=bby any variant of Gaussian elimination is correct to d = | log10 εm | − log κ ( A) digits.
MATLAB Backslash
See Section 8.4 for more details on these topics.
Backslash Operator in MATLAB
The system of equations Ax = b has the formal solution x = A−1 b. In MATLAB notation, the system is solved with the backslash command: x = A\b. The software attempts to solve the system with the method that gives the least roundoff error and fewest operations. When
A is an n × n matrix, MATLAB examines A to see:
If A is n × m with n ≠ m, MATLAB attempts to solve the system using the appropriate algorithms—some of which are discussed later.
Other mathematical software systems such as Maple and Mathematica have similar multi-algorithmic procedures or hybrid schemes.
Summary 2.2
• In performing Gaussian elimination, partial pivoting is highly recommended to avoid zero pivots and small pivots. In Gaussian elimination with scaled partial pivoting, we use a scale vector s = [s1,s2,…,sn]T in which
si = max |aij| (1≦i≦n) 1≦j≦n
1. If it is a permutation of a triangular system—if so, the appropriate triangular solve is used.
2. If it appears to be symmetric and positive definite—if so, a Cholesky factor- ization and two triangular solves are attempted.
3. IfCholeskyfactorizationfailsorifAdoesnotappeartobesymmetric,anLU factorization and two triangular solves are attempted.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
98
Chapter 2
Linear Systems
a1.
a2.
ShowhowGaussianeliminationwithscaledpartialpiv- oting works on the following matrix A:
2 3 −4 1 1−1 0−2 3343 4104
Solve the following system using Gaussian elimination with scaled partial pivoting:
1 −1 2x1 −2 −2 1 −1 x2 = 2
4 −1 2 x3 −1
Show intermediate matrices at each step.
a3.
4.
CarryoutGaussianeliminationwithscaledpartialpivot- ing on the matrix
and an index vector l = [l1,l2,…,ln], initially set as l = [1,2,…,n]. The scale array is set once at the beginning of the algorithm. The elements in the index array are interchanged rather than the rows of the matrix A, which reduces the amount of data movement considerably. The key step in the pivoting procedure is to select j to be the first index associated with the largest ratio in the set
|ali ,k | : k ≦ i ≦ n sli
and interchange l j with lk in the index array l. Then use multipliers ali ,k
alk ,k
times row lk and subtract from equations li for k + 1 ≦ i ≦ n.
• The forward elimination from equation li for lk+1 ≦ li ≦ ln is
The steps involving the vector b are usually done separately just before the back substi- tution phase, which we call updating the right-hand side.
• The back substitution is
1n
xi = a bli − ali,jxj (i =n,n−1,n−2,…,1) li,i j=i+1
• For an n × n system of linear equations Ax = b, the forward elimination phase of the Gaussian elimination with scaled partial pivoting involves approximately n3/3 long operations (multiplications or divisions), whereas the back substitution requires only n2 long operations.
ali,j ←ali,j −(ali,k/alkk)akj (lk ≦lj ≦ln) bli ← bli −(ali,k/alkk)blk
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
10 01 3−3 0 2
Show intermediate matrices. Considerthematrix
−0.0013 56.4972 0.0000 −0.0145 0.0000 102.7513 0.0000 −1.3131
30 3−1 06
4 −6
for naive Gaussian elimination, for Gaussian elimination
−9876.5432 100.0001 Identify the entry that is used as the next pivot element
987.6543 −7.6543 69.6869
123.4567
8.8990 833.3333
Exercises 2.2
a5.
a6.
7.
a
9.
a 10.
# # #
# # 0
# # #
0 # 0 # 0
8.
a.
ab.
c.
ad.
e.
3×1+4×2 +3×3 =10
x1+5×2 −x3=7
6x +3x +7x=15 13 3
3×1 +2×2 −5×3 =0 2×1−3×2+ x3=0 x1+4×2− x3=4
1−1 2 1x1 1 3214x21 5 8 6 3x3= 1 4253×4 −1
3×1+2×2− x3= 7
5×1 +3×2 +2×3 = 4 −x1 + x2 −3×3 =−1
x1 +3×2 +2×3 + x4 =−2 4×1 +2×2 + x3 +2×4 =2 2 x 1 + x 2 + 2 x 3 + 3 x 4 = 1 x1 +2×2 +4×3 + x4 =−1
with partial pivoting (the scale vector is [1, 1, 1, 1]), and for Gaussian elimination with scaled partial pivoting (the scale vector is [987.6543, 46.79, 256.29, 1.096]).
Without using a computer, determine the final contents of the array (ai j ) after procedure Gauss has processed the following array. Indicate the multipliers by underlining
them. 1 3 2 1
421 2 212 3 124 1
IftheGaussianeliminationalgorithmwithscaledpartial pivoting is used on the matrix shown, what is the scale vector? What is the second pivot row?
473 132
2 −4 −1
If the Gaussian elimination algorithm with scaled par- tial pivoting is used on the example shown, which row is selected as the third pivot row?
8−1 4 9 2 10397 −5 0 1 3 5 43227
30009 Solve the system
2×1 +4×2 −2×3 = 6 x1 +3×2 +4×3 =−1 5×1+2×2 =2
using Gaussian elimination with scaled partial pivoting. Show intermediate results at each step; in particular, dis- play the scale and index vectors.
Consider the linear system
2×1 + 3×2 = 8 −x1 +2×2 −x3 =0 3×1+ 2×3 =9
Solve for x1, x2, and x3 using Gaussian elimination with scaled partial pivoting. Show intermediate matrices and vectors.
11.
Consider Gaussian elimination with scaled partial pivot- ing applied to the coefficient matrix
Consider the linear system of equations
− x + x − 3 x = 4
124
x1 +3×3+ x4=0 x−x−x=3
pivoting code, except that you can include the right-hand side of the system in your calculations as you go along.
2.2
Gaussian Elimination with Scaled Partial Pivoting 99
234
3×1 + x3+2×4=1 2×1− x2+3×3+ 7×4=15
4×1+4×2 + 7×4=11 partial pivoting. Show all intermediate steps, and write 2×1+ x2+ x3+ 3×4= 7
Solve this system using Gaussian elimination with scaled
down the index vector at each step. 6×1 +5×2 +4×3 +17×4 =31
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
12. 13.
where each # denotes a different nonzero element. Circle the locations of elements in which multipliers are stored and mark with an f those where fill-in occurs. The final index vector is l = [2, 3, 1, 5, 4].
RepeatExercise2.1.6ausingGaussianeliminationwith scaled partial pivoting.
Solve each of the following systems using Gaussian elim- ination with scaled partial pivoting. Carry four significant figures. What are the contents of the index array at each step?
14.
Using scaled partial pivoting, show how a computer
would solve the following system of equations. Show
the scale array, tell how the pivot rows are selected, and
carry out the computations. Include the index array for
each step. There are no fractions in the correct solution,
except for certain ratios that must be looked at to se-
lect pivots. You should follow exactly the scaled-partial-
# #
0 #
0
# 0
0 0 # #
100
Chapter 2 Linear Systems
15.
16.
a17.
Derive the formula
n n
k = 2 (n + 1) k=1
Hint: Set S = nk=1 k; also observe that 2S=(1+2+···+n)+[n+(n−1)+···+2+1]
= (n + 1) + (n + 1) + · · ·
or use induction.
Derivetheformula
a 21.
After processing a matrix A by procedure Gauss, how can the results be used to solve a system of equations of formATx=b?
n n
k2 =
Hint: Induction is probably easiest.
Count the number of operations in the following pseu- docode:
22. What modifications would make procedure Gauss more efficient if division were much slower than multiplica- tion?
23. The matrix A = (aij)n×n is row-equilibrated if it is scaled so that
max |aij|=1 (1≦i≦n) 1≦j≦n
In solving a system of equations Ax = b, we can produce an equivalent system in which the matrix is row-equilibrated by dividing the ith equation by max1 ≦ j ≦ n |ai j |.
aa. Solvethesystemofequations
1 1 2 × 109 x1 1 2 −1 109 x2 = 1
1 2 0 x3 1
by Gaussian elimination with scaled partial pivoting.
b. Solvebyusingrow-equilibratednaiveGaussianelim- ination. Are the answers the same? Why or why not?
24. Solveeachsystemusingpartialpivotingandscaledpar- tial pivoting carrying four significant digits. Also, find thetrue solutions.
a. 0.004000x + 69.13y = 69.17 4.281x − 5.230y = 41.91
a 18.
a 19.
20.
1.
Count the number of divisions in procedure Gauss. Count the number of multiplications. Count the number of addi- tions or subtractions. Using execution times in microsec- onds (multiplication 1, division 2.9, addition 0.4, subtrac- tion 0.4), write a function of n that represents the time used in these arithmetic operations.
Considering long operations only and assuming 1- microsecond execution time for all long operations, give the approximate execution times and costs for procedure Gauss when n = 10, 102, 103, 104. Use only the domi- nant term in the operation count. Estimate costs at $500 per hour.
(Continuation) How much time would be used on the computer to solve 2000 equations using Gaussian elimi- nation with scaled partial pivoting? How much would it cost? Give a rough estimate based on operation times.
Test numerical example (5) in the text using naive Gaus- sian algorithm and Gaussian algorithm with scaled partial pivoting.
b. c. d. e. f.
40.00x + 691300y = 691700 4.281x − 5.230y = 41.91
0.003000x + 59.14y = 59.17 5.291x − 6.130y = 46.78
0.8000x + 1825y = 2040
6 k=1
(n+1)(2n+1)
real array (ai j )1:n×1:n , (xi j )1:n×1:n real z; integer i, j, n
fori =1ton
for j = 1 to i
z = z + ai j xi j
end for end for
a 2.
Consider the augmented matrix
30.00x + 591400y = 591700 5.291x − 6.130y = 46.78
0.7000x + 1725y = 1739 0.4352x − 5.433y = 5.278
0.4321x − 5.432y = 7.531
0.4096 0.2246 0.3645 0.1784
0.4043
0.1234 0.3678 0.2943
0.3872 0.4015 0.1129 0.1550 0.1920 0.3781 0.0643 0.4240 0.4002 0.2786 0.3927 0.2557
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 2.2
Solve it by Gaussian elimination with scaled partial piv- oting using procedures Gauss and Solve.
a3. (Continuation) Assume that an error was made when the coefficient matrix in Computer Exercise 2.2.2 was typed and that a single digit was mistyped—namely, 0.3645 be- came 0.3345. Solve this system, and notice the effect of this small change. Explain.
a4. The Hilbert matrix of order n is defined by aij =
(i + j − 1)−1 for 1≦i,j≦n. It is often used for
test purposesbecause of its ill-conditioned nature. De- n
Compare the results and explain.
0.0001 −5.0300 5.8090 7.8320 9.5740 2.2660 1.9950 1.2120 8.0080 7.2190 8.8500 5.6810 4.5520 1.3020 5.7300
6.7750 −2.2530 2.9080 3.9700 6.2910
10. Withoutchangingtheparameterlist,rewriteandtestpro- cedure Gauss so that it does both forward elimination and back substitution. Increase the size of array (ai j ), and store the right-hand side array (bi ) in the n + 1st column of (ai j ). Also, return the solution in this column.
2.2 Gaussian Elimination with Scaled Partial Pivoting 101
fine bi = j=1ai j . Then the solution of the sys- 11. tem of equations nj=1aijxj = bi for 1≦i≦n is
x = [1,1,…,1]T. Verify this. Select some values of
n in the range 2 ≦ n ≦ 15, solve the system of equations
for x using procedures Gauss and Solve, and see whether the result is as predicted. Do the case n = 2 by hand to see what difficulties occur in the computer.
a5. Definethen×narray(aij)byaij =−1+2max{i,j}. Set up array (bi ) in such a way that the solution of the systemAx=bisxi =1for1≦i≦n.Testprocedures Gauss and Solve on this system for a moderate value of n,say,n=30.
Modify procedures Gauss and Solve so that they are more robust. Two suggested changes are as follows: (i) skip elimination if ali ,k = 0 and (ii) add an error parameter ierr to the parameter list and perform error checking (e.g., on division by zero or a row of zeros). Test the modified code on linear systems of varying sizes.
12. RewriteproceduresGaussandSolvesothattheyarecol- umn oriented—that is, so that all inner loops vary the first index of (ai j ). On some computer systems, this imple- mentation may avoid paging or swapping between high- speed and secondary memory and be more efficient for large matrices.
a6. Select a modest value of n, say, 5 ≦ n ≦ 20, and let 13. aij = (i −1)j−1 and bi = i −1. Solve the system
Ax = b on the computer. By looking at the output, guess
what the correct solution is. Establish algebraically that
your guess is correct. Account for the errors in the com-
puted solution.
7. Forafixedvalueofnfrom2to4,let
aij = (i+j)2, bi = ni(i+n+1)+1n(1+n(2n+3)) 6
Show that the vector x = [1,1,…,1]T solves the sys- tem Ax = b. Test whether procedures Gauss and Solve can compute x correctly for n = 2, 3, 4. Explain what happens.
8. Usingeachvalueofnfrom2to9,solvethen×nsystem Ax=b,where Aandbaredefinedby
aij =(i+j−1)7, where
bi =p(n+i−1)−p(i−1)
Computer memory can be minimized by using a differ-
ent storage mode when the coefficient matrix is sym-
metric. An n × n symmetric matrix A = (ai j ) has
the property that aij = aji, so only the elements on
and below the main diagonal need to be stored in a
vector of length n(n + 1)/2. The elements of the ma-
trix A are placed in a vector v = (vk) in this order:
a11,a21,a22,a31,a32,a33, …,an,n. Storing a matrix
in this way is known as symmetric storage mode and
affects a savings of n(n − 1)/2 memory locations. Here,
aij =vk,wherek= 1i(i−1)+jfori≧ j.Verifythese 2
statements.
Write and test procedures
Gauss Sym(n,(vi),(li)) Solve Sym(n,(vi),(li),(bi))
which are analogous to procedures Gauss and Solve, ex- cept that the coefficient matrix is stored in symmetric storage mode in a one-dimensional array (vi ) and the solution is returned in array (bi ).
14. Thedeterminantofasquarematrixcanbeeasilycom- puted with the help of procedure Gauss. We require three facts about determinants. First, the determinant of a trian- gular matrix is the product of the elements on its diagonal. Second, if a multiple of one row is added to another row, the determinant of the matrix does not change. Third, if two rows in a matrix are interchanged, the determinant changes sign. Procedure Gauss can be interpreted as a
x2 2 2
(2+x (−7+n (14+n(12+3n)))) Explain what happens.
9. Solve the following augmented matrix using procedures Gauss and Solve and then using procedure Naive Gauss.
p(x)=
24
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
102
Chapter 2 Linear Systems
15.
16.
17.
18.
19.
1.0 0.5
0.333333 0.25
0.2
bi ← 7560bi
procedure for reducing a matrix to upper triangular form by interchanging rows and adding multiples of one row to another. Write a function det(n, (ai j )) that computes the determinant of an n × n matrix. It will call procedure Gauss and utilize the arrays (ai j ) and (li ) that result from that call. Numerically verify function det by using the following test matrices with several values of n:
and both with the right-hand side vector b = [1, 0, 0, 0, 0]T . Solve both systems using single- precision Gaussian elimination with scaled partial piv- oting. For each system, compute the l2-norms ||u||2 =
n u2 oftheresidualvectorr = Ax−bandofthe i=1 i
errorvectore=x−x,wherexisthecomputedsolution and x is the true, or exact, solution. For the first system, the
T
20. (Continuation)Repeattheprecedingcomputerproblem, but set
aij ← 7560aij;
for each system before solving.
21. WritecomplexarithmeticversionsofproceduresGauss and Solve by declaring certain variables complex and making other necessary changes in the code. Test them on the complex linear systems given in Computer Exer- cise 2.1.6.
22. (Continuation) Solve the complex linear systems given in Computer Exercises 2.1.7.
a. a =|i−j| Det(A)=(−1)n−1(n−1)2n−2
exact solution is x = [25, −300, 1050, −1400, 630] , and for the second system, the exact solu- tion, to six decimal digits of accuracy, is x = [26.9314, −336.018, 1205.11, −1634.03, 744.411]T . Do not change the input data of the second system to include more than the number of digits shown. Analyze the results. What have you learned?
ij b. aij =
1 j≧i −j j
|ai j | (1 ≦ i ≦ n)
■ Definition1 Strictly Diagonally
Dominant
Tridiagonal System Case
In the case of the tridiagonal system of Equation (1), strict diagonal dominance means simply that (with a0 = an = 0)
|di | > |ai−1| + |ci | (1 ≦ i ≦ n)
Let us verify that the forward elimination phase in procedure Tri preserves strictly diagonal dominance. The new coefficient matrix produced by Gaussian elimination has 0 elements where the ai ’s originally stood, and new diagonal elements are determined recursively by
d=d 11
di=di− ai−1 ci−1 (2≦i≦n) di−1
where d denotes new diagonal elements. The c elements are unaltered. Now we assume ii
di=di− ai−1 ci−1
that |di | > |ai−1| + |ci |, and we want to be sure that |di | > |ci |. Obviously, this is true for
i =1becaused1 =d1.Ifitistrueforindexi−1(thatis,|di−1|>|ci−1|),thenitistrue for index i because
di−1
≧ |di|−|ai−1|di−1
|ci−1|
> |ai−1|+|ci|−|ai−1| = |ci|
Although the number of long operations in Gaussian elimination on full matrices is O(n3), it is only O(n) for tridiagonal matrices. Also, the scaled pivoting strategy is not needed on strictly diagonally dominant tridiagonal systems.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Pentadiagonal n × n System
=
d1 c1 f1
a1 d2 c2 f2
e1 a2 d3 c3 f3
e2 a3 d4 c4 f4
… … … … …
ei−2 ai−1 di ci fi
… … … … …
en−4 an−3 dn−2 cn−2 en−3 an−2 dn−1 en−2 an−1
x1 x2
x3 x4 .
xi .
fn−2 x n − 2 cn−1 xn−1 dn xn
b2
.
b n − 2
2.3 Tridiagonal and Banded Systems 107
Pentadiagonal Systems
The principles illustrated by procedure Tri can be applied to matrices that have wider bands of nonzero elements. A procedure called Penta is given here to solve the five-diagonal system:
In the pseudocode, the solution vector is placed in an n × 1 array (xi ). Also, one should not use this routine if n ≦ 4. (Why?)
b1
b3
b4
. bi
bn−1 bn
procedure Penta(n, (ei ), (ai ), (di ), (ci ), ( fi ), (bi ), (xi )) integer i, n; real r, s, xmult
real array (ei )1:n , (ai )1:n , (di )1:n , (ci )1:n , ( fi )1:n , (bi )1:n , (xi )1:n r ← a1
s ← a2
t ← e1
fori =2ton−1
xmult ← r/di−1
di ←di −(xmult)ci−1
ci ←ci −(xmult)fi−1
bi ←bi −(xmult)bi−1 xmult ← t/di−1
r ← s − (xmult)ci−1
di+1 ← di+1 − (xmult) fi−1 bi+1 ←bi+1 −(xmult)bi−1
s ← ai+1
t ← ei end for
xmult ← r/dn−1
dn ← dn − (xmult)cn−1
xn ← (bn − (xmult)bn−1)/dn xn−1 ←(bn−1 −cn−1xn)/dn−1 fori =n−2to1
xi ←(bi − fixi+2 −cixi+1)/di end for
end procedure Penta
Penta Procedure Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
108 Chapter 2 Linear Systems
Symmetric Penta Key Call
To be able to solve symmetric pentadiagonal systems with the same code and with a mini- mum of storage, we have used variables r, s, and t to store temporarily some information rather than overwriting into arrays. This allows us to solve a symmetric pentadiagonal system with a procedure call of the form
call Penta(n, ( fi ), (ci ), (di ), (ci ), ( fi ), (bi ), (bi ))
This reduces the number of linear arrays from seven to four! Of course, the original data in some of these arrays may be corrupted. The computed solution may be stored in the (bi ) array. Here, we assume that all linear arrays are padded with zeros to length n in order not to exceed the array dimensions in the pseudocode.
Block Pentadiagonal Systems
Many mathematical problems involve matrices with block structures. In many cases, there are advantages in exploiting the block structure in the numerical solution. This is particularly true in solving partial differential equations numerically as in Section 12.3.
We can consider a pentadiagonal system as a block tridiagonal system
Block Pentadiagonal n × n System
…
… …
. .
where
d2i−1 c2i−1 e2i−1 c2i−1 f2i−1 0 Di=a d , Ai=0 e , Ci=c f
2i−1 2i 2i 2i−1
D1 C1
A1 D2 C2
X1 B1
X3 B3
A2 D3 C3
ii
ADC
X = B
i−1 i i
… … …
X2 B2
. . An−2 Dn−1 Cn−1 X n − 1 B n − 1
An−1Dn Xn Bn
D ← D − A D−1 C
i i i−1i−1i−1
B ←B −A D−1 B (2≦i≦m) i i i−1 i−1 i−1
Forward Elimination
Back Substitution
Here, we assume that n is even, say n = 2m. If n is not even, then the system can be padded with an extra equation xn+1 = 1 so that the number of rows is even.
The algorithm for this block tridiagonal system is similar to the one for tridiagonal systems. Hence, we have the forward elimination phase
and the back substitution phase
2i
X ←D−1B nnn
X ←D−1(B−CX ) (m−1≦i≦1) i i i ii+1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Here, we let
D−1=1 d2i
i −a2i−1
−c2i−1 d2i−1
2.3
Tridiagonal and Banded Systems 109
FIGURE 2.5
Mesh points in natural order
Sample Sparse 9 × 9 System
where = d2i d2i−1 − a2i−1c2i−1.
Code for solving a pentadiagonal system using this block procedure is left as Computer
Exercise 2.3.21. The results from the block pentadiagonal code are the same as those from the procedure Penta, except for roundoff error. Also, this procedure can be used for symmetric pentadiagonal systems (in which the subdiagonals are the same as the superdiagonals).
In Section 12.3, we discuss two-dimensional elliptic partial differential equations. For example, the Laplace equation is defined on the unit square with a 3×3 mesh of grid points placed over the unit square region which are ordered in the natural ordering (left-to-right and up) as shown in Figure 2.5.
789
456
123
In the Laplace equation, the second-order partial derivatives are approximated by second- order centered finite difference formulas. This results in a 9 × 9 system of linear equations having a sparse coefficient matrix with this nonzero pattern:
××× × × × × ×× × × ×
A= × × × × × ×× ××××
×××
Here, the nonzero entries in the matrix are indicated by the × symbol, and the zero entries are a blank. This matrix is block tridiagonal, and each nonzero block is either tridiagonal or diagonal. Other orderings of the mesh points result in sparse matrices with different patterns.
Summary 2.3
• In many applications, tridiagonal, pentadiagonal, and other banded systems are solved using special algorithms use Gaussian elimination without pivoting. The forward elim- ination procedure for a tridiagonal linear system A = Tridiagonal[(ai ), (di ), (ci )] is
×× ×××
××
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
110
Chapter 2
Linear Systems
di ←di −ai−1ci−1 di −1
bi ←bi −ai−1bi−1 (2≦i≦n) di −1
1.
2.
WhathappenstothetridiagonalSystem(1)ifGaussian elimination with partial pivoting is used to solve it? In general, what happens to a banded system?
Count the long arithmetic operations involved in proce- dures:
aa. Tri b. Penta
5. WhatistheappearanceofamatrixAifitselementssat- isfyaij=0when:
a. ji+1
a6. ConsiderastrictlydiagonallydominantmatrixAwhose elements satisfy ai j = 0 when i > j + 1. Does Gaussian elimination without pivoting preserve the strictly diago- nal dominance? Why or why not?
a7. Let A be a matrix of form (1) such that aici > 0 for 1≦i≦n−1.Findthegeneralformofthediagonalma- trix D = Diag(αi) with αi ≠ 0 such that D−1 AD is symmetric. What is the general form of D−1 A D?
2. Repeat the previous computer problem for procedure Penta with six arrays (ei), (ai), (di), (ci), ( fi), and (bi). Use the example that begins this chapter as one of the test cases.
The back substitution procedure is
• A strictly diagonally dominant matrix A = (ai j )n×n is one in which the magnitude of the diagonal entry is larger than the sum of the magnitudes of the off-diagonal entries in the same row, and this is true for all rows, namely,
j ≠ i
necessary because zero divisors will not be encountered.
• The forward elimination and back substitution procedures for a pentadiagonal linear system A = Pentadiagonal [(ei ), (ai ), (di ), (ci ), ( fi )] is similar to that for a tridiagonal system.
a3. Howmanystoragelocationsareneededforasystemof n linear equations if the coefficient matrix has banded structureinwhichaij =0for|i− j|≧k+1?
4. Giveanexampleofasystemoflinearequationsintridi- agonal form that cannot be solved without pivoting.
1. Rewrite procedure Tri using only four arrays, (ai ), (di ), (ci ), and (bi ), and storing the solution in the (bi ) array. Test the code with both a nonsymmetric and a symmetric tridiagonal system.
xi ← 1bi −cixi+1 (i=n−1,n−2,…,1) di
n j=1
|ai j | (1 ≦ i ≦ n)
For strictly diagonally dominant tridiagonal coefficient matrices, partial pivoting is not
|aii | >
Exercises 2.3
Computer Exercises 2.3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a3. Write and test a special procedure to solve the tridiagonal systeminwhichai =ci =1foralli.
a4. Use procedure Tri to solve the following system of 100 equations. Compare the numerical solution to the obvious exact solution.
2.3 Tridiagonal and Banded Systems 111 For n odd, write and test
procedureX Gauss(n,(ai),(di),(bi))
that does the forward elimination phase of Gaussian elim-
ination (without scaled partial pivoting) and
procedureX Solve(n,(ai),(di),(bi),(xi))
that does the back substitution for cross-systems of this
form.
11. Consider the n × n lower-triangular system Ax = b,
where A=(aij)andaij =0fori < j.
aa. Write an algorithm (in mathematical terms) for solv-
ing for x by forward substitution.
b. Write
procedureForward Sub(n,(ai),(bi),(xi)) which uses this algorithm.
c. Determine the number of divisions, multiplications, and additions (or subtractions) in using this algorithm to solve for x.
d. ShouldGaussianeliminationwithpartialpivotingbe used to solve such a system?
a 12. (Normalized Tridiagonal Algorithm) Construct an al- gorithm for handling tridiagonal systems in which the normalized Gaussian elimination procedure without piv- oting is used. In this process, each pivot row is divided by the diagonal element before a multiple of the row is subtracted from the successive rows. Write the equations involved in the forward elimination phase and store the upper diagonal entries back in array (ci ) and the right- hand side entries back in array (bi ). Write the equations for the back substitution phase, storing the solution in array (bi). Code and test this procedure. What are its advantages and disadvantages?
13. For a (2n) × (2n) tridiagonal system, write and test a procedure that proceeds as follows: In the forward elim- ination phase, the routine simultaneously eliminates the elements in the subdiagonal from the top to the middle and in the superdiagonal from the bottom to the middle. In the back substitution phase, the unknowns are deter- mined two at a time from the middle outward.
14. (Continuation) Rewrite and test the procedure in the pre- ceding computer problem for a general n × n tridiagonal matrix.
x1 + 0.5x2 = 1.5 0.5xi−1 + xi + 0.5xi+1 = 2.0 0.5x99 + x100 = 1.5
5. Solvethesystem
(2≦i≦99)
4x1 −x2 =−20
xj−1−4xj + xj+1= 40 (2≦j≦n−1)
− xn−1 + 4xn = −20 using procedure Tri with n = 100.
6. Let A be the 50 × 50 tridiagonal matrix
5 −1 −1 5−1 −15−1
... ... ... −1 5−1
−1 5
Consider the problem Ax = b for 50 different vectors b
of the form
[1,2,...,49,50]T , [2,3,...,50,1]T ,
[3,4,...,50,1,2]T , ...
Write and test an efficient code for solving this problem.
Hint: Rewrite procedure Tri.
7. RewriteandtestprocedureTrisothatitperformsGaus- sian elimination with scaled partial pivoting.
Hint: Additional temporary storage arrays may be needed.
8. RewriteandtestPentasothatitdoesGaussianelimina- tion with scaled partial pivoting. Is this worthwhile?
9. UsingtheideasillustratedinPenta,writeaprocedurefor solving seven-diagonal systems. Test it on several such systems.
10. Considerthesystemofequations(n=7) da x b
17 11
d2 a6 x2 b2
d3 a5 x3 b3 d4 x4=b4 a3 d5 x 5 b 5
a2 d6 x6 b6 a1 d7 x7 b7
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
112 Chapter 2 Linear Systems
15. Suppose
procedure Tri Normal(n, (ai ), (di ), (ci ), (bi ), (xi )) performs the normalized Gaussian elimination algorithm
of Computer Exercise 2.3.12 and
procedure Tri 2n(n, (ai ), (di ), (ci ), (bi ), (xi )) performs the algorithm outlined in Computer Exer-
cise 2.3.13. Using a timing routine on your computer, compare Tri, Tri Normal, and Tri 2n to determine which of them is fastest for the tridiagonal system
ai =i(n−i+1), ci =(i+1)(n−i−1), di =(2i+1)n−i−2i, bi =i
with a large even value of n.
Note: Mathematical algorithms may behave differently on parallel and vector computers. Generally speaking, parallel computations completely alter our conventional notions about what’s best or most efficient.
16. Consideraspecialbidiagonallinearsystemofthefollow- ing form (illustrated with n = 7) with nonzero diagonal elements:
forsolvingabackwardtridiagonalsystemoflinearequa- tions of the form
a1 d1 a2 d2 c1
a3 d3 c2
18.
x1 b1 x3 b3
... ... an−1 dn−1 cn−1
dn cn−1
...
. x n − 1 . b n − 1
x2 b2
. = .
xn bn
using Gaussian elimination without pivoting. An upper Hessenberg matrix is of the form
a11 a21
a12 a13 ··· a1n x1 b1 a22 a23 ··· a2n x2 b2 a32 a33 ··· a3nx3=b3
... ... . . . an,n−1 ann xn bn
d1
a1 d2
a2 d3 a3
Write and test
d4
a4 d 5
x1 b1
x2 b2 x3 b3 x4 = b4
a 5 x 5 b 5 d6a6x6b6 d7 x7 b7
19.
Write a procedure for solving such a system, and test it on a system having 10 or more equations.
An n × n banded coefficient matrix with l subdiagonals and m superdiagonals can be stored in banded storage mode in an n × (l + m + 1) array. The matrix is stored with the row and diagonal structure preserved with almost all 0 elements unstored. If the original n×n banded matrix hadtheformshowninthefigure,thenthen×(l+m+1) array in banded storage mode would be as shown. The main diagonal would be the l + 1st column of the new array. Write and test a procedure for solving a linear sys- tem with the coefficient matrix stored in banded storage mode.
m m
procedure Bi Diagional(n, (ai ), (di ), (bi ))
to solve the general system of order n (odd). Store the so- lution in array b, and assume that all arrays are of length n. Do not use forward elimination because the system can be solved quite easily without it.
17. Write and test
procedureBackward Tri(n,(ai),(di),(ci),(bi),(xi))
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
m m
mm
Lower-Band-Packed Storage mode
Symmetric banded array
20. An n × n symmetric banded coefficient matrix with m subdiagonals and m superdiagonals can be stored in sym- metric banded storage mode in an n × (m + 1) array. Only the main diagonal and subdiagonals are stored so that the main diagonal is the last column in the new array. Write and test a procedure for solving a linear system with the coefficient matrix stored in symmetric banded storage mode.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
m ømm
profile values for the long wave components, and z − w are those for the short wave components. Use this system to test the Penta code using various values of α.
Hint: For test systems, select a simple solution vector w = [1,−1,1,−1,...,1]T with a modest value for n, and then compute the right-hand side using matrix-vector multiplication z = (I + α4 Q)w.
(Continuation, Periodic Spline Filter) The filter equa- tion for the periodic spline filter is given by the n × n system
I + α 4 Q w = z where the matrix is
1 −4 6
...
1
Banded array
21. Writecodeforsolvingblockpentadiagonalsystemsand test it on the systems with block submatrices. Compare the code to Penta using symmetric and nonsymmetric systems.
22. (Nonperiodic Spline Filter) The filter equation for the nonperiodic spline filter is given by the n × n system
6−4 −4 6 1−4
1 −4 1 1 −41
I + α4 Qw = z where the matrix is
...
... ... ... −4 6−4 1
1−21 −25−4 1 1−4 6 −4 1
−4 1
5 −2 −2 1
1
1 −4 6 −4 1 −4 6
Q= ... ... ... ...
1 −4 6 1 −4 1
...
are used in cases of filtering closed
Here the parameter α = 1/[2sin(πx/λc)] involves measurement values of the profile, dimensions, and wave- length over a sampling interval. The solution w gives the
23.
24.
pseudocode to handle this system and then code and test it.
Use mathematical software such as MATLAB, Maple, or Mathematica to generate a tridiagonal system and solve it. For example, use the 5 × 5 tridiagonal sys- tem A = Band Matrix(−1,2,1) with right-hand side b = [1,4,9,16,25]T .
2.3 Tridiagonal and Banded Systems 113
ø
m
øø
BLAS-General-Band Storage mode
Q =
Periodic spline filters
−4 1
profiles. Making use of the symmetry, modify the Penta
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
3
Nonlinear Equations
An electric power cable is suspended (at points of equal height) from two towers that are 100 meters apart. The cable is allowed to dip 10 meters in the middle. How long is the cable?
x
It is known that the curve assumed by a suspended cable is a catenary. When the y-axis passes through the lowest point, we can assume an equa- tion of the form y = λcosh(x/λ). Here λ is a parameter to be determined. The conditions of the problem are that y ( 50) = y ( 0) + 10. Hence, we obtain
50
λ cosh = λ + 10
λ
By the methods of this chapter, the parameter is found to be λ = 126.632. After this value is substituted into the arc length formula of the catenary, the length is determined to be 102.619 meters. (See Computer Exercise 5.1.4.)
3.1 Bisection Method Introduction
Cable
y
y (50)
10 m y (0)
250 0 50
Sample Functions
114
Let f be a real- or complex-valued function of a real or complex variable. A number r, real or complex, for which f (r) = 0 is called a root of that equation or a zero of f . For example, the function
f (x) = 6x2 − 7x + 2
has 1 and 2 as zeros, as can be verified by direct substitution or by writing f in its factored
23 form:
f (x) = (2x − 1)(3x − 2) For another example, the function
g(x) = cos 3x − cos 7x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Engineering Problem
3.1 Bisection Method 115 has not only the obvious zero x = 0, but every integer multiple of π/5 and of π/2 as well,
which we discover by applying the trigonometric identity
cos A − cos B = 2 sin 1 (a + b) sin 1 (b − a) 22
Consequently, we find
g(x) = 2 sin(5x) sin(2x) Why is locating roots important?
Frequently, the solution to a scientific problem is a number about which we have little information other than that it satisfies some equation. Since every equation can be written so that a function stands on one side and zero on the other, the desired number must be a zero of the function. Thus, if we possess an arsenal of methods for locating zeros of functions, we shall be able to solve such problems.
We illustrate this claim by use of a specific engineering problem whose solution is the root of an equation. In a certain electrical circuit, the voltage V and current I are related by two equations of the form
I = a(ebV − 1) c=dI+V
in which a, b, c, and d are constants. For our purpose, these four numbers are assumed to be known. When these equations are combined by eliminating I between them, the result is a single equation:
c=ad(ebV −1)+V In a concrete case, this might reduce to
12=14.3(e2V −1)+V
and its solution is required. (It turns out that V ≈ 0.299 in this case.)
In some problems in which a root of an equation is sought, we can perform the required calculation with a hand calculator. But how can we locate zeros of complicated functions
such as these?
f (x) = 3.24x8 − 2.42x7 + 10.34x6 + 11.01x2 + 47.98 g(x)=2x2 −10x+1
h(x) = cosh x2 +1−ex+log|sinx|
What is needed is a general numerical method that does not depend on special properties of our functions. Of course, continuity and differentiability are special properties, but they are common attributes of functions that are usually encountered. The sort of special property that we probably cannot easily exploit in general-purpose codes is typified by the trigonometric identity mentioned previously.
Hundreds of methods are available for locating zeros of functions, and three of the most useful have been selected for study here: the bisection method, Newton’s method, and the secant method.
Let f be a function that has values of opposite sign at the two ends of an interval. Suppose also that f is continuous on that interval. To fix the notation, let a < b and f (a) f (b) < 0. It then follows that f has a root in the interval (a, b). In other words, there
More Samples Functions
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
116 Chapter 3
Nonlinear Equations
Method Description
must exist a number r that satisfies the two conditions a < r < b and f (r) = 0. How is this conclusion reached? One must recall the Intermediate-Value Theorem.∗ If x traverses an interval [a, b], then the values of f (x) completely fill out the interval between f (a) and
f (b). No intermediate values can be skipped. Hence, a specific function f must take on the value zero somewhere in the interval (a, b) because f (a) and f (b) are of opposite signs.
Bisection Algorithm
The bisection method exploits this property of continuous functions. At each step in this
algorithm, we have an interval [a, b] and the values u = f (a) and v = f (b). The numbers
u and v satisfy uv < 0. Next, we construct the midpoint of the interval, c = 1 (a + b), 2
and compute w = f (c). It can happen fortuitously that f (c) = 0. If so, the objective of
the algorithm has been fulfilled. In the usual case, w ≠ 0, and either wu < 0 or wv < 0.
(Why?) If wu < 0, we can be sure that a root of f exists in the interval [a, c]. Consequently,
westorethevalueofcinbandwinv.Ifwu > 0,thenwecannotbesurethat f hasarootin
[a,c],butsincewv < 0, f musthavearootin[c,b].Inthiscase,westorethevalueofcina
and w in u. In either case, the situation at the end of this step is just like that at the beginning
except that the final interval is half as large as the initial interval. This step can now be
repeated until the interval is satisfactorily small, say |b − a| < 1 × 10−6. At the end, the 2
bestestimateoftherootwouldbe(a+b)/2,where[a,b]isthelastintervalintheprocedure. Pseudocode
Now let’s construct pseudocode to carry out this procedure. We shall not try to create a piece of high-quality software with many “bells and whistles,” but we write the pseudocode in the form of a procedure for general use. This allows the reader an opportunity to review how a main program and one or more procedures can be connected.
As a general rule, in programming routines to locate the roots of arbitrary functions, unnecessary evaluations of the function should be avoided because a given function may be costly to evaluate in terms of computer time. Thus, any value of the function that may be needed later should be stored rather than recomputed. A careless programming of the bisection method might violate this principle.
The procedure to be constructed operates on an arbitrary function f . An interval [a, b] is also specified, and the number of steps to be taken, nmax, is given. Pseudocode to perform nmax steps of the bisection algorithm follows:
procedure Bisection( f, a, b, nmax, ε) integer n, nmax; real a, b, c, fa, fb, fc, error fa← f(a)
fb← f(b)
if sign(fa) = sign(fb) then
output a, b, fa, fb
output “function has same signs at a and b” return
(Continued)
∗Intermediate-Value Theorem: If the function f is continuous on the closed interval [a, b], and if f(a)≦y≦ f(b)or f(b)≦y≦ f(a),thenthereexistsapointcsuchthata≦c≦band f(c)=y.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Bisection Pseudocode
Find a Root for Each Function f and g
Here many modifications are incorporated to enhance the pseudocode. For example, we use fa, fb, fc as mnemonics for u, v, w, respectively. Also, we illustrate some techniques of structured programming and some other alternatives, such as a test for convergence. For example, if u, v, or w is close to zero, then uv or wu may underflow. Similarly, an overflow situation may arise. A test involving the intrinsic function sign could be used to avoid these difficulties, such as a test that determines whether sign(u) ≠ sign(v). Here, the iterations terminate if they exceed nmax or if the error bound (discussed later in this section) is less than ε. The reader should trace the steps in the routine to see that it does what is claimed!
Numerical Examples
Now we want to illustrate how the bisection pseudocode can be used. Suppose that we have two functions, and for each, we seek a zero in a specified interval:
f(x)=x3−3x+1, on[0,1] g(x)=x3−2sinx, on[0.5,2]
First, we write two procedure functions to compute f (x) and g(x). Then we input the initial intervals and the number of steps to be performed in a main program. Since this is a rather simple example, this information can be assigned directly in the main program or by way of statements in the subprograms rather than being read into the program. Also, depending on the computer language being used, an external or interface statement is needed to tell the compiler that the parameter f in the bisection procedure is not an ordinary variable with numerical values, but the name of a function procedure defined externally to the main program. In this example, there are two function procedures and two calls to the bisection procedure.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
3.1 Bisection Method 117
end if
error ← b − a
for n = 0 to nmax
error ← error/2
c ← a + error
fc← f(c)
output n, c, fc, error if |error| < ε then
output “convergence”
return end if
if sign(fa) ≠ sign(fc) then b←c
fb ← fc else
a←c
fa ← fc end if
end for
end procedure Bisection
118 Chapter 3
Nonlinear Equations
f (x) Output
g(x) Output
2 0.375
3 0.3125
4 0.34375
.
19 0.34729 67
20 0.34729 62
Also, the results for g(x) are as follows:
n cn
0 1.25
1 0.875
2 1.0625
3 1.15625
4 1.20312 5
.
19 1.23618 27
20 1.23618 34
−7.23 × 10−2 9.30 × 10−2 9.37 × 10−3
−9.54 × 10−7 3.58 × 10−7
g(cn)
5.52 × 10−2
−0.865 −0.548 −0.285 −0.125
−4.88 × 10−6 −2.15 × 10−6
A main program follows which calls the bisection routine for each of these functions:
program Test Bisection
integer n, nmax ← 20
real a, b, ε ← 1 10−6 2
external function f, g
a ← 0.0
b ← 1.0
call Bisection ( f, a, b, nmax, ε) a ← 0.5
b ← 2.0
call Bisection (g, a, b, nmax, ε) end program Test Bisection
real function f (x) real x
f ← x3 − 3x + 1 end function f
real function g(x) real x
g ← x 3 − 2 sin x end function g
Test Bisection
Pseudocode
Here are the computer results with the iterative steps of the bisection method for f (x):
n cn f(cn)
0 0.5 −0.375
1 0.25 0.266
error
0.5
0.25
0.125
6.25 × 10−2 3.125 × 10−2
9.54 × 10−7 4.77 × 10−7
error 0.75
0.375
0.188
9.38 × 10−2 4.69 × 10−2
1.43 × 10−6 7.15 × 10−7
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
g(x) Output
Mathematical Software
To verify these results, we use sophisticated procedures in mathematical software such as MATLAB, Mathematica, or Maple to find the desired roots of f and g to be 0.34729 63553 and 1.23618 3928, respectively. Since f is a polynomial, we can use a routine for finding numerical approximations to all the zeros of a polynomial function. However, when more complicated nonpolynomial functions are involved, there is generally no systematic pro- cedure for finding all zeros. In this case, a routine can be used to search for zeros (one at a time), but we have to specify a point at which to start the search, and different starting points may result in the same or different zeros. It may be particularly troublesome to find all the zeros of a function whose behavior is unknown.
Convergence Analysis
Now let us investigate the accuracy with which the bisection method determines a root of a function. Suppose that f is a continuous function that takes values of opposite sign at the ends of an interval [a0,b0]. Then there is a root r in [a0,b0], and if we use the midpoint c0 = (a0 + b0)/2 as our estimate of r, we have
FIGURE 3.1
Bisection method: Illustrating error upper bound
Error Bound
■ Theorem1
a0 r c0
If the bisection algorithm is now applied and if the computed quantities are denoted by
a0, b0, c0, a1, b1, c1, and so on, then by the same reasoning, |r − cn | ≦ bn − an (n ≧ 0)
2
Since the widths of the intervals are divided by 2 in each step, we conclude that
|r − cn | ≦ b0 − a0 (1) 2n+1
To summarize, a theorem can be written as follows:
If an error tolerance has been prescribed in advance, it is possible to determine the number of steps required in the bisection method. Suppose that we want
|r − cn | < ε
as illustrated in Figure 3.1.
|r −c0|≦ b0 −a0 2
(b0 – a0)y2
|r 2 c0|
3.1 Bisection Method 119
b0
Bisection Method Theorem
If the bisection algorithm is applied to a continuous function f on an interval [a, b], where f (a) f (b) < 0, then, after n steps, an approximate root will have been computed with error at most (b − a)/2n+1.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
120 Chapter 3
Nonlinear Equations
EXAMPLE 1
Solution
Then it is necessary to solve the following inequality for n: b−a
2n+1 <ε
By taking logarithms (with any convenient base), we obtain
n > log(b − a) − log(2ε) (2) log 2
How many steps of the bisection algorithm are needed to compute a root of f to full machine single precision on a 32-bit word-length computer if a = 16 and b = 17?
The root is between the two binary numbers a = (10 000.0)2 and b = (10 001.0)2 . Thus, we already know five of the binary digits in the answer. Since we can use only 24 bits altogether, that leaves 19 bits to determine. We want the last one to be correct, so we want the error to be less than 2−19 or 2−20 (being conservative). Since a 32-bit word-length computer has a 24-bit mantissa, we can expect the answer to have an accuracy of only 2−20. From the equation above, we want
(b−a)/2n+1 <ε
Since b−a = 1 and ε = 2−20, we have 1/2n+1 < 2−20. Taking reciprocals gives 2n+1 > 220, or n ≧ 20.
Alternatively, we can use Inequality (2), which in this case is
log1−log2−19
n>
log 2
Using a basic property of logarithms (log x y = y log x ), we find that n ≧ 20. In this ex- ample, each step of the algorithm determines the root with one additional binary digit of precision. ■
A sequence {xn } exhibits linear convergence to a limit x if there is a constant C in the interval [0, 1) such that
|xn+1 −x|≦C|xn −x| (n≧1) (3) If this inequality is true for all n, then
|xn+1 −x| ≦ C|xn −x|≦C2|xn−1 −x|≦ ··· ≦Cn|x1 −x| Thus, it is a consequence of linear convergence that
|xn+1 −x|≦ ACn (0≦C <1) (4)
The sequence produced by the bisection method obeys Inequality (4), as we see from Inequality (1). However, the sequence need not obey Inequality (3).
The bisection method is the simplest way to solve a nonlinear equation f (x) = 0. It arrives at the root by constraining the interval in which a root lies, and it eventually makes the interval quite small. Because the bisection method halves the width of the interval at each step, one can predict exactly how long it takes to find the root within any desired degree of accuracy. In the bisection method, not every guess is closer to the root than the previous guess because the bisection method does not use the nature of the function itself. Often the bisection method is used to get close to the root before switching to a faster method. (Root finding by the bisection method uses the same idea as in the binary search method taught in data structures.)
Linear Convergence
Error Bound (Linear Convergence)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
False Position (Regula Falsi) Method
The false position method retains the main feature of the bisection method: that a root is trapped in a sequence of intervals of decreasing size. Rather than selecting the midpoint of each interval, this method uses the point where the secant lines intersect the x-axis.
a
(a, f (a))
y 5 f (x)
rcb
Secant line
x
3.1 Bisection Method 121
(b, f(b))
FIGURE 3.2
False position method
In Figure 3.2, the secant line over the interval [a, b] is the chord between (a, f (a)) and (b, f (b)). The two right triangles in the figure are similar, which means that
It is easy to show that
c=b− f(b)
a−b f(a)− f(b)
b−c c−a =
f(b) −f(a)
=a− f(a) b−a f(b)− f(a)
= af(b)−bf(a) f(b)− f(a)
FP Method Description
We then compute f (c) and proceed to the next step with the interval [a, c] if f (a) f (c) < 0 ortotheinterval[c,b]if f(c)f(b)<0.
In the general case, the false position method starts with the interval [a0 , b0 ] contain- ing a root: f (a0) and f (b0) are of opposite signs. The false position method uses intervals [ak,bk] that contain roots in almost the same way that the bisection method does. How- ever, instead of finding the midpoint of the interval, it finds where the secant line joining (ak , f (ak )) and (bk , f (bk )) crosses the x -axis and then selects it to be the new endpoint. At the kth step, it computes
ck = ak f(bk)−bk f(ak) f(bk)− f(ak)
If f (ak) and f (ck) have the same sign, then set ak+1 = ck and bk+1 = bk; otherwise, set ak+1 = ak and bk+1 = ck . The process is repeated until the root is approximated sufficiently well.
Modified False Position Method
For some functions, the false position method may repeatedly select the same endpoint, and the process may degrade to linear convergence. There are various approaches to rectify this.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
122 Chapter 3
Nonlinear Equations
MFP Method Description
For example, when the same endpoint is to be retained twice, the modified false position
method uses
ak f(bk)−2bk f(ak)
, if f(ak)f(bk)<0 (m) f (bk ) − 2 f (ak )
ck =2ak f(bk)−bk f(ak), if f(ak)f(bk)>0 2f(bk)− f(ak)
So rather than selecting points on the same side of the root as the regular false position method does, the modified false position method changes the slope of the straight line so that it is closer to the root. See Figure 3.3.
(bk21, f(bk21))
ak21 5 ak r (ak, 12 f(ak))
(ak21, f(ak21))
ck(m) ck
ck21 5 bk bk21
x
(bk, f(bk)) y 5 f (x)
FIGURE 3.3
Modified false position method
The bisection method uses only the fact that f (a) f (b) < 0 for each new interval [a, b], but the false position method uses the values of f (a) and f (b). This is an example showing how to include additional information in an algorithm to build a better one. In the next section, Newton’s method uses not only the function, but also its first derivative.
Some variants of the modified false position procedure have superlinear convergence, which we discuss in Section 3.3. (See, for example, Ford [1995].) Another modified false position method replaces the secant lines by straight lines with ever smaller slope until the iterate falls to the opposite side of the root. (See Conte and de Boor [1980].) Early versions of the false position method date back to a Chinese mathematical text (200 B.C.E. to 100 C.E.) and an Indian mathematical text (3 B.C.E.).
Summary 3.1
• For finding a zero r of a given continuous function f in an interval [a,b], n steps
of the bisection method produces a sequence of intervals [a, b] = [a0, b0], [a1, b1],
[a2,b2],...,[an,bn] with each containing the desired root of the function. The mid-
points of these intervals c0,c1,c2,...,cn form a sequence of approximations to the
root, namely, ci = 1(ai +bi). On each interval [ai,bi], the error ei = r −ci obeys the 2
inequality
and after n steps we have
1
|ei|≦ 2(bi −ai)
1
|en|≦ 2n+1(b0 −a0)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
3.1 Bisection Method 123 • For an error tolerance ε such that |en| < ε, n steps are needed, where n satisfies the
inequality
log(b − a) − log 2ε
n>
log 2
• Forthekthstepofthefalsepositionmethodovertheinterval[ak,bk],let ck = ak f(bk)−bk f(ak)
f(bk)− f(ak)
If f(ak)f(ck) > 0, set ak+1 = ck and bk+1 = bk; otherwise, set ak+1 = ak and
a1. 2.
3. a4. 5. 6.
7. 8.
a9. a 10. 11.
a 12.
bk+1 =ck.
Find where the graphs of y = 3x and y = ex intersect by finding roots of ex − 3x = 0 correct to four decimal digits.
Give a graphical demonstration that the equation tan x = x has infinitely many roots. Determine one root precisely and another approximately by using a graph. Hint: Use the approach of the preceding exercise.
Demonstrate graphically that the equation 50π + sin x = 100 arctan x has infinitely many solutions.
Bygraphicalmethods,locateapproximationstoallroots of the nonlinear equation ln(x + 1) + tan(2x) = 0.
Give an example of a function for which the bisection method does not converge linearly.
Drawagraphofafunctionthatisdiscontinuousyetthe bisection method converges. Repeat, getting a function for which it diverges.
ProveInequality(1).
Ifa=0.1andb=1.0,howmanystepsofthebisection
method are needed to determine the root with an error of
atmost1×10−8? 2
Find all the roots of f(x) = cosx − cos3x. Use two different methods.
(Continuation) Find the root or roots of ln[(1 + x)/ (1−x2)]=0.
If f has an inverse, then the equation f(x) = 0 can be solved by simply writing x = f −1(0). Does this re- mark eliminate the problem of finding roots of equations? Illustrate with sin x = 1/π .
How many binary digits of precision are gained in each step of the bisection method? How many steps are re- quired for each decimal digit of precision?
13. Try to devise a stopping criterion for the bisection method to guarantee that the root is determined with relative error at most ε.
14. Denote the successive intervals that arise in the bisection method by [a0, b0], [a1, b1], [a2, b2], and so on. Show that
a. a0≦a1≦a2≦ ···andb0≧b1≧b2≧ ···.
b. bn−an=2−n(b0−a0).
c. anbn + an−1bn−1 = an−1bn + anbn−1, for all n.
15. (Continuation)Cana0=a1=a2=···happen? 16. (Continuation) Let cn = (an + bn )/2. Show that
lim cn = lim an = lim bn n→∞ n→∞ n→∞
a17. (Continuation) Consider the bisection method with the initial interval [a0 , b0 ]. Show that after 10 steps with this method,
1(a10 + b10) − 1(a9 + b9) = 2−11(b0 − a0) 22
Also, determine how many steps are required to guar- antee an approximation of a root to six decimal places (rounded).
18. (True-False)Ifthebisectionmethodgeneratesintervals [a0 , b0 ], [a1 , b1 ], and so on, which of these inequali- ties are true for the root r that is being calculated? Give proofs or counterexamples in each case.
a. |r −an|≦2|r −bn|
ab. |r−an|≦2−n−1(b0−a0)
c. |r−1(an+bn)|≦2−n−2(b0−a0) 2
ad. 0≦r−an≦2−n(b0−a0) e. |r − bn|≦ 2−n−1(b0 − a0)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 3.1
124 Chapter 3 Nonlinear Equations
correct number of steps to compute the root with full machine precision on a 32-bit word-length computer?
22. If the bisection method is applied with starting interval [2m , 2m +1 ], where m is a positive or negative integer, how many steps should be taken to compute the root to full machine precision on a 32-bit word-length computer?
a23. Everypolynomialofdegreenhasnzeros(countingmul- tiplicities) in the complex plane. Does every real polyno-
a
21. Ifthebisectionmethodisappliedwithstartinginterval mialhavenrealzeros?Doeseverypolynomialofinfinite
19. (True-False) Using the notation of the text, determine which of these assertions are true and which are gener- ally false:
aa. |r−cn|<|r−cn−1| c. cn≦r≦bn
b. an≦r≦cn
d. |r−an|≦2−n
ae. |r−bn|≦2−n(b0−a0)
20. Provethat|cn−cn+1|=2−n−2(b0−a0).
[a,a+1]anda=2m,wheren≧24−m0,whatisthe degree f(x)= ∞n=0anxn haveinfinitelymanyzeros?
1. Using the bisection method, determine the point of in- tersection of the curves given by y = x3 − 2x + 1 and y = x2.
2. Findarootofthefollowingequationintheinterval[0,1]
by using the bisection method: 9x 4 + 18x 3 + 38x 2 − 57x+14=0.
3. Find a root of the equation tan x = x on the interval [4, 5] by using the bisection method. What happens on the interval [1, 2]?
4. Findarootoftheequation6(ex −x)=6+3x2+2x3 between −1 and +1 using the bisection method.
5. Use the bisection method to find a zero of the equation λ cosh(50/λ) = λ + 10 that begins this chapter.
6. Program the bisection method as a recursive procedure and test it on one or two of the examples in the text.
7. Usethebisectionmethodtodeterminerootsofthesefunc- tions on the intervals indicated. Process all three functions in one computer run.
a 10.
f (a) f (b) < 0. Then c is computed as the root of the lin- ear function that agrees with f at a and b. We retain either [a, c] or [c, b], depending on whether f (a) f (c) < 0 or f (c) f (b) < 0. Test your program on several functions.
Select a routine from your program library to solve polynomial equations and use it to find the roots of the equation
x8 − 36x7 + 546x6 − 4536x5 + 22449x4 − 67284x3 + 118124x2 − 109584x + 40320 = 0
The correct roots are the integers 1, 2, . . . , 8. Next, solve the same equation when the coefficient of x7 is changed to −37. Observe how a minor perturbation in the coeffi- cients can cause massive changes in the roots. Thus, the roots are unstable functions of the coefficients. (Be sure to program the exercise to allow for complex roots.) Cul- tural Note: This is a simplified version of Wilkinson’s polynomial, which is found in Computer Exercise 3.3.9.
A circular metal shaft is being used to transmit power. It is known that at a certain critical angular velocity ω, any jarring of the shaft during rotation will cause the shaft to deform or buckle. This is a dangerous situation because the shaft might shatter under the increased cen- trifugal force. To find this critical velocity ω, we must first compute a number x that satisfies the equation
tan x + tanh x = 0
This number is then used in a formula to obtain ω. Solve
forx(x>0).
Usingbuilt-inroutinesinmathematicalsoftwaresystems such as MATLAB, Mathematica, or Maple, find the roots for f(x)=x3−3x+1on[0,1]andg(x)=x3−sinx on [0.5, 2] to more digits of accuracy than shown in the text.
f (x) = x3 + 3x − 1
g(x) = x3 − 2 sin x h(x)=x+10−xcosh(50/x) on[120,130]
Find each root to full machine precision. Use the correct number of steps, at least approximately. Repeat using the false position method.
8. Test the three bisection routines on f (x ) = x 3 + 2×2+10x−20,witha=1andb=2.Thezerois 1.36880 8108. In programming this polynomial function, use nested multiplication. Repeat using the modified false position method.
9. Write a program to find a zero of a function f in the fol- lowing way: In each step, an interval [a, b] is given and
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
on [0, 1] on [0.5, 2]
a11.
12.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 3.1
13. (Engineering Problem) Nonlinear equations occur in almost all fields of engineering. For example, suppose a given task is expressed in the form f (x) = 0 and the objective is to find values of x that satisfy this condition. It is often difficult to find an explicit solution, and an ap- proximate solution is sought with the aid of mathematical software. Find a solution of
1
17. (Board Across Hallways) In a building, two intersecting halls with widths w1 = 9 feet and w2 = 7 feet meet at an angle α = 125◦, as shown:
3.2 Newton’s Method 125
2
1 10
−(1/2)x2
Plot the curve in the range [−3.5, 3.5] for x values and
e
[−0.5, 0.5] for y = f (x) values.
14. (CircuitProblem)AsimplecircuitwithresistanceR,ca- pacitance C in series with a battery of voltage V is given by Q = CV[1−e−T/(RC)],where Q isthechargeofthe capacitor and T is the time needed to obtain the charge. We wish to solve for the unknown C. For example, solve this exercise
f(x)= √ 2π
+
sin(πx)
−0.004/(2000x) f(x)= 10x 1−e
Assuming a two-dimensional situation, what is the longest board that can negotiate the turn? Ignore the thickness of the board. The relationship between the an- glesθandthelengthoftheboardl=l1+l2is l1 = w1 csc(β), l2 = w2 csc(γ), β = π − α − γ and l = w1 csc(π − α − γ) + w2 csc(γ). The maximum length of the board that can make the turn is found by minimizing l as a function of γ . Taking the derivative and setting d l/d γ = 0, we obtain
w1 cot(π −α−γ ) csc(π −α−γ )−w2 cot(γ ) csc(γ ) = 0 Substitute in the known values and numerically solve the
nonlinear equation.
18. Find the rectangle of maximum area if its vertices are at (0, 0), (x, 0), (x, cos x), (0, cos x). Assume that 0 ≦ x ≦ π/2.
19. Programthefalsepositionalgorithmandtestitonsome examples such as some of the nonlinear exercises in the text or in the computer exercises. Compare your results with those given for the bisection method.
20. Program the modified false position method, test it, and compare it to the false position method when using some sample functions.
−0.00001 Hint: You may wish to magnify the vertical scale by using
Plot the curve.
y = 105 f (x).
15. (Engineering Polynomials) Equations such as
A+Bx2eCx =0 A+Bx+Cx2+Dx3+Ex4 = 0
occur in engineering problems. Using mathematical soft- ware, find one or more solutions to the following equa- tions and plot their curves:
a. 2 − x2e−0.385x = 0
b. 1−32x+160×2−256×3+128×4=0
16. (ReinforcedConcrete)Inthedesignofreinforcedcon- crete with regard to stress, one needs to solve numerically a quadratic equation such as
24147 07.2x [450 − 0.822x (225)] − 265,000,000 = 0 Find approximate values of the roots.
3.2 Newton’s Method
1
The procedure known as Newton’s method is also called the Newton-Raphson iteration. It has a more general form than the one seen here, and the more general form can be used to find roots of systems of equations. Indeed, it is one of the more important procedures in numerical analysis, and its applicability extends to differential equations and integral equations. Here it is being applied to a single equation of the form f (x) = 0. As before, we seek one or more points at which the value of the function f is zero.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
,1 ,2
126 Chapter 3
Nonlinear Equations
Tangent Line Approach
Interpretations of Newton’s Method
In Newton’s method, it is assumed at once that the function f is differentiable. This implies that the graph of f has a definite slope at each point and hence a unique tangent line. Now let’s pursue the following simple idea. At a certain point (x0, f (x0)) on the graph of f , there is a tangent, which is a rather good approximation to the curve in the vicinity of that point. Analytically, it means that the linear function
l(x)= f′(x0)(x−x0)+ f(x0)
is close to the given function f near x0. At x0, the two functions l and f agree. We take the
zero of l as an approximation to the zero of f . The zero of l is easily found: x1=x0− f(x0)
Geometric Approach
FIGURE 3.4
Newton’s method
Taylor Series Approach
we pass to a new point x1 obtained from the preceding formula. Naturally, this process can be repeated (iterated) to produce a sequence of points:
x2 =x1 − f(x1), x3 =x2 − f(x2), andsoon. f ′(x1) f ′(x2)
Under favorable conditions, the sequence of points approaches a zero of f .
The geometry of Newton’s method is shown in Figure 3.4. The line y = l(x) is tangent
to the curve y = f (x). It intersects the x-axis at a point x1. The slope of l(x) is f ′(x0). y
f ′(x0)
Thus, starting with point x0 (which we may interpret as an approximation to the root sought),
y 5 f(x)
Tangent line y5 (x)
x
r x1 x0
There are other ways of interpreting Newton’s method. Suppose again that x0 is an
initial approximation to a root of f . We ask:
What correction h should be added to x0 to obtain the root precisely?
Obviously, we want
f (x0 + h) = 0
If f is a sufficiently well-behaved function, it has a Taylor series at x0. (See Equation (11)
in Section 1.2.) Thus, we could write
f(x0)+hf′(x0)+ 2 f′′(x0)+···=0
Determining h from this equation is, of course, not easy. Therefore, we give up the expec- tation of arriving at the true root in one step and seek only an approximation to h. This can be obtained by ignoring all but the first two terms in the series:
f (x0) + h f ′(x0) = 0
h2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
3.2 Newton’s Method 127 The h that solves this is not the h that solves f (x0 + h) = 0, but it is the easily computed
number
Our new approximation is then
h = − f (x0) f ′(x0)
x1=x0+h=x0− f(x0) f ′(x0)
Newton’s Method Formula
EXAMPLE1
Solution
and the process can be repeated. In retrospect, we see that the Taylor series was not needed after all because we used only the first two terms. In the analysis to be given later, it is assumed that f ′′ is continuous in a neighborhood of the root. This assumption enables us to estimate the errors in the process.
If Newton’s method is described in terms of a sequence x0, x1, . . . , then the following recursive or inductive definition applies:
xn+1=xn− f(xn) f′(xn)
Naturally, the interesting question is whether
lim xn = r
n→∞
where r is the desired root.
If f(x)=x3 −x+1andx0 =1,whatarex1 andx2 intheNewtoniteration?
Fromthebasicformula,x =x −f(x)/f′(x).Nowf′(x)=3×2−1,andsof′(1)=2.
12228
1000
Also,wefind f(1)=1.Hence,wehavex =1−1 = 1.Similarly,weobtain f 1 = 5,
f′ 1 =−1,andx2=3. ■ 24
Pseudocode
A pseudocode procedure for Newton’s method can be written as follows:
procedure Newton( f, f ′, x, nmax, ε, δ) integer n, nmax; real x, fx, fp, ε, δ external function f, f ′
fx← f(x)
output 0, x , fx
for n = 1 to nmax
fp← f′(x)
if | f p| < δ then
output “small derivative”
return end if
d ← fx/fp x←x−d fx← f(x)
(Continued)
Newton Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
128 Chapter 3
Nonlinear Equations
output n, x, fx if |d| < ε then
output “convergence”
return end if
end for
end procedure Newton
Sample Problem
Using the initial value of x as the starting point, we carry out a maximum of nmax iterations of Newton’s method. Procedures must be supplied for the two external functions f (x) and f ′(x). The parameters ε and δ are used to control the convergence and are related to the
accuracy desired or to the machine precision available.
Illustration
Now we illustrate Newton’s method by locating a root of x3 + x = 2x2 + 3. We apply the methodtothefunction f(x)=x3−2x2+x−3,startingwithx0 =3.Ofcourse, f′(x)= 3x2 − 4x + 1, and these two functions should be arranged in nested form for efficiency:
f(x) = ((x −2)x +1)x −3 f ′(x) = (3x − 4)x + 1
To see in greater detail the rapid convergence of Newton’s method, we use arithmetic with double the normal precision in the program and obtain the following results:
n xn f(xn)
0 3.0 9.0
1 2.4375 2.04
2 2.21303 27224 73144 5
3 2.17555 49386 14368 4
4 2.17456 01006 55071 4
5 2.17455 94102 93284 1
2.56 × 10−1 6.46 × 10−3 4.48 × 10−6 1.97 × 10−12
Notice the doubling of the accuracy in f (x) (and also in x) until the maximum precision of the computer is encountered. Figure 3.5 shows a computer plot of three iterations of Newton’s method for this sample problem.
y
10
8
6
4
2
y 5 f(x)
FIGURE 3.5
Three steps of r
Newton’s method f(x) = x3−2x2+x−3
0x 2 2.2 2.4 2.6 2.8 3 3.2
x2 x1 x0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Mathematical Software
Using mathematical software that allows for complex roots such as in MATLAB, Maple, or Mathematica, we find that this polynomial has a single real root, 2.17456, and a pair of complex conjugate roots, −0.0872797 ± 1.17131i .
Convergence Analysis
Anyone who has experimented with Newton’s method—for instance, by working some of the exercises in this section—has observed the remarkable rapidity in the convergence of the sequence to the root. This phenomenon is also noticeable in the example just given. Usually, the number of correct figures in the answer nearly doubles at each successive step! Indeed, in the preceding example, we have first 0 and then 1, 2, 3, 6, 12, 24, . . . accurate digits from each Newton iteration. Five or six steps of Newton’s method often suffice to yield full machine precision in the determination of a root. There is a theoretical basis for this dramatic performance, as we shall now see.
Let the function f , whose zero we seek, possess two continuous derivatives f ′ and f′′, and let r be a zero of f. Assume further that r is a simple zero; that is, f′(r) ≠ 0. Then Newton’s method, if started sufficiently close to r , converges quadratically to r . This
means that the errors in successive steps obey an inequality of the form |r − xn+1|≦ c|r − xn|2
We shall establish this fact presently, but first, an informal interpretation of the inequality may be helpful.
Suppose, for simplicity, that c = 1. Suppose also that xn is an estimate of the root r that differs from it by at most one unit in the kth decimal place. This means that
|r − xn | ≦ 10−k The two inequalities above imply that
|r − xn+1| ≦ 10−2k
In other words, xn+1 differs from r by at most one unit in the (2k)th decimal place. So xn+1 has approximately twice as many correct digits as xn! This is the doubling of significant digits alluded to previously.
Error Bound Quadratic Convergence
Informal Interpretation
■ Theorem1
Proof
To establish the quadratic convergence of Newton’s method, let en = r − xn . The formula that defines the sequence {xn} then gives
en+1 =r−xn+1 =r−xn + f(xn) =en + f(xn) = en f′(xn)+ f(xn) f ′(xn) f ′(xn) f ′(xn)
3.2 Newton’s Method 129
Newton’s Method Theorem
If f, f′,and f′′ arecontinuousinaneighborhoodofarootr of f andif f′(r)≠ 0, then there is a positive δ with the following property: If the initial point in Newton’s method satisfies |r − x0| ≦ δ, then all subsequent points xn satisfy the same inequality, converge to r , and do so quadratically; that is,
|r − xn+1|≦ c(δ)|r − xn|2 where c(δ) is given by Equation (2).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
130 Chapter 3
Nonlinear Equations
Error Vector Recursive Relation
By Taylor’s Theorem (see Section 1.2), there exists a point ξn situated between xn and r for which
0= f(r)= f(xn +en)= f(xn)+en f′(xn)+1en2 f′′(ξn) 2
(Thesubscriptonξn emphasizesthedependenceonxn.)Thislastequationcanberearranged to read
en f′(xn)+ f(xn)=−1en2 f′′(ξn) 2
and if this is used in the previous equation for en+1, the result is
en+1 =−1f′′(ξn)en2 (1)
define a function
max | f ′′(x)| c(δ)= 1|x−r|≦δ
(δ>0) (2) By virtue of this definition, we can assert that, for any two points x and ξ within distance
2 f′(xn)
This is, at least qualitatively, the sort of equation we want. Continuing the analysis, we
2 min |f′(x)| |x−r|≦δ
δ of the root r, the inequality 1|f′′(ξ)/f′(x)|≦c(δ) is true. Now select δ so small that 2
Error Bound (Quadratic Convergence)
|r − xn+1| = |en+1|≦ ρ|en|≦ |en|≦ δ If the initial point x0 is chosen within distance δ of r, then
|en|≦ ρ|en−1|≦ ρ2|en−1|≦ ··· ≦ ρn|e0| Since0<ρ<1,limn→∞ρn =0andlimn→∞en =0.Inotherwords,weobtain
lim xn = r n→∞
In this process, we have
|en+1| ≦ c(δ)en2 ■ Discussion of Newton’s Method
In the use of Newton’s method, consideration must be given to the proper choice of a starting point. Usually, one must have some insight into the shape of the graph of the function. Sometimes a coarse graph is adequate, but in other cases, a step-by-step evaluation of the
δc(δ) < 1. This is possible because as δ approaches 0, c(δ) converges to 1 | f ′′(r)/f ′(r)|, 2
and so δc(δ) converges to 0. Recall that we assumed that f ′(r) ̸= 0. Let ρ = δc(δ). In the remainder of this argument, we hold δ, c(δ), and ρ fixed with ρ < 1.
Suppose now that some iterate xn lies within distance δ from the root r. We have
|en|=|r−xn|≦δ and |ξn −r|≦δ
Bythedefinitionofc(δ),itfollowsthat 1|f′′(ξn)|/|f′(xn)|≦c(δ).FromEquation(1),we 2
now have
|en+1| = 1 f ′′(ξn)en2 ≦ c(δ)en2 ≦ δc(δ)|en| = ρ|en|
2 f′(xn)
Consequently, xn+1 is also within distance δ of r because
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
y 5 f (x)
r x0 x1x2 (a) Runaway
x1
y 5 f (x) r
x0
x
3.2
Newton’s Method
131
x
FIGURE 3.6
Failure of Newton’s method due to bad starting points
Poor Choices of x0
(c) Cycle
Special Cases
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
r
x0 5 x2
function at various points may be necessary to find a point near the root. Often several steps of the bisection method are used to obtain a suitable starting point, so that Newton’s method converges more rapidly.
Although Newton’s method is truly a marvelous invention, its convergence depends upon hypotheses that are difficult to verify a priori. Some graphical examples show what can happen. In Figure 3.6a, the tangent to the graph of the function f at x0 intersects the x-axis at a point remote from the root r, and successive points in Newton’s iteration recede from r instead of converging to r. The difficulty can be ascribed to a poor choice of the initial point x0; it is not sufficiently close to r. In Figure 3.6b, the tangent to the curve is parallel to the x-axis and x1 = ±∞, or it is assigned the value of machine infinity in a computer. In Figure 3.6c, the iteration values cycle because x2 = x0. In a computer, roundoff errors or limited precision may eventually cause this situation to become unbalanced such that the iterates either spiral inward and converge or spiral outward and diverge.
The analysis that establishes the quadratic convergence discloses another troublesome hypothesis; namely, f′(r) ≠ 0. If f′(r) = 0, then r is a zero of f and f′. Such a zero is termed a multiple zero of f —in this case, at least a double zero. Newton’s iteration for a multiple zero converges only linearly! Ordinarily, one would not know in advance that the zero sought was a multiple zero. If one knew that the multiplicity was m, however, Newton’s method could be accelerated by modifying the equation to read
in which m is the multiplicity of the zero in question. The multiplicity of the zero r is the least m such that f (k)(r) = 0 for 0 ≦ k < m, but f (m)(r) ≠ 0. (See Exercise 3.2.35.)
As is shown in Figure 3.7 (p. 132), the equation p2(x) = x2 −2x +1 = 0 has a root at 1 of multiplicity 2, and the equation p3(x) = x3 −3x2 +3x −1 = 0 has a root at 1 of multiplicity 3. It is instructive to plot these curves. Both curves are rather flat at the roots, which slows down the convergence of the regular Newton’s method. Also, the figures illustrate the curves of two nonlinear functions with multiplicities as well as their regions of uncertainty about the curves. So the computed solutions could be anywhere within the indicated intervals on the x-axis. This is an indication of the difficulty in obtaining precise solutions of nonlinear functions with multiplicities.
Multiple Zero
Modified Newton’s
Method f′(xn)
xn+1 =xn −m f(xn)
y 5 f (x)
(b) Flat spot
x
132
Chapter 3 Nonlinear Equations
FIGURE 3.7
Curves p2 and p3 with multiplicity 2and3
y 5 p2(x) y 5 p3(x)
[]x[]x 0202
r51 r51
(a)p2(x)5x2 22x 11 (b)p3(x)5x3 23x2 13x 21 Systems of Nonlinear Equations
Some physical problems involve the solution of systems of N nonlinear equations in N unknowns. One approach is to linearize and solve, repeatedly. This is the same strategy used by Newton’s method in solving a single nonlinear equation. Not surprisingly, a natural extension of Newton’s method for nonlinear systems can be found. The topic of systems of nonlinear equations requires some familiarity with matrices and their inverses.
In the general case, a system of N nonlinear equations in N unknowns xi can be displayed in the form
f ( x , x , . . . , x ) = 0 112 N
f2(x1,x2,...,xN) = 0 .
fN(x1,x2,...,xN) = 0
Using vector notation, we can write this system in a more elegant form:
Nonlinear Systems
N×N
Nonlinear System 3 × 3
not computed, but rather a related system of equations is solved.
We illustrate the development of this procedure using three nonlinear equations
f1(x1,x2,x3) = 0
f2(x1,x2,x3) = 0 (3) f3(x1,x2,x3) = 0
by defining column vectors as
F(X) = 0
F = [ f1, f2, . . . , fN ]T X = [x1,x2,...,xN]T
The extension of Newton’s method for nonlinear systems is X(k+1) = X(k) − F ′X(k)−1FX(k)
where F′X(k) is the Jacobian matrix, which will be defined presently. It comprises partial derivatives of F evaluated at X(k) = x(k), x(k), . . . , x(k)T . This formula is similar to
12N
the previously seen version of Newton’s method, except that the derivative expression is
not in the denominator but in the numerator as the inverse of a matrix. In the computational form of the formula, X(0) = x(0), x(0), . . . , x(0)T is an initial approximation vector, taken
12N
to be close to the solution of the nonlinear system, and the inverse of the Jacobian matrix is
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Jacobian Matrix
∂f1 ∂f1
∂x1 ∂x2 ∂f2 ∂f2 F′ X(0) =∂x1 ∂x2 ∂f3 ∂f3
∂f1 ∂x3 ∂f2 ∂x3 ∂f3
3.2 Newton’s Method 133 Recall the Taylor expansion in three variables for i = 1, 2, 3:
f(x +h,x +h,x +h)=f(x,x,x)+h ∂fi +h ∂fi +h ∂fi +··· (4) i 1 1 2 2 3 3 i 1 2 3 1∂x 2∂x 3∂x
123
where the partial derivatives are evaluated at the point (x1, x2, x3). Here only the linear
terms in step sizes h are shown. Suppose that the vector X(0) = x(0), x(0), x(0)T is an i T 123
approximate solution to (3). Let H = h1, h2, h3 be a computed correction to the initial guesssothatX(0)+H=x(0)+h ,x(0)+h ,x(0)+h T isabetterapproximatesolution.
112233
Discarding the higher-order terms in the Taylor expansion (4), we have in vector notation
0 ≈ FX(0) + H ≈ FX(0) + F ′X(0)H (5) where the Jacobian matrix is defined by
Jacobian Linear Systems
Newton’s Method for Nonlinear Systems
∂x1 ∂x2 ∂x3 Here all of the partial derivatives are evaluated at X(0); namely,
∂fi = ∂fiX(0) ∂xj ∂xj
Also, we assume that the Jacobian matrix F ′X(0) is nonsingular, so its inverse exists. Solving for H in (5), we have
H ≈ −F ′X(0)−1FX(0)
Let X(1) = X(0) + H be the better approximation after the correction; we then arrive at the
first iteration of Newton’s method for nonlinear systems
X(1) = X(0) − F ′X(0)−1FX(0)
In general, Newton’s method uses this iteration:
X(k+1) = X(k) − F ′X(k)−1FX(k)
In practice, the computational form of Newton’s method does not involve inverting the Jacobian matrix but rather solves the Jacobian linear systems
F ′X(k)H(k) = −FX(k) (6) The next iteration of Newton’s method is then
X(k+1) = X(k) + H(k) (7)
This is Newton’s method for nonlinear systems. The linear system (6) can be solved by procedures Gauss and Solve as discussed in Chapter 2. Small systems of order 2 can be solved easily. (See Exercise 2.2.39.)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
134 Chapter 3 Nonlinear Equations
EXAMPLE 2
Solution
As an illustration, we can write a pseudocode to solve the following nonlinear system of equations using a variant of Newton’s method given by (6) and (7):
x+y+z = 3
x2+y2+z2 =5 (8)
ex +xy−xz = 1
With a sharp eye, the reader immediately sees that the solution of this system is x = 0, y = 1, z = 2. But in most realistic problems, the solution is not so obvious. We wish to develop a numerical procedure for finding such a solution. Here is the pseudocode:
X = 0.1, 1.2, 2.5T fork =1to10
x1+x2+x3−3 F= x12+x2+x32−5
ex1 +x1x2 −x1x3 −1
111 J = 2x1 2x2 2x3
ex1 +x2−x3 x1 −x1 solve JH = F
X=X−H end for
Basin of Attraction of Fractals
When programmed and executed on a computer, we found that it converges to x = (0, 1, 2), but when we change to a different starting vector, (1, 0, 1), it converges to another root, (1.2244, −0.0931, 1.8687). (Why?) ■
We can use mathematical software such as in MATLAB, Maple, or Mathematica and their procedures for solving the system of nonlinear equations (8). An important application area of solving systems of nonlinear equations is used in Chapter 13 on minimization of functions.
Fractal Basins of Attraction
The applicability of Newton’s method for finding complex roots is one of its outstanding strengths. One need only program Newton’s method using complex arithmetic.
The frontiers of numerical analysis and nonlinear dynamics overlap in some intriguing ways. Computer-generated displays with fractal patterns, such as in Figure 3.8, can easily be created with the help of the Newton iteration. The resulting pictures show intricately interwoven sets in the plane that are quite beautiful and colorful when displayed on a computer. One begins with a polynomial in the complex variable z. For example, p(z) = z4 − 1 is suitable. This polynomial has four zeros, which are the fourth roots of unity. Each of these zeros has a basin of attraction, that is, the set of all points z0 such that Newton’s iteration, started at z0, will converge to that zero. These four basins of attraction are disjoint from each other, because if the Newton iteration starting at z0 converges to one zero, then it cannot also converge to another zero. One would naturally expect each basin to be a simple set surrounding the zero in the complex plane. But they turn out to be far from simple. To
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 3.8
Basins of attraction
see what they are, we can systematically determine, for a large number of points, which zero of p the Newton iteration converges to if started at z0. Points in each basin can be assigned different colors. The (rare) points for which the Newton iteration does not converge can be left uncolored. Computer Exercise 3.2.27 suggests how to do this.
Summary 3.2
• For finding a zero of a continuous and differentiable function f , Newton’s method is given by
xn+1=xn− f(xn) (n≧0) f′(xn)
It requires a given initial value x0 and two function evaluations (for f and f ′) per step.
• The errors are related by
en+1 =−1f′′(ξn)en2 2 f′(xn)
|en+1|≦ c|en|2
This means that Newton’s method has quadratic convergence behavior for x0 suffi-
ciently close to the root r.
• For an N × N system of nonlinear equations F(X) = 0, Newton’s method is written
as
X(k+1) = X(k) − F ′X(k)−1FX(k) (k ≧ 0)
which involves the Jacobian matrix F′X(k) = J = ∂fiX(k)/∂xjN×N. In
practice, one solves the Jacobian linear system
F ′(X(k)H(k) = −FX(k)
using Gaussian elimination and then finds the next iterate from the equation X(k+1) = X(k) + H(k)
3.2 Newton’s Method 135
which leads to the inequality
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
136 Chapter 3 Nonlinear Equations
1. Verify that when Newton’s method is used to compute √R (by solving the equation x2 = R), the sequence of iterates is defined by
xn+1 = 1xn + R 2 xn
2. (Continuation) Show that if the sequence {xn} is defined 13. as in the preceding exercise, then
x 2 − R = x n2 − R 2 n+1 2xn
Interpret this equation in terms of quadratic convergence.
a3. WriteNewton’smethodinsimplifiedformfordetermin- ing the reciprocal of the square root of a positive number.
n+1 3 n xn2 n+1 2n xn
Do they converge for any nonzero initial point? If so, to
what values?
Each of the following functions has √3 R as a zero for any positive real number R. Determine the formulas for Newton’s method for each and any necessary restrictions on the choice for x0.
12. Considerthefollowingprocedures:
aa. x = 12x − r b. x = 1x + 1
Perform two iterations to approximate 1/ ± withx0 =1andx0 =−1.
√
5, starting
aa. a(x)=x3−R
ac. c(x)=x2−R/x ae. e(x)=1−R/x3 ag. g(x)=1/x2 −x/R
b. b(x)=1/x3−1/R d. d(x)=x−R/x2
f. f(x)=1/x−x2/R h. h(x)=1−x3/R
a4. Two of the four zeros of x4 + 2x3 − 7x2 + 3 are positive. Find them by Newton’s method, correct to two significant figures.
14. Determine the formulas for Newton’s method for find- ing a root of the function f(x) = x −e/x. What is the behavior of the iterates?
a15. IfNewton’smethodisusedon f(x)=x3−x+1starting with x0 = 1, what will x1 be?
5. Theequationx−Rx−1 =0hasx =±R1/2 foritssolu-
tion. Establish Newton’s iterative scheme, in simplified
form, for this situation. Carry out five steps for R = 25 16. andx0=1.
Locate the root of f (x) = e−x − cos x that is nearest π/2.
IfNewton’smethodisusedon f(x)=x5−x3+3and ifxn=1,whatisxn+1?
6. Using a calculator, observe the sluggishness with which a 17. Newton’s method converges in the case of f (x ) =
(x − 1)m with m = 8 or 12. Reconcile this with the theory. Use x0 = 1.1.
a7. What linear function y = ax + b approximates f (x) = sin x best in the vicinity of x = π/4? How does this exercise relate to Newton’s method?
8. In Exercises 1.2.10–1.2.12, several methods are sug- gested for computing ln 2. Compare them with the use of Newton’s method applied to the equation ex = 2.
a9. Define a sequence xn+1 = xn −tan xn with x0 = 3. What is limn→∞ xn?
10. Theiterationformula
xn+1 = xn − (cos xn)(sin xn) + R cos2 xn
where R is a positive constant, was obtained by apply- ing Newton’s method to some function f (x). What was
f (x)? What can this formula be used for?
a11. Establish Newton’s iterative scheme in simplified form, not involving the reciprocal of x, for the function f (x) = x R − x −1 . Carry out three steps of this procedure using R=4andx0 =−1.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
18. DetermineNewton’siterationformulaforcomputingthe cube root of N/M for nonzero integers N and M.
a19. ForwhatstartingvalueswillNewton’smethodconverge if the function f is f (x) = x2/(1 + x2)?
20. Startingatx =3,x <3,orx >3,analyzewhathap- pens when Newton’s method is applied to the function
f (x) = 2×3 − 9×2 + 12x + 15.
a21. (Continuation) Repeat for f(x) = √|x|, starting with
x <0orx >0.
a22. Todeterminex =√3 R,wecansolvetheequationx3 = R by Newton’s method. Write the loop that carries out this process, starting from the initial approximation x0 = R.
23. ThereciprocalofanumberRcanbecomputedwithout division by the iterative formula
xn+1 =xn(2−xnR)
Establish this relation by applying Newton’s method to some f (x). Beginning with x0 = 0.2, compute the recip- rocal of 4 correct to six decimal digits or more by this rule.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 3.2
24.
Tabulate the error at each step and observe the quadratic convergence.
On a certain modern computer, floating-point numbers have a 48-bit mantissa. Moreover, floating-point hard- ware can perform addition, subtraction, multiplication, and reciprocation, but not division. Unfortunately, the re- ciprocation hardware produces a result accurate to less than full precision, whereas the other operations produce results accurate to full floating-point precision.
a. Show that Newton’s method can be used to find a zero of the function f (x) = 1 − 1/(ax). This will provide an approximation to 1/a that is accurate to full floating-point precision. How many iterations are required?
b. Show how to obtain an approximation to b/a that is accurate to full floating-point precision.
√
simple zeros that is three times continuously differen- tiable. Show that the convergence of the resulting method to any zero r of f (x) is at least quadratic.
Hint: Apply the result in the text to F, making sure that F has the required properties.
The Taylor series for a function f looks like this:
a 32.
3.2 Newton’s Method 137
h2 h3 f(x+h)= f(x)+hf′(x)+ f′′(x)+
26
f′′′(x)+···
Supposethat f (x), f ′(x),and f ′′(x)areeasilycomputed. Derive an algorithm like Newton’s method that uses three terms in the Taylor series. The algorithm should take as input an approximation to the root and produce as output a better approximation to the root. Show that the method is cubically convergent.
Hint: Use en = en+1 − h and ignore e2 n+1
negligible.
terms as being
25.
26.
a 27.
28.
a29. 30. a 31.
R is
Newton’s method for finding xn+1=1 xn+R
33.
34.
a 35.
a36.
To avoid computing the derivative at each step in New- ton’s method, it has been proposed to replace f ′ (xn ) by f ′(x0). Derive the rate of convergence for this method.
Refer to the discussion of Newton’s method and establish
2 xn
Perform three iterations of this scheme for computing
that
1 f′′(r)
How can this be used in a practical case to test whether the convergence is quadratic? Devise an example in which r , f ′(r), and f ′′(r) are all known, and test numerically the
convergence of e e−2. n+1 n
Show that in the case of a zero of multiplicity m, the modified Newton’s method
xn+1=xn−m f(xn) f′(xn)
is quadratically convergent.
√2, starting with x0 = 1, and of the bisection method for
√
2, starting with interval [1, 2]. How many iterations are needed for each method in order to obtain 10−6 accuracy?
lim e e−2 =−
n→∞ n+1n 2 f′(r)
(Continuation) Newton’s method for finding R = AB, gives this approximation:
√
R , where
√
A+B AB 4 + A + B
AB ≈
Show that if x0 = A or B, then two iterations of Newton’s
method are needed to obtain this approximation, whereas
if x0 = 1 (A + B), then only one iteration is needed. 2
Show that Newton’s method applied to xm − R and to 1−(R/xm)fordetermining √m Rresultsintwosimilaryet different iterative formulas. Here R > 0, m ≧ 2. Which formula is better and why?
Using a handheld calculator, carry out three iterations of Newton’smethodusingx0 =1and f(x)=3×3+x2− 15x + 3.
WhathappensiftheNewtoniterationisappliedtof(x)= arctan x with x0 = 2? For what starting values will New- ton’s method converge? (See Computer Exercise 3.2.7.)
Newton’smethodcanbeinterpretedasfollows:Suppose that f(x+h)=0.Then f′(x)≈[f(x+h)−f(x)]/h= − f (x)/h. Continue this argument.
Derive a formula for Newton’s method for the function F(x) = f (x)/f ′(x), where f (x) is a function with
Hint: Use f′(r +en).
Taylor series for
each of
+ en ) and
a 37.
f (r TheSteffensenmethodforsolvingtheequation f(x)=
0 uses the formula
xn+1=xn− f(xn)
g(xn )
in which g(x) = {f[x + f(x)] − f(x)}/f(x). It is
quadratically convergent, like Newton’s method. How many function evaluations are necessary per step? Us- ing Taylor series, show that g(x) ≈ f ′(x) if f (x) is small and thus relate Steffensen’s iteration to Newton’s. What advantage does Steffensen’s have? Establish the quadratic convergence.
A proposed Generalization of Newton’s method is xn+1 =xn −ω f(xn)
f′(xn)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
138 Chapter 3 Nonlinear Equations
where the constant ω is an acceleration factor chosen to increase the rate of convergence. For what range of val- ues of ω is a simple root r of f (x) a point of attraction; that is, |g′(r)| < 1, where g(x) = x − ωf (x)/f ′(x)? This method is quadratically convergent only if ω = 1 because g′(r) ≠ 0 when ω ≠ 1.
38. Suppose that r is a double root of f(x) = 0; that is,
f(r) = f′(r) = 0but f′′(r) ≠ 0,andsupposethat f and
all derivatives up to and including the second are contin-
uous in some neighborhood of r . Show that en+1 ≈ 1 en 2
for Newton’s method and thereby conclude that the rate of convergence is linear near a double root. (If the root has multiplicity m, then en+1 ≈ [(m − 1)/m]en.)
39. (SimultaneousNonlinearEquations)UsingtheTaylor series in two variables (x, y) of the form
f(x+h,y+k)= f(x,y)+hfx(x,y)+kfy(x,y)+···
where fx = ∂f/∂x and fy = ∂f/∂y, establish that New- ton’s method for solving the two simultaneous nonlinear
equations
f (x, y) = 0
g(x, y) = 0 can be described with the formulas
x =x−fgy−gfy n+1 n fxgy −gx fy
xy xy
Here the functions f , f x , and so on are
Newton’s method
zn+1=zn− f(zn)
f′(zn) xn+1=xn− ghy−hgy
can be written in the form
hgx −ghx
gxhy −gyhx yn+1 = yn − gxhy −gyhx
Here all functions are evaluated at zn = xn + i yn .
(xn, yn).
a41. 42.
43.
44.
45.
46.
Considerthealgorithmofwhichonestepconsistsoftwo steps of Newton’s method. What is its order of conver- gence?
(Continuation)Usingtheideaoftheprecedingexercise, show how we can easily create methods of arbitrarily high order for solving f (x) = 0. Why is the order of a method not the only criterion that should be considered in assessing its merits?
If we want to solve the equation 2 − x = ex using New- ton’s iteration, what are the equations and functions that must be coded? Give a pseudocode for doing this exercise. Include a suitable starting point and a suitable stopping criterion.
√
Suppose that we want to compute 2 by using Newton’s
Method on the equation x2 = 2 (in the obvious, straight-
forward way). If the starting point is x = 7 , what is the 05
numerical value of the correction that must be added to x0togetx1?
Hint: The arithmetic is quite easy if you do it using ratios of integers.
Apply Newton’s method to the equation f (x) = 0 with f (x) as given below. Find out what happens and why.
a. f(x)=ex b. f(x)=ex+x2
Consider Newton’s method xn+1 = xn − f (xn )/ f ′(xn ). If the sequence converges then the limit point is a solu- tion. Explain why or why not.
using nested multiplication. Stop the computation when two successive points differ by 1 × 10−5 or some other
fxg−gx f yn+1 = yn − f g − g f
at
evaluated 40. Newton’s method can be defined for the equation f (z) =
g(x, y) + ih(x, y), where f (z) is an analytic function of the complex variable z = x + iy (x and y real) and g(x, y) and h(x, y) are real functions for all x and y. The derivative f ′(z) is given by f ′(z) = gx +ihx = hy −igy because the Cauchy-Riemann equations gx = hy and hx = −gy hold. Here the partial derivatives are defined as gx = ∂g/∂x, gy = ∂g/∂y, and so on. Show that
1. Using the procedure Newton and a single computer run, test your code on these examples: f (t ) = tan t − t with
t√2
x0 = 7 and g(t) = e − t + 9 with x0 = 2. Print each iterate and its accompanying function value.
2. Writeasimple,self-containedprogramtoapplyNewton’s
method to the equation x3 + 2x2 + 10x = 20, starting 3. with x0 = 2. Evaluate the appropriate f (x) and f ′(x),
convenient tolerance close to your machine’s capability. Print all intermediate points and function values. Put an upper limit of ten on the number of steps.
(Continuation) Repeat using double precision and more steps.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 3.2
a4. Find the root of the equation
2x(1 − x2 + x) ln x = x2 − 1
in the interval [0, 1] by Newton’s method using double precision. Make a table that shows the number of correct digits in each step.
a5. In 1685, John Wallis published a book called Algebra, in which he described a method devised by Newton for solving equations. In slightly modified form, this method was also published by Joseph Raphson in 1690. This form is the one now commonly called Newton’s method or the Newton-Raphson method. Newton himself discussed the method in 1669 and illustrated it with the equation x3 − 2x − 5 = 0. Wallis used the same example. Find a root of this equation in double precision, thus continuing the tradition that every numerical analysis student should solve this venerable equation.
6. In celestial mechanics, Kepler’s equation is important. It reads x = y −εsiny, in which x is a planet’s mean anomaly, y its eccentric anomaly, and ε the eccentricity of its orbit. Taking ε = 0.9, construct a table of y for 30 equally spaced values of x in the interval 0 ≦ x ≦ π . Use Newton’s method to obtain each value of y. The y corresponding to an x can be used as the starting point for the iteration when x is changed slightly.
7. In Newton’s method, we progress in each step from a given point x to a new point x − h, where h = f (x)/f ′(x). A refinement that is easily programmed is this: If | f (x − h)| is not smaller than | f (x)|, then reject this value of h and use h/2 instead. Test this refinement.
a8. Write a brief program to compute a root of the equation x3 = x2 + x + 1, using Newton’s method. Be careful to select a suitable starting value.
a9. Find the root of the equation 5(3x4 − 6x2 + 1) = 2(3x5 − 5x3) that lies in the interval [0, 1] by using Newton’s method and a short program.
10. Foreachequation,writeabriefprogramtocomputeand print eight steps of Newton’s method for finding a positive root.
aa. x=2sinx
ab. x3=sinx+7
ac. sinx=1−x
ad. x5+x2=1+7x3forx≧2
11. Write and test a recursive procedure for Newton’s method.
12. Rewrite and test the Newton procedure so that it is a char- acter function and returns key words such as iterating,
3.2 Newton’s Method 139 success, near-zero, max-iteration. Then a case
statement can be used to print the results.
13. Wouldyouliketoseethenumber0.55887766comeout of a calculation? Take three steps in Newton’s method on 10+x3 −12cosx =0startingwithx0 =1.
a 14. Write a short program to solve for a root of the equation e−x 2 = cos x + 1 on [0, 4]. What happens in Newton’s methodifwestartwithx0 =0orwithx0 =1?
15. Findtherootoftheequation1x2+x+1−ex =0by 2
Newton’s method, starting with x0 = 1, and account for the slow convergence.
16. Using f(x)=x5−9x4−x3+17x2−8x−8andx0 =0, study and explain the behavior of Newton’s method. Hint: The iterates are initially cyclic.
17. Find the zero of the function f(x) = x − tanx that is closest to 99 (radians) by both the bisection method and Newton’s method.
Hint: Extremely accurate starting values are needed for this function. Use the computer to construct a table of values of f (x) around 99 to determine the nature of this function.
18. Using the bisection method, find the positive root of 2x(1 + x2)−1 = arctan x. Using the root as x0, apply Newton’s method to the function arctanx. Interpret the results.
19. If the root of f (x) = 0 is a double root, then Newton’s method can be accelerated by using
f(xn) xn+1 = xn − 2 f ′(xn)
Numerically compare the convergence of this scheme with Newton’s method on a function with a known double root.
20. Program and test Steffensen’s method, as described in Exercise 3.2.36.
21. Considerthenonlinearsystem
f (x, y) = x2 + y2 − 25 = 0
g(x, y) = x2 − y − 2 = 0
Using a software package that has 2D plotting capabili- ties, illustrate what is going on in solving such a system by plotting f (x, y), g(x, y), and show their intersection with the (x, y)-plane. Determine approximate roots of these equations from the graphical results.
22. Solve this pair of simultaneous nonlinear equations by first eliminating y and then solving the resulting
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
140 Chapter 3 Nonlinear Equations
equation in x by Newton’s method. Start with the ini-
tial value x0 = 1.0.
x3 − 2xy + y7 − 4x3y = 5
ysinx+3x2y+tanx =4
23. Using Equations (6) and (7), code Newton’s methods for
nonlinear systems. Test your program by solving one or more of the following systems:
a. SysteminComputerExercise3.2.21.
b. SysteminComputerExercise3.2.22.
d. Using starting values 3, 1,−1, solve 422
be computed only every other step. This method is given by
x =x−f(x2n)
2 n + 1 2 n
f′(x2n)
x2n+2 = x2n+1 − f (x2n+1) f′(x2n)
c. System(3)usingstartingvalues(0,0,0).
Numerically compare both proposed methods to New- ton’s method for several simple functions that have known roots. Print the error of each method on every iteration to monitor the convergence. How well do the proposed methods work?
27. (BasinofAttraction)Considerthecomplexpolynomial
z3 − 1, whose zeros are the three cube roots of unity.
Generate a picture showing three basins of attraction
in the complex plane in the square region defined by
−1 ≦ Real(z) ≦ 1 and −1 ≦ Imaginary(z) ≦ 1. To do this,
use a mesh of 1000 × 1000 pixels inside the square. The
center point of each pixel is used to start the iteration of
Newton’s method. Assign a particular basin color to each
pixel if convergence to a root is obtained with nmax = 10
iterations. The large number of iterations suggested can
be avoided by doing some analysis with the aid of Theo-
rem 1, since the iterates get within a certain neighborhood
of the root and the iteration can be stopped. The criterion
for convergence is to check both |zn+1 − zn| < ε and
|z3 −1|<εwithasmallvaluesuchasε=10−4 as n+1
well as a maximum number of iterations.
Hint: It is best to debug your program and get a crude picture with only a small number of pixels such as 10×10.
28. (Continuation)Repeatforthepolynomialz4−1=0.
29. Write real function Sqrt(x) to compute the square root
of a real argument x by the following algorithm: First,
reduce the range of x by finding a real number r and an
integer m such that x = 22mr with 1 ≦r < 1. Next, 4
compute x2 by using three iterations of Newton’s method given by
x + y + z = 0
x2 + y2 + z2 = 2
x(y + z) = −1
e. Usingstartingvalues(−0.01,−0.01),solve
4y2 + 4y + 52x − 19 = 0
169x2 + 3y2 + 111x − 10y − 10 = 0
f. Selectstartingvalues,andsolve sin(x + y) = ex−y
cos(x + 6) = x2 y2
24. InvestigatethebehaviorofNewton’smethodforfinding
complex roots of polynomials with real coefficients. For
example, the polynomial p(x) = x2 + 1 has the com-
plex conjugate pair of roots ±i and Newton’s method is
xn+1 = 1 (xn − 1/xn ). First, program this method using 2
real arithmetic and real numbers as starting values. Sec- ond, modify the program using complex arithmetic but still using only real starting values. Finally, use complex numbers as starting values. Observe the behavior of the iterates in each case.
25. Using Exercise 3.2.40, find a complex root of each of the following:
a. z3−z−1=0 b. z4−2z3−2iz2+4iz=0 c. 2z3−6(1+i)z2−6(1−i)=0 d. z=ez
Hint: For the last part, use Euler’s relation eiy = cos y + i sin y.
26. IntheNewtonmethodforfindingarootrof f(x)=0,we start with x0 and compute the sequence x1, x2, . . . using the formula xn+1 = xn − f (xn)/f ′(xn). To avoid com- puting the derivative at each step, it has been proposed to replace f ′(xn) with f ′(x0) in all steps. It has also been suggested that the derivative in Newton’s formula
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
xn+1 = 1xn + r 2 xn
with the special initial approximation
x0 = 1.27235 367 + 0.24269 3281r −
√
Then set
values of x. Obtain a listing of the code for the square- root function on your computer system. By reading the comments, try to determine what algorithm it uses.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1.02966 039 1+r
x ≈ 2mx2. Test this algorithm on various
30. The following method has third-order convergence for √
computing R: 2 xn+1=xn xn+3R
3 x n2 + R
Carry out some numerical experiments using this method and the method of the preceding exercise to see whether you observe a difference in the rate of convergence. Use the same starting procedures of range reduction and initial approximation.
31. Write real function CubeRoot(x) to compute the cube
root of a real argument x by the following procedure:
First, determine a real number r and an integer m such
that x = r23m with 1 ≦r < 1. Compute x4 using four 8
iterations of Newton’s method: 2r
xn+1=3 xn+2xn2 with the special starting value
x0 =2.502926−
8.04512 5(r + 0.38775 52)
(r + 4.61224 4)(r + 0.38775 52) − 0.35984 96 Then set √3 x ≈ 2m x4. Test this algorithm on a variety of
x values.
32. Use mathematical software in MATLAB, Maple, or Mathematica to compute ten iterates of Newton’s method startingwithx0 =0for f(x)=x3−2x2+x−3.With 100 decimal places of accuracy and after nine iterations, show that the value of x is
2.17455 94102 92980 07420 23189 88695 65392 56759 48725 33708 24983 36733 92030 23647 64792 75760 66115 28969 38832 0640
Show that the values of the function at each iteration are 9.0, 2.0, 0.26, 0.0065, 0.45 × 10−5 , 0.22 × 10−11 , 0.50×10−24, 0.27×10−49, 0.1×10−98, and 0.1×10−98. Again notice that the number of digits of accuracy in Newton’s method doubles (approximately) with each it- eration once they are sufficiently close to the root. (Also, see Bornemann, Wagon, and Waldvogel [2004] for a 100- Digit Challenge, which is a study in high-accuracy nu- merical computing.)
33. (Continuation)UseMATLAB,MapleorMathematicato discover that this root is exactly
3791√ 1 2 54+6 77+ 3 79 1√ +3
9 + 77 54 6
Clearly, the decimal results are of more interest to us in our study of numerical methods.
34. 35.
3.2 Newton’s Method 141 (Continuation) Find all the roots including complex roots.
Numerically, find all the roots of the following systems of nonlinear equations. Then plot the curves to verify your results:
a. y=2x2+3x−4,y=x2+2x+3
b. y+x+3=0,x2+y2=17
c. y=1x−5,y=x2+2x−15 2
d. xy=1,x+y=2
e. y=x2,x2+(y−2)2=4
f. 3x2+2y2=35,4x2−3y2=24
g. x2−xy+y2=21,x2+2xy−8y2=0
Apply Newton’s method on these test problems:
36.
a.
b.
c.
f (x) = x2.
Hint: The first derivative is zero at the root, and con- vergence may not be quadratic.
f (x) = x + x4/3.
Hint: There is no second derivative at the root, and convergence may fail to be quadratic.
f(x)=x+x2sin(2/x)forx ≠ 0and f(0)=0and f′(x)=1+2xsin(2/x)−2cos(2/x)forx ≠ 0and f ′(0) = 1.
Hint: The derivative of this function is not continuous at the root, and convergence may fail.
37.
Let F(X) =
2 x1 −x2+c=0.
x2 −x1+c 0
Each component equation f1(x) = 0 and f2(x) = 0 de-
scribes a parabola. Any point (x∗, y∗) where these two
parabolas intersect is a solution to the nonlinear system of
equations. Using Newton’s method for systems of nonlin-
ear equations, find the solutions for each of these values
of the parameter c = 1, 1,−1,−1. Give the Jacobian 242
matrix for each. Also for each of these values, plot the re- sulting curves showing the points of intersection (Heath [2000], p. 218).
38.
39.
40.
LetF(X)=
2 x1 +2x2 −2 = 0 .
x1 + 4x2 − 4 0
Solve this nonlinear system starting with X(0) = (1, 2). Give the Jacobian matrix. Also plot the resulting curves showing the point(s) of intersection.
Using Newton’s method, find the zeros of f (z) = z3 − z with these starting values z(0) = 1 + 1.5i, 1 + 1.1i, 1 + 1.2i, 1 + 1.3i.
Use Halley’s method to produce a plot of the basins of attraction for p(z) = z6 − 1. Compare to Figure 3.8.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
142
Chapter 3 Nonlinear Equations
41.
(Global Positioning System Project) Each time a GPS is used, a system of nonlinear equations of the form
(x−a)2+(y−b)2+(z−c)2=[(C(t −D)]2 (x−a2)2+(y−b2)2+(z−ci)2 =[(C(t2−D)]2 (x−a3)2+(y−b3)2+(z−ci)2 =[(C(t3−D)]2 (x−a4)2+(y−b4)2+(z−ci)2 =[(C(t4−D)]2
nonlinear system. (See Hofmann-Wellenhof, Lichteneg-
ger, and Collins [2001], Sauer [2012], and Strang and
Borre [1997].) 1 1 i 1
is solved for the (x, y, z) coordinates of the receiver. For eachsatellitei,thelocationsare(ai,bi,ci),andti isthe synchronized transmission time from the satellite. Fur- ther, C is the speed of light, and D is the difference between the synchronized time of the satellite clocks and the earth-bound receiver clock. Although there are only two points on the intersection of three spheres (one of which can be determined to be the desired location), a fourth sphere (satellite) must be used to resolve the inaccuracy in the clock contained in the low-cost re- ceiver on earth. Explore various ways for solving such a
42.
UsemathematicalsoftwaresuchasinMATLAB,Maple, or Mathematica and their built-in procedures to solve the system of nonlinear equations (8) in Example 2. Also, plot the given surfaces and the solution obtained.
Hint: You may need to use a slightly perturbed starting point (0.5, 1.5, 0.5) to avoid a singularity in the Jacobian matrix.
3.3 Secant Method
Method Description
We now consider a general-purpose procedure that converges almost as fast as Newton’s method. This method mimics Newton’s method, but avoids the calculation of derivatives. Recall that Newton’s iteration defines xn+1 in terms of xn via the formula
xn+1=xn− f(xn) (1) f′(xn)
In the secant method, we replace f ′(xn) in Formula (1) by an approximation that is easily computed. Since the derivative is defined by
Secant Method Formula
(In Section 4.3, we revisit this subject and learn that this is a finite difference approximation to the first derivative.) In particular, if x = xn and h = xn−1 − xn, we have
f′(xn)≈ f(xn−1)− f(xn) (2) xn−1 − xn
When this is used in Equation (1), the result defines the secant method:
xn+1 =xn − xn −xn−1 f(xn) (3)
f(xn)− f(xn−1)
The secant method (like Newton’s) can be used to solve systems of equations as well.
The name of the method is taken from the fact that the right member of Equation (2) is the slope of a secant line to the graph of f (see Figure 3.9). Of course, the left member is the slope of a tangent line to the graph of f . (Similarly, Newton’s method could be called
the “tangent method.”)
we can say that for small h,
f′(x)=lim f(x+h)−f(x) h→0 h
f′(x)≈ f(x+h)− f(x) h
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 3.9
Secant method
y 5 f(x)
r xn11 xn xn21
3.3 Secant Method 143 Secant line
x
A few remarks about Equation (3) are in order. Clearly, xn+1 depends on two previous elements of the sequence. So to start, two points (x0 and x1) must be provided. Equation (3) can then generate x2, x3, . . . . In programming the secant method, we could calculate and test the quantity f (xn)− f (xn−1). If it is nearly zero, an overflow can occur in Equation (3). Of course, if the method is succeeding, the points xn will be approaching a zero of f , so
f (xn ) will be converging to zero. (We are assuming that f is continuous.) Also, f (xn−1) willbeconvergingtozero,and,afortiori, f(xn)− f(xn−1)willapproachzero.Iftheterms f(xn) and f(xn−1) have the same sign, additional significant digits are canceled in the subtraction.Sowecouldperhapshalttheiterationwhen|f(xn)− f(xn−1)|≦δ|f(xn)|for
some specified tolerance δ, such as 1 × 10−6. (See Computer Exercise 3.3.18.) 2
Secant Algorithm
A procedure pseudocode for nmax steps of the secant method applied to the function f starting with the interval [a, b] = [x0, x1] can be written as follows:
procedure Secant( f, a, b, nmax, ε) integer n, nmax; real a, b, fa, fb, ε, d external function f
fa← f(a)
fb← f(b)
if |fa| > |fb| then
a ←→ b
fa ←→ fb end if
output 0, a, fa output 1, b,fb
for n = 2 to nmax
if |fa| > |fb| then a ←→ b
f a ←→ f b end if
d ← (b − a)/(fb − fa) b←a
fb←fa
d ← d · fa
if |d| < ε then
output “convergence”
return end if
(Continued)
Secant Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
144 Chapter 3 Nonlinear Equations
a←a−d fa← f(a) outputn,a, fa
end for
end procedure Secant
Error Vector Approximation
We can use mathematical software to find the single real root, −1.1053, and the two pairs of complex roots, −0.319201 ± 1.35008i and 0.871851 ± 0.806311i. ■
Convergence Analysis
The advantages of the secant method are that (after the first step) only one function evaluation is required per step (in contrast to Newton’s iteration, which requires two) and that it is almost as rapidly convergent. It can be shown that the basic secant method defined by Equation (3) obeys an equation of the form
en+1 = −1 f ′′(ξn)enen−1 ≈ −1 f ′′(r)enen−1 (4) 2 f′(ζn) 2 f′(r)
EXAMPLE 1
Solution
Here ←→ means interchange values. The endpoints [a, b] are interchanged, if necessary, to keep | f (a)| ≦ | f (b)|. Consequently, the absolute values of the function are nonincreasing; thus, we have | f (xn)|≧ | f (xn+1)| for n ≧ 1.
Ifthesecantmethodisusedon p(x)=x5 +x3 +3withx0 =−1andx1 =1,whatisx8? The output from the computer program corresponding to the pseudocode for the secant
method is as follows. (We used a 32-bit word-length computer.)
n xn p(xn )
0 −1.0 1.0
1 1.0 5.0
2 −1.5 −7.97
3 −1.05575
4 −1.11416
5 −1.10462
6 −1.10529
7 −1.10530
8 −1.10530
0.512 −9.991 × 10−2
7.593 × 10−3 1.011 × 10−4 2.990 × 10−7 2.990 × 10−7
where ξn and ζn are in the smallest interval that contains r, xn, and xn−1. Thus, the ratio en+1(enen−1)−1 converges to −1 f ′′(r)/f ′(r). The rapidity of convergence of this method
2
is, in general, between those for bisection and for Newton’s method.
To prove the second part of Equation (4), we begin with the definition of the secant method in Equation (3) and the error
en+1 = r − xn+1
=r− f(xn)xn−1−f(xn−1)xn
f(xn)− f(xn−1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
= f (xn)en−1 − f (xn−1)en
f(xn)− f(xn−1) f(xn) − f(xn−1)
= xn −xn−1 en en−1 e e (5)
3.3 Secant Method 145
xn −xn−1 n n−1 f(xn)= f(r −en)= f(r)−en f′(r)+ 1en2 f′′(r)+Oen3
f(xn)− f(xn−1) By Taylor’s Theorem, we establish
Since f (r) = 0, this gives us
en 2
2 f(xn)=−f′(r)+1en f′′(r)+Oen2
Changing the index to n − 1 yields
f (xn−1) = − f ′(r) + 1e f ′′(r) + Oe2
en−1 2 n−1 n−1 By subtraction between these equations, we arrive at
Error Bound (Superlinear Convergence)
f(xn) − f(xn−1) = 1(e −e )f′′(r)+Oe2 en en−1 2n n−1 n−1
Since xn − xn−1 = en−1 − en , we reach the equation f(xn)− f(xn−1)
en en−1 ≈−1 f′′(r) xn − xn−1 2
The first bracketed expression in Equation (5) can be written as xn−xn−1 ≈ 1
f (xn) − f (xn−1) f ′(r)
Hence, we have shown the second part of Equation (4). We leave the establishment of the first part of Equation (4) as an exercise because it depends on some material to be covered in Chapter 4. (See Exercise 3.3.18.)
From Equation (4), the order of convergence for the secant method can be expressed in terms of the inequality
|en+1|≦ C|en|α (6) where α = 1 1+√5 ≈ 1.62 is the golden ratio. Since α > 1, we say that the convergence
2
is superlinear. Assuming that Inequality (6) is true, we can show that the secant method
converges under certain conditions.
Letc = c(δ)bedefinedasinEquation(2)ofSection3.2.If|r−xn|≦δand|r−xn−1|≦δ,
for some root r , then Equation (4) yields
|en+1|≦ c|en||en−1| (7)
Suppose that the initial points x0 and x1 are sufficiently close to r that c|e0| ≦ D and c|e1| ≦ D for some D < 1. Then
c|e1|≦D, c|e0|≦D c|e2| ≦ c|e1| c|e0| ≦ D2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
146 Chapter 3
Nonlinear Equations
Fibonacci Sequence
(n≧3)
This is the recurrence relation for generating the famous Fibonacci sequence, 1, 1, 2, 3, 5,
In general, we have where inductively,
c|e3| ≦ c|e2| c|e1| ≦ D3 c|e4| ≦ c|e3| c|e2| ≦ D5 c|e5| ≦ c|e4| c|e3| ≦ D8 etc.
|en | ≦ c−1 Dλn+1 λ1=1, λ2=1
(8)
(9)
Error Vector Approximation
Alternative Convergence Analysis
λn =λn−1 +λn−2
8, . . . . It can be shown to have the surprising explicit form
1n n
λn=√ α−β (10)
5
whereα= 11+√5andβ= 11−√5.SinceD<1andλn →∞,weconcludefrom 22
Inequality (8) that en → 0. Hence, xn → r as n → ∞, and the secant method converges to the root r if x0 and x1 are sufficiently close to it.
Next, we show that Inequality (6) is in fact reasonable—not a proof. From Inequal- ity (7), we now have
|en+1| ≦ c|en||en−1|
= c|en|α|en|1−α|en−1|
≈ c|en|αc−1Dλn+11−αc−1Dλn = |en|αcα−1Dλn+1(1−α)+λn
= |en|αcα−1Dλn+2−αλn+1
by using an approximation to Inequality (8). In the last line, we used the recurrence relation (9). Now λn+2 − αλn+1 converges to zero. (See Exercise 3.3.6.). Hence, cα−1 Dλn+2−αλn+1 is bounded, say by C, as a function of n. Thus, we have
|en+1| ≈ C|en|α
which is a reasonable approximation to Inequality (6).
Another derivation (with a bit of hand waving) for the order of convergence of the
secant method can be given by using a general recurrence relation. Equation (4) gives us en+1 ≈ Kenen−1
where K = −1 f ′′(r)/ f ′(r). We can write this as 2
|Ken+1| ≈ |Ken||Ken−1|
Letzi =log|Kei|.Thenwewanttosolvetherecurrenceequation
zn+1 = zn + zn−1
where z0 and z1 are arbitrary. This is a linear recurrence relation with constant coefficients
similar to the one for the Fibonacci numbers (9) except that the first two values z0 and z1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
are unknown. The solution is of the form
zn = Aαn + Bβn
whereα= 11+√5andβ= 11−√5.Thesearetherootsofthequadraticequation
zn ≈ Aαn
for large n and for some constant A. Consequently, we have
3.3 Secant Method 147
(11)
Error Vectors Approximation
|en+1| ≈ C|en|α (12) for large n and for some constant C. Again, Inequality (6) is essentially established! A
rigorous proof of Inequality (6) is tedious and quite long.
Comparison of Methods
In this chapter, three primary methods for solving an equation f (x ) = 0 have been presented. The bisection method is reliable but slow. Newton’s method is fast but often only near the root and requires f ′. The secant method is nearly as fast as Newton’s method and does not require knowledge of the derivative f ′, which may not be available or may be too expensive to compute. The user of the bisection method must provide two points at which the signs of f (x) differ, and the function f need only be continuous. In using Newton’s method, one must specify a starting point near the root, and f must be differentiable. The secant method requires two good starting points. Newton’s procedure can be interpreted as the repetition of a two-step procedure summarized by the prescription linearize and solve. This strategy is applicable in many other numerical problems, and its importance cannot be overemphasized. Both Newton’s method and the secant method fail to bracket a root. The modified false position method can retain the advantages of both methods.
The secant method is often faster at approximating roots of nonlinear functions in comparison to bisection and false position. Unlike these two methods, the intervals [ak , bk ] do not have to be on opposite sides of the root and have a change of sign. Moreover, the slope of the secant line can become quite small, and a step can move far from the current point. The secant method can fail to find a root of a nonlinear function that has a small slope near the root because the secant line can jump a large amount.
For nice functions and guesses relatively close to the root, most of these methods require relatively few iterations before coming close to the root. However, there are pathological functions that can cause troubles for any of those methods. When selecting a method for solving a given nonlinear problem, one must consider many issues such as what you know about the behavior of the function, an interval [a, b] satisfying f (a) f (b) < 0, the first derivative of the function, a good initial guess to the desired root, and so on.
Pros and Cons of Bisection, Newton’s, and Secant Methods
22
λ2 −λ−1=0.Since|α|>|β|,theterm Aαn dominates,andwecansaythat
Then it follows that
Hence, we have
|Ken| ≈ 10Aαn
|Ken+1| ≈ 10Aαn+1 = 10Aαn α = |Ken|α
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
148 Chapter 3
Nonlinear Equations
Combining Schemes
Yet Another Approach
Hybrid Schemes
In an effort to find the best algorithm for finding a zero of a given function, various hybrid methods have been developed. Some of these procedures combine the bisection method (used during the early iterations) with either the secant method or the Newton method. Also, adaptive schemes are used for monitoring the iterations and for carrying out stopping rules. More information on some hybrid secant-bisection methods and hybrid Newton-bisection methods with adaptive stopping rules can be found in Bus and Dekker [1975], Dekker [1969], Kahaner, Moler, and Nash [1989], and Novak, Ritter, and Woz ́niakowski [1995].
Fixed-Point Iteration
For a nonlinear equation f (x) = 0, we seek a point where the curve f intersects the x-axis (y = 0). An alternative approach is to recast the problem as a fixed-point problem x = g(x) for a related nonlinear function g. For the fixed point problem, we seek a point where the curve g intersects the diagonal line y = x. A value of x such that x = g(x) is a fixed point of g because x is unchanged when g is applied to it. Many iterative algorithms for solving a nonlinear equation f (x) = 0 are based on a fixed-point iterative method
x(n+1) = gx(n)
where g has fixed points that are solutions of f (x) = 0. An initial starting value x(0) is
selected, and the iterative method is applied repeatedly until it converges sufficiently well.
Apply the fixed-point procedure, where g(x ) = 1 + 2/x , starting with x (0) = 1, to compute a zero of the nonlinear function f (x) = x2 − x − 2. Graphically, trace the convergence process.
The fixed-point method is
x(n+1)=1+ 2 x(n)
Eight steps of the iterative algorithm are x (0) = 1, x (1) = 3, x (2) = 5/3, x (3) = 11/5, x(4) = 21/11, x(5) = 43/21, x(6) = 85/43, x(7) = 171/85, and x(8) = 341/171 ≈ 1.99415. In Figure 3.10, we see that these steps spiral into the fixed point 2.
EXAMPLE 2
Solution
y
3
2
1
y 5 1 1 2x
y 5 x
FIGURE 3.10
Fixed-point iterations forf(x)=x2−x−2
r
0123x ■
For a given nonlinear equation f (x) = 0, there may be many equivalent fixed-point problems x = g(x) with different functions g, some better than others. A simple way to
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Locally Convergence
characterize the behavior of an iterative method x(n+1) = gx(n) is locally convergent for x∗ if x∗ = g(x∗) and |g′(x∗)| < 1. By locally convergent, we mean that there is an interval containing x(0) such that the fixed-point method converges for any starting value x(0) within that interval. If |g′(x∗)| > 1, then the fixed-point method diverges for any starting point x(0) other than x∗. Fixed-point iterative methods are used in standard practice for solving many science and engineering problems. In fact, the fixed-point theory can simplify the proof of the convergence of Newton’s method.
For additional details and sample plots, see Kincaid and Cheney [2002] or Epureanu and Greenside [1998]. Also, look at other items in the Bibliography. For example, an expository paper by Ypma [1995] traces the historical development of Newton’s method through notes, letters, and publications by Isaac Newton, Joseph Raphson, and Thomas Simpson.
Summary 3.3
• The secant method for finding a zero r of a function f (x) is written as xn+1 =xn − xn −xn−1 f(xn)
f(xn)− f(xn−1)
for n ≧ 1, which requires two initial values x0 and x1. After the first step, only one new
function evaluation per step is needed.
• Aftern+1stepsofthesecantmethod,theerroriteratesei =r−xi obeytheequation
en+1 =−1f′′(ξn)enen−1 2 f′(ζn)
which leads to the approximation
|en+1| ≈ C|en|(1+ 5)/2 ≈ C|en|1.62
Therefore, the secant method has superlinear convergence behavior.
a1. Calculate an approximate value for 43/4 using one step of the secant method with x0 = 3 and x1 = 2.
2. If we use the secant method on f(x) = x3 −2x +2 starting with x0 = 0 and x1 = 1, what is x2?
a3. Ifthesecantmethodisusedon f(x)=x5+x3+3and ifxn−2 =0andxn−1 =1,whatisxn?
a4. Ifxn+1=xn+(2−exn)(xn−xn−1)/(exn −exn−1)with x0 =0andx1 =1,whatislimn→∞xn?
5. Using the bisection method, Newton’s method, and the secant method, find the largest positive root correct to three decimal places of x3 − 5x + 3 = 0. (All roots are in [−3, +3].)
6. Prove that in the first analysis of the secant method, λn+1 − αλn converges to zero as n → ∞.
7. EstablishEquation(10).
8. Write out the derivation of the order of convergence of the secant method that uses recurrence relations; that is, find the constants A and B in Equation (11), and fill in the details in arriving at Equation (12).
a9. Whatistheappropriateformulaforfindingsquareroots using the secant method? (Refer to Exercise 3.2.1.)
10. Establish that the formula for the secant method can be written as
xn+1 = xn−1 f(xn)−xn f(xn−1) f(xn)− f(xn−1)
Explore whether it or Equation (3) is more numerically stable with better achievable accuracy in a computer program.
√
3.3 Secant Method 149
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 3.3
150 Chapter 3 Nonlinear Equations
11.
a 12.
13.
Show that if the iterates in Newton’s method converge to a point r for which f ′(r) ≠ 0, then f (r) = 0. Establish the same assertion for the secant method.
Hint: In the latter, the Mean-Value Theorem of Differen- tial Calculus is useful. This is the case n = 0 in Taylor’s Theorem.
A method of finding a zero of a given function f proceeds as follows. Two initial approximations x0 and x1 to the zero are chosen, the value of x0 is fixed, and successive iterations are given by xn − x0
xn+1 = xn − f(xn)− f(x0) f(xn)
This process will converge to a zero of f under certain conditions. Show that the rate of convergence to a simple zero is linear under some conditions.
Test the following sequences for different types of con- vergence (i.e., linear, superlinear, or quadratic), where n = 1,2,3….
a −2 −na −2n a. xn = n b. xn = 2 c. xn = 2
d. xn =2−an witha0 =a1 =1andan+1 =an +an−1 for n ≧ 2
eration is as follows: Starting with any x0, we define xn+1 = f(xn),wheren=0,1,2,….Showthatif f is continuous and if the sequence {xn} converges, then its limitisafixedpointoff.
a15. (Continuation) Show that if f is a function defined on the whole real line whose derivative satisfies | f ′(x)| ≦ c with a constant c less than 1, then the method of functional iteration produces a fixed point of f .
Hint: In establishing this, the Mean-Value Theorem from Section 1.2 is helpful.
a 16. (Continuation) With a calculator, try the method of func- tional iteration with f (x) = x/2 + 1/x, taking x0 = 1. What is the limit of the resulting sequence?
a17. (Continuation) Using functional iteration, show that the equation 10−2x +sinx = 0 has a root. Locate the root approximately by drawing a graph. Starting with your ap- proximate root, use functional iteration to obtain the root accurately by using a calculator.
Hint: Write the equation in the form x = 5 + 1 sin x. 2
18. Establish the first part of Equation (4) using Equation (5). Hint: Use the relationship between divided differences and derivatives from Section 4.3.
cise 3.3.1. Print the final estimate of the solution and the value of the function at this point.
3. Findazeroofoneofthefunctionsgivenintheintroduc- tion of this chapter using one of the methods introduced in this chapter.
4. Write and test a recursive procedure for the secant method.
5. Reruntheexampleinthissectionwithx0 =0andx1 =1. Explain any unusual results.
6. Write a simple program to compare the secant method with Newton’s method for finding a root of each func- tion.
aa. x3−3x+1withx0=2 ab. x3−2sinxwithx0=1
2
Use the x1 value from Newton’s method as the second starting point for the secant method. Print out each itera- tion for both methods.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
14. This exercise and the next three deal with the method of functional iteration. The method of functional it-
a1. Use the secant method to find the zero near −0.5 of f (x) = ex − 3×2. This function also has a zero near 4.
Find this positive zero by Newton’s method. 2. Write
procedure Secant( f, x1, x2, epsi, delta, maxf, x, ierr)
which uses the secant method to solve f (x) = 0. The in- put parameters are as follows: f is the name of the given function; x1 and x2 are the initial estimates of the so- lution; epsi is a positive tolerance such that the iteration stops if the difference between two consecutive iterates is smaller than this value; delta is a positive tolerance such that the iteration stops if a function value is smaller in magnitude than this value; and maxf is a positive inte- ger bounding the number of evaluations of the function allowed. The output parameters are as follows: x is the final estimate of the solution, and ierr is an integer error flag that indicates whether a tolerance test was violated. Test this routine using the function of Computer Exer-
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
Computer Exercises 3.3
a7. Write a simple program to find the root of f(x) = x3 +2×2 +10x −20 using the secant method with starting valuesx0 =2andx1 =1.Letitrunatmost20steps, and include a stopping test as well. Compare the number of steps needed here to the number needed in Newton’s method. Is the convergence quadratic?
8. Test the secant method on the set of functions fk(x) = 2e−kx+1−3e−kx fork = 1,2,3,…,10.Usethestart- ing points 0 and 1 in each case.
a9. An example by Wilkinson [1963] shows that minute al- terations in the coefficients of a polynomial may have massive effects on the roots. Let
f (x) = (x − 1)(x − 2) · · · (x − 20)
which has become known as the Wilkinson polynomial.
The zeros of f are, of course, the integers 1, 2, . . . , 20. Try to determine what happens to the zero r = 20 when the function is altered to f (x) − 10−8×19.
Hint: The secant method in double precision will locate a zero in the interval [20, 21].
10. Test the secant method on an example in which r,
f ′ (r ), and f ′′ (r ) are known in advance. Monitor the
ratios en+1/(enen−1) to see whether they converge to
−1 f ′′(r)/f ′(r). The function f (x) = arctan x is suit- 2
able for this experiment.
11. Usingafunctionofyourchoice,verifynumericallythat the iterative method
Test the algorithm on some simple cases such as (x, y) = (3, 4), (−5, 12), and (7, −24). Then write a routine that uses the function f (x, y) for approximating the
xn+1 = xn −
is cubically convergent at a simple root but only linearly
16.
17.
15.
Study the following functions by starting with any ini- tial value of x0 in the domain [0, 2] and iterating xn+1 = F (xn ). First use a calculator and then a computer. Explain the results.
a. Usethetentfunction
2x if 2x < 1
F(x) =
2x−1 if2x≧1
b. Repeatusingthefunction
F(x) = 10x (modulo 1)
Hint: Don’t be surprised by chaotic behavior. The inter- ested reader can learn more about the dynamics of one- dimensional maps by reading papers such as the one by Bassien [1998].
Showhowthesecantmethodcanbeusedtosolvesystems of equations such as those in Computer Exercises 3.2.21– 3.2.23.
(StudentResearchProject)Muller’smethodisanal- gorithm for computing solutions of an equation f (x) = 0. It is similar to the secant method in that it replaces f locally by a simple function and finds a root of it. Nat- urally, this step is repeated. The simple function chosen in Muller’s method is a quadratic polynomial, p, that interpolates f at the three most recent points. After p has been determined, its roots are computed, and one of them is chosen as the next point in the sequence. Since this quadratic function may have complex roots, the al- gorithm should be programmed with this in mind. Sup- pose that points xn−2, xn−1, and xn have been computed. Set
p(x) = a(x − xn)(x − xn−1) + b(x − xn) + c
3.3 Secant Method 151
real function f (x, y) integern; reala,b,c,x,y
f ←max{|x|,|y|} a ← min{|x|,|y|} for n = 1 to 3
b ← (a/f )2
c ← b/(4 + b)
f ← f +2cf a ← ca
end for
end function f
Euclideannormofavectorx = (x1,x2,...,xn);thatis, 2 2 21/2
thenonnegativenumber∥x∥= x1+x2+···+xn .
f(xn)
[ f ′(xn)]2 − f (xn) f ′′(xn)
convergent at a multiple root.
12. Test numerically whether Olver’s method, given by
xn+1=xn− f(xn)−1 f′′(xn)f(xn)2 f ′(xn) 2 f ′(xn) f ′(xn)
is cubically convergent to a root of f . Try to establish that it is.
13. (Continuation)RepeatforHalley’smethod
x =x − 1 with a = f′(xn)−1f′′(xn)
n+1 n a n f(x) 2 f′(x) nnn
14. (Moler-Morrison Algorithm) Computing an approxi- mation for x2 + y2 does not require square roots. It
can be done as follows:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
152
Chapter 3 Nonlinear Equations
18. 19.
p(x) = x3 + x2 − 10x − 10 22.
If the first three points are 1, 2, 3, then you should find that
the polynomial is p(x) = 7(x −3)(x −2)+14(x −3)−4
and x4 = 3.17971 086. Next, test your code on a polyno- 23. mial having real coefficients but some complex roots.
Programandtestthecodeforthesecantalgorithmafter incorporating the stopping criterion described in the text.
Using mathematical software such as MATLAB, Math- ematica, and Maple, find the real zero of the polynomial p(x) = x5 + x3 + 3. Attain more digits of accuracy than
shown in the solution to Example 1 in the text.
Find the fixed points for each of the following functions: a. ex+1 b. e−x−x c. x2−4sinx
d. x3+6x2+11x−6 e. sinx
Forthenonlinearequation f(x)=x2−x−2=0with roots 1 and 2, write four fixed-point problems x = g(x) that are equivalent. Plot all of these, and show that they all intersect the line x = y. Also, plot the convergence steps of each of these fixed-point iterations for differ- ent starting values x (0) . Show that the behavior of these fixed-point schemes can vary wildly: slow convergence, fast convergence, and divergence.
where a, b, and c are determined so that p interpolates 20. f at the three points mentioned previously. Then find the
(Continuation) Using mathematical software that allows for complex roots, find all zeros of the polynomial.
roots of p and take xn+1 to be the root of p closest to xn. At the beginning, three points must be furnished by the user. Program the method, allowing for complex numbers throughout. Test your program on the example
21. Programahybridmethodforsolvingseveralofthenon- linear problems given as examples in the text, and com- pare your results with those given.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4
Interpolation and Numerical Differentiation
The viscosity of water has been experimentally determined at different temperatures, as indicated in the following table:
Temperature 0◦ 5◦ 10◦ 15◦ Viscosity 1.792 1.519 1.308 1.140
From this table, how can we estimate a reasonable value for the viscosity at temperature 8◦?
The method of polynomial interpolation, described in Section 4.1, can be used to create a polynomial of degree 3 that assumes the values in the table. This polynomial should provide acceptable intermediate values for temperatures not tabulated. The value of that polynomial at the point 8◦ turns out to be 1.386.
4.1 Polynomial Interpolation Preliminary Remarks
Problem 1
Problem 2
Problem 3
A Simple Function
We pose three problems concerning the representation of functions to give an indication of the subject matter in this chapter, in Chapter 6 (on splines), and in Chapter 9 (on least squares).
First, suppose that we have a table of numerical values of a function: x x0 x1 ··· xn
y y0 y1 ··· yn
The second problem is similar, but it is assumed that the given table of numerical values is contaminated by errors, as might occur if the values came from a physical experiment. Now we ask for a formula that represents the data (approximately) and, if possible, filters out the errors.
As a third problem, a function f is given, perhaps in the form of a computer procedure, but it is an expensive function to evaluate. In this case, we ask for another function g that is simpler to evaluate and produces a reasonable approximation to f . Sometimes in this problem, we want g to approximate f with full machine precision.
In all of these problems, a simple function p can be obtained that represents or ap- proximates the given table or function f . The representation p can always be taken to be a polynomial, although many other types of simple functions can also be used. Once a simple
153
Is it possible to find a simple and convenient formula that reproduces the given points exactly?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
154 Chapter 4
Interpolation and Numerical Differentiation
High-Degree Polynomial
function p has been obtained, it can be used in place of f in many situations. For example, the integral of f could be estimated by the integral of p, and the latter should generally be easier to evaluate.
In many situations, a polynomial solution to the problems outlined above are unsatisfac- tory from a practical point of view, and other classes of functions must be considered. In this book, one other class of versatile functions is discussed: the spline functions (see Chapter 6). The present chapter concerns polynomials exclusively, and Chapter 9 discusses general lin- ear families of functions, of which splines and polynomials are important examples.
The obvious way in which a polynomial can fail as a practical solution to one of the preceding problems is that its degree may be unreasonably high. For instance, if the table considered contains 1,000 entries, a polynomial of degree 999 may be required to represent it. Polynomials also may have the surprising defect of being highly oscillatory. If the table is precisely represented by a polynomial p, then p(xi ) = yi for 0 ≦ i ≦ n. For points other than the given xi, however, p(x) may be a very poor representation of the function from which the table arose. The example in Section 4.2 involving the Runge function illustrates this phenomenon.
Polynomial Interpolation
We begin again with a table of values:
x x0 x1 ··· xn
y y0 y1 ··· yn
and assume that the xi ’s form a set of n + 1 distinct points. The table represents n + 1 points in the Cartesian plane, and we want to find a polynomial curve that passes through all points. Thus, we seek to determine a polynomial that is defined for all x and takes on thecorrespondingvaluesofyi foreachofthen+1distinctxi’sinthistable.Apolynomial p for which p(xi ) = yi when 0 ≦ i ≦ n is said to interpolate the table. The points xi are called nodes.
Consider the first and simplest case, n = 0. Here, a constant function solves the problem. In other words, the polynomial p of degree 0 defined by the equation p(x ) = y0 reproduces the one-node table.
The next simplest case occurs when n = 1. Since a straight line can be passed through two points, a linear function is capable of solving the problem. Explicitly, the polynomial
Polynomial Interpolations Table
Constant Case
Linear Case
EXAMPLE 1
p defined by
p(x) = x − x1 y0 + x − x0 y1 x0 − x1 x1 − x0
= y 0 + y 1 − y 0 ( x − x 0 ) x1 − x0
is of first degree (at most) and reproduces the table. That means (in this case) that p(x0) = y0 and p(x1) = y1, as is easily verified. This p is used for linear interpolation.
Find the polynomial of least degree that interpolates this table:
x 1.4 1.25
y 3.7 3.9
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Cardinal Polynomial
Expanded Form
li(x)=n x−xj (0≦i≦n) (2)
xi −xj j ≠ i
j=0
This formula indicates that li (x ) is the product of n linear factors: li(x)=x−x0 x−x1 ···x−xi−1 x−xi+1 ···x−xn
xi −x0 xi −x1 xi −xi−1 xi −xi+1 xi −xn
(The denominators are just numbers; the variable x occurs only in the numerators.) Thus, li is a polynomial of degree n. Notice that when li (x) is evaluated at x = xi , each factor in the preceding equation becomes 1. Hence, we obtain li (xi ) = 1. But when li (x ) is evaluated at any other node, say, x j , one of the factors in the preceding equation is 0, and li (x j ) = 0, fori≠ j.
Solution
4.1 Polynomial Interpolation 155 By the linear case equation, the polynomial that is sought is
p(x) = x −1.25 3.7+ x −1.4 3.9 1.4 − 1.25 1.25 − 1.4
= 3.7+ 3.9−3.7 (x −1.4) 1.25 − 1.4
= 3.7− 4(x −1.4) 3
■
Kronecker Delta Property
Lagrange Form
As we can see, an interpolating polynomial can be written in a variety of forms, in- cluding the Newton form and the Lagrange form. The Newton form is probably the most convenient and efficient; however, conceptually, the Lagrange form has several advantages. We begin with the Lagrange form, since it may be easier to understand.
Interpolating Polynomial: Lagrange Form
Supposethatwewishtointerpolatearbitraryfunctionsatasetoffixednodesx0,x1,...,xn. We first define a system of n + 1 special polynomials of degree n known as cardinal polynomials in interpolation theory. These are denoted by l0 , l1 , . . . , ln and have the property
li(xj)=δij=0 ifi≠ j 1 ifi=j
Once these are available, we can interpolate any function f by the Lagrange form of the interpolation polynomial:
n i=0
n i=0
li (x ) f (xi ) (1) of degree at most n. Furthermore, when we evaluate pn at x j , we get f (x j ):
pn (x ) =
This function pn , being a linear combination of the polynomials li , is itself a polynomial
pn(xj)=
Thus, pn istheinterpolatingpolynomialforthefunction f atnodesx0,x1,...,xn.Itremains
li(xj)f(xi)=lj(xj)f(xj)= f(xj) now only to write the formula for the cardinal polynomial li , which is
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
156 Chapter 4
Interpolation and Numerical Differentiation
y
1.2
1 ,0 0.8
0.6 0.4 0.2
0
20.2
20.4
20.6 21
l3(x), and l4(x).
Write out the cardinal polynomials appropriate to the problem of interpolating the following
,1 ,2 ,3 ,4
FIGURE 4.1
First few Lagrange cardinal polynomials
EXAMPLE 2
Solution
x
20.8 20.6 20.4 20.2
Figure 4.1 shows the first few Lagrange cardinal polynomials: l0(x), l1(x), l2(x),
table, and give the Lagrange form of the interpolating polynomial:
0 0.2
0.4 0.6
0.8 1
x111 34
f(x) 2 −1 7
Using Equation (2), we have
l0(x)= 4 =−18 x−
x − 1 (x − 1) 1−1 1−1
1
(x−1) 4
3 43
x−1(x−1) 1 l1(x)= 3 =16 x− (x−1)
1−1 1−1 3 4 34
x−1x−1 11 l2(x)= 3 4=2 x− x−
1−11−134 34
Therefore, the interpolating polynomial in Lagrange’s form is
p2(x) = −36x − 1(x − 1) − 16x − 1(x − 1) + 14x − 1x − 1 4334
Existence of Interpolating Polynomial
■
Existence of p(x)
The Lagrange interpolation formula proves the existence of an interpolating polynomial for any table of values. There is another constructive way of proving this fact, and it leads to a different formula.
Suppose that we have succeeded in finding a polynomial p that reproduces part of the table. Assume, say, that p(xi ) = yi for 0 ≦ i ≦ k. We shall attempt to add to p another term that enables the new polynomial to reproduce one more entry in the table. We consider
p(x)+c(x −x0)(x −x1)···(x −xk)
where c is a constant to be determined. This is surely a polynomial. It also reproduces the
first k points in the table because p itself does so, and the added portion takes the value 0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Inductive Reasoning
■ Theorem1
Degree of p(x)
Uniqueness of p(x)
at each of the points x0, x1, . . . , xk . (Its form is chosen for precisely this reason.) Now we adjust the parameter c so that the new polynomial takes the value yk+1 at xk+1. Imposing this condition, we obtain
p(xk+1)+c(xk+1 −x0)(xk+1 −x1)···(xk+1 −xk)= yk+1
The proper value of c can be obtained from this equation because none of the factors xk+1 − xi , for 0 ≦ i ≦ k, can be zero. Remember our original assumption that the xi ’s are all distinct.
This analysis is an example of inductive reasoning. We have shown that the process can be started and that it can be continued. Hence, the following formal statement has been partially justified:
Two parts of this formal statement must still be established. First, the degree of the poly- nomial increases by at most 1 in each step of the inductive argument. At the beginning, the degree was at most 0, so at the end, the degree is at most n.
Second, we establish the uniqueness of the polynomial p. Suppose that another poly- nomial q claims to accomplish what p does; that is, q is also of degree at most n and satisfies q(xi)=yi for0≦i≦n.Thenthepolynomialp−qisofdegreeatmostnandtakesthe value 0 at x0,x1,...,xn. Recall, however, that a nonzero polynomial of degree n can have at most n roots. We conclude that p = q, which establishes the uniqueness of p.
Interpolating Polynomial: Newton Form
In Example 2, we found the Lagrange form of the interpolating polynomial:
p2(x) = −36x − 1(x − 1) − 16x − 1(x − 1) + 14x − 1x − 1 4334
It can be simplified to
p2(x) = −79 + 349x − 38x2 66
Now, we learn that this polynomial can be written in another form called the nested Newton form:
p2(x) = 2 + x − 136 + x − 1(−38) 34
It involves the fewest arithmetic operations and is recommended for evaluating p2(x). It cannot be overemphasized that the Newton and Lagrange forms are just two different derivations for precisely the same polynomial. The Newton form has the advantage of easy extensibility to accommodate additional data points.
The preceding discussion provides a method for constructing an interpolating polyno- mial. The method is known as the Newton algorithm, and the resulting polynomial is the Newton form of the interpolating polynomial.
4.1 Polynomial Interpolation 157
Theorem on Existence of Polynomial Interpolation
Ifpointsx0,x1,...,xn aredistinct,thenforarbitraryrealvaluesy0,y1,...,yn,there is a unique polynomial p of degree at most n such that p(xi) = yi for 0≦ i ≦ n.
Nested Newton’s Form for p2(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
158 Chapter 4 Interpolation and Numerical Differentiation
EXAMPLE 3
Solution
Polynomials
p0, p1, p2, p3, p4
Using the Newton algorithm, find the interpolating polynomial of least degree for this table: x 0 1 −1 2 −2
y −5 −3 −15 39 −9
In the construction, five successive polynomials appear; these are labeled p0 , p1 , p2 , p3 , and p4. The polynomial p0 is defined to be
p0(x) = −5
The polynomial p1 has the form
p1(x)= p0(x)+c(x−x0)=−5+c(x−0)
The interpolation condition placed on p1 is that p1(1) = −3. Therefore, we have −5 + c(1−0)=−3.Hence,c=2,and p1 is
p1(x) = −5 + 2x
The polynomial p2 has the form
p2(x)= p1(x)+c(x−x0)(x−x1)=−5+2x+cx(x−1)
The interpolation condition placed on p2 is that p2(−1) = −15. Hence, we have −5 + 2(−1) + c(−1)(−1 − 1) = −15. This yields c = −4, so
p2(x) = −5 + 2x − 4x(x − 1)
The remaining steps for p3(x) are similar. The final result is the Newton form of the
interpolating polynomial: p4(x)=−5+2x−4x(x−1)+8x(x−1)(x+1)+3x(x−1)(x+1)(x−2) ■
Later, we develop a better algorithm for constructing the Newton interpolating poly- nomial. Nevertheless, the method just explained is a systematic one and involves very little computation. An important feature to notice is that each new polynomial in the algorithm is obtained from its predecessor by adding a new term. Thus, at the end, the final polynomial exhibits all the previous polynomials as constituents.
Nested Form
Before continuing, let’s rewrite the Newton form of the interpolating polynomial for efficient evaluation.
Write the polynomial p4 of Example 3 in nested form and use it to evaluate p4(3). We write p4 as
p4(x) = −5 + x2 + (x − 1) − 4 + (x + 1)8 + (x − 2)3
Therefore, we obtain
p4(3) = −5+32+2−4+4(8+3)
= 241
Adding a New Term
EXAMPLE 4
Solution
Nested Form p4(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Nested Multiplication p4(x)
Another solution, also in nested form, is
p4(x) = −5 + x4 + x − 7 + x(2 + 3x) from which we obtain
p4(3) = −5 + 34 + 3 − 7 + 3(2 + 3 · 3) = 241
This form is obtained by expanding and systematic factoring of the original polynomial. It
is also known as a nested form, and its evaluation is by nested multiplication. ■ To describe nested multiplication in a formal way (so that it can be translated into a
pseudocode), consider a general polynomial in the Newton form. It might be pn(x)=a0 +a1(x−x0)+a2(x−x0)(x−x1)+···
+ an(x − x0)(x − x1)···(x − xn−1)
The nested form of pn(x) is
pn(x) = a0 +(x −x0)a1 +(x −x1)a2 +···+(x −xn−2)an−1 +(x −xn−1)(an)···
The Newton interpolation polynomial can be written succinctly as
n i−1
pn(x)=ai (x−xj) (3)
Newton Interpolation Polynomial
Newton Polynomials
i=0 j=0
Here −1 (x − xj) is interpreted to be 1. Also, we can write it as
j=0
where
pn(x)=
n i=0
i − 1
(x − xj) (4)
πi(x) =
Figure 4.2 shows the first few Newton polynomials: π0(x), π1(x), π2(x), π3(x), π4(x),
and π5(x).
y
3
2.5
2
1.5
p0 1
j=0
ai πi(x)
4.1 Polynomial Interpolation 159
FIGURE 4.2
First few Newton polynomials
0.5
0
20.5 21
p1 p2 p3 p4
20.8 20.6 20.4 20.2 0 0.2 0.4 0.6 0.8
1
x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
160 Chapter 4
Interpolation and Numerical Differentiation
integer i,n; real t,v; real array (ai)0:n,(xi)0:n v ← an
fori =n−1to0step−1
v ← v(t − xi ) + ai end for
Evaluation of Interpolation Polynomial Pseudocode
In evaluating p(t ) for a given numerical value of t , we naturally start with the innermost parentheses, forming successively the following quantities:
v0 = an
v1 =v0(t−xn−1)+an−1 v2 =v1(t−xn−2)+an−2
.
vn =vn−1(t−x0)+a0
The quantity vn is now p(t). In the following pseudocode, a subscripted variable is not needed for vi . Instead, we can write
Here,thearray(ai)0:n containsthen+1coefficientsoftheNewtonformoftheinterpolating polynomial (3) of degree at most n, and the array (xi )0:n contains the n + 1 nodes xi .
Calculating Coefficients ai Using Divided Differences Weturnnowtotheproblemofdeterminingthecoefficientsa0,a1,...,an efficiently.Again
we start with a table of values of a function f :
x x0 x1 x2 ··· xn
f(x) f(x0) f(x1) f(x2) ··· f(xn)
Thepointsx0,x1,...,xn areassumedtobedistinct,butnoassumptionismadeabouttheir positions on the real line.
Previously, we established that for each n = 0, 1, . . . , there exists a unique polynomial pn such that
It was shown that pn can be expressed in the Newton form
pn(x)=a0 +a1(x−x0)+a2(x−x0)(x−x1)+···
+ an(x − x0)···(x − xn−1)
A crucial observation about pn is that the coefficients a0, a1, . . . do not depend on n. In other words, pn is obtained from pn−1 by adding one more term, without altering the coefficients already present in pn−1 itself. This is because we began with the hope that pn could be expressed in the form
pn(x)= pn−1(x)+an(x−x0)···(x−xn−1) and discovered that it was indeed possible.
Unique Polynomial pn
Adding One Term at a Time
• The degree of pn is at most n.
• pn(xi)= f(xi)fori =0,1,...,n.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
ak Divided Difference of Order k
a0, a1, a2 Divided Differences
EXAMPLE 5
Solution
Equations (5) can be solved for the ai ’s in turn, starting with a0. Then we see that a0 depends on f (x0), that a1 depends on f (x0) and f (x1), and so on. In general, ak depends on f (x0), f(x1),..., f(xk).Inotherwords,ak dependsonthevaluesof f atthenodesx0,x1,...,xk.
The traditional notation is
ak = f[x0,x1,...,xk] (7)
Thisequationdefines f[x0,x1,...,xk].Thequantity f[x0,x1,...,xk]iscalledthedivided difference of order k for f. Notice also that the coefficients a0,a1,...,ak are uniquely determined by System (6). Indeed, there is no possible choice for a0 other than a0 = f (x0). Similarly, there is now no choice for a1 other than [ f (x1) − a0]/(x1 − x0) and so on. Using Equations (5), we see that the first few divided differences can be written as
4.1 Polynomial Interpolation 161 Awayofsystematicallydeterminingtheunknowncoefficientsa0,a1,...,an istoset
x equal in turn to x0, x1, . . . , xn in the Newton form (3) and to write down the resulting
equations: f (x0) = a0
f(x1)=a0 +a1(x1 −x0)
f(x2)=a0 +a1(x2 −x0)+a2(x2 −x0)(x2 −x1) etc.
The compact form of Equations (5) is
k i−1
(5)
f(xk)=ai (xk −xj) (0≦k≦n) (6) i=0 j=0
a0 = f(x0)
a1 = f(x1)−a0 = f(x1)− f(x0)
x1 − x0 x1 − x0
a2 = f(x2)−a0 −a1(x2 −x0) =
f(x2)− f(x1) − f(x1)− f(x0) x2 −x1 x1 −x0
x2 −x0
For the table:
x 1 −4 0 f(x) 3 13 −23
(x2 −x0)(x2 −x1)
Newton Form of Interpolating Polynomial
determine the quantities f [x0], f [x0, x1], and f [x0, x1, x2].
We write out the system ofEquations (5) for this concrete case:
3=a0
13 = a0 + a1(−5)
−23 = a0 + a1(−1) + a2(−1)(4)
The solution is a0 = 3, a1 = −2, and a2 = 7. Hence, for this function, f[1] = 3,
f [1,−4] = −2, and f [1,−4,0] = 7. ■ With this new notation, the Newton form of the interpolating polynomial takes the
form
n i−1
pn(x)= f[x0,x1,...,xi](x−xj) (8)
i=0 j=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
162 Chapter 4
Interpolation and Numerical Differentiation
with the usual convention that −1 (x − x j ) = 1. Notice that the coefficient of x n in pn is j=0
f[x0,x1,...,xn]becausethetermxn occursonlyin n−1(x−xj).Itfollowsthatif f is j=0
apolynomialofdegree≦n−1,then f[x0,x1,...,xn]=0.
We return to the question of how to compute the required divided differences
f [x0, x1, . . . , xk ]. From System (5) or System (6), it is evident that this computation can be performed recursively. We simply solve Equation (6) for ak as follows:
k−1 k−1 i−1 f(xk)=ak (xk −xj)+ai (xk −xj)
j=0 i=0 j=0
k−1
k−1
f[x0,x1,...,xk] = i=0 j=0 k−1
(xk −xj) j=0
Using Algorithm (10), write out the divided differences formulas for f[x0,x1,x2],and f[x0,x1,x2,x3].
f[x0]= f(x0)
f[x0,x1] = f(x1)− f[x0] x1 − x0
f[x0,x1,x2]= f(x2)− f[x0]− f[x0,x1](x2 −x0) (x2 − x0)(x2 − x1)
and
ak =
j=0
f(xk)−
i=0
k−1 j=0
ai
i−1
(xk −xj) (xk −xj)
Using Equation (7), we have
i−1
f(xk)− f[x0,x1,...,xi] (xk −xj)
(9)
Computing the Divided Differences of f
• Set f[x0]= f(x0).
• Fork =1,2,...,n,compute f[x0,x1,...,xk]byEquation(9).
(10)
■ Algorithm
EXAMPLE 6
Solution
First Four Divided Differences
Operation Count
f [x0 ],
f [x0 , x1 ],
f[x0,x1,x2,x3]= f(x3)− f[x0]− f[x0,x1](x3 −x0)− f[x0,x1,x2](x3 −x0)(x3 −x1) (x3 − x0)(x3 − x1)(x3 − x2)
■
Algorithm (10) is easily programmed and is capable of computing the divided dif-
ferences f[x0], f[x0,x1],..., f[x0,x1,...,xn] at the cost of 1n(3n + 1) additions, 2
(n − 1)(n − 2) multiplications, and n divisions excluding arithmetic operations on the
indices. Now a more refined method is presented for which the pseudocode requires only
three statements (!) and costs only 1 n(n + 1) divisions and n(n + 1) additions. 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.1 Polynomial Interpolation 163 At the heart of the new method is the following remarkable theorem:
Recursive Property of Divided Differences
The divided differences obey the formula
f[x0,x1,...,xk] = f[x1,x2,...,xk]− f[x0,x1,...,xk−1] (11) xk − x0
■ Theorem2
Recursive Property
Proof
Since f [x0, x1, . . . , xk ] was defined to be equal to the coefficient ak in the Newton form of the interpolating polynomial pk of Equation (3), we can say that f [x0, x1, . . . , xk ] is the coefficientofxk inthepolynomial pk ofdegree≦k,whichinterpolates f atx0,x1,...,xk. Similarly, f[x1,x2,...,xk]isthecoefficientofxk−1 inthepolynomialqofdegree≦k−1, whichinterpolates f atx1,x2,...,xk.Likewise, f[x0,x1,...,xk−1]isthecoefficientof xk−1 inthepolynomial pk−1 ofdegree≦k−1,whichinterpolates f atx0,x1,...,xk−1.The three polynomials pk, q, and pk−1 are intimately related. In fact,
pk(x)=q(x)+ x−xk [q(x)−pk−1(x)] (12) xk − x0
To establish Equation (12), observe that the right side is a polynomial of degree at most k. Evaluatingitatxi,for1≦i≦k−1,resultsin f(xi):
q(xi ) + xi − xk [q(xi ) − pk−1(xi )] = f (xi ) + xi − xk [ f (xi ) − f (xi )]
xk − x0
xk − x0
Similarly, evaluating it at x0 and xk gives f (x0) and f (xk ), respectively. By the uniqueness of interpolating polynomials, the right side of Equation (12) must be pk (x ), and Equation (12) is established.
Completing the argument to justify Equation (11), we take the coefficient of xk on both sides of Equation (12). The result is Equation (11). Indeed, we see that f [x1, x2, . . . , xk ] is the coefficient of xk−1 in q, and f [x0, x1, . . . , xk−1] is the coefficient of xk−1 in pk−1. ■
Notice that f[x0,x1,...,xk] is not changed if the nodes x0,x1,...,xk are permuted. Thus, for example, we have
f[x0,x1,x2] = f[x1,x2,x0]
The reason is that f [x0, x1, x2] is the coefficient of x2 in the quadratic polynomial interpolat- ing f at x0, x1, x2, whereas f [x1, x2, x0] is the coefficient of x2 in the quadratic polynomial interpolating f at x1,x2,x0. These two polynomials are, of course, the same! A formal statement in mathematical language is as follows:
= f(xi)
Invariance Theorem
The divided difference f [x0, x1, . . . , xk ] is invariant under all permutations of the arguments x0,x1,...,xk.
■ Theorem3 Invariance Property
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
164 Chapter 4
Interpolation and Numerical Differentiation
First Three Divided Differences
Divided-Difference Table
EXAMPLE 7
Solution
Sincethevariablesx0,x1,...,xk andkarearbitrary,therecursiveFormula(11)can also be written as
f[xi,xi+1,...,xj−1,xj]= f[xi+1,xi+2,...,xj]− f[xi,xi+1,...,xj−1] (13) xj −xi
The first three divided differences are thus
f[xi]= f(xi) f[xi,xi+1] = f[xi+1]− f[xi]
xi+1 −xi f[xi,xi+1,xi+2]= f[xi+1,xi+2]−f[xi,xi+1]
xi+2 −xi
Using Formula (13), we can construct a divided-difference table for a function f . It is
customary to arrange it as follows (here n = 3):
x f[] f[,] f[,,] f[,,,]
x0
x1 f [x1]
form of the interpolating polynomial (3).
Construct a divided-difference diagram for the function f given in the following table, and write out the Newton form of the interpolating polynomial.
f[x0]
x1302 2
f(x) 3 13 3 5 43
13 −3 1 f[x0,x1]= 4 =
After completion of column 3, the first entry in column 4 is
f[x0,x1]
f[x1,x2] f[x2,x3]
f [x1, x2, x3]
In the table, the coefficients along the top diagonal are the ones needed to form the Newton
x2 f [x2]
x3 f [x3]
The first entry is
3−12 2
f[x0,x1,x2]= The complete diagram is
f[x1,x2]− f[x0,x1] x2 − x0
1 − 1 1 = 6 2 =
0 − 1 3 x f[] f[,] f[,,] f[,,,]
1
2
1
3
Sample Divided- Difference Table
1
3 13 24
6
03 −5
1
−2 3
3
2
5 3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
f [x0, x1, x2]
f [x0, x1, x2, x3]
3
−2
Thus, we obtain
p3(x) = 3 + 1 (x − 1) + 1 (x − 1)x − 3 − 2(x − 1)x − 3 x ■ 2322
Algorithms and Pseudocode
Turningnexttoalgorithms,wesupposethatatablefor f isgivenatpointsx0,x1,...,xn and thatallthedivideddifferencesaij ≡ f[xi,xi+1,...,xj]aretobecomputed.Thefollowing pseudocode accomplishes this:
4.1 Polynomial Interpolation 165
integer i, j, n; real array (ai j )0:n×0:n , (xi )0:n fori =0ton
ai0 ← f(xi) end for
for j = 1 to n
fori =0ton− j
aij ←(ai+1,j−1−ai,j−1)/(xi+j −xi) end for
end for
Divided Differences Pseudocode
Observe that the coefficients of the interpolating polynomial (3) are stored in the first row of the array (ai j )0:n×0:n .
If the divided differences are being computed for use only in constructing the Newton form of the interpolation polynomial
n i−1 pn(x)=ai (x−xj)
i=0 j=0
whereai = f[x0,x1,...,xi],thereisnoneedtostoreallofthem.Only f[x0], f[x0,x1],..., f [x0, x1, . . . , xn ] need to be stored.
Whenaone-dimensionalarray(ai)0:n isused,thedivideddifferencescanbeoverwritten each time from the last storage location backward so that, finally, only the desired coefficients remain. In this case, the amount of computing is the same as in the preceding case, but the storage requirements are less. (Why?) Here is a pseudocode to do this:
integer i, j, n; real array (ai )0:n , (xi )0:n fori =0ton
ai ← f(xi) end for
for j = 1 to n
for i = n to j step −1
ai ←(ai −ai−1)/(xi −xi−j) end for
end for
Improved Pseudocode
This algorithm is more intricate, and the reader is invited to verify it—say, in the case n = 3. For the numerical experiments suggested in the computer problems, the following two procedures should be satisfactory. The first is called Coef. It requires as input the number n and tabular values in the arrays (xi) and (yi). Remember that the number of points in
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
166 Chapter 4
Interpolation and Numerical Differentiation
the table is n + 1. The procedure then computes the coefficients required in the Newton interpolating polynomial, storing them in the array (ai ).
procedure Coef (n, (xi ), (yi ), (ai ))
integer i, j, n; real array (xi )0:n , (yi )0:n , (ai )0:n fori =0ton
ai ← yi end for
for j = 1 to n
for i = n to j step −1
ai ←(ai −ai−1)/(xi −xi−j) end for
end for
end procedure Coef
Coef Pseudocode
The second is function Eval. It requires as input the array (xi ) from the original table and the array (ai ), which is output from Coef. The array (ai ) contains the coefficients for the Newton form of the interpolation polynomial. Finally, as input, a single real value for t is given. The function then returns the value of the interpolating polynomial at t.
real function Eval(n, (xi ), (ai ), t)
integer i,n; real t,temp; real array (xi)0:n,(ai)0:n temp ← an
fori =n−1to0step−1
temp ← (temp)(t − xi ) + ai end for
Eval ← temp
end function Eval
Eval Pseudocode
EXAMPLE 8
Solution
Since the coefficients of the interpolating polynomial need be computed only once, we call Coef first, and then all subsequent calls for evaluating this polynomial are accomplished with Eval. Notice that only the t argument should be changed between successive calls to function Eval.
Write pseudocode for the Newton form of the interpolating polynomial p for sin x at ten equidistant points in the interval [0, 1.6875]. The code finds the maximum value of | sin x − p(x )| over a finer set of equally spaced points in the same interval.
If we take 10 points, including the ends of the interval, then we create 9 subintervals, each of length h = 0.1875. The points are then xi = ih for i = 0,1,...,9. After obtaining the polynomial, we divide each subinterval into four panels, and we evaluate | sin x − p(x )| at 37 points (called t in the pseudocode). These are tj = jh/4 for j = 0,1,...,36. Here is a suitable main program in pseudocode that calls routines Coef and Eval previously given:
program Test Coef Eval
integer j, k, n, jmax; real e, h, p, emax, pmax, tmax, real array (xi )0:n , (yi )0:n , (ai )0:n
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.1 Polynomial Interpolation 167
n←9
h ← 1.6875/n for k = 0 to n
xk ←kh
yk ←sin(xk) end for
call Coef (n, (xi ), (yi ), (ai )) output (ai ); emax ← 0
for j = 0 to 4n
t ← jh/4
p ← Eval(n,(xi)n,(ai)n,t)
e ← |sin(t) − p| output j, t, p, e if e > emax then
jmax ← j;tmax ←t; pmax ← p;emax ←e end if
end for
output jmax, tmax, pmax, emax end program Test Coef Eval
Test Coef Eval
Pseudocode
f (x) Approximation with Basis Functions
Linear System
Monomials
The first coefficient in the Newton form of the interpolating polynomial is 0 (why?), and the others range in magnitude from approximately 0.99 to 0.18 × 10−5 . The deviation between sin x and p(x) is practically zero at each interpolation node. (Because of roundoff errors, they are not precisely zero.) From the computer output, the largest error is at jmax = 35, where sin(1.64062 5) ≈ 0.99756 31 with an error of 1.19 × 10−7 . ■
Vandermonde Matrix
Another view of interpolation is that for a given set of n + 1 data points (x0, y0), (x1, y1), …,(xn,yn),wewanttoexpressaninterpolatingfunction f(x)asalinearcombinationof a set of basis functions φ0,φ1,φ2,…,φn so that
f (x) ≈ c0φ0(x) + c1φ1(x) + c2φ2(x) + ··· + cnφn(x)
Here the coefficients c0 , c1 , c2 , . . . , cn are to be determined. We want the function f to
interpolate the data (xi , yi ). This means that we have linear equations of the form f(xi)=c0φ0(xi)+c1φ1(xi)+c2φ2(xi)+···+cnφn(xi)= yi
for each i = 0,1,2,…,n. This is a system of linear equations Ac = y
Here,theentriesinthecoefficientmatrixAaregivenbyaij =φj(xi),whichisthevalueof the jth basis function evaluated at the ith data point. The right-hand side vector y contains the known data values yi , and the components of the vector c are the unknown coefficients ci . Systems of linear equations are discussed in Chapters 2 and 8.
Polynomials are the simplest and most common basis functions. The natural basis for Pn consists of the monomials
φ0(x) = 1, φ1(x) = x, φ2(x) = x2, …, φn(x) = xn Figure 4.3 (p. 168) shows the first few monomials: 1, x, x2, x3, x4, and x5.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
168 Chapter 4
Interpolation and Numerical Differentiation
1 0.8 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8
1
y
x
x3
x4
x2
FIGURE 4.3
First few monomials
Vandermonde Matrix Linear System
21 x 21 20.8 20.6 20.4 20.2 0 0.2 0.4 0.6 0.8 1
Consequently, a given polynomial pn has the form pn(x)=c0 +c1x+c2x2 +···+cnxn
The corresponding linear system Ac = y has the form
Chebyshev Recursive Relation
First Six Chebyshev Polynomials
The coefficient matrix is called a Vandermonde matrix. It can be shown that this matrix is invertibleprovidedthatthepointsx0,x1,x2,…,xn aredistinct.Sowecan,intheory,solve the system for the polynomial interpolant. Although the Vandermonde matrix is invertible, it is ill-conditioned as n increases. For large n, the monomials are less distinguishable from one another, as shown in Figure 4.3. Moreover, the columns of the Vandermonde become nearly linearly dependent in this case. High-degree polynomials often oscillate wildly and are highly sensitive to small changes in the data.
As Figures 4.1–4.3 show, we have discussed three choices for the basis functions: the Lagrange cardinal polynomials li (x ), the Newton polynomials πi (x ), and the monomials. It turns out that there are better choices for the basis functions; namely, the Chebyshev polynomials have more desirable features.
The Chebyshev polynomials play an important role in mathematics because they have several special properties such as the recursive relation
T0(x)=1, T1(x)=x Ti(x) =2xTi−1(x)−Ti−2(x)
for i = 2, 3, 4, and so on. Thus, the first six Chebyshev polynomials are T0(x)=1, T1(x)=x, T2(x)=2×2 −1, T3(x)=4×3 −3x T4(x)=8×4 −8×2 +1, T5(x)=16×5 −20×3 +5x
These curves for these polynomials, as is shown in Figure 4.4, are quite different from one another. The Chebyshev polynomials are usually employed on the interval [−1, 1]. With changes of variable, they can be used on any interval, but the results will be more complicated.
1 1
x 0 x1 x 2
x 02 x12 x 2 2
· · · ··· · · ·
x 0n c 0 y 0 x1n c1 y1
1
. . . … . . .
x 2n c 2 = y 2 1xnxn2···xncn yn
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 4.4
First few Chebyshev polynomials
T4
21 x
y
4.1
Polynomial Interpolation 169
1 0.5 0 20.5
T0 T1
T2 T3
Inverse Interpolation
T5 21 20.5 0 0.5 1
One of the important properties of the Chebyshev polynomials is the equal oscillation property. Notice in Figure 4.4 that successive extreme points of the Chebyshev polynomi- als are equal in magnitude and alternate in sign. This property tends to distribute the error uniformly when the Chebyshev polynomials are used as the basis functions. In polynomial interpolation for continuous functions, it is particularly advantageous to select as the inter- polation points the roots or the extreme points of a Chebyshev polynomial. This causes the maximum error over the interval of interpolation to be minimized. An example of this is given in Section 4.2. In Section 9.2, we discuss Chebyshev polynomials in more detail.
Inverse Interpolation
A process called inverse interpolation is often used to approximate an inverse function. Supposethatvaluesyi = f(xi)havebeencomputedatx0,x1,…,xn.Usingthetable
y y0 y1 ··· yn
x x0 x1 ··· xn we form the interpolation polynomial
n i−1 p(y)=ci (y−yj) i=0 j=0
The original relationship, y = f (x), has an inverse, under certain conditions. This inverse is being approximated by x = p(y). Coef and Eval can be used to carry out the inverse interpolation by reversing the arguments x and y in the calling sequence for Coef.
Inverse interpolation can be used to find where a given function f has a root or zero. This means inverting the equation f (x) = 0. We propose to do this by creating a table of values ( f (xi ), xi ) and interpolating with a polynomial, p. Thus, we obtain p(yi ) = xi . The points xi should be chosen near the unknown root, r. The approximate root is then given by r ≈ p(0). See Figure 4.5 for an example of function y = f (x) and its inverse function x = g(y) with the root r = g(0).
For a concrete case, let the table of known values be
y −0.57892 00 −0.36263 70 −0.18491 60 −0.03406 42 0.09698 58
Finding a Root
EXAMPLE 9
x 1.0 2.0
Find the inverse interpolation polynomial.
3.0 4.0 5.0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
170
Chapter 4 Interpolation and Numerical Differentiation
yx y 5 f(x)
x 5 g(y)
FIGURE 4.5
Function y = f (x) and inverse function x = g(y)
Solution
Inverse Interpolation Polynomial
0
r
f(r) 5 0
r 5 g(0) xy
0
Neville: 1st Recursive Relation
The nodes in this problem are the points in the row of the table headed y, and the function values being interpolated are in the x row. The resulting polynomial is
p(y) = 0.25y4 + 1.2y3 + 3.69y2 + 7.39y + 4.24747 0086
and p(0) = 4.24747 0086. Only the last coefficient is shown with all the digits carried in
the calculation, as it is the only one needed for the problem at hand. ■
Polynomial Interpolation by Neville’s Algorithm
Another method of obtaining a polynomial interpolant from a given table of values x x0 x1 ··· xn
y y0 y1 ··· yn
was given by Neville. It builds the polynomial in steps, just as the Newton algorithm does. The constituent polynomials have interpolating properties of their own.
Let Pa,b,…,s(x) be the polynomial interpolating the given data at a sequence of nodes xa,xb,…,xs. We start with constant polynomials Pi(x) = f(xi). Selecting two nodes xi and x j with i > j , we define recursively
Pu,…,v(x)=x−xj Pu,…,j−1,j+1,…,v(x)+xi −xPu,…,i−1,i+1,…,v(x) xi −xj xi −xj
Using this formula repeatedly, we can create an array of polynomials:
Neville: 2nd Recursive Relation
Here, each successive polynomial can be determined from two adjacent polynomials in the previous column.
We can simplify the notation by letting
Si j (x) = Pi− j,i− j+1,…,i−1,i (x)
where Si j (x ) for i ≧ j denotes the interpolating polynomial of degree j on the j + 1 nodes xi− j , xi− j+1, . . . , xi−1, xi . Next we can rewrite the recurrence relation above as
Si j (x) = x − xi− j Si, j−1(x) + xi − x Si−1, j−1(x) xi −xi−j xi −xi−j
x0 P0(x)
x1 P1(x)
x2 P2(x)
x3 P3(x)
x4 P4(x)
P0,1(x)
P1,2(x) P0,1,2(x)
P2,3(x) P1,2,3(x) P0,1,2,3(x)
P3,4(x) P2,3,4(x) P1,2,3,4(x) P0,1,2,3,4(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
So the displayed array becomes
x0 S00(x)
x1 S10(x)
x2 S20(x)
x3 S30(x)
x4 S40(x)
S11(x)
S21(x) S22(x)
S31(x) S32(x) S33(x)
S41(x) S42(x) S43(x) S44(x)
4.1 Polynomial Interpolation 171
Neville: 3rd Recursive Relation
■ Theorem4
Proof
Pj(x)=x−xi−j Pj−1(x)+ xi −x Pj−1(x) (14) i x −x i x −x i−1
In this equation, the superscripts are simply indices, not exponents. The range of j is 1 ≦ j ≦ n, while that of i is j ≦ i ≦ n. Formula (14) is seen again, in slightly different form, in the theory of B splines in Section 6.3.
The interpolation properties of these polynomials are given in the next result.
We use induction on j. When j = 0, the assertion in Equation (15) reads P0(x )= y (0≦i≦k≦i≦n)
To prove some theoretical results, we change the notation by making the superscript
the degree of the polynomial. At the beginning, we define constant polynomials (i.e., poly-
nomials of degree 0) as P0(x) = y for 0≦ i ≦ n. Then we define ii
i i−j i i−j
Interpolation Properties
The polynomials P j defined above interpolate as follows: i
P j (x ) = y (0 ≦ i − j ≦ k ≦ i ≦ n) (15) ikk
ikk
In other words, P0(x ) = y , which is true by the definition of P0. iii i
Now assume, as an induction hypothesis, that for some j ≧ 1,
P j−1(x ) = y (0 ≦ i − j + 1 ≦ k ≦ i ≦ n)
ikk
To prove the next case in Equation (15), we begin by verifying the two extreme cases for k,
namely, k = i − j and k = i. We have, by Equation (14),
Pj(x )=xi−xi−jPj−1(x ) i i−j x −x i−1 i−j
0 ≦ i − 1 − j + 1 ≦ i − j ≦ i − 1 ≦ n. In the same way, we compute
Pj(x)=xi −xi−jPj−1(x) iix−xii
Pj(x)=xk−xi−jPj−1(x)+xi−xk Pj−1(x) i k x−x i k x−x i−1 k
i i−j i i−j
i i−j =Pj−1(x )=y
i−1 i−j i−j
The last equality is justified by the induction hypothesis. It is necessary to observe that
i i−j
= P j−1(x ) = y
iii
Here, in using the induction hypothesis, observe that 0 ≦ i − j + 1 ≦ i ≦ i ≦ n.
Now let i − j < k < i. Then
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
172 Chapter 4
Interpolation and Numerical Differentiation
Inthisequation,Pj−1(x )= y bytheinductionhypothesis,because0≦i−j+1≦k≦i≦n. ikk
Likewise,Pj−1(x)=y because0≦i−1−j+1≦k≦i−1≦n.Thus,wehave i−1 k k
Pj(x)=xk−xi−jy+xi−xk y=y ikx−xkx−xkk ■
An algorithm follows in pseudocode to evaluate Pn(t) when a table of values is given:
ii−j ii−j
integer i, j, n; real array (xi )0:n , (yi )0:n , (Si j )0:n×0:n fori =0ton
Si0 ← yi end for
end for return Snn
for j = 1 to n
for i = j to n
Sij ←(t−xi−j)Si,j−1 +(xi −t)Si−1,j−1(xi −xi−j) end for
Evaluate Pn(t) Pseudocode
We begin the algorithm by finding the node nearest the point t at which the evaluation is to be made. In general, interpolation is more accurate when this is done.
Interpolation of Bivariate Functions
The methods we have discussed for interpolating functions of one variable by polynomials extend to some cases of functions of two or more variables. An important case occurs when a function (x, y) → f (x, y) is to be approximated on a rectangle. This leads to what is known as tensor-product interpolation.
Suppose the rectangle is the Cartesian product of two intervals: [a, b] × [α, β]. That is, the variables x and y run over the intervals [a, b], and [α, β], respectively. Select n nodes xi in[a,b],anddefinetheLagrangianpolynomials
l i ( x ) = n x − x j
xi −xj j ≠ i
j=1 Similarly,weselectmnodesyi in[α,β]anddefine
( 1 ≦ i ≦ n )
( 1 ≦ i ≦ m )
P(x, y) =
is a polynomial in two variables that interpolates f at the grid points (xi , y j ). There are nm
such points of interpolation.
l i ( y ) = m j ≠ i
j=1
y − y j yi −yj
Tensor-Product Interpolation
n m i=1 j=1
f (xi , yj )li (x)lj (y)
Then the function
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
The proof of the interpolation property is quite simple because li (xq ) = lj(yp) = δjp. Consequently, we have
n m i=1 j=1
n m i=1 j=1
δi q and
P(xq,yp) = =
f(xi,yj)li(xq)lj(yp) f(xi,yj)δiqδjp = f(xq,yp)
The same procedure can be used with spline interpolants (or indeed any other type of function).
Summary 4.1
• The Lagrange form of the interpolation polynomial is
n i=0
l i ( x ) = n x − x j ( 0 ≦ i ≦ n )
xi −xj j ≠ i
j=0
that obey the Kronecker delta equation
li(xj)=δij=0 ifi≠ j 1 ifi=j
• The Newton form of the interpolation polynomial is
n i−1 pn(x)=ai (x−xj)
i=0 j=0
with divided differences
ai = f[x0,x1,...,xi] = f[x1,x2,...,xi]− f[x0,x1,...,xi−1]
xi −x0
These are two different forms of the unique polynomial p of degree n that interpolates
atableofn+1pairsofpoints(xi, f(xi))for0≦i≦n. • We can illustrate this with a small table for n = 2:
x x0 x1 x2 f(x) f(x0) f(x1) f(x2)
The Lagrange interpolating polynomial is
p2(x)= (x−x1)(x−x2) f(x0)+ (x−x0)(x−x2) f(x1)
(x0 − x1)(x0 − x2) (x1 − x0)(x1 − x2)
+ (x−x0)(x−x1) f(x2)
(x2 − x0)(x2 − x1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.1 Polynomial Interpolation
173
with cardinal polynomials
pn (x ) =
li (x ) f (xi )
174 Chapter 4
Interpolation and Numerical Differentiation
a1. Use the Lagrange interpolation process to obtain a poly- nomial of least degree that assumes these values:
x0234 y 7 11 28 63
2. (Continuation) Rearrange the points in the table of the preceding problem and find the Newton form of the interpolating polynomial. Show that the polynomi- als obtained are identical, although their forms may differ.
a3. For the four interpolation nodes −1, 1, 3, 4, what are the li Functions (2) required in the Lagrange interpolation procedure? Draw the graphs of these four functions to show their essential properties.
4. Verifythatthepolynomials
p(x) = 5x3 −27x2 +45x −21
q(x) = x4 −5x3 +8x2 −5x +3 interpolate the data
x1234 y 2 1 6 47
5.
and explain why this does not violate the uniqueness part of the theorem on existence of polynomial interpo- lation.
Verifythatthepolynomials
p(x) = 3+2(x −1)+4(x −1)(x +2)
q(x) = 4x2 + 6x − 7
are both interpolating polynomials for the following ta- ble, and explain why this does not violate the unique- ness part of the existence theorem for polynomial inter- polation.
x 1 −2 0 y 3 −3 −7
Find the polynomial p of least degree that takes these values: p(0) = 2, p(2) = 4, p(3) = −4, p(5) = 82. Use divided differences to get the correct polynomial. It is not necessary to write the polynomial in the standard form a0 + a1 x + a2 x 2 + · · ·.
Complete the following divided-difference tables, and use them to obtain polynomials of degree 3 that inter- polate the function values indicated:
Clearly, p2(x0) = f (x0), p2(x1) = f (x1), and p2(x2) = f (x2). Next, we form the divided-difference table:
x0
x1 f (x1)
f(x0)
f[x1,x2]
x2 f (x2)
Using the divided-difference entries from the top diagonal, we have
pn(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1) Again,itcanbeeasilyshownthatp2(x0)= f(x0),p2(x1)= f(x1),andp2(x)= f(x2).
• We can use inverse polynomial interpolation to find an approximate value of a root roftheequation f(x)=0fromatableofvalues(xi,yi)for1≦i≦n.Hereweare assuming that the table values are in the vicinity of this zero of the function f . Flipping the table values, we use the reversed table values (yi , xi ) to determine the interpolating polynomial called pn(y). Now evaluating it at 0, we find a value that approximates the desired zero, namely, r ≈ pn(0) and f (pn(0)) ≈ f (r) = 0.
• Other advanced polynomial interpolation methods discussed are Neville’s algorithm and bivariate function interpolation.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
6.
7.
f[x0,x1]
f [x0, x1, x2]
Exercises 4.1
aa.
x f[] f[, ] f[, , ] −1 2
1−4 2 36
f[, , , ]
a 12.
4.1 Polynomial Interpolation 175 Thepolynomialp(x)=x4−x3+x2−x+1hasthe
following values:
x −2 −1 0 1 2 3
p(x) 31 5 1 1 11 61 Find a polynomial q that takes these values:
x −2 −1 0 1 2 3 q(x) 31 5 1 1 11 30
Hint: This can be done with little work. Usethedivided-differencemethodtoobtainapolynomial
of least degree that fits the values shown. aa.x 0 1 2−13
y −1 −1 −1 −7 5
b.x 1 3 −2 4 5 y 2 6 −1 −4 2
Findtheinterpolatingpolynomialforthesedata:
2
b. x f[] f[,] f[,,] f[,,,]
−1 2
1 −4
3 46
53.5
4 99.5
Write the final polynomials in a form most efficient for
computing.
a8. Findaninterpolatingpolynomialforthistable:
x 1 2 2.5 3 4 y −1 −1 3 4 25
9. Giventhedata x01246
f(x) 1 9 23 93 259 do the following.
5 10
13.
a14.
15.
a16.
17.
a18.
a 19.
x
f (x)
1.0 2.0 −1.5 −0.5
2.5 3.0 4.0 0.0 0.5 1.5
3 32 3
Itissuspectedthatthetable
x −2 −1 0 1 2 3
y 1 4 11 16 13 −4
comes from a cubic polynomial. How can this be tested?
Explain. Thereexistsauniquepolynomialp(x)ofdegree2orless
such that p(0) = 0, p(1) = 1, and p′(α) = 2 for any value of α between 0 and 1 (inclusive) except one value of α, say, α0. Determine α0, and give this polynomial for α ≠ α 0 .
Determine by two methods the polynomial of degree 2 or less whose graph passes through the points (0, 1.1), (1, 2), and (2, 4.2). Verify that they are the same.
Developthedivided-differencetablefromthegivendata. Write down the interpolating polynomial, and rearrange it for fast computation without simplifying.
x01325 f(x) 2 1 5 6 −183
Checkpoint: f[1,3,2,5]=−7. Letf(x)=x3+2x2+x+1.Findthepolyno-
mial of degree 4 that interpolates the values of f at
aa. ab.
10. a.
b.
Constructthedivided-differencetable.
UsingNewton’sinterpolationpolynomial,findanap- proximation to f (4.2).
Hint: Use polynomials starting with 93 and involving factors (x − 4).
Construct Newton’s interpolation polynomial for the data shown.
x0234
y 7 11 28 63 Withoutsimplifyingit,writethepolynomialobtained
in nested form for easy evaluation.
11. From census data, the approximate population of the United States was 150.7 million in 1950, 179.3 million in 1960, 203.3 million in 1970, 226.5 million in 1980, and 249.6 million in 1990. Using Newton’s interpolation polynomial for these data, find an approximate value for the population in 2000. Then use the polynomial to esti- mate the population in 1920 based on these data. What conclusion should be drawn?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
176
Chapter 4 Interpolation and Numerical Differentiation
20.
21.
a22.
23.
24.
a 25.
26.
a 27. a 28.
0.30103 0.47712 0.54407 0.60206
x = −2,−1,0,1,2. Find the polynomial of degree 2 that interpolates the values of f at x = −1, 0, 1.
Withoutusingadivided-differencetable,deriveandsim- plify the polynomial of least degree that assumes these values:
x −2 −1 0 1 2
y 2 14 4 2 2 (Continuation) Find a polynomial that takes the values
shown in the preceding problem and has at x = 3 the value 10.
Hint: Add a suitable polynomial to the p(x) of the previ- ous problem.
Findapolynomialofleastdegreethattakesthesevalues: x 1.73 1.82 2.61 5.22 8.26
y 0 0 7.8 0 0
Hint: Rearrange the table so that the nonzero value of y is the last entry, or think of some better way.
Formadivided-differencetableforthefollowingandex- plain what happened.
x1231
y3557
Simple polynomial interpolation in two dimensions is not always possible. For example, suppose that the following data are to be represented by a polynomial of first degree inxandy,p(t)=a+bx+cy,wheret=(x,y):
t (1,1) (3,2) (5,3) f(t) 3 2 6
Show that it is not possible.
Consider a function f (x) such that f (2) = 1.5713, f (3) = 1.5719, f (5) = 1.5738, and f (6) = 1.5751. Estimate f (4) using a second-degree interpolating poly- nomial and a third-degree polynomial. Round the final results off to four decimal places. Is there any advantage
here in using a third-degree polynomial?
Use inverse interpolation to find an approximate value of x such that f (x) = 0 given the following table of values for f . Look into what happens and draw a conclusion.
x −2 −1 1 2 3 f(x) −31 5 1 11 61
Find a polynomial p(x) of degree at most 3 such that p(0)=1, p(1)=0, p′(0)=0,and p′(−1)=−1.
From a table of logarithms, we obtain the following val- ues of log x at the indicated tabular points:
x 1 1.5 2 3
log x
Form a divided-difference table based on these values. Interpolate for log 2.4 and log 1.2 using third-degree in- terpolation polynomials in Newton form.
Showthatthedivideddifferencesarelinearmaps;thatis, (αf +βg)[x0,x1,...,xn] = αf[x0,x1,...,xn]
+ βg[x0, x1, . . . , xn ]
Hint: Use induction.
Show that another form for the polynomial pn of degree at most n that takes values y0,y1,...,yn at abscissas x0,x1,...,xn is
n i−1
f[xn,xn−1,...,xn−i](x −xn−j)
i=0 j=0
Use the uniqueness of the interpolating polynomial to verify that
n n i−1
f (xi )li (x) = f [x0, x1, . . . , xi ] (x − x j )
i=0 i=0 j=0
(Continuation)Showthatthefollowingexplicitformula is valid for divided differences:
n f[x0,x1,...,xn]=
0 0.17609
3.5 4
29.
30.
31.
32.
33.
34.
35.
36.
n j ≠ i
(xi −xj)−1 Hint: If two polynomials are equal, the coefficients of xn
f(xi)
li(x) = 1
for the case n = 1. Then establish the result for arbitrary
values of n.
WritetheLagrangeform(1)oftheinterpolatingpolyno- mial of degree at most 2 that interpolates f (x) at x0, x1, and x2, where x0 < x1 < x2.
(Continuation)WritetheNewtonformoftheinterpolat- ing polynomial p2(x), and show that it is equivalent to the Lagrange form.
(Continuation) Show directly that
p′′(x)=2f[x ,x ,x ] 2 012
in each are equal. Verifydirectlythat
n i=0
i = 0
j=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
37.
38.
(Continuation) Show directly for uniform spacing h = x1−x0=x2−x1that
41.
a 42. 43.
4.1 Polynomial Interpolation 177 Verify directly that for any three distinct points x0,x1,
and x2,
f[x0,x1,x2] = f[x2,x0,x1] = f[x1,x2,x0]
Compare this argument to the one in the text.
Let p be a polynomial of degree n. What is
p[x0,x1,...,xn+1]?
Show that if f is continuously differentiable on the in-
terval [x0, x1], then f [x0, x1] = f ′(c) for some c in (x0, x1).
If f is a polynomial of degree n, show that in a divided-difference table for f, the nth column has a single constant value—a column containing entries
f[xi,xi+1,...,xi+n].
Determine whether the following assertion is true or false. If x0, x1, . . . , xn are distinct, then for arbitrary real val- ues y0, y1, ..., yn, there is a unique polynomial pn+1 of degree≦n+1suchthatpn+1(xi)=yi foralli=0, 1,...,n.
Show that if a function g interpolates the function f at x0,x1,...,xn−1 and h interpolates f at x1,x2,...,xn,
a 39.
cient [s!]/[(s − m)! m!], and s!/(s − m)! = s(s − 1)(s − 2)···(s−m+1)becauses canbeanyrealnumberand m! has the usual definition because m is an integer.
(Continuation) From the following table of values of lnx, interpolate to obtain ln2.352 and ln2.387 using the Newton forward-difference form of the interpolating polynomial:
f[x0,x1]= f0, h,
f[x0,x1,x2]= 2 f0 2h2
2 wherefi=fi+1−fi,fi=fi+1−fi,and
fi = f(xi).
(Continuation) Establish Newton’s forward-difference form of the interpolating polynomial with uniform spacing
s s p2(x) = f0 + f0 +
44.
45.
46.
47.
2 f0
where x = x0 + sh. Here, m is the binomial coeffi-
12
s
a
2 f −0.00001
−0.00002 −0.00002
[g(x) − h(x)]
then
x0 − x xn − x0
x f(x)
2.35 0.85442
2.36 0.85866
2.37 0.86289
2.38 0.86710
2.39 0.87129
f 0.00424
0.00423 0.00421 0.00419
g(x) +
interpolates f at x0,x1,...,xn.
a 40.
a1.
1x0 x2 0
Using the correctly rounded values ln 2.352 ≈ 0.85527 and ln 2.387 ≈ 0.87004, show that the forward-difference formula is more accurate near the top of the table than it is near the bottom.
Count the number of multiplications, divisions, and additions/subtractions in the generation of the divided- difference table that has n + 1 points.
Test the procedure given in the text for determining the Newton form of the interpolating polynomial. For exam- ple, consider this table:
1x0 f 0 1x1 f 1 1x2 f2
(Vandermonde Determinant) Using fi = f (xi ), show the following:
a. f[x0,x1]=
f1 1 x0
1 1
f 0
1 x1
b. f[x0,x1,x2]=
x 1 2 3 −4 5
y 2 48 272 1182 2262
Find the interpolating polynomial and verify that
p(−1) = 12.
1x1 x 12 1x2 x2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 4.1
178 Chapter 4 Interpolation and Numerical Differentiation
2. Find the polynomial of degree 10 that interpolates the function arctan x at 11 equally spaced points in the in- terval [1,6]. Print the coefficients in the Newton form of the polynomial. Compute and print the difference be- tween the polynomial and the function at 33 equally spaced points in the interval [0, 8]. What conclusion can be drawn?
3. Write a simple program using procedure Coef that inter- polates ex by a polynomial of degree 10 on [0,2] and then compares the polynomial to exp at 100 points.
4. Use as input data to procedure Coef the annual rainfall in your town for each of the last five years. Using func- tion Eval, predict the rainfall for this year. Is the answer reasonable?
5. A table of values of a function f is given at the points
xi =i/10for0≦i≦100.Inordertoobtainagraphof f
with the aid of an automatic plotter, the values of f are
required at the points zi = i/20 for 0 ≦ i ≦ 200. Write a
procedure to do this, using a cubic interpolating polyno-
mial with nodes xi , xi+1, xi+2, and xi+3 to compute f at
1(xi+1 + xi+2). For z1 and z199, use the cubic polyno- 2
mial associated with z3 and z197, respectively. Compare this routine to Coef for a given function.
6. WriteroutinesanalogoustoCoefandEvalusingtheLa- grange form of the interpolation polynomial. Test on the example given in this section at 20 points with h/2. Does the Lagrange form have any advantage over the Newton form?
7. (Continuation) Design and carry out a numerical exper- iment to compare the accuracy of the Newton and La- grange forms of the interpolation polynomials at values throughout the interval [x0, xn].
8. RewriteandtestroutinesCoefandEvalsothatthearray (ai ) is not used.
Hint: When the elements in the array (yi ) are no longer needed, store the divided differences in their places.
9. Write a procedure for carrying out inverse interpolation to solve equations of the form f (x) = 0. Test it on the introductory example at the beginning of this chapter.
10. ForExample8,comparetheresultsfromyourcodewith that in the text. Redo using linear interpolation based on the 10 equidistant points. How do the errors compare at intermediate points? Plot curves to visualize the dif- ference between linear interpolation and a higher-degree polynomial interpolation.
11. Use mathematical software such as MATLAB, Maple, or Mathematica to find an interpolation polynomial for the points (0, 0), (1, 1), (2, 2.001), (3, 3), (4, 4), (5, 5). Evaluate the polynomial at the point x = 14 or x = 20 to show that slight roundoff errors in the data can lead to suspicious results in extrapolation.
12. Use symbolic mathematical software such as MATLAB, Maple, or Mathematica to generate the interpolation poly- nomial for the data points in Example 3. Plot the polyno- mial and the data points.
13. (Continuation.) Repeat these instructions using Exam- ple 7.
14. Carry out the details in Example 8 by writing a computer program that plots the data points and the curve for the interpolation polynomial.
15. (Continuation.) Repeat the instructions using Example 9.
16. Using mathematical software, carry out the details and verify the results in the introductory example to this chapter.
17. (Pade ́Interpolation)Findarationalfunctionoftheform g(x)= a+bx
1+cx
that interpolates the function f(x) = arctan(x) at the pointsx0 =1,x1 =2,andx2 =3.Onthesameaxes, plot the graphs of f and g, using dashed and dotted lines, respectively.
4.2 Errors in Polynomial Interpolation Introduction
When a function f is approximated on an interval [a,b] by means of an interpolating polynomial p, the discrepancy between f and p will (theoretically) be 0 at each node of interpolation. A natural expectation is that the function f is well approximated at all intermediate points and that as the number of nodes increases, this agreement will become better and better.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 1
In the history of numerical mathematics, it was a severe shock to realize that this expectation was ill-founded! Of course, if the function being approximated is not required to be continuous, then there may be no agreement at all between p(x) and f (x) except at the nodes.
Consider these five data points: (0, 8), (1, 12), (3, 2), (4, 6), (8, 0). Construct and plot the interpolation polynomial using the two outermost points. Repeat this process by adding one additional point at a time until all the points are included. What conclusions can you draw?
y
35 30
4.2 Errors in Polynomial Interpolation 179
25 p4 20
15
10
5 p1
p3
FIGURE 4.6
Interpolant polynomials over datapoints
Solution
Poorly Fitting Polynomials
Dirichlet Function Numerical Experiment
p2
25 x
012345678
The first interpolation polynomial is the line between the outermost points (0, 8) and (8, 0). Then we added the points (3, 2), (4, 6), and (1, 12) in that order and plotted a curve for each additional point. All of these polynomials are shown in Figure 4.6. We were hoping for a smooth curve going through these points without wide fluctuations, but this did not happen! (Why?) It may seem counterintuitive, but as we added more points, the situation became worse instead of better! The reason for this comes from the nature of high-degree polynomials. A polynomial of degree n has n zeros. If all of these zero points are real, then the curve crosses the x-axis n times. The resulting curve must make many turns for this to happen, resulting in wild oscillations. In Chapter 6, we discuss fitting the data points with spline curves. ■
Dirichlet Function
As a pathological example, consider the so-called Dirichlet function f , defined to be 1 at each irrational point and 0 at each rational point. If we choose nodes that are rational numbers, then p(x) ≡ 0 and f (x)− p(x) = 0 for all rational values of x, but f (x)− p(x) = 1 for all irrational values of x.
However, if the function f is well behaved, can we not assume that the differences | f (x) − p(x)| are small when the number of interpolating nodes is large? The answer is still no, even for functions that possess continuous derivatives of all orders on the interval!
Runge Function
A specific example of this remarkable phenomenon is provided by the Runge function:
f (x) = (1 + x2)−1 (1)
0
Runge Function
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
180 Chapter 4 Interpolation and Numerical Differentiation
on the interval [−5, 5]. Let pn be the polynomial that interpolates this function at n + 1
Runge Function Numerical Experiment
equally spaced points on the interval [−5, 5], including the endpoints. Then lim max |f(x)−pn(x)|=∞
n→∞ −5 ≦ x ≦ 5
Thus, the effect of requiring the agreement of f and pn at more and more points is to increase the error at nonnodal points, and the error actually increases beyond all bounds!
The moral of this example, then, is that polynomial interpolation of high degree with many nodes is a risky operation; the resulting polynomials may be very unsatisfactory as representations of functions unless the set of nodes is chosen with great care.
The reader can easily observe the phenomenon just described by using the pseudocodes already developed in this chapter. See Computer Exercise 4.2.1 for a suggested numerical experiment. In a more advanced study of this topic, it would be shown that the divergence of the polynomials can often be ascribed to the fact that the nodes are equally spaced. Again, contrary to intuition, equally distributed nodes are usually a very poor choice in interpolation. A much better choice for n + 1 nodes in [−1, 1] is the set of Chebyshev nodes:
xi =cos2i+1π (0≦i≦n) 2n + 2
The corresponding set of nodes on an arbitrary interval [a, b] would be derived from a linear mapping to obtain
1 1 2i+1
xi =2(a+b)+2(b−a)cos 2n+2 π (0≦i≦n)
Notice that these nodes are numbered from right to left. Since the theory does not depend on any particular ordering of the nodes, this is not troublesome.
A simple graph illustrates this phenomenon best. Again, consider Equation (1) on the interval [−5, 5]. First, we select nine equally spaced nodes and use routines Coef and Eval with an automatic plotter to graph p8. As shown in Figure 4.7, the resulting curve assumes negativevalues,which,ofcourse, f(x)doesnothave!Addingmoreequallyspacednodes— and thereby obtaining a higher-degree polynomial—only makes matters worse with wilder oscillations. In Figure 4.8, nine Chebyshev nodes are used, and the resulting polynomial
Chebyshev Nodes
Better Fitting with Chebyshev Polynomials
FIGURE 4.7
Polynomial interpolant with nine equally spaced nodes
FIGURE 4.8
Polynomial interpolant with nine Chebyshev nodes
25 24
y
1
23 22 21 0 1 2 3 21
4 5
y
1
x
x
25 24 23 22 21 0 21
1 2 3
4
5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 4.9
Chebyshev nodes ofT9
4.2 Errors in Polynomial Interpolation 181 5
25 0 5
curve is smoother. However, cubic splines (discussed in Section 6.2) produce an even better curve fit.
The Chebyshev nodes of T9 are obtained by taking equally spaced points on the unit circle and projecting them onto the horizontal axis, as in Figure 4.9.
Theorems on Interpolation Errors
It is possible to assess the errors of interpolation by means of a formula that involves the (n + 1)st derivative of the function being interpolated. Here is the formal statement:
First Interpolation Error Theorem
If p is the polynomial of degree at most n that interpolates f at the n + 1 distinct nodes x0, x1, . . . , xn belonging to an interval [a, b] and if f (n+1) is continuous, then for each x in [a,b], there is a ξ in (a,b) for which
1 (n+1) n
f (x) − p(x) = (n + 1)! f (ξ) (x − xi ) (2)
i=0
■ Theorem1
Proof
Observe first that Equation (2) is obviously valid if x is one of the nodes xi because then both sides of the equation reduce to zero. If x is not a node, let it be fixed in the remainder of the discussion, and define
w(t ) = c =
n
(t − xi ) (polynomial in the variable t )
i=0
f(x)− p(x)
(3)
(constant)
φ(t) = f (t) − p(t) − cw(t) (function in the variable t)
w(x )
Observe that c is well defined because w(x) ≠ 0 (x is not a node). Note also that φ takes thevalue0atthen+2pointsx0,x1,...,xn,andx.NowinvokeRolle’sTheorem,∗ which states that between any two roots of φ, there must occur a root of φ′. Thus, φ′ has at least n + 1 roots. By similar reasoning, φ′′ has at least n roots, φ′′′ has at least n − 1 roots, and so on. Finally, it can be inferred that φ(n+1) must have at least one root. Let ξ be a root of φ(n+1). All the roots being counted in this argument are in (a, b). Thus, we obtain
0 = φ(n+1)(ξ) = f (n+1)(ξ) − p(n+1)(ξ) − cw(n+1)(ξ)
∗ Rolle’s Theorem: Let f be a function that is continuous on [a, b] and differentiable on (a, b). If
f(a)= f(b)=0,then f′(c)=0forsomepointcin(a,b).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
182 Chapter 4
Interpolation and Numerical Differentiation
■ Lemma1
Proof
To establish this inequality, fix x and select j so that xj ≦ x ≦ xj+1. It is an exercise in calculus (Exercise 5.2.2) to show that
h2
|x−xj||x−xj+1|≦ 4 (5)
FIGURE 4.10
Typical location of x in equally spaced nodes
a 5 x0
. . .
x
xj11 xj12
. . .
In this equation, p(n+1)(ξ) = 0 because p is a polynomial of degree ≦ n. Also, w(n+1)(ξ) = (n + 1)! because w(t) = tn+1+ (lower-order terms in t). Thus, we have
0= f(n+1)(ξ)−c(n+1)!= f(n+1)(ξ)−(n+1)![f(x)−p(x)] w(x )
This equation is a rearrangement of Equation (2).
■
A special case that often arises is the one in which the interpolation nodes are equally spaced.
Upper Bound Lemma
Supposethatxi =a+ihfori=0,1,...,nandthath=(b−a)/n.Thenforany x ∈ [a, b]
n 1 n + 1
|x−xi|≦ 4h n! (4) i=0
Using Equation (5), we have
n j−1 n
|x−xi|≦ h2 (x−xi) (xi −x) i=0 4 i=0 i=j+2
The sketch in Figure 4.10, showing a typical case of equally spaced nodes, may be helpful.
xn21 xn 5 b
Nowusethefactthatxi =a+ih.Thenwehavexj+1 −xi =(j−i+1)handxi −xj = (i − j)h. Therefore, we obtain
x1 x2 x3
xj21 xj
Since xj ≦ x ≦ xj+1, we have further
n j−1 n
|x−xi|≦h2 (xj+1−xi) (xi −xj) i=0 4 i=0 i=j+2
n j−1n h2
|x − xi | ≦ 4 h j hn−( j+2)+1 ( j − i + 1)
i=0 i=0 i=j+2
≦ 1hn+1(j+1)!(n−j)!≦1hn+1n! 44
(i − j)
Inthelaststep,weusethefactthatif0≦ j≦n−1,then(j+1)!(n−j)!≦n!.This,too, is left as an exercise (Exercise 5.2.3). Hence, Inequality (4) is established. ■
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Proof
EXAMPLE 2
Solution
Use Theorem 1 on interpolation errors and Inequality (4) in Lemma 1. ■
This theorem gives loose upper bounds on the interpolation error for different values of n. By other means, one can find tighter upper bounds for small values of n (Cf. Exercise 4.2.5). If the nodes are not uniformly spaced, then a better bound can be found by use of the Chebyshev nodes.
Assess the error if sin x is replaced by an interpolation polynomial that has 10 equally spaced nodes in [0, 1.6875]. (See the related Example 8 in Section 4.1.)
We use Theorem 2 on interpolation errors, taking f (x) = sin x, n = 9, a = 0, and b = 1.6875. Since f (10)(x) = − sin x, | f (10)(x)| ≦ 1. Hence, in Equation (6), we can let M = 1. The result is
|sinx − p(x)|≦1.34×10−9
Thus, p(x) represents sin x on this interval with an error of at most two units in the ninth decimal place. Therefore, the interpolation polynomial that has 10 equally spaced nodes on the interval [0, 1.6875] approximates sin x to at least eight decimal digits of accuracy. In fact, a careful check on a computer would reveal that the polynomial is accurate to even more decimal places. (Why?) ■
The error expression in polynomial interpolation can also be given in terms of divided differences:
4.2 Errors in Polynomial Interpolation 183 We can now find a bound on the interpolation error.
Second Interpolation Error Theorem
Let f beafunctionsuchthat f(n+1) iscontinuouson[a,b]andsatisfies|f(n+1)(x)|≦ M. Let p be the polynomial of degree ≦ n that interpolates f at n + 1 equally spaced nodes in [a, b], including the endpoints. Then on [a, b],
|f(x)− p(x)|≦ 1 Mhn+1 (6) 4(n + 1)
where h = (b − a)/n is the spacing between nodes.
■ Theorem2
Third Interpolation Error Theorem
If pisthepolynomialofdegreenthatinterpolatesthefunction f atnodesx0,x1,...,xn, then for any x that is not a node,
n f(x)−p(x)= f[x0,x1,...,xn,x] (x−xi)
i=0
■ Theorem3
Proof
Let t be any point, other than a node, where f (t) is defined. Let q be the polynomial of degree≦n+1thatinterpolates f atx0,x1,...,xn,t.BytheNewtonformoftheinterpolation
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
184 Chapter 4
Interpolation and Numerical Differentiation
formula (Equation (8) in Section 4.1), we have
q(x)= p(x)+ f[x0,x1,...,xn,t] Since q(t) = f (t), this yields at once
n (x−xi)
i=0
n
(t−xi) ■
i=0
f(t)= p(t)+ f[x0,x1,...,xn,t]
The following theorem shows that there is a relationship between divided differences
and derivatives.
Divided Differences and Derivatives
If f(n) iscontinuouson[a,b]andifx0,x1,...,xn areanyn+1distinctpointsin [a, b], then for some ξ in (a, b),
f[x0,x1,...,xn]= 1 f(n)(ξ) n!
■ Theorem4
Proof
Let p be the polynomial of degree ≦ n − 1 that interpolates f at x0, x1, . . . , xn−1. By Theorem 1 on interpolation errors, there is a point ξ such that
1 ( n ) n − 1 f(xn)−p(xn)=n!f (ξ) (xn−xi)
i=0
By Theorem 3 on interpolation errors, we obtain
differences are zero for a polynomial.
Is there a cubic polynomial that takes these values?
x 1 −2 0 3 −1 7
y −2 −56 −2 4 −16 376
If such a polynomial exists, its fourth-order divided differences f [ , , , , ] would all be
zero. We form a divided-difference table to check this possibility:
f(xn)− p(xn)= f[x0,x1,...,xn−1,xn]
As an immediate consequence of this theorem, we observe that all high-order divided
n−1
(xn −xi) ■
i=0
Divided Differences Corollary
If f is a polynomial of degree n, then all of the divided differences f [x0, x1, . . . , xi ] are zero for i ≧ n + 1.
■ Corollary1
EXAMPLE 3
Solution
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Sample Divided- Difference Table
4.2 Errors in Polynomial Interpolation 185 x f[] f[,] f[,,] f[,,,] f[,,,,]
1
−2 −56
27
0−2−5 0 22 34−3 0 52
Newton Interpolation Polynomials
The data can be represented by a cubic polynomial because the fourth-order divided dif- ferences f [ , , , , ] are zero. From the Newton form of the interpolation formula, this polynomial is
p3(x) = −2 + 18(x − 1) − 9(x − 1)(x + 2) + 2(x − 1)(x + 2)x ■
Summary 4.2
• The Runge function f (x) = 1/(1 + x2) on the interval [−5, 5] shows that high-degree polynomial interpolation and uniform spacing of nodes may not be satisfactory. The Chebyshev nodes for the interval [a, b] are given by
xi =1(a+b)+1(b−a)cos2i+1π 2 2 2n+2
• There is a relationship between differences and derivatives: f[x0,x1,...,xn]= 1 f(n)(ξ)
−1 −16 7 376
11
18
49
n!
• Expressions for errors in polynomial interpolation are
1 (n+1)
f (x) − p(x) = (n + 1)! f (ξ)
n f(x)−p(x)= f[x0,x1,...,xn,x] (x−xi)
i=0
• For n + 1 equally spaced nodes, an upper bound on the error is given by
M b − a n+1
|f(x)− p(x)|≦
Here M isanupperboundonf(n+1)(x)whena≦x≦b.
−9
2
n
(x − xi )
i=0
4(n + 1) n
• If f is a polynomial of degree n, then all of the divided differences f [x0, x1, . . . , xi ] are
zero for i ≧ n + 1.
−2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
186 Chapter 4 Interpolation and Numerical Differentiation
a 1.
Use a divided-difference table to show that the following data can be represented by a polynomial of degree 3:
x −2 −1 0 1 2 3 y 1 4 11 16 13 −4
a9. An interpolating polynomial of degree 20 is to be used to approximate e−x on the interval [0, 2]. How accurate will it be? (Use 21 uniform nodes, including the endpoints of the interval. Compare results, using Theorems 1 and 2.)
a10. Let the function f(x) = lnx be approximated by an interpolation polynomial of degree 9 with 10 nodes uni- formly distributed in the interval [1, 2]. What bound can be placed on the error?
11. In the first theorem on interpolation errors, show that if x0 < x1 < ··· < xn and x0 < x < xn, then x0 < ξ < xn.
12. (Continuation) In the same theorem, considering ξ as a function of x, show that f (n)[ξ(x)] is a continuous func- tion of x.
Note: ξ(x) need not be a continuous function of x.
a13. Supposecosxistobeapproximatedbyaninterpolating
polynomial of degree n, using n+1 equally spaced nodes
in the interval [0, 1]. How accurate is the approximation?
(Express your answer in terms of n.) How accurate is the
2. Fill in a detail in the proof of Inequality (4) by proving Inequality (5).
3. (Continuation) Fill in another detail in the proof of In- equality (4) by showing that (j + 1)!(n − j)!≦n! if 0 ≦ j ≦ n − 1. Induction and a symmetry argument can be used.
4. For nonuniformly distributed nodes a = x0 < x1 < ···
7. Let f (x) = max{0, 1 − x}. Sketch the function f .
Then find interpolating polynomials p of degrees 2, 4, 8, 16, and 32 to f on the interval [−4, 4], using equally spaced nodes. Print out the discrepancy f (x) − p(x) at 128 equally spaced points. Then redo the problem using Chebyshev nodes.
8. UsingCoefandEvalandanautomaticplotter,fitapoly- nomial through the following data:
9.
a 10.
2.1 2.30 2.60 2.8 3.00 0.1 0.28 1.03 1.5 1.44
Does the resulting curve look like a good fit? Explain.
Find the polynomial p of degree ≦ 10 that interpolates |x | on [−1, 1] at 11 equally spaced points. Print the dif- ference |x| − p(x) at 41 equally spaced points. Then do the same with Chebyshev nodes. Compare.
Why are the Chebyshev nodes generally better than equally spaced nodes in polynomial interpolation? The
answer lies in the term n (x − xi ) that occurs in the i=0
error formula. If xi = cos[(2i + 1)π/(2n + 2)], then
4.3
Estimating Derivatives and Richardson Extrapolation 187
11.
12. 13.
14.
n
( x − x i ) ≦ 2 − n
i=0
for all x in [−1, 1]. Carry out a numerical experiment to
test the given inequality for n = 3, 7, 15.
(Student Research Project) Explore the topic of inter- polation of multivariate scattered data, such as those in geophysics and other areas.
Use mathematical software such as found in MATLAB, Maple, or Mathematica to reproduce Figures 4.7–4.8.
UsesymbolicmathematicalsoftwaresuchasMATLAB, Maple or Mathematica to generate the interpolation poly- nomial for the data points in Example 2. Plot the polyno- mial and the data points.
Usegraphicalsoftwaretoplotfourorfivepointsthathap- pen to generate an interpolating polynomial that exhibits a great deal of oscillations. This piece of software should let you use your computer mouse to click on three or four points that visually appear to be part of a smooth curve. Next it uses Newton’s interpolating polynomial to sketch the curve through these points. Then add another point that is somewhat remote from the curve and refit all the points. Repeat, adding other points. After a few points have been added in this way, you should have evidence that polynomials can oscillate wildly.
4.3
Estimating Derivatives and Richardson Extrapolation
A numerical experiment outlined in Section 1.1 (p. 12) showed that determining the deriva- tive of a function f at a point x is not a trivial numerical problem. Specifically, if f (x) can be computed with only n digits of precision, it is difficult to calculate f ′(x) numerically
x 0.0 0.60 1.50 1.70 1.90
y −0.8 −0.34 0.59 0.59
0.23
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 4.2
188 Chapter 4
Interpolation and Numerical Differentiation
f ′(x) Approximation
f ′(x) Forward Difference Rule with Error Term
Error Analysis
EXAMPLE 1
Solution
with n digits of precision. This difficulty can be traced to the subtraction between quanti- ties that are nearly equal. In this section, several alternatives are offered for the numerical computation of f ′(x) and f ′′(x).
First-Derivative Formulas via Taylor Series
First, consider again the obvious method based on the definition of f ′(x). It consists of selecting one or more small values of h and writing
f′(x)≈ 1[f(x+h)− f(x)] (1) h
What truncation error is involved in this formula?
To find out, use Taylor’s Theorem:
f(x+h)= f(x)+hf′(x)+1h2 f′′(ξ) 2
Rearranging this equation gives
f′(x)= 1[f(x+h)− f(x)]−1hf′′(ξ) (2)
h2 Hence,weseethatapproximation(1)haserrorterm−1hf′′(ξ)=O(h),whereξ isinthe
2
interval having endpoints x and x + h.
Equation (2) shows that in general, as h → 0, the difference between f ′(x) and the
estimate h−1[ f (x + h) − f (x)] approaches zero at the same rate that h does—that is, O(h). Of course, if f ′′(x) = 0, then the error term is 1 h2 f ′′′(γ ), which converges to zero
6 somewhat faster at O(h2). But usually, f ′′(x) is not zero.
Equation (2) gives the truncation error for this numerical procedure, namely, − 1 h f ′′ (ξ ). This error is present even if the calculations are performed with infinite preci-
2
sion; it is due to our imitating the mathematical limit process by means of an approximation
formula. Additional (and worse) errors must be expected when calculations are performed on a computer with finite word length.
In Section 1.1 (p. 12), the program named First used the forward difference rule (1) to approximate the first derivative of the function f (x) = sin x at x = 0.5. Explain what happens when a large number of iterations are performed, say n = 50.
There is a total loss of all significant digits! When we examine the computer output closely, we find that, in fact, a good approximation f ′(0.5) ≈ 0.87758 was found, but it deteriorated as the process continued. This was caused by the subtraction of two nearly equal quantities
f (x + h) and f (x), resulting in a loss of significant digits as well as a magnification of this effect from dividing by a small value of h. We need to stop the iterations sooner! When to stop an iterative process is a common question in numerical algorithms. In this case, monitor the iterations to determine when they settle down, namely, when two successive ones are within a prescribed tolerance. Alternatively, we can use the truncation error term. If we want six significant digits of accuracy in the results, we set
− 1hf′′(ξ)≦ 14−n < 110−6 222
since |f′′(x)| < 1 and h = 1/4n. We find n > 6/log4 ≈ 9.97. So we should stop after about 10 steps in the process. (The least error of 3.1 × 10−9 was found at iteration 14.) ■
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.3 Estimating Derivatives and Richardson Extrapolation 189
As we saw in Newton’s method (Section 3.2) and will see in the Romberg method (Section 5.2), it is advantageous to have the convergence of numerical processes occur with higher powers of some quantity approaching zero. In the present situation, we want an approximation to f ′(x) in which the error behaves like O(h2). One such method is easily obtained with the aid of the following two Taylor series:
111
′ 2 ′′ 3 ′′′ 4 (4)
2! 3! 4!
By subtraction, we obtain
f(x+h)− f(x−h)=2hf′(x)+ 2h3 f′′′(x)+ 2h5 f(5)(x)+··· 3! 5!
This leads to a very important formula for approximating f ′(x):
1 h2 h4 f′(x)= [f(x+h)− f(x−h)]− f′′′(x)−
2h 3!5! Expressed otherwise,
f′(x)≈ 1 [f(x+h)− f(x−h)] 2h
with an error whose leading term is −1 h2 f ′′′(x), which makes it O(h2). 6
f(x+h)=f(x)+hf(x)+2!hf(x)+3!hf(x)+4!hf (x)+··· f(x−h)= f(x)−hf′(x)+ 1h2f′′(x)− 1h3f′′′(x)+ 1h4f(4)(x)−···
(3)
f(5)(x)−··· (4)
f ′(x) Central Difference Rule
(5)
By using Taylor’s Theorem with its error term, we could have obtained the following two expressions:
f(x+h)= f(x)+hf′(x)+1h2 f′′(x)+1h3 f′′′(ξ1) 26
f(x−h)= f(x)−hf′(x)+1h2 f′′(x)−1h3 f′′′(ξ2) 26
Then the subtraction would lead to
f′(x)= 1 [f(x+h)− f(x−h)]−1h2f′′′(ξ1)+ f′′′(ξ2)
f ′(x) Central Difference Rule with Error Term
EXAMPLE 2
between the least and greatest values of f ′′′ on this interval. If f ′′′ is continuous on this interval, then this average value is assumed at some point ξ. Hence, the formula with its error term can be written as
f′(x)= 1[f(x+h)−f(x−h)]−1h2f′′′(ξ) 2h 6
This is based on the sole assumption that f ′′′ is continuous on [x − h, x + h]. This formula for numerical differentiation turns out to be very useful in the numerical solution of certain differential equations, as we shall see in Chapter 11 (on boundary value problems) and Chapter 12 (on partial differential equations).
Modify program First in Section 1.1 so that it uses the central difference formula (5) to
approximate the first derivative of the function f (x) = sin x at x = 0.5. Determine how
many iterations are needed for error less than 1 10−6 . 2
2h 62
The error term here can be simplified by the following reasoning: The expression 1 [ f ′′′(ξ1)+
2
f ′′′(ξ2)] is the average of two values of f ′′′ on the interval [x − h, x + h]. It therefore lies
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
190 Chapter 4 Interpolation and Numerical Differentiation
Solution
f ′(x) Central Difference Rule with Error Series
φ(h) Formula
Using the truncation error term for the central difference formula (5), we set
− 1h2 f′′′(ξ)≦ 14−2n < 110−6
662
or n > (6 − log 3)/ log 16 ≈ 4.59. We obtain a good approximation after about five iterations
with this higher-order formula. (The least error of 3.6 × 10−12 was at step 9.) ■
Richardson Extrapolation
Returning now to Equation (4), we write it in a simpler form:
′1 246
f (x)= 2h[f(x+h)− f(x−h)]+a2h +a4h +a6h +··· (6)
φ(h) with Error Series
Linear Combination of φ
(h) Formula
inwhichtheconstantsa2,a4,…dependon f andx.Whensuchinformationisavailable about a numerical process, it is possible to use a powerful technique known as Richardson extrapolation to wring more accuracy out of the method. This procedure is explained here, using Equation (6) as our model.
Holding f and x fixed, we define a function of h by the formula
φ(h)= 1 [f(x+h)− f(x−h)] (7)
2h
From Equation (6), we see that φ(h) is an approximation to f ′(x) with error of order O(h2).
Our objective is to compute limh→0 φ(h) because this is the quantity f ′(x) that we wanted
in the first place. If we select a function f and plot φ(h) for h = 1, 1, 1, 1,…, then we 248
get a graph (Computer Exercise 4.3.5). Near zero, where we cannot actually calculate the value of φ from Equation (7), φ is approximately a quadratic function of h, since the higher- order terms from Equation (6) are negligible. Richardson extrapolation seeks to estimate the limiting value at 0 from some computed values of φ(h) near 0. Obviously, we can take any convenient sequence hn that converges to zero, calculate φ(hn) from Equation (7), and use these as approximations to f ′(x).
But something much more clever can be done. Suppose we compute φ(h) for some h and then compute φ(h/2). By Equation (6), we have
φ(h) = f′(x)−a2h2 −a4h4 −a6h6 −··· h ′ h2 h4 h6
φ2 =f(x)−a2 2 −a4 2 −a6 2 −···
We can eliminate the dominant term in the error series by simple algebra. To do so, multiply
the second equation by 4 and subtract it from the first equation. The result is
φ(h)−4φh=−3f′(x)− 3a4h4 − 15a6h6 −··· 2 416
We divide by −3 and rearrange this to get
φh+ 1φh−φ(h)= f′(x)+ 1a4h4 + 5 a6h6 +···
232 416
This is a marvelous discovery. Simply by adding 1 [φ(h/2) − φ(h)] to φ(h/2), we have 3
apparently improved the precision to O(h4) because the error series that accompanies this new combination begins with 1 a4h4. When h is small, this is a dramatic improvement!
4
We can repeat this process by letting
(h)= 4φh− 1φ(h) 323
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.3 Estimating Derivatives and Richardson Extrapolation 191 Then we have from the previous derivation that
(h)= f′(x)+b4h4 +b6h6 +··· h ′ h4 h6
2 =f(x)+b4 2 +b6 2 +···
We can combine these equations to eliminate the first term in the error series
Linear Combination of
φ(h) Property
This is yet another apparent improvement in the precision to O(h6). And now, to top it off, note that the same procedure can be repeated over and over again to kill higher and higher terms in the error. This is Richardson extrapolation.
Essentially the same situation arises in the derivation of Romberg’s algorithm in Sec- tion 5.2. We begin a general discussion of the procedure here. We start with an equation that includes both situations. Let φ be a function such that
∞ k=1
where the coefficients a2k are not known. Equation (8) is not interpreted as the definition of φ, but rather as a property that φ possesses. It is assumed that φ(h) can be computed for any h > 0 and that our objective is to approximate L accurately using φ.
Hence, we have
(h)−16h=−15f′(x)+ 3b6h6 +··· 24
h+ 1 h−(h)= f′(x)− 1 b6h6 +··· 2 15 2 20
D(n, 0) Formula
Extrapolation Formula
D(n,m)
■ Theorem1
Because of Equation (8), we have
D(n,0) = L + A(k,0) 2n
φ(h) = L −
a2kh2k (8)
Select a convenient h, and compute the numbers
D(n, 0) = φ h (n ≧ 0) (9)
2n
∞ h 2 k
k=1
where A(k, 0) = −a2k . These quantities D(n, 0) give a crude estimate of the unknown num- ber L = limx→0 φ(x). More accurate estimates are obtained via Richardson extrapolation. The extrapolation formula is
4m 1
D(n,m)= 4m −1D(n,m−1)− 4m −1D(n−1,m−1) (1≦m≦n) (10)
Richardson Extrapolation Theorem
The quantities D(n, m) defined in the Richardson extrapolation process (10) obey the equation
∞ h 2 k
D(n,m)=L+ A(k,m) 2n (0≦m≦n) (11)
k=m+1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
192 Chapter 4 Interpolation and Numerical Differentiation
Proof
Equation (11) is true by hypothesis if m = 0. For the purpose of an inductive proof, we assume that Equation (11) is valid for an arbitrary value of m − 1, and we prove that Equation (11) is then valid for m. Now from Equations (10) and (11) for a fixed value m, we have
D(n,m)=L+ Thus, we are led to define
k=m
4m ∞
h2k A(k,m−1) 2n
D(n,m)=4m−1 L+
1 ∞
−4m−1 L+ After simplification, this becomes
k=m
h 2 k A(k,m−1) 2n−1
k=m
∞ 4m−4kh2k
A(k,m−1) 4m −1 2n (12) A(k,m)=A(k,m−1)4m −4k
D(n, m) Formula
4m −1
At the same time, we notice that A(m, m) = 0. Hence, Equation (12) can be written as
∞ h 2 k D(n,m) = L + A(k,m) 2n
k=m+1
Equation (11) is true for m, and the induction is complete. ■
ThesignificanceofEquation(11)isthatthesummationbeginswiththeterm(h/2n)2m+2. Since h/2n is small, this indicates that the numbers D(n, m) are approaching L very rapidly, namely,
D(n,m)=L+Oh2(m+1) 22n(m+1)
In practice, one can arrange the quantities in a two-dimensional triangular array as follows:
Rule of Converging of
D(n, m)
2D Triangular Array
D(0, 0) D(1, 0) D(2, 0)
D(1, 1) D(2, 1)
D(2, 2)
. …
D(N,2) ··· The main tasks to generate such an array are as follows:
(13)
. D(N,0)
. D(N,1)
D(N, N)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Test Derivative
Pseudocode
■ Algorithm
EXAMPLE 3
Solution
Derivative Pseudocode
Notice that in this algorithm, the computation of D(i, j) follows Equation (10) but has been rearranged slightly to improve its numerical properties.
Write a procedure to compute the derivative of a function at a point by using Equation (5) and Richardson extrapolation.
The input to the procedure are a function f , a specific point x, a value of h, and a number n signifying how many rows in the array (13) are to be computed. The output is the array (13). Here is a suitable pseudocode:
4.3 Estimating Derivatives and Richardson Extrapolation 193
Richardson Extrapolation
1. Writeafunctionforφ.
2. Decide on suitable values for N and h.
3. For i = 0,1,…, N, compute D(i,0) = φ(h/2i). 4. For1≦i≦ j≦N,compute
D(i,j)=D(i,j−1)+(4j −1)−1[D(i,j−1)−D(i−1,j−1)]
procedure Derivative( f, x, n, h, (di j ))
integer i, j, n; real h, x; real array (di j )0:n×0:n external function f
fori =0ton
di0 ←[f(x+h)− f(x−h)]/(2h) for j = 1 to i
di,j ←di,j−1+(di,j−1−di−1,j−1)/(4j −1) end for
h ← h/2 end for
end procedure Derivative
To test the procedure, choose f (x ) = sin x , where x0 = 1.23095 94154 and h = 1. Then f ′(x) = cos x and f ′(x0) = 1 . A pseudocode is written as follows:
3
program Test Derivative
real array (di j )0:n×0:n ; external function f
integer n ← 10; real h ← 1; x ← 1.23095 94154 call Derivative( f, x, n, h, (di j ))
output (di j )
end program Test Derivative
real function f (x) real x
f ← sin(x) end function f
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
194 Chapter 4 Interpolation and Numerical Differentiation
Derivatives Symbolically
We invite the reader to program the pseudocode and execute it on a computer. The computer output is the triangular array (di j ) with indices 0 ≦ j ≦ i ≦ 10. The most accurate value is (d4,1 ) = 0.33333 33433. The values di 0 , which are obtained solely by Equations (7) and (9) without any extrapolation, are not as accurate, having no more than four correct digits. ■
Mathematical software is now available with algebraic manipulation capabilities. Using
them, we could write a computer program to find derivatives symbolically for a rather
large class of functions—probably all those you would encounter in a calculus course.
For example, we could verify the numerical results above by first finding the derivative
for approximating derivatives that cannot be determined exactly.
First-Derivative Formulas via Interpolation Polynomials
An important general stratagem can be used to approximate derivatives (as well as integrals and other quantities). The function f is first approximated by a polynomial p so that f ≈ p. Then we simply proceed to the approximation f ′(x) ≈ p′(x) as a consequence. Of course, this strategy should be used very cautiously because the behavior of the interpolating
polynomial can be oscillatory.
In practice, the approximating polynomial p is often determined by interpolation at
a few points. For example, suppose that p is the polynomial of degree at most 1 that interpolates f at two nodes, x0 and x1. Then from Equation (8) in Section 4.1 with n = 1, we have
p1(x)= f(x0)+ f[x0,x1](x−x0)
f′(x)≈ p1′(x)= f[x0,x1]= f(x1)− f(x0) (14)
x1 − x0
If x0 = x and x1 = x + h (see Figure 4.11), this formula is one previously considered,
f ′(x) Finite Difference Approximation
FIGURE 4.11
Forward difference: two nodes
FIGURE 4.12
Central difference: two nodes
namely, Equation (1):
exactly and then evaluating the numerical answer cos(1.23095 94154) ≈ 0.33333 33355
since arccos 1 ≈ 1.23095 941543. Of course, the procedures discussed in this section are 3
Consequently, we have
f′(x)≈ 1[f(x+h)− f(x)] (15) h
x0 x1
x x1h
If x0 = x − h and x1 = x + h (see Figure 4.12), the resulting formula is Equation (5): f′(x)≈ 1 [f(x+h)− f(x−h)] (16)
2h
x0
x2h x x1h
Now consider interpolation with three nodes, x0, x1, and x2. The interpolating polyno- mial is obtained from Equation (8) in Section 4.1:
p2(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1)
x1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Here the right-hand side consists of two terms. The first is the previous estimate in Equa- tion (14), and the second is a refinement or correction term.
If Equation (17) is used to evaluate f ′(x) when x = 1 (x0 + x1), as in Equation (16), 2
then the correction term in Equation (17) is zero. Thus, the first term in this case must be more accurate than those in other cases because the correction term adds nothing. This is why Equation (16) is more accurate than (15).
An analysis of the errors in this general procedure goes as follows: Suppose that pn is thepolynomialofleastdegreethatinterpolates f atthenodesx0,x1,…,xn.Thenaccording to Theorem 1 on interpolating errors in Section 4.2 (p. 181),
f (x) − pn(x) = 1 f (n+1)(ξ)w(x) (n + 1)!
whereξ isdependentonx,andw(x)=(x−x0)(x−x1)···(x−xn).Differentiatinggives f ′(x) − pn′ (x) = 1 w(x) d f (n+1)(ξ) + 1 f (n+1)(ξ)w′(x) (18)
(n + 1)! dx (n + 1)!
Here, we had to assume that f (n+1)(ξ) is differentiable as a function of x, a fact that is known if f (n+2) exists and is continuous.
The first observation to make about the error formula in Equation (18) is that w(x) vanishes at each node, so if the evaluation is at a node xi , the resulting equation is simpler:
f′(xi)= pn′(xi)+ 1 f(n+1)(ξ)w′(xi) (n + 1)!
For example, taking just two points x0 and x1, we obtain with n = 1 and i = 0, f′(x0)= f[x0,x1]+1f′′(ξ)d [(x−x0)(x−x1)]
2 dx x=x0 = f[x0,x1]+1f′′(ξ)(x0−x1)
2
This is Equation (2) in disguise when x0 = x and x1 = x + h. Similar results follow with
n = 1 and i = 1.
The second observation to make about Equation (18) is that it becomes simpler if x is
chosen as a point where w′(x) = 0. For instance, if n = 1, then w is a quadratic function that vanishes at the two nodes x0 and x1. Because a parabola is symmetric about its axis, w′[(x0 + x1)/2] = 0. The resulting formula is
′x0+x1 1 2d ′′
f = f[x0,x1]− (x1 −x0) f (ξ)
2 8 dx
As a final example, consider four interpolation points: x0, x1, x2, and x3. The interpo-
lating polynomial from Equation (8) in Section 4.1 with n = 3 is p3(x)= f(x0)+ f[x0,x1](x−x0)+ f[x0,x1,x2](x−x0)(x−x1)
and its derivative is
4.3 Estimating Derivatives and Richardson Extrapolation 195
p2′(x)= f[x0,x1]+ f[x0,x1,x2](2x−x0 −x1) (17)
f′ x0+x1 with Finite 2
Difference Formula
Its derivative is
+ f[x0,x1,x2,x3](x−x0)(x−x1)(x−x2)
p3′(x)= f[x0,x1]+ f[x0,x1,x2](2x−x0 −x1) + f [x0, x1, x2, x3]((x − x1)(x − x2)
+(x −x0)(x −x2)+(x −x0)(x −x1))
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
196
Chapter 4 Interpolation and Numerical Differentiation
FIGURE 4.13
Central difference: four nodes
x2x0 x1x3 x22h x2h x x1h x12h
Ausefulspecialcaseoccursifx0 =x−h,x1 =x+h,x2 =x−2h,andx3 =x+2h(see Figure 4.13). The resulting formula is
f′(x)≈− 2 [f(x+h)− f(x−h)]− 1 [f(x+2h)− f(x−2h)] 3h 12h
This can be arranged in a form in which it is computed with a principal term plus a correction or refining term:
f′(x)≈ 1 [f(x+h)− f(x−h)] 2h
f ′(x) Finite Difference Formula at Four Nodes
− 1 f(x +2h)−2[f(x +h)− f(x −h)]− f(x −2h)
12h
The error term is − 1 h4 f (5)(ξ) = O(h5).
30
Second-Derivative Formulas via Taylor Series
In the numerical solution of differential equations, it is often necessary to approximate second derivatives. We shall derive the most important formula for accomplishing this. Simply add the two Taylor series (3) for f (x + h) and f (x − h). The result is
f(x+h)+ f(x−h)=2f(x)+h2 f′′(x)+21h4 f(4)(x)+··· 4!
f ′′(x) Central Difference Formula and Error Term
EXAMPLE 4
Solution
E=−21h2f(4)(x)+ 1h4f(6)(x)+··· 4! 6!
By carrying out the same process using Taylor’s formula with a remainder, one can show that E is also given by
E = − 1 h2 f (4)(ξ) = O(h2) 12
for some ξ in the interval (x − h, x + h). Hence, we have the approximation
f′′(x)≈ 1[f(x+h)−2f(x)+ f(x−h)] (20)
h2
with error O(h2).
Repeat Example 2, using the central difference formula (20) to approximate the second
derivative of the function f (x) = sin x at the given point x = 0.5. Using the truncation error term, we set
− 1 h2 f(4)(ξ)≦ 1 4−2n < 110−6 12 12 2
When this is rearranged, we get
f′′(x)= 1[f(x+h)−2f(x)+ f(x−h)]+E
(19)
h2 where the error series is
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Noise and Truncation Error
f ′(x) Central Difference Formula with Error Term
and we obtain n > (6 − log 6)/ log 16 ≈ 4.34. Hence, the modified program First finds a good approximation of f ′′(0.5) ≈ −0.47942 after about four iterations. (The least error of 3.1 × 10−9 was obtained at iteration 6.) ■
Approximate derivative formulas of high order can be obtained by using unequally spaced points such as at Chebyshev nodes. Recently, software packages have been developed for automatic differentiation of functions that are expressible by a computer program. They produce true derivatives with only rounding errors and no discretization errors.
Noise in Computation
An interesting question is how noise in the evaluation of f (x) affects the computation of derivatives when using the standard formulas.
The formulas for derivatives are derived with the expectation that evaluation of the function at any point is possible, with complete precision. Then the approximate derivative produced by the formula differs from the actual derivative by a quantity called the error term, which involves the spacing of the sample points and some higher derivative of the function.
If there are errors in the values of the function (noise), they can vitiate the whole process! Those errors could overwhelm the error inherent in the formulas. The inherent error arises from the fact that in deriving the formulas, a Taylor series was truncated after only a few terms. It is called the truncation error. It is present even if the evaluation of the function at the required sample points is absolutely correct.
4.3 Estimating Derivatives and Richardson Extrapolation 197
For example, consider the formula
f(x+h)− f(x−h) h2 f ′(x) = −
2h 6
f ′′′(ξ)
The term with h2 is the error term. The point ξ is a nearby point (unknown). If f (x + h) and f (x − h) are in error by at most d, then one can see that the formula produces a value for f ′(x) that is in error by d/h, which is large when h is small. Noise completely spoils the process if d is large.
For a specific numerical case, suppose that h = 10−2 and | f ′′′(s)| ≦ 6. Then the trunca- tion error, E, satisfies |E| ≦ 10−4. The derivative computed from the formula with complete precision is within 10−4 of the actual derivative. Suppose, however, that there is noise in the evaluation of f (x ±h) of magnitude d = h. The correct value of [ f (x +h)− f (x −h)]/(2h) may differ from the noisy value by (2d)/(2h) = 1.
Summary 4.3
• A forward difference formula is
f′(x)≈ 1[f(x+h)− f(x)]
h
with error term − 1 h f ′′ (ξ ). 2
• A central difference formula is
f′(x)≈ 1 [f(x+h)− f(x−h)]
2h with error −1 h2 f ′′′(ξ) = O(h2).
6
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
198 Chapter 4
Interpolation and Numerical Differentiation
• A central difference formula with a correction term is f′(x)≈ 1 [f(x+h)− f(x−h)]
2h
− 1 [f(x +2h)−2f(x +h)+2f(x −h)− f(x −2h)]
12h
with error term − 1 h4 f (5)(ξ) = O(h4).
30
• For f ′′(x), a central difference formula is
f′′(x)≈ 1[f(x+h)−2f(x)+ f(x−h)] h2
with error term − 1 h2 f (4)(ξ). 12
• Ifφ(h)isoneoftheseformulaswitherrorseriesa2h2+a4h4+a6h6+···,thenwecan applyRichardson extrapolation as follows
D(n,0) =φ(h/2n)
D(n,m)= D(n,m−1)+[D(n,m−1)−D(n−1,m−1)]/(4m −1)
with error terms
a1. Determinetheerrortermfortheformula f′(x)≈ 1 [f(x+3h)− f(x−h)]
4h
a2. Using Taylor series, establish the error term for the for-
mula
f′(0)≈ 1 [f(2h)− f(0)] 2h
3. Derivetheapproximationformula
f′(x)≈ 1 [4f(x+h)−3f(x)− f(x+2h)]
2h
Show that its error term is of the form 1 h2 f ′′′(ξ).
3
a4. Can you find an approximation formula for f′(x) that has error term O(h3) and involves only two evaluations of the function f ? Prove or disprove.
5. Averaging the forward-difference formula f ′(x) ≈ [ f (x + h) − f (x)]/h and the backward-difference for- mula f′(x) ≈ [f(x) − f(x − h)]/h, each with er- ror term O(h), results in the central-difference formula
f′(x)≈[f(x+h)− f(x−h)]/(2h)witherrorO(h2). Show why.
Hint: Determine at least the first term in the error series for each formula.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
D(n,m)=L+Oh2(m+1) 22n(m+1)
a6.
Criticizethefollowinganalysis.ByTaylor’sformula,we have
f(x +h)− f(x) = hf′(x)+ ′
h2 h3 f′′(x)+ f′′′(ξ)
26
h2 ′′ h3 ′′′
f(x−h)−f(x)=−hf(x)+ f (x)− f (ξ) 26
7.
So by adding, we obtain an exact expression for f ′′(x): f (x + h) + f (x − h) − 2 f (x) = h2 f ′′(x)
Criticize the following analysis. By Taylor’s formula, we have
′ h2 ′′ h3 ′′′ f(x+h)−f(x)=hf(x)+2f(x)+6f (ξ1)
′ h2 ′′ h3 ′′′ f(x−h)−f(x)=−hf(x)+2f(x)−6f (ξ2)
Therefore, we have 1
h2[f(x+h)−2f(x)+ f(x−h)] h
= f′′(x)+ 6[f′′′(ξ1)− f′′′(ξ2)]
The error in the approximation formula for f ′′ is thus
O(h).
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 4.3
8.
9.
Derivethetwoformulas
aa. f′(x)≈ 1 [f(x+2h)− f(x−2h)]
a
a
′′′ 1
a. f (x)≈ 2h3[f(x+2h)−2f(x+h) 17.
+2f(x −h)− f(x −2h)]
(x)≈ h4[f(x+2h)−4f(x+h)+6f(x) 18.
(Continuation) State and prove a theorem analogous to the theorem on Richardson extrapolation for the situa- tion of the preceding problem.
Ifφ(h)= L−c1h1/2−c2h2/2−c3h3/2−···,thenwhat combination of φ(h) and φ(h/2) should give an accurate estimate of L?
4h2 15. Establish formulas for the errors in using them.
4.3
Estimating Derivatives and Richardson Extrapolation 199 14. ConsiderEquation(19).
a. Fillinthedetailsinitsderivation.
b. UsingTaylorseries,deriveitserrorterm.
ShowhowRichardsonextrapolationwouldworkonFor- mula (20).
a16. If φ(h) = L − c1h − c2h2 − c3h3 − ···, then what combination of φ(h) and φ(h/2) should give an accurate estimate of L ?
4h
b. f′′(x)≈ 1 [f(x+2h)−2f(x)+ f(x−2h)]
Derivethefollowingrulesfortheseestimatingderivatives and establish their error terms. Which is more accurate?
10.
19. Show that Richardson extrapolation can be carried out foranytwovaluesofh.Thus,ifφ(h)= L−O(hp),then from φ(h1) and φ(h2), a more accurate estimate of L is given by
h 2p
φ(h2) + h1p − h2p [φ(h2) − φ(h1)]
a20. Consider a function φ such that limh→0 φ(h) = L and L − φ(h) ≈ ce−1/h for some constant c. By combin- ing φ(h), φ(h/2), and φ(h/3), find an accurate estimate ofL.
21. Considertheapproximateformula
(4) 1
Hint: Consider the Taylor series for D(h) ≡ f (x + h) −
b. f
−4f(x −h)+ f(x −2h)] f(x−h)andS(h)≡ f(x+h)+ f(x−h).
Establish the formula
f′′(x)≈ 2 f(x0) − f(x1)+ f(x2)
h2 (1 + α) α α(α + 1)
in the following two ways, using the unevenly spaced pointsx0 < x1 < x2,wherex1−x0 = hand x2 − x1 = αh. Notice that this formula reduces to the standard central-difference formula (20) when α = 1.
a. Approximate f (x) by the Newton form of the inter- polating polynomial of degree 2.
b. Calculate the undetermined coefficients A, B, and C in the expression f ′′(x) ≈ A f (x0)+B f (x1)+C f (x2) by making it exact for the three polynomials 1, x −x1, and (x − x1)2 and thus exact for all polynomials of degree ≦ 2.
(Continuation) Using Taylor series, show that
f′(x1)= f(x2)− f(x0) +(α−1)h f′′(x1)+O(h2)
x2 − x0 2
Establish that the error for approximating f′(x1) by [ f (x2) − f (x0)]/(x2 − x0) is O(h2), when x1 is mid- way between x0 and x2, but only O(h) otherwise.
A certain calculation requires an approximation formula
for f ′(x) + f ′′(x). How well does the expression 22. 2+h 2 2−h D(3,3).
3h
f′(x)≈ 2h3
Determine its error term. Does the function f have to be differentiable for the formula to be meaningful?
Hint: This is a novel method of doing numerical differen- tiation. Read more about Lanczos’ generalized deriva- tive in Groetsch [1998].
a11.
a 12.
a 13.
tf(x+t)dt
−h
2h2 f(x +h)− h2 f(x)+ 2h2 f(x −h) 23. Differentiation and integration are mutual inverse pro-
Derive the error terms for D(3, 0), D(3, 1), D(3, 2), and
serve? Derive this approximation and its error term.
The values of a function f are given at three points x0, x1,
and x2. If a quadratic interpolating polynomial is used to
estimate f ′(x) at x = 1 (x0 + x1), what formula will 2
result?
cesses. Differentiation is an inherently sensitive prob- lem in which small changes in the data can cause large changes in the results. Integration is a smoothing process and is inherently stable. Display two functions that have very different derivatives but equal definite integrals and vice versa.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
200 Chapter 4 Interpolation and Numerical Differentiation 24. Establishtheerrortermsfortheserules:
a. f′′′(x) ≈ 1 [3f(x +h)−10f(x) 2h3
+ 12f(x −h)−6f(x −2h)+ f(x −3h)] b. f′(x)+hf′′≈1[f(x+h)−f(x)]
2h
1. Test procedure Derivative on the following functions at the points indicated in a single computer run. Interpret the results.
a. f(x)=cosxatx=0
b. f(x)=arctanxatx=1 c. f(x)=|x|atx=0
2. (Continuation) Write and test a procedure similar to Derivative that computes f ′′(x) with repeated Richard- son extrapolation.
a3. Find f′(0.25) as accurately as possible, using only the function corresponding to the following pseudocode and a method for numerical differentiation:
c. f(iv)(x)≈ 14f(x+3h)−6f(x+2h) h4 3
+ 12 f (x + h)
4. Carry out a numerical experiment to compare the accu- racy of Formulas (5) and (19) on a function f whose derivative can be computed precisely. Take a sequence of valuesforh,suchas4−n with0≦n≦12.
5. Using the discussion of the geometric interpretation of Richardson extrapolation, produce a graph to show that φ(h) looks like a quadratic curve in h.
6. UsesymbolicmathematicalsoftwaresuchasMATLAB, Maple or Mathematica to establish the first term in the error series for Equation (19).
7. Use mathematical software such as found in MATLAB, Maple, or Mathematica to redo Example 1.
if f(x)= f′(x)=0.
real function f (x) integer i ; real a, b, c, x a ← 1; b ← cos(x)
fori =1to5
c←b √
b← ab
a ← (a + c)/2 end for
f ← 2 arctan(1)/a end function f
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 4.3
5
Numerical Integration
In electrical field theory, it is proved that the magnetic field induced by a current flowing in a circular loop of wire has intensity
4Ir π/2 x2 2 1/2 H(x)= 2 2 1− sinθ dθ
where I is the current, r is the radius of the loop, and x is the distance from the center to the point where the magnetic intensity is being computed ( 0 ≦ x ≦ r ) . If I , r , and x are given, we have a formidable integral to evaluate. It is an elliptic integral and not expressible in terms of familiar functions. But H can be computed precisely by the methods of this chapter. For example, if I = 15.3, r = 120, and x = 84, we find H = 1.355661135 accurate to nine decimals.
5.1 Trapezoid Method
r−x0 r
Indefinite/Definite Integral
Indefiniteintegral: Definite integral:
2
One of the main mathematical procedures that is the focus of elementary calculus is integration, which is examined from the standpoint of numerical mathematics in this chapter.
Definite and Indefinite Integrals
It is customary to distinguish two types of integrals: the definite and the indefinite integral. The indefinite integral of a function is another function or a class of functions, whereas the definite integral of a function over a fixed interval is a number. For example, we have
2 13
x dx= x +C
3
Actually, a function has not just one, but many indefinite integrals. These differ from each other by constants. Thus, in the preceding example, any constant value may be assigned to C, and the result is still an indefinite integral. In elementary calculus, the concept of an indefinite integral is identical with the concept of an antiderivative. An antiderivative of a function f is any function F having the property that F ′ = f .
201
x2 dx = 8 03
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
202 Chapter 5
Numerical Integration
Examples of Fundamental Theorem of Calculus
Derivative of an Integral
Antiderivative versus Area Under Curves
The definite and indefinite integrals are related by the Fundamental Theorem of Calculus,∗ which states that ab f (x ) d x can be computed by first finding an antiderivative F of f and then evaluating F(b) − F(a). Thus, using traditional notation, we have
3 x3 3 27 1 14 (x2 −2)dx = −2x = −6 − −2 =
1 31333
As another example of the Fundamental Theorem of Calculus, we can write b
F′(x)dx = F(b)− F(a)
a
x
F′(t)dt = F(x)− F(a) a
If this second equation is differentiated with respect to x, the result is (and here we have
Nonuniform Partition
This last equation shows that ax f (t ) d t must be an antiderivative (indefinite integral) of f . The foregoing technique for computing definite integrals is virtually the only one emphasized in elementary calculus. The definite integral of a function, however, has an interpretation as the area under a curve, and so the existence of a numerical value for ab f (x ) d x should not depend logically on our limited ability to find antiderivatives. Thus,
for instance, 1
ex2 dx 0
has a precise numerical value despite the fact that there is no elementary function F such that F′(x) = ex2 . By the preceding remarks, ex2 does have antiderivatives, one of which is
x2 F(x) = et dt
0
However, this form of the function F is of no help in determining the numerical value
sought.
Trapezoid Rule: Nonuniform Spacing
The trapezoid rule is based on an estimation of the area beneath a curve using trapezoids. Moreover, it is an important ingredient of the Romberg algorithm of the next section. The estimation of ab f (x) dx is approached by first dividing the interval [a, b] into subintervals according to the partition
P={a=x0
2
induces us to take 92 subintervals. ■
Recursive Trapezoid Formula
In the next section, we require a formula for the composite trapezoid rule when the interval [a, b] is subdivided into 2n equal parts. By the composite trapezoid rule (1), we have
n−1 T(f;P)=h
h f(xi)+2[f(x0)+ f(xn)]
h
=h
i=1
i=1 n−1
[f(a)+ f(b)]
f(a+ih)+
If we now replace n by 2n and use h = (b − a)/2n , the preceding formula becomes
2 −1 n
i=1
Here, we have introduced the notation that is used in Section 5.2 on the Romberg algorithm, namely, R(n, 0). It denotes the result of applying the composite trapezoid rule with 2n equal subintervals.
In the Romberg algorithm, it is necessary to have a means of computing R(n, 0) from R(n − 1, 0) without involving unneeded evaluations of f . For example, the computation of R(2,0) utilizes the values of f at the five points a, a + (b − a)/4, a + 2(b − a)/4, a + 3(b − a)/4, and b. In computing R(3, 0), we need values of f at these five points, as wellasatfournewpoints:a+(b−a)/8,a+3(b−a)/8,a+5(b−a)/8,anda+7(b−a)/8 (see Figure 5.3). The computation should take advantage of the previously computed result. The manner of doing so is explained below.
R(n,0)=h
f(a+ih)+
h
2
[f(a)+ f(b)] (8)
2
Subintervals Array
ab
20 21 22
2n equal subintervals 23
If R(n − 1, 0) has been computed and R(n, 0) is to be computed, we use the identity
R(n, 0) = 1 R(n − 1, 0) + R(n, 0) − 1 R(n − 1, 0) 22
It is desirable to compute the bracketed expression with as little additional work as possible. Fixing h = (b − a)/2n for the analysis and putting
h
C= [f(a)+f(b)]
2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 5.3
R(0, 0) R(1, 0) R(2, 0) R(3, 0)
we have, from Equation (8),
2 −1 n
i=1
2n−1 2n−1−1 R(n,0)−1R(n−1,0)=hf(a+ih)−h f(a+2jh)
2 i=1 j=1 2n−1
= h f [a + (2k − 1)h] k=1
Here, we have taken account of the fact that each term in the first sum that corresponds to an even value of i is canceled by a term in the second sum. This leaves only terms that correspond to odd values of i.
To summarize, we have the Recursive Trapezoid Formula.
R(n, 0) = h
f (a + ih) + C
(9)
2 n − 1 − 1 j=1
R(n − 1, 0) = 2h
Notice that the subintervals for R(n − 1, 0) are twice the size of those for R(n, 0). Now
from Equations (9) and (10), we have
5.1
Trapezoid Method 211
f (a + 2 jh) + 2C (10)
Recursive Trapezoid Formula
If R(n − 1, 0) is available, then R(n, 0) can be computed by the formula
n−1 12
R(n − 1, 0) + h
usingh =(b−a)/2n.Here, R(0,0)= 1(b−a)[f(a)+ f(b)]. 2
R(n, 0) =
2
f [a + (2k − 1)h]
(n ≧ 1) (11)
k=1
■ Theorem2
This formula allows us to compute a sequence of approximations to a definite integral using the trapezoid rule without reevaluating the integrand at points where it has already been evaluated.
Multidimensional Integration
Here, we give a brief account of multidimensional numerical integration. For simplicity, we illustrate with the trapezoid rule for the interval [0, 1], using n + 1 equally spaced points. The step size is therefore h = 1/n. The composite trapezoid rule is then
1 1 n−1i f(x)dx ≈ f(0)+2 f + f(1)
0 2ni=1n We write this in the form
1 ni f(x)dx≈ Cifn
0 i=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
212 Chapter 5
Numerical Integration
where 1/(2n), i = 0
Ci = 1/n, 0 < i < n
1/(2n), i = n
The error is O(h2) = O(n−2) for functions having a continuous second derivative.
If one is faced with a two-dimensional integration over the unit square, then the
trapezoid rule can be applied twice:
n 1α1
= Cα1 fn,ydy
11
00 0α1=0
1n α1 f(x,y)dxdy≈ Cα1 f n,y dy
α1=0 0
n n α1α2 ≈ Cα1 Cα2fn,n
α1 =0 α2 =0
n n α1 α2 = Cα1Cα2f n,n
α1=0 α2=0
The error here is again O(h2), because each of the two applications of the trapezoid rule entails an error of O(h2).
In the same way, we can integrate a function of k variables. Suitable notation is the vector x = (x1, x2, . . . , xk )T for the independent variable. The region now is taken to be the k- dimensionalcube[0,1]k ≡[0,1]×[0,1]×···×[0,1].Thenweobtainamultidimensional numerical integration rule
nnn α1α2αk f(x)dx ≈ ··· Cα1Cα2 ···Cαk f n , n ,..., n
[0,1]k α1 =0 α2 =0 αk =0
The error is still O(h2) = O(n−2), provided that f has continuous partial derivatives ∂2 f/∂xi2.
Besides the error involved, one must consider the effort, or work, required to attain a desired level of accuracy. The work in the one-variable case is O(n). In the two-variable case, it is O(n2), and it is O(nk) for k variables. The error, now expressed as a function of the number of nodes N = nk, is
O(h2) = O(n−2) = O(nk)−2/k = O(N−2/k)
Thus, the quality of the numerical approximation of the integral declines very quickly as the number of variables, k, increases. Expressed in other terms, if a constant order of accuracy is to be retained while the number of variables, k, goes up, the number of nodes must go up like nk . These remarks indicate why the Monte Carlo method for numerical integration becomes more attractive for high-dimensional integration. (This subject is discussed in Chapter 10.)
Remarks
For historical reasons, formulas for approximating definite integrals are called rules such as the trapezoid rule and many other rules, some of which are found in the exercises and
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
subsequent chapters of this book. A large collection of these quadrature rules can be found in Abramowitz and Stegun [1964], Standard Mathematical Tables, which had its origins in a U.S. Federal Government public works project during the Great Depression in the 1930s.
The word quadrature has several meanings both in mathematics and in astronomy. In the dictionary, the first mathematical meaning is the process of finding a square whose area is equal to the area enclosed by a given curve. The general mathematical meaning is the process of determining the area of a surface, especially one bounded by a curve. We use it primarily to mean the approximation of the area under a curve in numerical integration.
In the exercises of this chapter, we have used various well-known integrals to illustrate numerical integration. Many of these integrals have been thoroughly investigated and tabu- lated. Examples are elliptic integrals, the sine integral, the Fresnel integral, the logarithmic integral, the error function, and Bessel functions. When one is faced with a daunting inte- gral, the first question to raise is whether the integral has already been studied and perhaps tabulated. The first places to search are websites on the Internet and books such as M. Abramowitz and I. Stegun [1964]. In modern scientific computing, such tables are of lim- ited use because of the ready availability of software packages such as MATLAB, Maple, and Mathematica. Nevertheless, one needs a basic understanding of numerical methods to intelligently use mathematical software.
Summary 5.1
• To estimate ab f (x) dx, divide the interval [a, b] into subintervals according to the partitionP={a=x0
Extrapolation of the same type can be used in still more general situations, as is illus- trated next (and in the exercises).
If φ is a function with the property
φ(x)=L+a1x−1 +a2x−2 +a3x−3 +···
how can L be estimated using Richardson extrapolation?
Obviously, L = limx→∞ φ(x). Thus, L can be estimated by evaluating φ(x) for a succession
of ever larger values of x. To use extrapolation, we write
φ(x) = L +a1x−1 +a2x−2 +a3x−3 +···
φ(2x) = L +2−1a1x−1 +2−2a2x−2 +2−3a3x−3 +··· 2φ(2x) = 2L + a1x−1 + 2−1a2x−2 + 2−2a3x−3 + ···
φ(x) = 2φ(2x) − φ(x) = L − 2−1a2x−2 − 3 · 2−2a3x−3 − ···
Having computed φ(x) and φ(2x), we can compute a new function ψ(x), which should be a better approximation to L because its error series begins with x−2 and is O(x−2) as x → ∞. This process can be repeated, as in the Romberg algorithm. ■
Here is a concrete illustration of the preceding example. We want to estimate limx→∞ φ(x) from the following table of numerical values:
x 1 2 4 8 16 32 64 128 φ(x) 21.1100 16.4425 14.3394 13.3455 12.8629 12.6253 12.5073 12.4486
A tentative hypothesis is that φ has the form in the preceding example. When we compute the values of the function ψ(x) = 2φ(2x) − φ(x), we get a new table of values:
x 1 2 4 8 16 32 64
ψ (x ) 11.7750 12.2363 12.3516 12.3803 12.3877 12.3893 12.3899
It seems reasonable to believe that the value of limx→∞ φ(x) is approximately 12.3899. If we do another extrapolation, we should compute θ(x) = [4ψ(2x) − ψ(x)]/3; values for this table are
Thus, the special linear combination
2α h 1 h 1h
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
224 Chapter 5
Numerical Integration
x 1 2 4 8 16 32
θ (x ) 12.3901 12.3900 12.3899 12.3902 12.3898 12.3901
For the precision of the given data, we conclude that limx→∞ φ(x) = 12.3900 to within roundoff error.
Summary 5.2
• By using the recursive trapezoid rule, we find that the first column of the Romberg algorithm is
n−1 12
R(n, m) = R(n, m − 1) + 1 [R(n, m − 1) − R(n − 1, m − 1)] 4m −1
with n ≧ 1 and m ≧ 1. The error is O(h2) for the first column, O(h4) for the second column, O(h6) for the third column, and so on. Check the ratios
R(n, m) − R(n − 1, m) ≈ 4m+1 as h → 0
R(n + 1, m) − R(n, m) to test whether the algorithm is working.
• IftheexpressionLisapproximatedbyφ(h)andiftheseentitiesarerelatedbytheerror series
L = φ(h) + ahα + bhβ + chγ + · · · then a more accurate approximation is
R(n, 0) =
where h = (b − a)/2n and n ≧ 1. The second and successive columns in the Romberg
R(n − 1, 0) + h
array are generated by the Richardson extrapolation formula and are
2
f [a + (2k − 1)h]
k=1
a1. What is R(5,3) if R(5,2) = 12 and R(4,2) = −51, in the Romberg algorithm?
6. Using the Romberg scheme, establish a numerical value for the approximation
with error O(hβ ).
L ≈φh+ 1 φh−φ(h) 2 2α−1 2
2. If R(3,2) = −54 and R(4,2) = 72, what is R(4,3)? 12
a3. Compute R(5, 2) from R(3, 0) = R(4, 0) = 8 and 0 R(5, 0) = −4.
4. Let f (x) = 2x . Approximate 04 f (x) dx by the trape- zoid rule using partition points 0, 2, and 4. Repeat by us- ing partition points 0, 1, 2, 3, and 4. Now apply Romberg extrapolation to obtain a better approximation.
a5. By the Romberg algorithm, approximate 02 4 d x / (1 + x2) by evaluating R(1, 1).
e−(10x) dx ≈ R(1,1)
Compute the approximation to only three decimal places
of accuracy.
a7. W e are going to use the Romberg method to estimate
01√xcosxdx. Will the method work? Will it work well? Explain.
a8. BycombiningR(0,0)andR(1,0)forthepartitionP= {−h < 0 < h}, determine R(1, 1).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 5.2
9.
a 10. 11.
5.2 Romberg Algorithm 225 Show that the precise form of Equation (5) is
∞
b 4j −1
f(x)dx = R(n,1)− a2j+2h2j+2 a j=1 3×4j
20. DeriveEquation(6),andshowthatitspreciseformis
b ∞ 4 j − 1 4 j − 1 − 1 2 j + 2
f(x)dx = R(n,2)+ 3×4j 15×4j−1 a2j+2h j=2
In calculus, a technique of integration by substitution is developed. For example, if the substitution x = z2
19.
1x√
ismade in the integral 0 (e / x) dx, the result is
2 01 ez2 dz. Verify this and discuss the numerical aspects of this example. Which form is likely to produce a more accurate answer by the Romberg method?
How many evaluations of the function (integrand) are needed if the Romberg array with n rows and n columns is to be constructed?
UsingEquation(2),fillinthecirclesinthefollowingdi- agram with coefficients used in the Romberg algorithm:
a
R(0, 0)
R(1, 0)
R(2, 0)
R(3, 0) R(3, 1) R(3, 2) R(3, 3)
R(4, 0) R(4, 1) R(4, 2) R(4, 3) R(4, 4)
21.
a 22.
a 23.
a
UsethefactthatthecoefficientsinEquation(3)havethe form
ak =ck[f(k−1)(b)− f(k−1)(a)]
to prove that ab f (x) dx = R(n, m) if f is a polynomial
ofdegree≦2m−2.
In the Romberg algorithm, R(n, 0) denotes an estimate of ab f(x)dx withsubintervalsofsizeh =(b−a)/2n. If it were known that
R(1, 1) R(2, 1)
12.
a 13. a14.
a 15.
16.
a17.
a18.
R(2, 2)
Derive the quadrature rule for R(1,1) in terms of the function f evaluated at partition points a, a + h, and a+2h,whereh =(b−a)/2.Dothesamefor R(n,1) withh=(b−a)/2n.
(Continuation) Derive the quadrature rule R(2,2) in terms of the function f evaluated at a, a + h, a + 2h, a+3h,andb,whereh =(b−a)/4.
We want to compute X = limn→∞ Sn, and we have al- ready computed the two numbers u = S10 and v = S30. Itisknownthat X = Sn +Cn−3.Whatis X intermsof uandv?
Suppose that we want to estimate Z = limh→0 f (h) and that we calculate f (1), f (2−1), f (2−2),
f (2−3), . . . , f (2−10). Then suppose also that it is known that Z = f(h)+ah2 +bh4 +ch6.Showhowtoobtain an improved estimate of Z from the 11 numbers already computed. Show how Z can be determined exactly from any 4 of the 11 computed numbers.
Show how Richardson extrapolation works on a sequence x1,x2,x3,... that converges to L as n → ∞ in such a waythatL−xn =a2n−2+a3n−3+a4n−4+···.
Let xn be a sequence that converges to L as n → ∞. If L−xn isknowntobeoftheforma3n−3+a4n−4+··· (in which the coefficients are unknown), how can the convergence of the sequence be accelerated by taking combinations of xn and xn+1?
IftheRombergalgorithmisoperatingonafunctionthat possesses continuous derivatives of all orders on the in- terval of integration, then what is a bound on the quantity
| ab f(x)dx−R(n,m)|intermsofh?
a
how would we have to modify the Romberg algorithm?
Show that if f ′′ is continuous, then the first column in the Romberg array converges to the integral in such a way that the error at the nth step is bounded in magnitude by a constant times 4−n .
Assuming that the first column of the Romberg array
converges to ab f (x ) d x , show that the second column does also.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
24.
25.
b
f(x)dx = R(n,0)+a3h3 +a6h6 +···
(Continuation) In the preceding problem, we estab- lished the elementary property that if limn→∞ R(n, 0) =
ab f (x) dx, then limn→∞ R(n, 1) = ab f (x) dx. Show that
limn→∞ R(n,2) = limn→∞ R(n,3) = ···
=limn→∞R(n,n)= ab f(x)dx 26. a. Using Formula (7), prove that Euler-Maclaurin coef-
ficients can be generated recursively:
j=1
A0 = 1,
b. Determine Ak for1≦k≦6.
k Ak−j Ak = − ( j + 1)!
a27. Evaluate E in the theorem on the Euler-Maclaurin formula for this special case: a = 0, b = 2π,
f(x)=1+cos4x,n=4,andmarbitrary.
226 Chapter 5 Numerical Integration
a
for 2.19 x −1 sin x d x .
10. Computelog2byusingtheRombergalgorithmonasuit- able integral.
1. Compute eight rows and columns in the Romberg array
1.3
2. DesignandcarryoutanexperimentusingtheRombergal-
a
1 π
gorithm. Suggestions: For a function that possesses many continuous derivatives on the interval, the method should work well. Try such a function first. If you choose one whose integral you can compute by other means, you will acquire a better understanding of the accuracy in the Romberg algorithm. For example, try definite integrals for each of these:
(1+x)−1dx =ln(1+x),
ex dx =ex, (1+x2)−1dx =arctanx
3. Test the Romberg algorithm on a bad function, such as
√
cos(xsinθ)dθ
11. TheBesselfunctionoforder0isdefinedbytheequation
J0(x)= π
Calculate J0(1) by applying the Romberg algorithm to
the integral.
12. RecodetheRombergproceduresothatallthetrapezoid rule results are computed first and stored in the first col- umn. Then in a separate procedure,
procedures Extrapolate(n, (ri ))
carry out Richardson extrapolation, and store the results
in the lower triangular part of the (ri ) array. What are
the advantages and disadvantages of this procedure over
run.
13. (StudentResearchProject)StudytheClenshaw-Curtis method for numerical quadrature. If possible, read the original paper by Clenshaw and Curtis [1960] and then program the method. If programmed well, it should be su- perior to the Romberg method in many cases. For further information on it, consult papers by Dixon [1974], Fraser and Wilson [1966], Gentleman [1972], Havie [1969], Kahaner [1971], and O’Hara and Smith [1968].
0
x on [0, 1]. Why is it bad?
4. The transcendental number π is the area of a circle whose
1/√2
8 ( 1−x2−x)dx=π
0
with the help of a diagram, and use this integral to ap-
the routine given in the text? Test on the two integrals
radius is 1. Show that
4 dx/(1 + x) and 1 ex dx using only one computer 0 −1
proximate π by the Romberg method.
a5. Apply the Romberg method to estimate
π −1
0 (2 + sin 2x ) d x . Observe the high precision ob-
tained in the first column of the array, that is, by the simple trapezoidal estimates.
a6. Compute 0π x cos 3x d x by the Romberg algorithm us- 14. ing n = 6. What is the correct answer?
(Student Research Project) Numerical integration is an ideal problem for use on a parallel computer, since the interval of integration can be subdivided into subinter-
simultaneously and independently of each other. Investi-
a7. An integral of the form ∞ f (x) dx can be trans-
0 vals on each of which the integral can be approximated
formed into an integral on a finite interval by making a change of variable. Verify, for instance, that the sub- stitution x = − ln y changes the integral 0∞ f (x) dx into 01 y−1 f(−lny)dy. Use this idea to compute
∞−x 2
0 [e /(1 + x)] d x by means of the Romberg algo-
rithm, using 128 evaluations of the transformed function.
8. BytheRombergalgorithm,calculate ∞√
e−x 1−sinxdx
gate how numerical integration can be done in parallel. If
you have access to a parallel computer or can simulate a
parallel computer on a collection of PCs, write a parallel
program to approximate π by using the standard example
1
(1 + x2)−1 dx 0
with a basic rule such as the midpoint rule. Vary the num- ber of processors used and the number of subintervals. You can read about parallel computing in books such as Pacheco [1997], Quinn [1994], and others or at any of the numerous sites on the Internet.
Use a mathematical software system with symbolic ca- pabilities to verify the relationship between Ak and the Bernoulli numbers for k = 6.
0
9. Calculate
1 sinx
√ dx
x 15. Hint: Consider making a change of variable.
0 by the Romberg algorithm.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 5.2
Interval [a, b]
The basic trapezoid rule for approximating ab f (x ) d x is based on an estimation of the area beneath the curve over the interval [a, b] using a trapezoid. The function of integration f (x) is taken to be a straight line between f (a) and f (b). The numerical integration formula is of the form
b a
where the values of A and B are selected so that the resulting integration rule correctly integrates any linear function. It suffices to integrate exactly the two functions 1 and x because a polynomial of degree at most 1 is a linear combination of these two monomials. (This process is also called the method of undetermined coefficients.) To simplify the calculations, let a = 0 and b = 1 and find a formula of the following type:
1
f(x)dx ≈ Af(0)+ Bf(1)
0
Thus, these equations should be fulfilled:
1
0 11
f (x) = 1 :
dx = 1 = A + B
f(x)=x:
The solution is A = B = 1 , and the integration formula is
5.3 Simpson’s Rules and Newton-Cotes Rules 227
5.3 Simpson’s Rules and Newton-Cotes Rules Basic Trapezoid Rule
f(x)dx ≈ Af(a)+ Bf(b)
0
2
11
xdx= =B 2
f(x)dx≈ [f(0)+f(1)] 02
By a linear mapping y = (b − a)x + a from [0, 1] to [a, b], the basic trapezoid rule for the interval [a, b] is obtained:
Basic Trapezoid Rule
b1 f(x)dx ≈
a2 See Figure 5.4 for a graphical illustration.
(b−a)[f(a)+ f(b)] (1)
f (b)
x
FIGURE 5.4
Basic trapezoid rule
ab
p1(x) f (a)
y 5 f(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
228
Chapter 5
Numerical Integration
Basic Simpson’s Rule
The next obvious generalization is to take two subintervals a, a+b and a+b , b and 22
to approximate b f (x ) d x by taking the function of integration f (x ) to be a quadratic a a+b
polynomialpassingthroughthethreepoints f(a), f 2 ,and f(b).Letusseekanumerical integration rule of the following type:
Intervals b
a, a+b,a+b,b f(x)dx≈Af(a)+Bf a+b +Cf(b)
22a2
The function f is assumed to be continuous on the interval [a, b]. The coefficients A, B, and C are chosen such that the formula above gives correct values for the integral whenever f is a quadratic polynomial. It suffices to integrate correctly the three functions 1, x, and x2 because a polynomial of degree at most 2 is a linear combination of those 3 monomials.
To simplify the calculations, let a = −1 and b = 1 and consider the formula
1
f(x)dx ≈ Af(−1)+ Bf(0)+Cf(1)
−1
Thus, these equations should be fulfilled:
1 −1
1
f (x) = 1 :
f (x) = x :
dx = 2 = A + B + C x dx = 0 = −A + C
f(x)=x2:
The solution is A = 1 , C = 1 , and B = 4 . The resulting rule is
−1
12
x2dx= =A+C
Basic Simpson’s Rule
b 1 a+b
f(x)dx≈ (b−a) f(a)+4f +f(b) (2)
a62 See Figure 5.5 for an illustration.
3
11
f(x)dx ≈ [f(−1)+4f(0)+ f(1)]
−1 333
−1 3
Using a linear mapping y = 1(b − a)x + 1(a + b) from [−1,1] to [a,b], we obtain the
22 basic Simpson’s rule over the interval [a, b]:
f(a 1 b ) 2
p2(x) f (a)
f (b)
p2(x)
FIGURE 5.5
Basic Simpson’s rule
aa1bb 2
x
y 5 f(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 5.6
Example of trapezoid rule vs. Simpson’s rule
EXAMPLE 1
Solution
aa1bb 2
Figure 5.6 shows graphically the difference between the basic trapezoid rule (1) and the basic Simpson’s rule (2).
Find approximate values for the integral 12
e−x dx 0
using the basic trapezoid rule and the basic Simpson’s rule. Carry five significant digits.
Let a = 0 and b = 1. For the basic trapezoid rule (1), we obtain 12 1
e−x dx ≈ e0 + e−1 ≈ 0.5[1 + 0.36788] = 0.68394 02
which is correct to only one significant decimal place (rounded). For the basic Simpson’s rule (2), we find
121 e−x dx ≈ e0 + 4e−0.25 + e−1
06
≈ 0.16667[1 + 4(0.77880) + 0.36788] = 0.7472
which is correct to three significant decimal places (rounded). Recall that 1 e−x2 dx = 1√0
2 π erf(1) ≈ 0.74682. ■ Basic Simpson’s Rule: Uniform Spacing
A numerical integration rule over two equal subintervals with partition points a, a + h, and a + 2h = b is the widely used basic Simpson’s rule:
a+h h
f(x)dx≈ [f(a)+4f(a+h)+f(a+2h)] (3)
a3
Simpson’s rule computes exactly the integral of an interpolating quadratic polynomial over an interval of length 2h using three points; namely, the two endpoints and the middle point. It can be derived by integrating over the interval [0, 2h] the Lagrange quadratic polynomial
p2 through the points (0, f (0)), (h, f (h)), and (2h, f (2h)):
2h 2h h
f(x)dx ≈ p2(x)dx = 3[f(0)+4f(h)+ f(2h)]
00
where
p2(x)= 1 (x−h)(x−2h)f(0)− 1 x(x−2h)f(h)+ 1 x(x−h)f(2h)
5.3 Simpson’s Rules and Newton-Cotes Rules 229 Simpson
p2(x) p1(x)
Trapezoid
y 5 f(x)
Basic Simpson’s Rule
2h2 h2 2h2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
230 Chapter 5
Numerical Integration
The error term in Simpson’s rule can be established by using the Taylor series from Sec- tion 1.2:
f(a+h)= f +hf′ + 1h2 f′′ + 1h3 f′′′ + 1h4 f(4) +··· 2! 3! 4!
where the functions f , f ′, f ′′, . . . on the right-hand side are evaluated at a. Now replacing h by 2h, we have
4 24 f(a+2h)= f +2hf′ +2h2 f′′ + h3 f′′′ + h4 f(4) +···
3 4!
Combining these two series, we obtain
f(a)+4f(a+h)+ f(a+2h)=6f +6hf′ +4h2 f′′ +2h3 f′′′ + 20h4 f(4) +··· 4!
and, thereby, we have
h [ f (a) + 4 f (a + h) + f (a + 2h)] = 2h f + 2h2 f ′ + 4 h3 f ′′ 33
+ 2h4 f′′′ + 20 h5 f(4) +··· (4) 3 3·4!
Hence, we have a series for the right-hand side of Equation (3). Now let’s find one for the left-hand side. The Taylor series for F(a + 2h) is
F(a + 2h) = F(a) + 2hF′(a) + 2h2 F′′(a) + 4h3 F′′′(a) 3
2 25
+ h4F(4)(a)+ h5F(5)(a)+···
3 5! Let x
F(x) = f (t) dt a
By the Fundamental Theorem of Calculus, F′ = f . We observe that F(a) = 0 and F(a + 2h) is the integral on the left-hand side of Equation (3). Since F′′ = f ′, F′′′ = f ′′, and so on, we have
a+2h
f(x)dx = [f(a)+4f(a+h)+ f(a+2h)]−
a 3 90
4 2 25 h3 f′′ + h4 f′′′ +
h5 f(4) +··· (5)
3 3 5·4! a+2h h h5
f(x)dx =2hf +2h2 f′ + Subtracting Equation (4) from Equation (5), we obtain
a
f(4) −···
Basic Simpson’s Rule
Error Term
with error term
b (b−a) a+b f(x)dx ≈ f(a)+4f + f(b)
a62
A more detailed analysis shows that the error term for the basic Simpson’s rule (2) is −(h5/90) f (4)(ξ) = O(h5) as h → 0, for some ξ between a and a + 2h. By letting b = a + 2h, the basic Simpson’s rule over the interval [a, b] is
for some ξ in (a, b).
1b−a5 (4)
−90 2 f (ξ) (6)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
5.3 Simpson’s Rules and Newton-Cotes Rules 231 Composite Simpson’s Rule
Suppose that the interval [a, b] is subdivided into an even number of subintervals, say n, eachofwidthh=(b−a)/n.Thenthepartitionpointsarexi =a+ihfor0≦i≦n,where n is divisible by 2. Now from basic calculus, we have
Interval [a, b] with n
(even) Subintervals of Width h
b n/2 a+2ih
f(x)dx = f(x)dx (7)
a i=1 a+2(i−1)h
Using the basic Simpson’s rule, we have, for the right-hand side,
≈n/2 h{f(a+2(i−1)h)+4f(a+(2i−1)h)+ f(a+2ih)} i=1 3
(n/2)−1
+ f(a+2ih)+ f(b)
i=1
h
i=1 i=1
(n/2)−1 n/2
=3 f(a)+ f(a+2ih)+4f(a+(2i−1)h)
Error Term
Caution
where h = (b − a)/n. The error term is
− 1 (b−a)h4 f(4)(ξ) (8)
180
Many formulas for numerical integration have error estimates that involve derivatives
of the function being integrated. An important point that is frequently overlooked is that such error estimates depend on the function having derivatives. So if a piecewise function is being integrated, the numerical integration should be broken up over the region to coincide with the regions of smoothness of the function. Another important point is that no polynomial ever becomes infinite in the finite plane, so any integration technique that uses polynomials to approximate the integrand may fail to give good results without extra work at integrable singularities.
An Adaptive Simpson’s Scheme
Now we develop an adaptive scheme based on Simpson’s rule for obtaining a numerical approximation to the integral
b
f(x)dx
a
In this adaptive algorithm, the partitioning of the interval [a, b] is not selected beforehand
but is automatically determined. The partition is generated adaptively so that more and smaller subintervals are used in some parts of the interval and fewer and larger subintervals are used in other parts.
In the adaptive process, we divide the interval [a,b] into two subintervals and then decide whether each of them is to be divided into more subintervals. This procedure is continued until some specified accuracy is obtained throughout the entire interval [a,b].
Thus, we obtain
b n/2 (n−2)/2
Composite Simpson’s f(x)dx≈h [f(a)+f(b)]+4f[a+(2i−1)h]+2 f(a+2ih) Rule a 3 i=1 i=1
Adaptive Process for Subintervals
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
232
Chapter 5
Numerical Integration
Since the integrand f may vary in its behavior on the interval [a, b], we do not expect the final partitioning to be uniform, but to vary in the density of the partition points.
It is necessary to develop the test for deciding whether subintervals should continue to be divided. One application of Simpson’s rule over the interval [a, b] can be written as
where
and
b a
S(a,b)= (b−a) f(a)+4fa+b+ f(b) 62
1b−a5 (4) E(a,b)=− f (a)+···
I ≡
f(x)dx = S(a,b)+ E(a,b)
Letting h = b − a, we have
where
and
90 2
I = S(1) + E(1) (9)
S(1) = S(a,b) (1) 1h5 (4)
E =− f (a)+··· 90 2
1 h5 =−C
90 2
Here we assume that f (4) remains a constant value C throughout the interval [a, b]. Now
two applications of Simpson’s rule over the interval [a, b] give
I = S(2) + E(2) (10)
where
where c = (a + b)/2, as in Figure 5.7, and
(2) 1 h/25
E =−
90 2
(4)
f (a)+···−
1 h/25 (4)
S(2) =S(a,c)+S(c,b)
f (c)+···
1 1h5
=− 1 h/25f(4)(a)+ f(4)(c)+··· 90 2
90 2
11h5
=−90 25 2 (2C)=16−90 2 C
a
h
c 5 (a 1 b)/2
b
One Simpson’s Rule
Two Simpson’s Rules
h/2 acb
FIGURE 5.7
Simpson’s rule
h/2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
5.3 Simpson’s Rules and Newton-Cotes Rules 233 Again, we use the assumption that f (4) remains a constant value C throughout the interval
[a, b]. We find that
Subtracting Equation (10) from (9), we have
From this equation and Equation (10), we have
Stopping Test
Recursive Process
Split Interval [a, b] in Half
to guide the adaptive process so that |I − S(2)| < ε.
If Test (11) is not satisfied, the interval [a, b] is split into two subintervals, [a, c] and
[c, b], where c is the midpoint c = (a + b)/2. On each of these subintervals, we again use Test (11) with ε replaced by ε/2 so that the resulting tolerance is ε over the entire interval [a, b]. A recursive procedure handles this quite nicely.
To see why, we take ε/2 on each left and right subinterval bcb
I = f(x)dx = f(x)dx + f(x)dx = Ileft + Iright aac
If S is the sum of approximations S(2) over [a, c] and S(2) over [c, b], we have
We use the inequality
I=S(2)+E(2)=S(2)+ 1S(2)−S(1) 15
1 S(2) − S(1) < ε (11) 15
16E(2) = E(1)
S(2) − S(1) = E(1) − E(2) = 15E(2)
Four Subintervals of Width (b − a)/4
|I−S|=I +I −S(2)−S(2) left right left right
≦ I −S(2)+I −S(2) left left right right
= 1 S ( 2 ) − S ( 1 ) + 1 S ( 2 ) − S ( 1 ) 15 left left 15 right right
using Inequality (11). Hence, if we require
1 S(2) −S(1)< ε and 1 S(2) −S(1) < ε 15 left left 2 15 right right 2
then |I − S| < ε over the entire interval [a, b].
We now describe an adaptive Simpson recursive procedure. The interval [a, b] is parti-
tioned into four subintervals of width (b − a)/4. Two Simpson approximations are computed by using two double-width subintervals and four single-width subintervals; that is,
where h = b − a and c = (a + b)/2.
According to Inequality (11), if one simpson and two simpson agree to within 15ε, then
the interval [a, b] does not need to be subdivided further to obtain an accurate approximation
left
right
one simpson← h f(a)+4fa+b+ f(b) 62
two simpson← h f(a)+4fa+c+2f(c)+4fc+b+ f(b) 12 2 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
234 Chapter 5 Adaptive Process
Numerical Integration
Sample Integral
to the integral ab f (x ) d x . In this case, the value of [16 (two simpson) − (one simpson)]/15 is used as the approximate value of the integral over the interval [a,b]. If the desired accuracy for the integral has not been obtained, then the interval [a, b] is divided in half. The subintervals [a, c] and [c, b], where c = (a + b)/2, are used in a recursive call to the adaptive Simpson procedure with tolerance ε/2 on each. This procedure terminates when all subintervals satisfy Inequality (11). Alternatively, a maximum number of allowable levels of subdividing intervals is used to terminate the procedure prematurely. The recursive procedure provides an elegant and simple way to keep track of which subintervals satisfy the tolerance test and which need to be divided further.
Example Using Adaptive Simpson Procedure
The main program for calling the adaptive Simpson procedure can best be presented in terms of a concrete example. An approximate value for the integral
5π
4 cos(2x) dx (12)
0 ex is desired with accuracy 1 × 10−3.
2
1 0.8 0.6 0.4 0.2 0 2 0.2
f (x) 5 cos(2x)/ex
FIGURE 5.8
Adaptive Integration 5
of 4πcos(2x)/exdx 0
0 0.5 1 1.5 2 2.5 3 3.5 4
Adaptive Procedure
The graph of the integrand function is shown in Figure 5.8. This function has many turns and twists, so accurately determining the area under the curve may be difficult. A function procedure f is written for the integrand. Its name is the first argument in the procedure, and necessary interface statements may be needed. In the pseudocode below, other arguments are the values of the upper and lower limits a and b of the integral, the desired accuracy ε, the level of the current subinterval, and the maximum level depth.
recursive real function Simpson( f, a, b, ε, level, level max) result(simpson result)
integer level, level max; external function f level ← level + 1 h←b−a
real a, b, c, d, e, h
c ← (a + b)/2
one simpson←h[f(a)+4f(c)+ f(b)]/6
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Simpson Pseudocode
Trapezoid Rule:
Simpson’s 1 Rule: 3
Simpson’s 3 Rule:
x1 x0
1 1 3 ′′ f(x)dx=2h[f0+f1]−12h f (ξ)
5.3 Simpson’s Rules and Newton-Cotes Rules 235
d ← (a + c)/2
e ← (c + b)/2
two simpson←h[f(a)+4f(d)+2f(c)+4f(e)+ f(b)]/12 if level ≧ level max then
simpson result ← two simpson
output “maximum level reached” else
if |two simpson − one simpson| < 15ε then
simpson result ← two simpson + (two simpson − one simpson)/15
else
end if end if
end function Simpson
left simpson ← Simpson( f, a, c, ε/2, level, level max) right simpson ← Simpson( f, c, b, ε/2, level, level max) simpson result ← left simpson + right simpson
By writing a driver computer program for this pseudocode and executing it on a computer, we obtain an approximate value of 0.208 for the integral. The adaptive Simpson procedure uses a different number of panels for different parts of the curve, as shown in Figure 5.8.
Newton-Cotes Rules
Newton-Cotes quadrature formulas for approximating ab f (x ) d x are obtained by approxi- matingthefunctionofintegration f(x)withinterpolatingpolynomials.Therulesareclosed when they involve function values at the ends of the interval of integration. Otherwise, they are said to be open.
Inthefollowing,a = x0,b = xn,xi = x0+ih,fori = 0,1,...,n,whereh = (b−a)/n, fi = f(xi),anda=x0 <ξ
Continuity of a function f at a point s can be defined by the condition lim f(x)= f(s)= lim f(x)
x→s+ x→s−
Here, limx→s+ means that the limit is taken over x values that converge to s from above s; that is, (x − s) is positive for all x values. Similarly, limx→s− means that the x values converge to s from below.
degree 1), whose pieces are linear polynomials joined together to achieve continuity, as in Figure6.2.Thepointst0,t1,…,tn atwhichthefunctionchangesitscharacteraretermed knots in the theory of splines. Thus, the spline function shown in Figure 6.2 has eight knots.
Such a function appears somewhat complicated when defined in explicit terms. We are forced to write
S (x), x ∈[t ,t ] 0 0 1
S1(x), x ∈ [t1,t2]
S(x)= . . (1)
where
..
Sn−1(x), x∈[tn−1,tn]
Si (x) = ai x + bi
because each piece of S(x) is a linear polynomial. Such a function S(x) is piecewise linear. Iftheknotst0,t1,…,tn weregivenandifthecoefficientsa0,b0,a1,b1,…,an−1,bn−1 were all known, then the evaluation of S(x) at a specific x would proceed by first determining the interval that contains x and then using the appropriate linear function for that interval.
If the function S defined by Equation (1) is continuous, we call it a first-degree spline. It is characterized by the following three properties.
(2)
Spline of Degree 1
A function S is called a spline of degree 1 if:
1. ThedomainofSisaninterval[a,b].
2. Siscontinuouson[a,b].
3. Thereisapartitioningoftheintervala=t0
real function Spline1(n, (ti ), (yi ), x)
integer i,n; real x; real array (ti)0:n,(yi)0:n fori =n−1to0 step−1
ifx−ti ≧0thenexitloop end for
Spline1←yi +(x−ti)[(yi+1−yi)/(ti+1−ti)] end function Spline1
First-Degree Polynomial Accuracy Theorem
If p is the first-degree polynomial that interpolates a function f at the endpoints of an interval [a, b], then with h = b − a, we have
|f(x)−p(x)|≦ω(f;h) (a≦x≦b)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
256 Chapter 6
Spline Functions
Then we have
|f(x)− p(x)| ≦ x −a|f(x)− f(b)|+b−x|f(x)− f(a)| b−a b−a
≦ x−aω(f;h)+b−xω(f;h) b−a b−a
= x−a+b−xω(f;h)=ω(f;h) b−a b−a
First-Degree Spline Accuracy Theorem
Let p be a first-degree spline having knots a = x0 < x1 <···
h2
ui =2(hi +hi−1)− i−1 >2(hi +hi−1)−hi−1 >hi
ui−1
Then by induction, ui > 0 for i = 1,2,…,n − 1.
Equation (5) is not the best computational form for evaluating the cubic polynomial Si (x ). We would prefer to have it in the form
Si(x)= Ai +Bi(x−ti)+Ci(x−ti)2 +Di(x−ti)3 (11)
because nested multiplication can then be utilized.
Notice that Equation (11) is the Taylor expansion of Si about the point ti . Hence, we
obtain
A =S(t), B =S′(t), C = 1S′′(t), D = 1S′′′(t) iiiiiii2iii6ii
Therefore, we have Ai = yi and Ci = zi /2. The coefficient of x 3 in Equation (11) is Di , whereas the coefficient of x3 in Equation (5) is (zi+1 − zi )/6hi . Therefore, we obtain
1
Di = 6h (zi+1 −zi)
i
Finally, Equation (6) provides the value of Si′ (ti ), which is
Bi =−hizi+1−hizi + 1(yi+1−yi) 6 3 hi
Thus, the nested form of Si (x ) is
Si(x)=yi +(x−ti)Bi +(x−ti)zi + 1 (x−ti)(zi+1−zi) (12)
2 6hi
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Pseudocode for Natural Cubic Splines
We now write routines for determining a natural cubic spline based on a table of values and for evaluating this function at a given value. First, we use Algorithm 1 for directly solving the tridiagonal System (10). This procedure, called Spline3 Coef , takes n + 1 table values (ti , yi ) in arrays (ti ) and (yi ) and computes the zi ’s, storing them in array (zi ). Intermediate (working) arrays (hi ), (bi ), (ui ), and (vi ) are needed.
Now a procedure called Spline3 Eval is written for evaluating Equation (12), the natural cubic spline function S(x), for x a given value. The procedure Spline3 Eval first determines the interval [ti , ti+1] that contains x and then evaluates Si (x) using the nested form of this cubic polynomial:
6.2 Natural Cubic Splines 271
procedure Spline3 Coef (n, (ti ), (yi ), (zi ))
integer i, n; real array (ti )0:n , (yi )0:n , (zi )0:n
allocate real array (hi )0:n−1, (bi )0:n−1, (ui )1:n−1, (vi )1:n−1 fori =0ton−1
hi ←ti+1−ti
bi ←(yi+1−yi)/hi end for
u1 ← 2(h0 + h1) v1 ← 6(b1 − b0) fori =2ton−1
u←2(h+h )−h2 /u
i i i−1 i−1 i−1
vi ← 6(bi − bi−1) − hi−1vi−1/ui−1 end for
zn ←0
fori =n−1to1 step−1
zi ←(vi −hizi+1)/ui end for
z0 ← 0
deallocate array (hi ), (bi ), (ui ), (vi ) end procedure Spline3 Coef
Spline3 Coef Pseudocode
real function Spline3 Eval(n, (ti ), (yi ), (zi ), x) integer i; real h,tmp
real array (ti )0:n , (yi )0:n , (zi )0:n
fori =n−1to0 step−1
ifx−ti ≧0thenexitloop end for
h ← ti+1 − ti
tmp←(zi/2)+(x−ti)(zi+1 −zi)/(6h) tmp←−(h/6)(zi+1 +2zi)+(yi+1 −yi)/h+(x−ti)(tmp) Spline3 Eval ← yi + (x − ti )(tmp)
end function Spline3 Eval
Spline3 Eval Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
272 Chapter 6
Spline Functions
EXAMPLE 3
Solution
The function Spline3 Eval can be used repeatedly with different values of x after one call to procedure Spline3 Coef . For example, this would be the procedure when plotting a natural cubic spline curve. Since procedure Spline3 Coef stores the solution of the tridiagonal sys- tem corresponding to a particular spline function in the array (zi ), the arguments n, (ti ), (yi ), and (zi ) must not be altered between repeated uses of Spline3 Eval.
Using Pseudocode for Interpolating and Curve Fitting
To illustrate the use of the natural cubic spline routines Spline3 Coef and Spline3 Eval, we rework Example 3 from Section 4.1 (p. 166).
Write pseudocode for a program that determines the natural cubic spline interpolant for sin x at 10 equidistant knots in the interval [0, 1.6875]. Over the same interval, subdivide each subintervalintofourequallyspacedparts,andfindthepointwherethevalueof|sinx−S(x)| is largest.
Here is a suitable pseudocode main program, which calls procedures Spline3 Coef and Spline3 Eval:
program Test Spline3 integeri; reale,h,x
real array (ti )0:n , (yi )0:n , (zi )0:n integer n ← 9
real a ← 0, b ← 1.6875
h ← (b − a)/n
fori =0ton
ti ←a+ih
yi ←sin(ti) end for
call Spline3 Coef (n, (ti ), (yi ), (zi )) temp ← 0
for j = 0 to 4n
x ← a + jh/4
e ← | sin(x) − Spline3 Eval(n, (ti ), (yi ), (zi ), x)| if e > temp then temp ← e
output j,x,e
end for
end Test Spline3
Test Spline3 Pseudocode
Mathematical Software
Theoutputis j =19,x =0.890625,andd =0.930×10−5. ■ We can use mathematical software to plot the cubic spline curve for this data. Caution: MATLAB routine spline uses the not-a-knot end condition!
It dictates that S′′′ be a single constant in the first two subintervals and another single constant in the last two subintervals. First, the original data are generated. Next, a finer subdivision of the interval [a, b] on the x-axis is made, and the corresponding y-values are obtained from the procedure spline. Finally, the original data points and the spline curve are plotted.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Sample Data Set
Mathematica and Maple can be used to plot cubic spline curves as well as many other computer software packages and programs.
We now illustrate the use of spline functions in fitting a curve to a set of data. Consider the following table:
x 0.0 0.6 1.5 1.7 1.9 2.1 2.3 2.6 2.8 3.0 y −0.8 −0.34 0.59 0.59 0.23 0.1 0.28 1.03 1.5 1.44
3.6 4.7 5.2 5.7 5.8 6.0 6.4 6.9 7.6 8.0
0.74 −0.82 −1.27 −0.92 −0.92 −1.04 −0.79 −0.06 1.0 0.0
These 20 points were selected from a wiggly freehand curve drawn on graph paper. We intentionally selected more points where the curve bent sharply. A visually pleasing curve is obtained by using the cubic spline routines Spline3 Coef and Spline3 Eval and plotting the resulting natural cubic spline curve. (See Figure 6.8.)
6.2 Natural Cubic Splines 273
y
2 1.5 1 0.5
y 5 S(x)
FIGURE 6.8
Natural cubic spline curve
2D Parametric Cubic Splines
0x 12345678
– 0.5 –1 – 1.5 –2
Alternatively, we can use mathematical software such as MATLAB, Maple, or Math- ematica to plot the cubic spline function for this table.
Space Curves
In two dimensions, two cubic spline functions can be used together to form a parametric representation of a complicated curve that turns and twists. Select points on the curve and label them t = 0,1,…,n. For each value of t, read off the x- and y-coordinates of the point, thus producing a table:
t01···n x xn y y0 y1 ··· yn
Then fit x = S(t) and y = S(t), where S and S are natural cubic spline interpolants. The two functions S and S give a parametric representation of the curve. (See Computer Exercises 6.2.6.)
x0
x1
···
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
274
Chapter 6 Spline Functions
EXAMPLE 4
Select 13 points on the well-known serpentine curve given by x
y = 1/4 + x2
So that the knots will not be equally spaced, write the curve in parametric form:
x = 1 tanθ 2
y = sin 2θ
and take θ = i (π/14), where i = −6, −5, . . . , 5, 6. Plot the natural cubic spline curve and
the interpolation polynomial in order to compare them.
This is an example of curve fitting using the polynomial interpolation routines Coef and Eval from Chapter 4 (p. 166) and the cubic spline routines Spline3 Coef and Spline3 Eval. Figure 6.9 shows the resulting cubic spline curve and the high-degree polynomial curve (dashed line). The polynomial becomes extremely erratic after the fourth knot from
Solution
the origin and oscillates wildly, whereas the spline is a near perfect fit.
22 21.5
y
8
6
4
2 21
20.5
0 0.5
Polynomial curve
Cubic spline curve
x
1 1.5 2
FIGURE 6.9
Serpentine curve
EXAMPLE 5
Solution
22 24 26 28
Use cubic spline functions to produce the curve for the following data: t01234567
y 1.0 1.5 1.6 1.5 0.9 2.2 2.8 3.1 It is known that the curve is continuous but its slope is not.
■
A single cubic spline is not suitable. Instead, we can use two cubic spline interpolants, the first having knots 0, 1, 2, 3, 4 and the second having knots 4, 5, 6, 7. By carrying out two separate spline interpolation procedures, we obtain two cubic spline curves that meet at the point (4, 0.9). At this point, the two curves have different slopes. The resulting curve is shown in Figure 6.10. ■
Smoothness Property
Why do spline functions serve the needs of data fitting better than ordinary polynomials? To answer this, one should understand that interpolation by polynomials of high degree is
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
y
3 2.5 2 1.5 1 0.5
y 5 Sˆ ( x )
̃
y 5 S(x)
6.2 Natural Cubic Splines 275
FIGURE 6.10
Two cubic splines
FIGURE 6.11
Wildly oscillating function
Natural Cubic Splines and Avoiding Wild Oscillations
■ Theorem1
Proof
0x 1234567
p
q
often unsatisfactory because polynomials may exhibit wild oscillations. Polynomials are smooth in the technical sense of possessing continuous derivatives of all orders, whereas in this sense, spline functions are not smooth.
Wild oscillations in a function can be attributed to its derivatives being very large. Consider the function whose graph is shown in Figure 6.11. The slope of the chord that joins the points p and q is very large in magnitude. By the Mean-Value Theorem, the slope of that chord is the value of the derivative at some point between p and q. Thus, the derivative must attain large values. Indeed, somewhere on the curve between p and q, there is a point where f ′(x) is large and negative. Similarly, between q and r, there is a point where f ′(x) is large and positive. Hence, there is a point on the curve between p and r where f ′′(x) is large. This reasoning can be continued to higher derivatives if there are more oscillations. This is the behavior that spline functions do not exhibit. In fact, the following result shows that from a certain point of view, natural cubic splines are the best functions to use for curve fitting!
r
Cubic Spline Smoothness Theorem
If S is the natural cubic spline function that interpolates a twice-continuously differ- entiablefunction f atknotsa=t0
2 determining the cubic spline interpolant.
By hand calculation, find the natural cubic spline inter- polant for this table:
x12345 y01010
Find a cubic spline over knots −1, 0, and 1 such that the following conditions are satisfied: S′′(−1) = S′′(1) = 0, S(−1) = S(1) = 0, and S(0) = 1.
Thisproblemandthenexttwoleadtoamoreefficiental- gorithm for natural cubic spline interpolation in the case of equally spaced knots. Let hi = h in Equation (5), and replace the parameters zi by qi = h2zi /6. Show that the new form of Equation (5) is then
x − ti 3 h
ti +1 − x 3
Si(x) = qi+1
+ qi
+ (yi+1 − qi+1) h
24.
25.
a 26.
27.
28.
+(yi −qi) (Continuation)Establishthenewcontinuityconditions:
Show that Si can also be written in the form
Si(x) = yi + Ai(x −ti)+ 1zi(x −ti)2 + zi+1 − zi (x −ti)3
α0=0, αi =−(αi−1+4)−1 (1≦i≦n)
β0=0, βi=−αi(yi+1−2yi+yi−1−βi−1) (1≦i≦n)
(This stable and efficient algorithm is due to MacLeod [1973].)
37. Prove that if S(x) is a spline of degree k on [a,b], then S′(x) is a spline of degree k − 1.
a 38. How many coefficients are needed to define a piecewise quartic (fourth-degree) function with n + 1 knots? How many conditions will be imposed if the piecewise quartic function is to be a quartic spline? Justify your answers.
a39. Determinewhetherthisfunctionisanaturalcubicspline: x3 + 3×2 + 7x − 5 (−1 ≦ x ≦ 0)
S(x) = −x3 + 3×2 + 7x − 5 (0 ≦ x ≦ 1)
h
x − ti ti+1 −x
35.
36.
(Continuation) Show that the parameters qi can be deter- mined by backward recursion as follows:
qn = 0, qn−1 = βn−1
qi =αiqi+1 +βi (i =n−2,n−3,…,0)
where the coefficients αi and βi are generated by ascend- ing recursion from the formulas
q0 =qn =0
qi−1+4qi +qi+1 =yi+1−2yi +yi−1 (1≦i≦n−1)
h
with
2 6hi
hhyy
Ai = − i zi − i zi+1 − i + i+1
3 6 hi hi
29. Carry out the details in deriving Equation (9), starting
with Equation (5).
30. Verify that the algorithm for computing the (zi ) array is correct by showing that if (zi ) satisfies Equation (9), then it satisfies the equation in Step 3 of the algorithm.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
280 Chapter 6 Spline Functions
40. Determine whether this function is or is not a natural 41.
cubic spline having knots 0, 1, and 2:
x 3 + x − 1 (0 ≦ x ≦ 1) f (x) = 3 2
−(x−1) +3(x−1) +4(x−1)+1 (1≦ x ≦ 2)
1. Rewrite and test procedure Spline3 Coef using proce- dure Tri from Section 2.3. Use the symmetry of the (n − 1) × (n − 1) tridiagonal system.
2. The extra storage required in Step 1 of the algorithm for solving the natural cubic spline tridiagonal system di- rectly can be eliminated at the expense of a slight amount of extra computation—namely, by computing the hi ’s and bi ’s directly from the ti ’s and yi ’s in the forward elimi- nation phase (Step 2) and in the back substitution phase (Step 3). Rewrite and test procedure Spline3 Coef using this idea.
3. Using at most 20 knots and the cubic spline routines Spline3 Coef and Spline3 Eval, plot on a computer plot- ter an outline of your:
a. School’s mascot. b. Signature. c. Profile.
4. LetSbethecubicsplinefunctionthatinterpolatesf(x)= (x2 + 1)−1 at 41 equally spaced knots in the interval [−5, 5]. Evaluate S(x) − f (x) at 101 equally spaced points on the interval [0, 5].
5. Draw a free-form curve on graph paper, making certain that the curve is the graph of a function. Then read val- ues of your function at a reasonable number of points, say, 10–50, and compute the cubic spline function that takes those values. Compare the freely drawn curve to the graph of the cubic spline.
6. Draw a spiral (or other curve that is not a function) and reproduce it by way of parametric spline functions. (See the following figure.)
y
Show that the natural cubic spline going through the points (0, 1), (1, 2), (2, 3), (3, 4), and (4, 5) must be y = x + 1. (The natural cubic spline interpolant to a given data set is unique, because the matrix in Equa- tion (10) is diagonally dominant and invertible, as proven in Section 2.3.)
7
3
2 40
81
5 9
6
x
7. Write and test procedures that are as simple as possible to perform natural cubic spline interpolation with equally spaced knots.
Hint: See Exercises 6.2.34–6.2.36.
b
8. Write a program to estimate a f (x ) d x , assuming that
we know the values of f at only certain prescribed knots a=t0
By Equation (2), this assertion is true when k = 0. If it is true for index k−1, then Bk−1(x) > i
0 on (t ,t ) and Bk−1(x) > 0 on (t ,t ). In Equation (3), the factors that multiply i i+k i+1 i+1 i+k+1
Bk−1(x) and Bk−1(x) are positive when t < x < t . Thus, Bk(x) > 0 on this interval.
B2i 3 Bi
The principal use of the B splines Bik(i = 0,±1,±2,…) is as a basis for the set of all kth-degree splines that have the same knot sequence. Thus, linear combinations
∞
ci Bik
i =−∞
are important objects of study. (We use ci for fixed k and Cik to emphasize the degree k of
the corresponding B splines.)
Our first task is to develop an efficient method to evaluate a function of the form
∞
Cik Bik(x) (4) Using Definition (3) and some simple series manipulations, we have
f (x) =
under the supposition that the coefficients Cik are given (as well as the knot sequence ti ).
i =−∞
∞ x−ti ti+k+1−x f (x) = Ck Bk−1(x) + Bk−1(x)
i =−∞ ∞
= Ck i
i =−∞ ∞
i ti+k −ti i ti+k+1 −ti+1 i+1
x−ti
ti+k −ti i−1 ti+k −ti
+Ck
ti+k −x
Bk−1(x) i
i
Ck−1 Bk−1(x) ii
(5)
=
where Ck−1 is defined to be the appropriate coefficient from the line preceding Equation (5).
i =−∞
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
284
Chapter 6
Spline Functions
j−1 Ci
(7)
Equation (4) and we want to compute f (x), we use Equation (7) to calculate the entries in the following triangular array:
Ck Ck−1···C0 mmm
Coefficients Recurrance Relation
i+j i i+j i
A nice feature of Equation (4) is that only the k + 1 coefficients Ck , Ck
Derivative Property I
Derivative Property II
A basic result about the derivatives of B splines is
d Bk(x) = k Bk−1(x) − k Bk−1(x) (8)
This algebraic manipulation shows how a linear combination of Bik (x ) can be expressed as a linear combination of Bk−1(x). Repeating this process k−1 times, we eventually express
tained is
Cj−1 =Cj x−ti +Cj ti+j −x i it−ti−1t−t
i
f (x) in the form
∞
Ci0 Bi0(x) (6) If t ≦x < t , then f(x) = C0. The formula by which the coefficients Cj−1 are ob-
f (x) =
mm+1m i
i =−∞
m m−1
are needed to compute f (x) if tm ≦ x < tm+1 (see Exercise 6.3.6). Thus, if f is defined by
...
m−k
Although our notation does not show it, the coefficients in Equation (4) are independent
of x, whereas the C j−1’s calculated subsequently by Equation (7) do depend on x. i
It is now a simple matter to establish that
∞
Bik(x)=1, forallxandallk≧0
i =−∞
Ifk = 0,wealreadyknowthis.Ifk > 0,weuseEquation(4)withCik = 1foralli.
By Equation (7), all subsequent coefficients Ck,Ck−1,Ck−2,…,C0 are also equal to 1 iiii
(induction is needed here!). Thus, at the end, Equation (6) is true with Ci0 = 1, and so f (x) = 1. Therefore, from Equation (4), the sum of all B splines of degree k is unity.
The smoothness of the B splines Bik increases with the index k. In fact, we can show by induction that Bik has a continuous k − 1st derivative.
The B splines can be used as substitutes for complicated functions in many mathematical situations. Differentiation and integration are important examples.
Derivatives of B Splines
Ck Ck−1 m−1 m−1
. … Ck
, …, Ck m−k
dx i ti+k −ti i ti+k+1 −ti+1 i+1
This equation can be proved by induction using the recursive Formula (3). Once Equation (8)
is established, we get the useful formula
d ∞ ∞
c Bk(x) = d Bk−1(x) (9)
dxii ii i =−∞ i =−∞
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
where
di =kci −ci−1 ti+k −ti
6.3 B Splines: Interpolation and Approximation 285
The verification is as follows. By Equation (8), we obtain
d ∞
dx ci Bik(x)
i =−∞
∞kk−1 kk−1 = ci t −t Bi (x)− t −t Bi+1(x)
=
ci dx Bi (x)
∞ d k
i =−∞
i=−∞ i+k i i+k+1
= ∞ cik − ci−1k Bk−1(x)
i=−∞ ti+k −ti ti+k −ti i
i =−∞
For numerical integration, the B splines are also recommended, especially for indefinite integration. Here is the basic result needed for integration:
(10)
This equation can be verified by differentiating both sides with respect to x and simplifying by the use of Equation (9). To be sure that the two sides of Equation (10) do not differ by a constant, we note that for any x < ti , both sides reduce to zero.
The basic result (10) produces this useful formula:
i+1
Integration Property I
x t i + k + 1 − t i ∞ Bk(s)ds = Bk+1(x)
Mathematical Software
MATLAB has a Spline Toolbox, developed by Carl de Boor, that can be used for many tasks involving splines. For example, there are routines for interpolating data by splines with diverse end conditions and routines for least-squares fits to data. There are
d Bk−1(x) Integration of B Splines
=
∞
ii
Integration Property II
where
x ∞ −∞ i=−∞
∞
c Bk(s)ds = e Bk+1(x) (11)
1 i
ik+1j −∞ j=i
ei =k+1
cj(tj+k+1−tj)
ii ii i=−∞
j =−∞
It should be emphasized that this formula gives an indefinite integral (antiderivative) of any function expressed as a linear combination of B splines. Any definite integral can be obtained by selecting a specific value of x . For example, if x is a knot, say, x = tm , then
t m ∞ ∞ m cBk(s)ds= eBk+1(t )= eBk+1(t )
ii iim iim −∞ i=−∞ i=−∞ i=m−k−1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
286 Chapter 6
Spline Functions
Basic Property
Interpolation Conditions
B Splines of Degree 0
B Spline of Degree 1
∞
i i−k
many demonstration routines in this Toolbox that exhibit plots and provide models for programming MATLAB M-files. These demonstrations are quite instructive for visualizing and learning the concepts in spline theory, especially B splines.
Maple has a BSpline package for constructing B spline basis functions of degree k from a given knot list, which may include multiple knots. It is based on a divided-difference implementation found in Bartels, Beatty, and Barskey [1987].
Mathematica has a splines packages also.
Interpolation and Approximation by B Splines
We developed a number of properties of B splines and showed how B splines are used in various numerical tasks. The problem of obtaining a B spline representation of a given function was not discussed. Here, we consider the problem of interpolating a table of data; later, a noninterpolatory method of approximation is described.
A basic question is how to determine the coefficients in the expression
S(x) =
A Bk (x) (12)
i =−∞
so that the resulting spline function interpolates a prescribed table:
x t0 t1 ··· tn y y0 y1 ··· yn
We mean by interpolate that
S(ti ) = yi (0 ≦ i ≦ n) (13) The natural starting point is with the simplest splines, corresponding to k = 0. Since
Bi0(tj)=δij=1 (i=j) 0 ( i ≠ j )
the solution to the problem is immediate: Just set Ai = yi for 0 ≦ i ≦ n. All other coefficients in Equation (12) are arbitrary. In particular, they can be zero. We arrive then at this result: The B spline of degree 0
B1 (t)=δ i−1j ij
Hence, the following is true: The B spline of degree 1
i=0
[t−1, t1] and t3 is the midpoint of [t2, t4].
yi Bi0(x)
The next case, k = 1, also has a simple solution. We use the fact that
has the interpolation property (13).
y B1 (x) has the interpolation property (13). So Ai = yi again.
S(x) =
S(x) =
n i=0
n
i i−1
, B1, B1, and B1. They, inturn,requirefortheirdefinitionknotst−1,t0,t1,...,t4.Knotst−1 andt4 canbearbitrary. Figure 6.15 shows the graphs of the four B1 splines. In such a problem, if t−1 and t4 are not prescribed, it is natural to define them in such a way that t0 is the midpoint of the interval
If the table has four entries (n = 3), for instance, we use B1
−101 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Quadratic Case
∞ 1
AB2 (t)= A(t −t)+A (t−t ) (14)
FIGURE 6.15
Bi1 splines
t2 t3 t4
x
6.3
B121 B10
t21 t0 t1
B Splines: Interpolation and Approximation 287 B1 B12
In both elementary cases considered, the unknown coefficients A0, A1, . . . , Equation (12) were uniquely determined by the interpolation conditions (13). If terms were present in Equation (12) corresponding to values of i outside the range {0, 1, . . . , n}, then they would have no influence on the values of S(x) at t0, t1, ...,tn.
Higher-Degree B Splines
For higher-degree splines, we shall see that some arbitrariness exists in choosing coefficients. In fact, none of the coefficients is uniquely determined by the interpolation conditions. This fact can be advantageous if other properties are desired of the solution. In the quadratic case, we begin with the equation
i i−2 j t −t j j+1 j j+1 j j−1 i=−∞ j+1 j−1
An in
Its justification is left to Exercise 6.3.26. If the interpolation conditions (13) are now im- posed, we obtain the following system of equations, which gives the necessary and sufficient conditions on the coefficients:
Aj(tj+1 −tj)+ Aj+1(tj −tj−1)= yj(tj+1 −tj−1) (0≦ j≦n) (15)
This is a system of n + 1 linear equations in n + 2 unknowns A0, A1,..., An+1.
One way to solve Equation (15) is to assign any value to A0 and then use Equation (15) to compute for A1, A2, . . . , An+1, recursively. For this purpose, the equations could be
rewritten as
where these abbreviations have been used:
t j + 1 − t j − 1
Aj+1 =αj +βjAj
(0≦ j≦n) (16)
αj =yj tj −tj−1 tj −tj+1
(0≦ j≦n) βj = tj −tj−1
To keep the coefficients small in magnitude, we recommend selecting A0 such that the expression
n+1 = Ai2
i=0
will be a minimum. To determine this value of A0, we proceed as follows: By successive
substitution using Equation (16), we can show that
Aj+1 =γj +δjA0 (0≦ j≦n) (17)
where the coefficients γj and δj are obtained recursively by this algorithm: γ0 =α0, δ0 =β0
γj =αj +βjγj−1, δj =βjδj−1 (1≦ j≦n)
(18)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
288 Chapter 6
Spline Functions
Then is a quadratic function of A0 as follows: = A 20 + A 21 + · · · + A 2n + 1
= A20 +(γ0 +δ0A0)2 +(γ1 +δ1A0)2 +···+(γn +δnA0)2
To find the minimum of , we take its derivative with respect to A0 and set it equal to zero:
d
dA =2A0 +2(γ0 +δ0A0)δ0 +2(γ1 +δ1A0)δ1 +···+2(γn +δnA0)δn =0
0
This is equivalent to q A0 + p = 0, where
q = 1 + δ 02 + δ 12 + · · · + δ n2 p=γ0δ0 +γ1δ1 +···+γnδn
Pseudocode and a Curve-Fitting Example
A procedure that computes coefficients A0, A1, . . . , An+1 in the manner previously outlined is given now. In its calling sequence, (ti )0:n is the knot array, (yi )0:n is the array of abscissa points, (ai )0:n+1 is the array of Ai coefficients, and (hi )0:n+1 is an array that contains hi = ti − ti−1. Only n, (ti ), and (yi ) are input values. They are available unchanged when the routine is finished. Arrays (ai ) and (hi ) are computed and available as output.
procedure BSpline2 Coef (n, (ti ), (yi ), (ai ), (hi )) integer i, n; real δ, γ, p, q
real array (ai)0:n+1,(hi)0:n+1,(ti)0:n,(yi)0:n
fori =1ton
hi ←ti −ti−1 end for
h0 ← h1
hn+1 ← hn δ ← −1
γ ← 2y0
p ← δγ q←2
fori =1ton
r ← hi+1/hi
δ ← −rδ
γ ← −rγ + (r + 1)yi
p←p+γδ q ← q + δ2
end for
a0 ←−p/q
fori =1ton+1
ai ←[(hi−1 +hi)yi−1 −hiai−1]/hi−1 end for
end procedure BSpline2 Coef
BSpline 2 Coef
Pseudocode
Next we give a procedure function BSpline2 Eval for computing values of the quadratic splinegivenbyS(x)= n+1 A B2 (x).Itscallingsequencehassomeofthesamevariables
i=0 i i−2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
BSpline2 Eval
Pseudocode
6.3 B Splines: Interpolation and Approximation 289 as in the preceding pseudocode. The input variable x is a single real number that should lie
between t0 and tn . The result of Exercise 6.3.26 is used.
real function BSpline2 Eval(n, (ti ), (ai ), (hi ), x)
integer i, n; real d, e, x; real array (ai )0:n+1, (hi )0:n+1, (ti )0:n fori =n−1to0 step−1
ifx−ti ≧0thenexitloop end for
i←i+1
d ← [ai+1(x − ti−1) + ai (ti − x + hi+1)]/(hi + hi+1) e←[ai(x−ti−1 +hi−1)+ai−1(ti−1 −x+hi)]/(hi−1 +hi) BSpline2 Eval ← [d(x − ti−1) + e(ti − x)]/hi
end function BSpline2 Eval
Using the table of 20 points from Section 6.2, we can compare the resulting natural cubic spline curve with the quadratic spline produced by the procedures BSpline2 Coef and BSpline2 Eval. The first of these curves is shown in Figure 6.8 (p. 273), and the second is in Figure 6.16. The latter is reasonable, but perhaps not as pleasing as the former. These curves show once again that cubic natural splines are simple and elegant functions for curve fitting.
y
2 1.5 1 0.5
y 5 S(x)
FIGURE 6.16
Quadratic interpolating spline
Quadratic Case
0x 12345678
– 0.5 –1 – 1.5 –2
Schoenberg’s Process
An efficient process due to Schoenberg [1967] can also be used to obtain B spline approx- imations to a given function. Its quadratic version is defined by
∞ 1
S(x)= f(τi)Bi2(x) where τi = 2(ti+1 +ti+2) (19)
i =−∞
Here, of course, the knots are {t }∞ , and the points where f must be evaluated are
i i=−∞
Equation (19) is useful in producing a quadratic spline function that approximates f .
midpoints between the knots.
The salient properties of this process are as follows:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
290 Chapter 6 Spline Functions
Schoenberg’s Process
1. If f(x)=ax+b,thenS(x)= f(x).
2. If f (x) ≧ 0 everywhere, then S(x) ≧ 0 everywhere.
3. maxx |S(x)|≦ maxx |f(x)|.
4. Iffiscontinuouson[a,b],ifδ=maxi|ti+1−ti|,andifδ0 x∈(ti,ti+k+1) An efficient method to evaluate a function of the form
is to use
∞ i =−∞
C ik B ik ( x ) Cj−1=Cjx−ti +Cj ti+j−x
f ( x ) =
i it−ti−1t−t i+j i i+j i
• The derivative of B splines is
d Bk(x) = k Bk−1(x) −
Bk−1(x) where di = k (ci − ci −1 )/(ti +k − ti ). A basic result needed for integration is
dx i ti+k −ti i A useful formula is
k
ti+k+1 −ti+1 i+1
−∞
ik+1j j=i
d∞ ∞ c Bk(x) =
d Bk−1(x) dxii ii
i =−∞ i =−∞
x t i + k + 1 − t i ∞ Bk(s)ds = Bk+1(x)
A resulting useful formula is
x ∞ ∞
c Bk(s)ds = e Bk+1(x)
j =−∞
• To determine the coefficients in the expression
−∞ i=−∞
whereei =1/(k+1) i cj(tj+k+1 −tj).
ii ii i=−∞
S(x)=
AB2 (x)
i =−∞
so that the resulting spline function interpolates a prescribed table, we use the condition
Aj(tj+1 −tj)+ Aj+1(tj −tj−1)= yj(tj+1 −tj−1) (0≦ j≦n) Thisisasystemofn+1linearequationsinn+2unknowns A0,A1,…,An+1 thatcan
be solved recursively.
• Schoenberg’s process is an efficient process to obtain B spline approximations to a given function. For example, its quadratic version is defined by
∞ i =−∞
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
S(x) =
f (τi )Bi2(x)
∞
i i−k
2.
Whatfunctionsaregeneratedbythefollowingrecursive
definition?
f0(x) = 1, f1(x) = x
fn+1(x) = 2x fn(x) − fn−1(x)
Find an expression for Bi2(x) and verify that it is piece-
on the real line (ti = i). Showthat
this recursive definition:
f0(x) = 1, f1(x) = cos x fn+1(x) = 2 f1(x) fn(x) − fn−1(x)
(n ≧ 1)
i i+1 Showthat Bik(x) = B0k(x−ti)iftheknotsaretheintegers
6.3 B Splines: Interpolation and Approximation 295 whereτ = 1(t +t )andtheknotsare{t}∞ .Thepointsτ where f mustbe
i 2 i+1 i+2 i i=−∞ i evaluated are midpoints between the knots.
• Be ́ziercurvesareusedincomputer-aideddesignforproducingacurvethatgoesthrough (or near to) control points, or a curve that can be manipulated easily to give a desired shape. Be ́zier curves use Bernstein polynomials. For a continuous function f defined on [0, 1], the sequence of Bernstein polynomials
n i
pn(x)= f n φni(x) (n≧1)
i=0
converges uniformly to f . The polynomials φni are
φ n i ( x ) = ni x i ( 1 − x ) n − i ( 0 ≦ i ≦ n )
1.
a
a3.
4. a5.
6.
7.
8.
Showthatthefunctions fn(x)=cosnxaregeneratedby
9.
10. 11.
12.
13. 14.
Forequallyspacedknots,showthatk(k+1)−1Bik(x)lies in the interval with endpoints Bk−1(x) and Bk−1(x).
(n ≧ 1)
wise quadratic. Show that Bi2(x) is zero at every knot
∞ −∞
k
Bi (x)dx =
ti+k+1 −ti k+1
except
Bi2(ti+1) = ti+1 − ti ,
ti+2 − ti Verify Equation (5).
Bi2(ti+2) = ti+3 − ti+2 ti+3 − ti+1
Show that the class of all spline functions of degree m that have knots x0, x1, . . . , xn includes the class of poly- nomials of degree m.
Establish Equation (8) by induction.
Which B splines Bik have a nonzero value on the interval
(tn , tm )? Explain.
Show that on [t ,t ] we have
Establish that ∞ f (ti )B1 (x) is a first-degree
i−1
spline that interpolates f at every knot. What is the zero-
i=−∞ degree spline that does so?
a 15.
a 16. a 17. 18. 19.
i i+1
Lethi =ti+1−ti.Showthatif S(x) = ∞ c B2(x)
for all i, then S(tm) = ym for all m. Hint: Use Exercise 6.3.3.
i=−∞ i i
ci−1hi−1 + ci−2hi = yi (hi + hi−1)
Show that the coefficients C j −1 generated by i
coefficients in order that prove.
∞ ci Bk = 0? State and i=−∞ i
Equa- tion (7) satisfy the condition min Cj−1≦ f(x)≦
maxiCj−1. i
a
Expand the function f(x) = x in an infinite series
ii
∞ ci B1. i=−∞ i
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a
Showthatiftm≦x
x
0 12
(x < 0)
(0 ≦ x < 1)
(1 ≦ x < 2) (2 ≦ x < 3)
i−1 ti+1 −ti−1 hi +hi−1 B2 (t ) = ti+1 − ti = hi
i−2 i ti+1−ti−1 hi+hi−1 wherehi=ti+1−ti.
Showbyinductionthatif
Aj = 1 yj−1(tj −tj−2)− Aj−1(tj −tj−1)
tj−1 −tj−2
for j = 2,3,...,n + 1, then
j ≧ i + k + 1.
What is the maximum value of Bi and where does it
occur? Lettheknotsbetheintegers,andprovethat
2
1 ( 6 x − 3 − 2 x 2 )
a
33. 34.
2
B 02 ( x ) =
Establish formulas
B2 (ti)= i i−1 = i−1
1 (3 − x )2 2
12
2 0
(x < 0)
(0 ≦ x < 1)
(1 ≦ x < 2)
(2 ≦ x < 3)
Show that a linear B spline with integer knots can be written in matrix form as
(x ≧ 3) t−th
0
24.
1x3 6
(4−3x(x−2)) 6
B 03 ( x ) =
6
In the theory of Be ́zier curves, using the Bernstein basic
point, v0.
1
2 (4+3(x −4)(x −2) )
6
1 (4 − x )3
(3 ≦ x < 4) (x≧4)
n+1
AiB2 (tj)=yj (0≦j≦n)
i−2 i=0
−1 1 c1 2 0 c0
b =x (0≦x<1) 10
35.
36.
0
polynomials, show that the curve passes through the first
Show that if we set S(x) = t j −1 ≦ x ≦ t j , then
∞ 2
i=−∞ Ai Bi−2(x) and
S(x) = [x 1] where
= b10c0 + b11c1
with
d = t j+1
and
1
−t 1
[Aj+1(x −tj−1)+ Aj(tj+1 −x)]
B01(x)= b11=2−x (1≦x<2) 0 (otherwise)
ShowthatthequadraticBsplinewithintegerknotscan be written in matrix form as
S(x)= 1
tj −tj−1
[d(x−tj−1)+e(tj −x)]
j−1
37.
[Aj(x−tj−2)+Aj−1(tj −x)] Verify Equations (17) and (18) by induction, using Equa-
tion (16).
1 −2 1c2 S(x)=1[x2 x 1] −6 6 0 c1
e=
tj −tj−2
27.
2
= b20c0 + b21c1 + b22c2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
9 −3 0 c0
where
b 20
B02(x) = 21 b22
(2 ≦ x < 3) (otherwise)
0 Hint: See Exercise 6.3.23.
B03(x)= b32 b33
b (1 ≦ x < 2)
b 31 b
(0≦x<1) (1≦x<2) (2≦x<3) (3≦x<4) (otherwise)
38. Show that the cubic B spline with integer knots can be
0 Hint: See Exercise 6.3.34.
written as
S(x)= 1[x3 x2 x 6
−1 3−31c3 1] 12 −24 12 0c2
−48 60 −12 0 c1 64 −44 4 0 c0
( 0 ≦ x < 1 )
30
6.3
B Splines: Interpolation and Approximation 297 where
= b30c0 + b31c1 + b32c2 + b33c3
1. Using an automatic plotter, graph B0k for k = 0, 1, 2, 3, 4. Use integer knots ti = i over the interval [0, 5].
2. Let ti = i (so the knots are the integer points on the real line). Print a table of 100 values of the function 3B71 + 6B81 − 4B91 + 2B10 on the interval [6,14]. Us- ing a plotter, construct the graph of this function on the given interval.
3. (Continuation) Repeat for the function 3 B72 + 6 B82 − 4B92 + 2B120.
Oct. 1582 61, 162
Apr. 1588 63, 455
Nov. 1591 62, 164
Mar. 1607 41, 471
4. Assuming that S(x) = n ci Bk (x), write a pro- i=0 i
8. 9.
10.
11.
12.
Fit the table with a quadratic B spline, and use it to find the average size of the army during the period given. (The average is defined by an integral.)
Rewrite procedures BSpline2 Coef and BSpline Eval so that the array (hi ) is not used.
Rewrite procedures BSpline2 Coef and BSpline2 Eval for the special case of equally spaced knots, simplifying the code where possible.
Writeaproceduretoproduceasplineapproximationto F(x) = ax f(t)dt. Assume that a≦x≦b. Begin by finding a quadratic spline interpolant to f at the n points ti =a+i(b−a)/n.Testyourprogramonthefollowing:
a. f(x)=sinx (0≦x≦π) b. f(x)=ex (0≦x≦4) c. f(x)=(x2+1)−1 (0≦x≦2)
Write a procedure to produce a spline function that ap- proximates f ′(x) for a given f on a given interval [a, b]. Begin by finding a quadratic spline interpolant to f at n + 1 points evenly spaced in [a, b], including endpoints. Test your procedure on the functions suggested in the preceding computer exercise.
Define f on [0, 6] to be a polygonal line that joins points (0, 0), (1, 2), (3, 3), (5, 3), and (6, 0). Determine spline approximationsto f,usingSchoenberg’sprocessandtak- ing 7, 13, 19, 25, and 31 knots.
cedure to evaluate S′(x) at a specified x. Input is
n,k,x,t ,...,t and c ,c ,...,c . 0 n+k+1 0 1 n
5. Write a procedure to evaluate b S(x)dx, using the as- n a k
sumption that S(x) = i=0 ci Bi (x). Input will be n,k,a,b,c0,c1,...,cn,t0,...,tn+k+1.
6. (March of the B splines) Produce graphs of several B splines of the same degree marching across the x-axis. Use mathematical software such as MATLAB, Maple, or Mathematica.
a7. Historians have estimated the size of the Spanish Army of Flanders in the Spanish Netherlands as follows:
Date Number
Jan. 1575 59, 250
Sept. 1572 67, 259
May 1576 51, 457
Dec. 1573 62, 280
Feb. 1578 27, 603
Mar. 1574 62, 350
Sept. 1580 45, 435
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 6.3
298 Chapter 6 Spline Functions
13. Write suitable code to calculate ∞ f(si)B2(x)
duce the following figure.
i=−∞ i with si = 1 (ti+1 + ti+2). Assume that f is defined
2
on [a, b] and that x will lie in [a, b]. Assume also that
t1 < a < t2 and tn+1 < b < tn+2. (Make no assumption about the spacing of knots.)
14. Write a procedure to carry out this approximation scheme:
where
S(x) =
∞ i =−∞
18. Showhowtousemathematicalsoftwaresuchasfoundin f (τi )Bi3(x), MATLAB, Maple, or Mathematica to plot the functions
corresponding to
a. Figure 6.14. b. Figure 6.15.
c. Figure 6.18. d. Figure 6.19.
19. (Computer-AidedGeometricDesign)Usemathemati- cal software for drawing two-dimensional Be ́zier spline curves, and graph the script number five shown, us- ing spline points and control points. See Farin [1990], Sauer [2012], and Yamaguchi [1988] for additional de- tails.
1
τi = 3(ti+1 + ti+2 + ti+3)
Assume that f is defined on [a,b] and that τi = a + ih for 0 ≦ i ≦ n, where h = (b − a)/n.
15. UsingamathematicalsoftwaresystemsuchasMATLAB, Maple, or Mathematica with B spline routines, compute and plot the spline curve in Figure 6.16 (p. 273) based on the 20 data points from Section 6.2. Vary the degree of the B splines from 0, 1, 2, 3, through 4 and observe the resulting curves.
16. Using B splines, write a program to perform a natural cubic spline interpolation at knots t0 < t1 < · · · < tn .
17. The documentation preparation system LATEX is widely available and contains facilities for drawing some simple curves such as Be ́zier curves. Use this system to repro-
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7
Initial Values Problems
In a simple electrical circuit, the current I in amperes is a function of time: I(t). The function I(t) satisfies an ordinary differential equation of the form
dI = f(t,I) dt
Here, the right-hand side is a function of t and I that depends on the circuit and on the nature of the electromotive force supplied to the circuit.
A model to account for the way in which two different animal species sometimes interact is the predator-prey model. If u(t) is the number of individuals in the predator species and v(t) the number of individuals in the prey species, then under suitable simplifying assumptions and with appropriate constants a, b, c, and d,
d u
d t = a ( v + b ) u
dv
dt =c(u+d)v
This is a pair of nonlinear ordinary differential equations (ODEs) that govern the populations of the two species (as functions of time t).
Numerical procedures are developed for solving such problems.
7.1 Taylor Series Methods
ODE Examples
Equation
x′ − x = et x′′ + 9x = et x′ + (1/2)x = 0
Solution
x(t) = tet + cet
x(t) = c1 sin 3t + c2 cos 3t x(t) = √c − t
First, we present a general discussion of ordinary differential equations and their solutions.
Initial-Value Problem: Analytical versus Numerical Solution
An ordinary differential equation (ODE) is an equation that involves one or more deriva- tives of an unknown function. A solution of a differential equation is a specific function that satisfies the equation. Here are some examples of differential equations with their solutions. In each case, t is the independent variable and x is the dependent variable. Thus, x is the name of the unknown function of the independent variable t:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
299
300 Chapter 7
Initial Values Problems
IVP Standard Form
In these three examples, the letter c denotes an arbitrary constant. The fact that such constants appear in the solutions is an indication that a differential equation does not, in general, determine a unique solution function. When occurring in a scientific problem, a differential equation is usually accompanied by auxiliary conditions that (together with the differential equation) specify the unknown function precisely.
In this chapter, we concentrate on one type of differential equation and one type of auxiliary condition: the initial-value problem for a first-order differential equation. The standard form that has been adopted is
x′ = f(t,x) (1) x(a) is given
It is understood that x is a function of t, so the differential equation written in more detail looks like this:
dx(t)
= f(t,x(t)) dt
Problem (1) is termed an initial-value problem because t can be interpreted as time and t = a can be thought of as the initial instant in time. We want to be able to determine the value of x at any time t before or after a.
Here are some examples of initial-value problems, together with their solutions:
IVP Examples
Equation
x′ = x + 1
x′ = 6t − 1 x′=t/(x+1)
Initial Value
x(0) = 0 x(1) = 6 x(0)=0
Solution
x = et − 1
x = 3t2 − t + 4 √
x= t2+1−1
Direct Approach of Indefinite Integration
x′ = 3t2 −4t−1 +(1+t2)−1 x(5) = 17
The differential equation can be integrated to produce x(t)=t3 −4lnt+arctant+C
(2)
Although many methods exist for obtaining analytical solutions of differential equa- tions, they are primarily limited to special differential equations. When applicable, they produce a solution in the form of a formula, such as shown in the preceding examples. Fre- quently, however, in practical problems, a differential equation is not amenable to solution by special methods, and a numerical solution must be sought. Even when a formal solution can be obtained, a numerical solution may be preferable, especially if the formal solution is very complicated. A numerical solution of a differential equation is usually obtained in the form of a table; the functional form of the solution remains unknown insofar as a specific formula is concerned.
The form of the differential equation adopted here permits the function f to depend on t and x. If f does not involve x, as in the second preceding example, then the differential equation can be solved by a direct process of indefinite integration. To illustrate, consider the initial-value problem
The constant C can then be chosen so that x(5) = 17. We can use a mathematical soft- ware system such as MATLAB, Maple, or Mathematica to solve this differential equation explicitly and thereby find the value of this constant as C = 4 ln(5) − arctan(5) − 108.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
No Closed-Form Solution
We often want a numerical solution to a differential equation because (a) the closed- form solution may be very complicated and difficult to evaluate or (b) there is no other choice; that is, no closed-form solution can be found. Consider, for instance, the differential equation
√
x′ =e− t2−sint +ln|sint+tanht3| (3)
The solution is obtained by taking the integral or antiderivative of the right-hand side. It can be done in principle, but not in practice. In other words, a function x exists for which dx/dt is the right-hand member of Equation (3), but it is not possible to write x(t) in terms of familiar functions.
Solving ordinary differential equations on a computer may require a large number of steps with small step size, so a significant amount of roundoff error can accumulate. Consequently, multiple-precision computations may be necessary on small-word-length computers.
An Example of a Practical Problem
Many practical problems in dynamics involve Newton’s three Laws of Motion, particularly the Second Law. It states symbolically that F = ma, where F is the force acting on a body of mass m and a is the resulting acceleration of that body. This law is a differential equation in disguise because a, the acceleration, is the derivative of velocity and velocity is, in turn, the derivative of the position. We illustrate with a simplified model of a rocket being fired at time t = 0. Its motion is to be vertically upward, and we measure its height with the variable x. The propulsive force is a constant value, namely, 5370. (Units are chosen to be consistent with each other.) There is a negative force due to air resistance whose magnitude is v3/2/ ln(2 + v), where v is the velocity of the rocket. The mass is decreasing at a steady rate due to the burning of fuel and is taken to be 321 − 24t. The independent variable is time, t. The fuel is completely consumed by the time t = 10. There is a downward force, due to gravity, of magnitude 981. Putting all these terms into the equation F = ma, we have
5370 − 981 − v3/2/ ln(2 + v) = (321 − 24t)v′ (4)
The initial condition is v = 0 at t = 0.
We shall develop methods to solve such differential equations in the succeeding sec-
tions. Moreover, one can also invoke a mathematical software system to solve this problem. A computer code for solving ordinary differential equations produces a table of discrete values, whereas the mathematical solution is a continuous function. One may need additional values within an interval for various purposes, such as plotting. Interpolation procedures can be used to obtain all values of the approximate numerical solution within a given interval. For example, a piecewise polynomial interpolation scheme may yield a numerical solution that is continuous and has a continuous first derivative matching the derivative of the solution. In using any ODE solver, an approximation to x′(t) is available from the fact
that
x′(t)= f(t,x)
Mathematical packages for solving ODEs may include automatic plotting capabilities be- cause the best way to make sense out of the large amount of data that may be returned as the solution is to display the solution curves on a graphical monitor or plot them on paper.
Simplified Model of a Rocket
Using an ODE Computer Code
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.1 Taylor Series Methods 301
302 Chapter 7
Initial Values Problems
Integration Rules Yield ODE Methods
Solving Differential Equations and Integration
There is a close connection between solving differential equations and integration. Consider the differential equation
d x = f ( r , x ) dr
x(a) = s
Integrating from t to t + h, we have
t+h t+h
t
dx = tt
f(r,x(r))dr
Hence, we obtain
x(t + h) = x(t) +
Replacing the integral with one of the numerical integration rules from Chapter 5, we obtain
a formula for solving the differential equation. For example, Euler’s method, Equation (6), is obtained from the left rectangle approximation. (See Exercise 5.2.33):
t+h
f(r,x(r))dr ≈hf(t,x(t))
The trapezoid rule
t+h h f(r,x(r))dr ≈
t2 gives the formula
h
x(t+h)=x(t)+
Since x(t + h) appears on both sides of this equation, it is called an implicit formula. If
Euler’s method
x(t + h) = x(t) + h f (t, x(t))
is used for the x(t + h) on the right-hand side, then we obtain the Runge-Kutta formula of order 2—namely, Equation (10) in Section 7.2.
Using the Fundamental Theorem of Calculus, we can easily show that an approximate numerical value for the integral
b
f(r,x(r))dr
a
can be computed by solving the following initial-value problem for x(b):
d x = f ( r , x ) dr
x(a) = 0
t+h t
f (r, x(r)) dr
[f(t,x(t))+ f(t +h,x(t +h))]
Integration versus Solving IVP
2
[f(t,x(t))+ f(t+h,x(t+h))]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Generic First-Order ODE
Vector Fields
Consider a generic first-order differential equation with prescribed initial condition:
x′(t)= f(t,x(t)) x(a) = b
Before addressing the question of solving such an initial-value problem numerically, it is helpful to think about the intuitive meaning of the equation. The function f provides the slope of the solution function in the t x -plane. At every point where f (t , x ) is defined, we can imagine a short line segment being drawn through that point and having the prescribed slope. We cannot graph all of these short segments, but we can draw as many as we wish, in the hope of understanding how the solution function x(t) traces its way through this forest of line segments while keeping its slope at every point equal to the slope of the line segment drawn at that point. The diagram of line segments illustrates discretely the so-called vector field of the differential equation.
For example, let us consider the equation
x′ =sin(x+t2)
with initial value x (0) = 0. In the rectangle described by the inequalities −4 ≦ x ≦ 4 and −4 ≦ t ≦ 4, we can direct mathematical software, such as MATLAB, to furnish a picture of the vector field engendered by our differential equation. Using commands in an environment of windows, we bring up a window with the differential equation shown in a rectangle. Behind the scenes, the mathematical software carries out immense calculations to provide the vector field for this differential equation, and displays it, correctly labeled. To see the solution going through any point in the diagram, it is necessary only to use the mouse to position the pointer on such a point. By clicking the left mouse button, the software displays the solution sought. By using such a software tool, one can see immediately the effect of changing initial conditions. For the problem under consideration, several solution curves (corresponding to different initial values) are shown in Figure 7.1.
First Vector Field Example
x
4 3 2 1 0
21
22
23
x9 5 sin(x 1 t2)
7.1 Taylor Series Methods 303
FIGURE 7.1
Vector field and some solution curves for x′=sin(x+t2)
Second-Order Field Example
24
24 23 22 21 0 1 2 3 4
Another example, treated in the same way, is the differential equation x′ = x2 − t
Figure 7.2 (p. 304) shows a vector field for this equation and some of its solutions. Notice the phenomenon of many quite different curves all seeming to arise from the same initial
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
t
304 Chapter 7
Initial Values Problems
Taylor Series
■ Theorem1
x
4 3 2 1 0
21
22
23
x9 = x2 2 t
FIGURE 7.2
Vector field and some solution curves for x′ = x2 − t
24
22 0 2 4 6 8 10
condition. What is happening here? This is an extreme example of a differential equation whose solutions are exceedingly sensitive to the initial condition! Expect trouble in solving this differential equation with an initial value prescribed at t = −2.
How do we know that the differential equation x′ = x2 − t, together with an initial value, x(t0) = x0, has a unique solution? There are many theorems in the subject of differ- ential equations that concern such existence and uniqueness questions. One of the easiest to use is as follows.
From the theorem just quoted, we cannot conclude that the solution in question is defined for |t − t0| < β. However, the value of ǫ in the theorem is at least β/M, where M is an upper bound for | f (t, x)| in the original rectangle.
Taylor Series Methods
The numerical method described in this section does not have the utmost generality, but it is natural and capable of high precision. Its principle is to represent the solution of a differential equation locally by a few terms of its Taylor series.
In what follows, we shall assume that our solution function x is represented by its Taylor series∗
x(t + h) = x(t) + hx′(t) + 1 h2x′′(t) + 1 h3x′′′(t) 2! 3!
+ 1 h4x(iv)(t)+ 1 h5x(v)(t)+···+ 1 hmx(m)(t)+··· (5) 4! 5! m!
∗Rememberthatsomefunctionssuchase−1/x2 aresmooth,butnotrepresentedbyaTaylorseriesat0.
t
Uniqueness of Initial-Value Problems
If f and∂f/∂yarecontinuousintherectangledefinedby|t−t0|<αand|x−x0|<β, then the initial-value problem x′ = f (t, x), x(t0) = x0 has a unique continuous solution in some interval |t − t0| < ǫ.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Euler’s Method
For numerical purposes, the Taylor series truncated after m + 1 terms enables us to compute x(t + h) rather accurately if h is small and if x(t),x′(t),x′′(t),...,x(m)(t) are known. When only terms through hm x(m)(t)/m! are included in the Taylor series, the method that results is called the Taylor series method of order m. We begin with the case m = 1.
Euler’s Method and Pseudocode
The Taylor series method of order 1 is known as Euler’s method. To find approximate values of the solutions to the initial-value problem
x′ = f(t,x(t)) x(a) = xa
over the interval [a, b], the first two terms in the Taylor series (5) are used: x(t + h) ≈ x(t) + hx′(t)
Hence, the formula
x(t + h) = x(t) + h f (t, x(t)) (6)
can be used to step from t = a to t = b with n steps of size h = (b − a)/n.
The pseudocode for Euler’s method can be written as follows, where some prescribed
values for n, a, b, and xa are used:
Euler Pseudocode
To use this program, a code for f (t , x ) is needed, as shown in Example 1.
Using Euler’s method, compute an approximate value for x(2) for the differential equation
x′ = 1 + x2 + t3 with the initial value x(1) = −4 using 100 steps.
Use the pseudocode above with the initial values given and combine with the following
function:
The computed value is x(2) ≈ 4.23585. ■
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 1
Solution
7.1 Taylor Series Methods 305
program Euler
integer k; real h, t; integer n ← 100 external function f
real a ← 1, b ← 2, x ← −4
h ← (b − a)/n
t←a
output 0, t, x
for k = 1 to n
x ← x + h f (t , x ) t←t+h
output k, t, x
end for
end program Euler
real function f (t, x) real t, x
f ←1+x2 +t3 end function
306
Chapter 7
Initial Values Problems
We can write a computer program to execute Euler’s method on this very simple problem:
x′(t) = x x(0) = 1
We obtain the results x(2) ≈ 7.3891. The plot produced by the code is shown in Figure 7.3. The solution, x(t) = et, is the solid curve, and the points produced by Euler’s method are shown by dots. Can you understand why the dots are always below the curve?
y
FIGURE 7.3
Euler’s method curves
Important Questions
Sample IVP
50 40 30 20 10
01234
x
Before accepting these results and continuing, one should raise some questions such as: How accurate are the answers? Are higher-order Taylor series methods ever needed? Unfortunately, Euler’s method is not very accurate because only two terms in the Taylor series (5) are used. Consequently, the truncation error is O(h2).
Taylor Series Method of Higher Order
Example 1 can be used to explain the Taylor series method of higher order. Consider again the initial-value problem
x′ =1+x2 +t3 x(1) = −4
(7)
If the functions in the differential equation are differentiated several times with respect to t, the results are as follows. (Remember that a function of x must be differentiated with respect to t by using the chain rule.)
x′ = 1 + x2 + t3 x′′ =2xx′+3t2
x′′′ = 2xx′′ + 2x′x′ + 6t x(iv) = 2xx′′′ + 6x′x′′ + 6
(8)
If numerical values of t and x(t) are known, these four formulas, applied in order, yield x′(t), x′′(t), x′′′(t), and x(iv)(t). Thus, it is possible from this work to use the first five terms in the Taylor series, Equation (5). Since x(1) = −4, we have a suitable starting point, and we select n = 100, which determines h. Next, we can compute an approximation to x(a+h) from Formulas (5) and (8). The same process can be repeated to compute x(a + 2h) using x(a + h), x′(a + h), . . . , x(iv)(a + h).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Pseudocode and Results
Here is the pseudocode:
liable. Here is a coarse assessment. Since terms up to 1 h4x(iv)(t) are included, the first 24
7.1 Taylor Series Methods 307
program Taylor
integer k; real h, t, x, x′, x′′, x′′′, x(iv) integer n ← 100
real a ← 1, b ← 2, x ← −4
h ← (b − a)/n
t←a
output 0, t, x
for k = 1 to n
x′ ←1+x2 +t3
x′′ ←2xx′ +3t2
x′′′ ←2xx′′ +2(x′)2 +6t
t ← a + kh
output k, t, x end for
end program Taylor
x(iv) ←2xx′′′ +6x′x′′ +6
x ← x +hx′ + 1hx′′ + 1hx′′′ + 1hx(iv) 234
Taylor Pseudocode
Computer Results
A few words of explanation may be helpful here. In this example, we compute the solution of the differential equation over the interval a = 1 ≦ t ≦ 2 = b using 100 steps. In each step, the current value of t is an integer multiple of the step size h. The assignment statements that define x′, x′′, x′′′, and x(iv) are simply carrying out calculations of the derivatives according to Equation (8). The final calculation carries out the evaluation of the Taylor series in Equation (5) using five terms. Since this equation is a polynomial in h, it is evaluated most efficiently by using nested multiplication, which explains the formula for x in the pseudocode. The computation t ← t + h may cause a small amount of roundoff error to accumulate in the value of t. This is minimized by using t ← a + kh.
As one might expect, the results of using only two terms in the Taylor series (Euler’s method) are not as accurate as when five terms are used:
Euler’s Method Taylor Series Method (Order 4)
x (2) ≈ 4.23585 41 x (2) ≈ 4.37120 96
By further analysis, one can prove that the correct value to more significant figures is x (2) ≈ 4.37122 1866. Here, the computations were done with more precision just to show that lack of precision was not a contributing factor.
Types of Errors
When the pseudocode described above is programmed and run on a computer, what sort
of accuracy can we expect? Are all the digits printed by the machine for the variable x
accurate? Of course not! On the other hand, it is not easy to say how many digits are re-
term not included in the Taylor series is 1 h5x(v)(t). The error may be larger than this, 120
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
308 Chapter 7
Initial Values Problems
Local Truncation Error
Roundoff Error
but the factor h5 = (10−2)5 ≈ 10−10 is affecting only the tenth decimal place. The printed solution is perhaps accurate to eight decimal places. Bridges or airplanes should not be built on such shoddy analysis, but for now, our attention is focused on the general form of the procedure.
Actually, there are two types of errors to consider. At each step, if x(t) is known and
x(t +h) is computed from the first few terms of the Taylor series, an error occurs because we
have truncated the Taylor series. This error, then, is called the truncation error or, to be more
precise, the local truncation error. In the preceding example, it is roughly 1 h5x(v)(ξ). 120
In this situation, we say that the local truncation error is of order h5, abbreviated by O(h5). The second type of error obviously present is due to the accumulated effects of all local truncation errors. Indeed, the calculated value of x(t + h) is in error because x(t) is already wrong (because of previous truncation errors) and because another local truncation error
occurs in the computation of x(t + h) by means of the Taylor series.
Additional sources of errors must be considered in a complete theory. One is roundoff
error. Although not serious in any one step of the solution procedure, after hundreds or thousands of steps, it may accumulate and contaminate the calculated solution seriously! Remember that an error that is made at a certain step is carried forward into all succeeding steps. Depending on the differential equation and the method that is used to solve it, such errors may be magnified by succeeding steps.
Taylor Series Method Using Symbolic Computations
Various routine mathematical calculations of both a nonnumerical and a numerical type, including differentiation and integration of even rather complicated expressions, can now be turned over to the computer. Of course, this applies only to a restricted class of func- tions, but this class is broad enough to include all the functions that one encounters in the typical calculus textbook. With the use of such a program for symbolic computations, the Taylor series method of high order can be carried out without difficulty. Using the alge- braic manipulation potentialities in mathematical software such as MATLAB, Maple or Mathematica, we can write code to solve the initial value problem (7). The final result is x (2) ≈ 4.37121 00522 49692 27234 569.
Summary 7.1
• We wish to solve the first-order initial-value problem x′(t)= f(t,x(t))
x(a) = xa
over the interval [a, b] with step size h = (b − a)/n.
• The Taylor series method of order m is
x(t + h) = x(t) + hx′(t) + 1 h2x′′(t) + 1 h3x′′′(t)
2! 3!
+ 1h4x(iv)(t)+ 1h5x(v)(t)+···+ 1 hmx(m)(t)
4! 5! m! whereallofthederivativesx′′,x′′′,...,x(m) havebeendeterminedanalytically.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
aa. x′=t3+7t2−t1/2 c. x′=−x
ae. x′′=x
2. Givethesolutionsoftheseinitial-valueproblems:
aa. x′=t2+t1/3 x(0)=7 b.x′=2x x(0)=15
c. x′′ =−x x(π)=0 x′(π)=3
3. Solvethefollowingdifferentialequations:
a. x′ =1+x2 Hint:1+tan2t =sec2t
′√222
b. x = 1−x Hint:sin t+cos t=1
• Euler’s method is the Taylor series method of order 1 and can be written as x(t + h) = x(t) + h f (t, x(t))
Because only two terms in the Taylor series are used, the truncation error is large, and the results cannot be computed with much accuracy. Consequently, higher-order Taylor series methods are used most often. Of course, they require determining more derivatives, with more chances for mathematical errors.
ab. x′=x d. x′′=−x
f. x′′+x′−2x=0 Hint: Try x = eat .
7.1 Taylor Series Methods 309
1. Givethesolutionsofthesedifferentialequations:
a9. Consider the problem x ′ = x . If the initial condition is x(0) = c, then the solution is x(t) = cet. If a round- off error of ε occurs in reading the value of c into the computer, what effect is there on the solution at the point t =10?Att =20?Dothesameforx′ =−x.
a 10. If the Taylor series method is used on the initial-value problemx′ =t2 +x3,x(0)=0,andifweintendtouse the derivatives of x up to and including x(iv), what are the five main equations that must be programmed?
11. In solving the following differential equations by the Taylor series method of order n, what are the main equa- tions in the algorithm?
aa.x′=x+ex, n=4 b.x′=x2−cosx, n=5
ac. x′=t−1sint a 12. Calculate an approximate value for x (0.1) using one step
Hint: See Computer Exercise 5.1.2. ad. x′+tx=t2
of the Taylor series method of order 3 on the ordinary differential equation
Hint: Multiply by f (t) = exp(t2/2) so left-hand side x′′ =x2et +x′ becomes (x f )′. x(0) = 1, x′(0) = 2
4. Solve Exercise 7.1.3b by substituting a power series a
x(t) = ∞n=0 antn and then determining appropriate 13. Suppose that a differential equation is solved numerically values of the coefficients. on an interval [a, b] and that the local truncation error is
5. Determinex′′whenx′=xt2+x3+ext.
a6. Find a polynomial p with the property p − p′ = t3 +
t2−2t. a
7. The general first-order linear differential equation is
x′+px+q=0,wherepandqarefunctionsoft.Show that the solution is x = −y−1(z + c), where y and z are functions obtained as follows: Let u be an antiderivative of p. Put y = eu, and let z be an antiderivative of yq.
8. Here is an initial-value problem that has two solutions:
ch p . Show that if all truncation errors have the same sign (the worst possible case), then the total truncation error is (b − a)chp−1, where h = (b − a)/n.
14. IfweplantousetheTaylorseriesmethodwithtermsup
to h20, how should the computation 20 x(n)(t)hn/n! n=0
be carried out? Assume that x(t), x(1)(t), x(2)(t), . . . , and x (20) (t ) are available.
Hint: Only a few statements suffice.
15. ExplainhowtousetheODEmethodthatisbasedonthe trapezoid rule:
seriesmethodisapplied,whathappens? x(t+h)=x(t)+2[f(t,x(t))+ f(t+h,x(t+h))]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x′ = x1/3, x(0) = 0. Verify that the two solutions are
x1(t) = 0 and x2(t) = 2t3/2 for t≧0. If the Taylor x(t +h) = x(t)+hf(t,x(t)) 3h
Exercises 7.1
310 Chapter 7 Initial Values Problems
16.
a1.
2.
(Continuation)UsetheimprovedEuler’smethodtosolve the following differential equation over the interval [0, 1] with step size h = 0.1:
x′ = −x + t + 1 2
x(0) = 1
Write and test a program for applying the Taylor series method to the initial-value problem
x′ = x + x2
Consider the initial-value problem ′2
x(0) = 1
In the improved Euler’s method, replace x(t + h) with x(t + h) and try to solve with one step of size h = 0.1. Explain what happens. Find the closed-form solution by substituting x = (a + bt)c and determining a, b, c.
6. Solvetheinitial-valueproblemx′=(x+t)2withx(0)= −1 on the interval [0, 1] using the Taylor series method with derivatives up to and including the fourth. Compare
This is called the improved Euler’s method or Heun’s 17.
method. Here, x(t + h) is computed by using Euler’s
method. x =−100x
e this to Taylor series methods of orders 1, 2, and 3. x (1) = = 0.20466 34172 89155 26943
16−e a7.
Generate the solution in the interval [1, 2.77]. Use deriva- tives up to x(v) in the Taylor series. Use h = 1/100. Print out for comparison the values of the exact solution x (t ) = et /(16 − et ). Verify that it is the exact solution.
Write a program to solve each problem on the indicated intervals. Use the Taylor series method with h = 1/100, andinclude terms to h3. Account for any difficulties.
a. x′ =t+x2 on[0,0.9] 8. x(0) = 1
x′=x−t on[1,1.75]
x(1) = 1 a9.
x′ =tx+t2x2 on[2,5] x (2) = −0.63966 25333
Write a program to solve on the interval [0, 1] the initial- value problem
x′ =tx x(0) = 1
using the Taylor series method of order 20; that is, in- clude terms in the Taylor series up to and including h 20 . Observe that a simple recursive formula can be used to obtain x(n) for n = 1,2,...,20.
Write a program to solve the initial-value problem x′ = sin x + cos t , using the Taylor series method. Continue the solution from t = 2 to t = 5, starting with x(2) = 0.32. Include terms up to and including h 3 .
Write a program to solve the initial-value problem x′ = etx with x(2) = 1 on the interval 0≦t≦2 using the Taylor series method. Include terms up to h 4 .
ab.
a
c.
a3. Solve the differential equation x′ = x with initial value
x(0) = 1 by the Taylor series method on the inter-
val [0, 10]. Compare the result with the exact solution
x (t ) = et . Use derivatives up to and including the tenth. 11. Use step size h = 1/100.
4. Solve for x (1):
aa. x′=1+x2, x(0)=0
b. x′=(1+t)−1x, x(0)=1
Use the Taylor series method of order 5 with h = 1/100, and compare with the exact solutions, which are tan t and 1 + t, respectively.
a5. Solve the initial-value problem x′ = t + x + x2 on the interval [0,1] with initial condition x(1) = 1. Use the Taylor series method of order 5.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
a10. Write a program to solve x′ = tx + t4 on the interval 0 ≦ t ≦ 5 with x(5) = 3. Use the Taylor series method with terms to h4.
Write a program to solve the initial-value problem of the example in this section over the interval [1, 3]. Explain.
12. Compute a table, at 101 equally spaced points in the in- terval [0, 2], of the Dawson integral
x f(x)=exp −x2 exp t2 dt
0
by numerically solving, with the Taylor series method of
suitable order, an initial-value problem of which f is the solution. Make the table accurate to eight decimal places, and print only eight decimal places.
Hint: Find the relationship between f ′(x) and x f (x). The Fundamental Theorem of Calculus is useful.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 7.1
Check values: f (1) = 0.53807 95069 and f (2) = 0.30134 03889.
13. Solve the initial-value problem x′ = t3 + ex with x(3) = 7.4 on the interval 0≦t≦3 by means of the fourth-order Taylor series method.
14. Use a symbolic manipulation package such as Maple to solve the differential equations of Example 1 by the fourth-order Taylor series method to high accuracy, carrying 24 decimal digits.
15. ProgramthepseudocodesEulerandTaylorandcompare the numerical results to those given in the text.
7.2 Runge-Kutta Methods Introduction
16. (Continuation) Repeat by calling directly an ordinary dif- ferential equation solver routine within a mathematical software system such as MATLAB, Maple, or Mathe- matica.
17. Use mathematical software such as MATLAB, Maple, or Mathematica to find analytical or numerical solutions for these ordinary differential equations:
a. ODE (2) b. ODE (3) c. ODE (4)
18. Writecomputerprogramstoreproducethesefigures:
a. Figure 7.1 b. Figure 7.2 c. Figure 7.3
Solving ODEs Without Differentiation
weneedtoobtainx′′,x′′′,...bydifferentiatingthefunction f.Thisrequirementcanbea serious obstacle to using the method. The user of this method must do some preliminary analytical work before writing a computer program. Ideally, a method for solving Equa- tion (1) should involve nothing more than writing a code to evaluate f . The Runge-Kutta methods accomplish this.
For purposes of exposition, the Runge-Kutta method of order 2 is presented, although its low precision usually precludes its use in actual scientific calculations. Later, the Runge- Kutta method of order 4 is given without a derivation. It is in common use. The order-2 Runge-Kutta procedure does find application in real-time calculations on small computers. For example, it is used in some aircraft by the on-board mini-computer.
At the heart of any method for solving an initial-value problem is a procedure for advancing the solution function one step at a time; that is, a formula must be given for x(t + h) in terms of known quantities. As examples of known quantities, we can cite x(t), x(t − h), x(t − 2h), . . . if the solution process has gone through a number of steps. At the beginning, only x(a) is known. Of course, we assume that f (t, x) can be computed for any point (t, x).
Taylor Series for f (x, y)
Before explaining the Runge-Kutta method of order 2, let us present the Taylor series in
TaylorSeriesinTwo Variables
two variables. The infinite series is
f (x + h, y + k) =
∞ 1 ∂ h
i=0 i! ∂x
+ k
∂i ∂y
f (x, y) (2)
7.2 Runge-Kutta Methods 311
The methods named after Carl Runge and Wilhelm Kutta are designed to imitate the Taylor series method without requiring analytic differentiation of the original differential equation. Recall that in using the Taylor series method on the initial-value problem
x′ = f(t,x) x(a) = xa
(1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
312 Chapter 7
Initial Values Problems
This series is analogous to the Taylor series in one variable given by Equation (11) in Section 1.2. The mysterious-looking terms in Equation (2) are interpreted as follows:
∂ ∂0 h +k
∂x ∂y
f(x,y)= f
∂f ∂f
∂ ∂1 h + k
f (x, y) = h
∂ ∂2 2∂2f ∂2f 2∂2f
.
where f and all partial derivatives are evaluated at (x, y). As in the one-variable case, if the Taylor series is truncated, an error term or remainder term is needed to restore the equality. Here is the appropriate equation:
f (x + h, y + k) =
The point (x, y) lies on the line segment that joins (x, y) to (x + h, y + k) in the plane. In applying Taylor series, we use subscripts to denote partial derivatives. So, for
instance, we define
∂f ∂f ∂2 f ∂2 f
fx = ∂x ft = ∂t , fxx = ∂x2 , fxt = ∂t ∂x (4)
We are dealing with functions for which the order of these subscripts is immaterial; for example, fxt = ftx. Thus, we have
f(x+h,y+k)= f +(hfx +kfy)
+ 1 h2 fxx +2hkfxy +k2 fyy
2!
+ 1 h3 fxxx +3h2kfxxy +3hk2 fxyy +k3 fyyy 3!
+···
As special cases, we notice that
f(x+h,y)= f +hfx + 2! fxx + 3! fxxx +···
+ k ∂x ∂y
∂x ∂y
h∂x+k∂y f(x,y)=h ∂x2 +2hk∂x∂y+k ∂y2
TS with Error Term
n−1
1∂∂i 1∂∂n
f (x, y) (3)
h + k f (x, y) + h + k i=0i!∂x ∂y n!∂x ∂y
TS with First Few Terms
Special Cases
h2 h3 k2 k3
f(x,y+k)= f +kfy + 2! fyy + 3! fyyy +··· Runge-Kutta Method of Order 2
In the Runge-Kutta method of order 2, a formula is adopted that has two function evaluations of the special form
K1 =hf(t,x)
K2 =hf(t+αh,x+βK1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Deriving Runge-Kutta Methods of Order 2
7.2 Runge-Kutta Methods 313 and a linear combination of these is added to the value of x at t to obtain the value at t + h:
x(t+h)=x(t)+w1K1 +w2K2
or, equivalently,
x(t + h) = x(t) + w1h f (t, x) + w2h f (t + αh, x + βh f (t, x)) (5)
The objective is to determine constants w1, w2, α, and β so that Equation (5) is as accurate as possible. Explicitly, we want to reproduce as many terms as possible in the Taylor series
x(t+h)=x(t)+hx′(t)+ 1h2x′′(t)+ 1h3x′′′(t)+ 1h4x(iv)(t)+ 1h5x(v)(t)+··· (6) 2! 3! 4! 5!
Now compare Equation (5) with Equation (6). One way to force them to agree up through the term in h is to set w1 = 1 and w2 = 0 because x′ = f. However, this simply reproduces Euler’s method (described in the preceding section), and its order of precision is only 1. Agreement up through the h2 term is possible by a more adroit choice of parameters. To see how, apply the two-variable form of the Taylor series to the final term in Equation (5). We use n = 2 in the two-variable Taylor series given by Formula (3), with t, αh, x, and βhf playing the role of x, h, y, and k, respectively:
1∂ ∂2 f(t+αh,x+βhf)= f +αhft +βhffx +2 αh∂t +βhf∂x f(x,y)
Using the above equation results in a new form for Equation (5). We have
x(t+h)=x(t)+(w1 +w2)hf +αw2h2 ft +βw2h2 ffx +O(h3) (7)
Equation (6) is also given a new form by using differential Equation (1). Since x′ = f , we have
′′ dx′ df(t,x) ∂f dt ∂f dx x=dt= dt =∂t dt+∂x dt=ft+fxf
So Equation (6) implies that
x(t+h)=x+hf +1h2 ft +1h2 ffx +O(h3) (8) 22
Agreement between Equations (7) and (8) is achieved by stipulating that
w1 +w2 =1, αw2 = 1, βw2 = 1 (9) 22
A convenient solution to Equation (9) is
α=1, β=1, w1 =1, w2 =1 22
The resulting second-order Runge-Kutta method is then, from Equation (5), x(t + h) = x(t) + h f (t, x) + h f (t + h, x + h f (t, x))
Runge-Kutta Method of Order 2
22
x(t + h) = x(t) + 1(K1 + K2) (10) 2
or, equivalently,
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
314 Chapter 7
Initial Values Problems
where
K1 =hf(t,x)
K2 =hf(t+h,x+K1)
Error Term
Formula (10) shows that the solution function at t + h is computed at the expense of two evaluations of the function f .
Notice that other solutions for the nonlinear System (9) are possible. For example, α can be arbitrary, and then
β = α, w1 = 1 − 1 , w2 = 1 2α 2α
One can show (see Exercise 7.2.10) that the error term for Runge-Kutta methods of order 2 is
h32 ∂ ∂ 2 h3 ∂ ∂
4 3−α ∂t+f∂x f+6fx ∂t+f∂x f (11)
Notice that the method with α = 2 is especially interesting. However, none of the second- 3
order Runge-Kutta methods is widely used on large computers because the error is only O(h3).
Runge-Kutta Method of Order 4
One algorithm in common use for the initial-value Problem (1) is the classical fourth-order Runge-Kutta method. Its formulas are as follows:
x(t + h) = x(t) + 1(K1 + 2K2 + 2K3 + K4) (12) 6
where
Runge-Kutta Method of Order 4
1
K2=hf t+1h,x+1K1
K =hf(t,x)
2 2 K3=hf t+1h,x+1K2
22 K4 =hf(t+h,x+K3)
The derivation of the Runge-Kutta formulas of order 4 is tedious. Very few textbooks give the details. Two exceptions are the books of Henrici [1962] and Ralston [1965]. There exist higher-order Runge-Kutta formulas, and they are still more laborious to derive. However, symbolic manipulation software packages such as in MATLAB, Maple, or Mathematica can be used to develop the formulas.
As shown, the solution at x(t + h) is obtained at the expense of evaluating the function f four times. The final formula agrees with the Taylor expansion up to and including the term in h4. The error therefore contains h5, but no lower powers of h. Without knowing the coefficient of h5 in the error, we cannot be precise about the local truncation error. In treatises devoted to this subject, these matters are explored further. See, for example,
Butcher [1987] or Gear [1971].
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
RK4 Pseudocode
Pseudocode
Here is a pseudocode to implement the classical Runge-Kutta method of order 4:
To illustrate the use of the preceding pseudocode, consider the initial-value problem
x′ =2+(x−t−1)2 x(1) = 2
(13)
whose exact solution is x(t) = 1 + t + tan(t − 1). A pseudocode to solve this problem on the interval [1, 1.5625] by the Runge-Kutta procedure follows. The step size needed is calculated by dividing the length of the interval by the number of steps, say, n = 72.
7.2 Runge-Kutta Methods 315
procedure RK4( f, t, x, h, n)
integer j,n; real K1,K2,K3,K4,h,t,ta,x external function f
output 0, t, x
ta ← t
for j = 1 to n
K1 ←hf(t,x)
K2 ←hf(t+1h,x+1K1)
K3 ←hf(t+1h,x+1K2) 22
K4 ←hf(t+h,x+K3)
x ← x + 1 (K1 + 2K2 + 2K3 + K4)
6
t ← ta + jh
output j,t,x end for
end procedure RK4
22
program Test RK4
real h, t; external function f integer n ← 72
real a ← 1, b ← 1.5625, x ← 2 h ← (b − a)/n
t←a
call RK4(f,t,x,h,n)
end program Test RK4
real function f (t, x) real t, x
f ←2+(x−t−1)2 end function f
Test RK4 Pseudocode
We include an external-function statement both in the main program and in procedure RK4 because the procedure f is passed in the argument list of RK4. The final value of the computed numerical solution is x (1.5625) = 3.19293 7699.
General-purpose routines incorporating the Runge-Kutta algorithm usually include additional programming to monitor the truncation error and make necessary adjustments in the step size as the solution progresses. In general terms, the step size can be large when the
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
316
Chapter 7
Initial Values Problems
solution is slowly varying, but should be small when it is rapidly varying. Such a program is presented in Section 7.3.
Summary 7.2
• The second-order Runge-Kutta method is
1
x(t + h) = x(t) + 2(K1 + K2) where K1 =hf(t,x)
K2 =hf(t+h,x+K1)
This method requires two evaluations of the function f per step. It is equivalent to a
Taylor series method of order 2.
• One of the most popular single-step methods for solving ODEs is the fourth-order
Runge-Kutta method
x(t + h) = x(t) + 1(K1 + 2K2 + 2K3 + K4) 6
where
K =hf(t,x)
1
K3 =hf t+1h,x+1K2
K2 =hf t+1h,x+1K1
1.
2.
Derive the equations needed to apply the fourth-order Taylor series method to the differential equation x′ = t x 2 + x − 2t . Compare them in complexity with the equations required for the fourth-order Runge-Kutta method.
Put these differential equations into a form suitable for numerical solution by the Runge-Kutta method.
a. x+2xx′−x′=0 b. logx′=t2−x2 ac. (x′)2(1−t2)=x
at t = −0.2, correct to two decimal places, using one step of the Taylor series method of order 2 and one step of the Runge-Kutta method of order 2.
4. Considertheordinarydifferentialequation x′ =(tx)3 −(x/t)2
x(1) = 1
Take one step of the Taylor series method of order 2 with h = 0.1 and then use the Runge-Kutta method of order 2 to recompute x(1.1). Compare answers.
5. Insolvingthefollowingdifferentialequationsbyusinga Runge-Kutta procedure, it is necessary to write code for a function f (t , x ). Do so for each of the following:
aa. x′=t2+tx′−2xx′ b. x′=et+x′cosx+t2
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a3. Solvethedifferentialequation dx
d t = − t x 2 x(0) = 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
2 2 22
K4 =hf(t+h,x+K3)
It needs four evaluations of the function f per step. Since it is equivalent to a Taylor series method of order 4, it has truncation error of order O(h5). The small number of function evaluations and high-order truncation error account for its popularity.
Exercises 7.2
6. Consider the ordinary differential equation x′ = t3x2 − 2x3/t2 with x(1) = 1. Determine the equations that would be used in applying the Taylor series method of order 3 and the Runge-Kutta method of order 4.
7. Considerthethird-orderRunge-Kuttamethod: x(t + h) = x(t) + 1(2K1 + 3K2 + 4K3)
13. An important theorem of calculus states that the equa- tion ftx = fxt is true, provided that at least one of these two partial derivatives exists and is continuous. Test this equation on some functions, such as f (t , x ) = xt2 + x2t + x3t4, log(x − t−1), and ex sinh(t + x) + cos(2x − 3t).
14. a. Ifx′ = f(t,x),then
x′′=Df, x′′′=D2f+fxDff
where
9 K 1 = h f ( t , x )
where
K2 = h f t + 1 h, x + 1 K1
7.2 Runge-Kutta Methods 317
22 222 K3=hf t+3h,x+3K2 ∂∂2∂∂2∂
44 D=∂t+f∂x, D =∂t2+2f∂x∂t+f ∂x2 a. Show that it agrees with the Taylor series method of
the same order for the differential equation x ′ = x + t .
b. Prove that this third-order Runge-Kutta method re- produces the Taylor series of the solution up to and including terms in h3 for any differential equation.
a8. Describehowthefourth-orderRunge-Kuttamethodcan
be used to produce a table of values for the function x
f (x) = e−t2 dt 0
at 100 equally spaced points in the unit interval.
Hint: Find an appropriate initial-value problem whose solution is f .
9. Showthatthefourth-orderRunge-Kuttaformulareduces to a simple form when applied to an ordinary differential equation of the form
x′ = f(t)
a10. Establishtheerrorterm(11)forRunge-Kuttamethodsof
order 2.
a11. Onacertaincomputer,itwasfoundthatwhenthefourth- order Runge-Kutta method was used over an interval [a, b] with h = (b − a)/n, the total error due to round- off was about 36n2−50 and the total truncation error was 9nh5, where n is the number of steps and h is the step size. What is an optimum value of h?
Hint: Minimize the total error: roundoff error plus trun- cation error.
a 15.
16.
a 17.
a18. a 19.
a 20.
a 21.
Verify these equations.
ab. Determine x(iv) in a similar form.
Derive the two-variable form of the Taylor series from the one-variable form by considering the function of one variable φ(t) = f (x + th, y + tk) and expanding it by Taylor’s Theorem.
The Taylor series expansion about point (a, b) in terms of two variables x and y is given by
∞ i 1∂∂
i! (x−a)∂x +(y−b)∂y
Show that Formula (2) can be obtained from this form by
a change of variables.
(Continuation) Using the form given in the preceding problem, determine the first four nonzero terms in the Taylor series for f (x, y) = sin x + cos y about the point (0, 0). Compare the result to the known series for sin x and cos y. Make a conjecture about the Taylor series for functions that have the special form f (x, y) = g(x) + h(y).
Forthefunction f(x,y)=y2−3lnx,writethefirstsix termsintheTaylorseriesof f(1+h,0+k).
Using the truncated Taylor series about (1,1), give a three-term approximation to e(1−xy).
Hint: Use Exercise 7.2.16.
The function f (x, y) = xey can be approximated by theTaylorseriesintwovariablesby f(x+h,y+k)≈ (Ax + B)ey. Determine A and B when terms through the second partial derivatives are used in the series.
For f (x, y) = (y − x)−1, the Taylor series can be written as
f (x + h, y + k) = A f + B f 2 + C f 3 + · · ·
f(x,y)=
f(a,b)
a12. Howwouldyousolvetheinitial-valueproblem ′
on the interval [0, 1] if ten decimal places of accuracy, 10−10, are required? Assume that you use a computer with adequate precision, and assume that the fourth-order Runge-Kutta method involves truncation error of magni- tude 100h5.
x =sinx+sint x(0) = 0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
i=0
318 Chapter 7 Initial Values Problems
where f = f (x, y). Determine the coefficients A, B,
terms. Use this result to obtain an approximate value for f (0.001, 0.998).
Show that the improved Euler’s method is a Runge-Kutta method of order 2.
x(0) = 0. Integrate it with the fourth-order Runge- Kutta method on the interval [0,3], using step sizes h = 0.015, 0.020, 0.025, 0.030. Observe the numerical instability!
Consider the differential equation
x+t, −1≦t≦0
0 ≦ t ≦ 1
Using the Runge-Kutta procedure RK4 with step size h = 0.1, solve this problem over the interval [−1, 1]. Now solve by using h = 0.09. Which numerical solution is more accurate and why?
Hint: The true solution is given by x = e(t+1) − (t + 1) ift≦0andx=e(t+1)−2et +(t+1)ift≧0.
Solvet−x′+2xt =0withx(0)=0ontheinterval [0, 10] using the Runge-Kutta formulas with h = 0.1.
Compare with the true solution: 1 (et 2 − 1). Draw a graph 2
or have one created by an automatic plotter. Then graph the logarithm of the solution.
′
Write a program to solve x = sin(xt) + arctan t on
1 ≦ t ≦ 7 with x (2) = 4 using the Runge-Kutta proce- dure RK4.
The general form of Runge-Kutta methods of order 2 is given by Equation (5). Write and test procedure RK2( f, t, x, h, α, n) for carrying out n steps with step size h and initial conditions t and x for several given α values.
Wewanttosolve
x′ = et x2 + e3
x(2) = 4
at x(5) with step size 0.5. Solve it in the following two
ways.
a. Code the function f (t , x ) that is needed and use pro-
cedure RK4.
b. Write a short program that uses the Taylor series
method including terms up to h4.
a 22.
1.
a2.
3.
a
a5.
a6.
7.
andC.
Consider the function ex2+y. Determine its Taylor series 23.
about the point (0, 1) through second-partial-derivative
Run the sample pseudocode given in the text for dif- ferential Equation (13) to illustrate the Runge-Kutta method.
Solve the initial-value problem x′ = x/t + t sec(x/t)
with x(0) = 0 by the fourth-order Runge-Kutta method. a8. Continue the solution to t = 1 using step size h = 2−7.
Compare the numerical solution with the exact solution,
which is x(t) = t arcsin t. Define f (0, 0) = 0, where
f (t, x) = x/t + t sec(x/t).
Select one of the following initial-value problems, and compare the numerical solutions obtained with fourth- order Runge-Kutta formulas and fourth-order Taylor se- ries.Usedifferentvaluesofh = 2−n,forn = 2,3,...,7, to compute the solution on the interval [1, 2].
a. x′ =1+x/t, x(1)=1
ab. x′=1/x2−xt, x(1)=1 a9. ac. x′=1/t2−x/t−x2, x(1)=−1
x′ = x(−1) = 1
2 10.
x − t,
4.
Select a Runge-Kutta routine from a program library, and test it on the initial-value problem x′ = (2 − t)x with x(2) = 1. Compare with the exact solution, x = exp−1(t−2)2.
(Ill-Conditioned ODE) Solve the ordinary differential equation x′ = 10x + 11t − 5t2 − 1 with initial value x(0) = 0. Continue the solution from t = 0 to t = 3, using the fourth-order Runge-Kutta method with h = 2−8. Print the numerical solution and the exact solution (t2/2−t) at every tenth step, and draw a graph of the two solutions. Verify that the solution of the same differential equation with initial value x(0) = ε is εe10t + t2/2 − t and thus account for the discrepancy between the numer- ical and exact solutions of the original problem.
√
Solve the initial-value problem x ′ = x x 2 − 1 with
x(0) = 1 by the Runge-Kutta method on the interval 0 ≦ t ≦ 1.6, and account for any difficulties. Then, using negative h, solve the same differential equation on the same interval with initial value x(1.6) = 1.0.
The following pathological example has been given by Dahlquist and Bjo ̈rck [1974]. Consider the differen- tial equation x ′ = 100(sin t − x ) with initial value
11.
12.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 7.2
13. Plot the solution for differential equation (13).
14. Selectadifferentialequationwithaknownsolutionand compare the classical fourth-order Runge-Kutta method with one or both of the following ones. Print the errors at each step. Is the ratio of the two errors a constant at each step? What are the advantages and disadvantages of each method?
a. A fourth-order Runge-Kutta method similar to the classical one is given by
x(t + h) = x(t) + 1(K1 + 4K3 + K4) 6
15.
Note: There are any number of Runge-Kutta methods of any order. The higher the order, the more complicated are the formulas. Since the one given by Equation (12) has error O(h5) and is rather simple, it is the most popular fourth-order Runge-Kutta method. The error term for the method of part b of this problem is also O(h5), and it is optimum in a certain sense.
(See Ralston [1965] for details.)
A fifth-order Runge-Kutta method is given by
x(t+h)=x(t)+ 1 K1+ 5 K4+27K5+125K6 24 48 56 336
where
1
K2 =hft+2h, x+2K1
7.2 Runge-Kutta Methods 319
where
K 1 = h f ( t , x )
K = h f t + 1 h , x + 1 K
K1=hf(t,x)
2 2 21 K3=hf t+1h,x+1K1+1K2
K=hft+1h,x+1K 2221
2 4 4 K4 = h f (t + h, x − K2 + 2K3)
111 K3=hf t+2h,x+4K1+4K2
See England [1969] or Shampine, Allen, and Pruess [1997].
b. Another fourth-order Runge-Kutta method is given by
x(t+h)=x(t)+w1K1+w2K2+w3K3+w4K4 where
K4=hf(t+h,x−K2+2K3) K5=hft+2h,x+7K1+10K2+1K4
3272727 K=hft+1h,x+28K−1K+546K
6 5 625 1 5 2 625 3
+54 K4−378K5
K =hf(t,x)
Write and test a procedure that uses this formula.
16. a. UseasymbolmanipulationpackagesuchasMapleor Mathematica to find the general Runge-Kunge method
625 625
5 5 1√
oforder2.
b. Repeatfororder3.
K3=hf t+16 14−3 5 h,x+c31K1+c32K2
K4 = h f (t + h, x + c41 K1 + c42 K2 + c43 K3) Here the appropriate constants are
17.
(Delay Ordinary Differential Equation) Investigate procedures for determining the numerical solution of an ordinary differential equation with a constant delay such as
x′(t)=−x(t)+x(t−20)+ 1 cost 20 20
+ sin t −sin t −1 20 20
on the interval 0 ≦ t ≦ 1000, where x(t) = sin t/20 fort≦0.Useastepsizelessthanorequalto20sothat no overlapping occurs. Compare to the exact solution x (t ) = sin(t /20).
Write a software for program Test RK4 and routine RK4, and verify the numerical results given in the text.
c31 = c41 = c43 = w1 = w3 =
3 − 963 + 476√5 1024
−3365 + 2094√5 6040 ,
,
5757 − 324√5 c32 = 1024
−975 − 3046√5 c42 = 2552
3214595 + 6374√5
240845 √
263+24√5 125 1−8 5 , w2=
1812 3828
10243346 + 1623√5 , w4 = 215 − 2√5 59 24787 123
18.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
320
Chapter 7 Initial Values Problems
7.3
Adaptive Runge-Kutta and Multistep Methods An Adaptive Runge-Kutta-Fehlberg Method
In realistic situations involving the numerical solution of initial-value problems, there is always a need to estimate the precision attained in the computation. Usually, an error tolerance is prescribed, and the numerical solution must not deviate from the true solution beyond this tolerance. Once a method has been selected, the error tolerance dictates the largest allowable step size. Even if we consider only the local truncation error, determining an appropriate step size may be difficult. Moreover, often a small step size is needed on one portion of the solution curve, whereas a larger one may suffice elsewhere.
For the reasons given, various methods have been developed for automatically adjusting the step size in algorithms for the initial-value problem. One simple procedure is now described. Consider the classical fourth-order Runge-Kutta method discussed in Section 7.2. To advance the solution curve from t to t +h, we can take one step of size h using the Runge- Kutta formulas. But we can also take two steps of size h/2 to arrive at t + h. If there were no truncation error, the value of the numerical solution x(t + h) would be the same for both procedures. The difference in the numerical results can be taken as an estimate of the local truncation error. So, in practice, if this difference is within the prescribed tolerance, the current step size h is satisfactory. If this difference exceeds the tolerance, the step size is halved. If the difference is very much less than the tolerance, the step size is doubled.
The procedure just outlined is easily programmed but rather wasteful of computing time and is not recommended. A more sophisticated method was developed by Fehlberg [1969]. The Runge-Kutta-Fehlberg method of order 4 is
x(t+h)=x(t)+ 25K1+1408K3+2197K4−1K5 (1) 216 2565 4104 5
where
Schemes for Automatic Step-Size Adjustment
K =hf(t,x)
1
1 1 K =hf t+ h,x+ K 2 4 41
Runge-Kutta-Fehlberg Method of Order 4
K 3 = h f t + 3 h , x + 3 K 1 + 9 K 2
8 32 32
K4=hf t+12h,x+1932K1−7200K2+7296K3 13 2197 2197 2197
Runge-Kutta Method of Order 5
K5=hf t+h,x+439K1−8K2+3680K3−845K4 216 513 4104
Since this scheme requires one more function evaluation than the classical Runge-Kutta method of order 4, it is of questionable value alone. However, with an additional function evaluation
K6 = h ft + 1 h, x − 8 K1 + 2K2 − 3544 K3 + 1859 K4 − 11 K5 (2) 2 27 2565 4104 40
we can obtain a fifth-order Runge-Kutta method, namely,
x(t+h)=x(t)+ 16K1+ 6656K3+28561K4− 9K5+ 2K6 (3)
135 12825 56430 50 55
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
RK45 Pseudocode
7.3 Adaptive Runge-Kutta and Multistep Methods 321
The difference between the values of x(t + h) obtained from the fourth- and fifth-order procedures is an estimate of the local truncation error in the fourth-order procedure. So six function evaluations give a fifth-order approximation, together with an error estimate!
Pseudocode
A pseudocode for the Runge-Kutta-Fehlberg method is given in procedure RK45:
procedure RK45( f, t, x, h, ε)
real ε, K1, K2, K3, K4, K5, K6, h, t, x, x4
external function f
real c20 ← 0.25, c21 ← 0.25
real c30 ← 0.375, c31 ← 0.09375, c32 ← 0.28125
real c40 ← 12./13., c41 ← 1932./2197.
real c42 ← −7200./2197., c43 ← 7296./2197.
real c51 ← 439./216., c52 ← −8.
real c53 ← 3680./513., c54 ← −845./4104.
real c60 ← 0.5, c61 ← −8./27., c62 ← 2.
real c63 ← −3544./2565., c64 ← 1859./4104.
real c65 ← −0.275
real a1 ← 25./216., a2 ← 0., a3 ← 1408./2565.
real a4 ← 2197./4104., a5 ← −0.2
real b1 ← 16./135., b2 ← 0., b3 ← 6656./12825.
real b4 ← 28561./56430., b5 ← −0.18
real b6 ← 2./55.
K1 ←hf(t,x)
K2 ←hf(t+c20h,x+c21K1)
K3 ←hf(t+c30h,x+c31K1 +c32K2)
K4 ←hf(t+c40h,x+c41K1 +c42K2 +c43K3)
K5 ←hf(t+h,x+c51K1 +c52K2 +c53K3 +c54K4)
K6 ←hf(t+c60h,x+c61K1 +c62K2 +c63K3 +c64K4 +c65K5) x4 ← x +a1K1 +a3K3 +a4K4 +a5K5
x ← x +b1K1 +b3K3 +b4K4 +b5K5 +b6K6
t←t+h
ε ← |x − x4|
end procedure RK45
Of course, the programmer may wish to consider various optimization techniques such as assigning numerical values to the coefficients with decimal expansions corresponding to the precision of the computer being used so that the fractions do not need to be recomputed at each call to the procedure.
We can use the RK45 procedure in a nonadaptive fashion such as in the following test program:
program Test RK45
integer k; real t, h, ε; external function f integer n ← 72
real a ← 1.0,b ← 1.5625,x ← 2.0
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
322 Chapter 7 Initial Values Problems
h ← (b − a)/n t←a
output 0, t, x for k = 1 to n
call RK45( f, t, x, h, ε)
output k,t,x,ε end for
end program Test RK45
real function f (t, x) real t, x
f ← 2.0+(x −t −1.0)2 end function f
Test RK45 Pseudocode
A Simple Adaptive Procedure
Here, we print the error estimation at each step. However, we can use it in an adaptive procedure, since the error estimate ε can tell us when to adjust the step size to control the single-step error.
We now describe a simple adaptive procedure. In the RK45 procedure, the fourth- and fifth-order approximations for x(t + h), say, x4 and x5, are computed from six function evaluations, and the error estimate ε = |x4 − x5| is known. From user-specified bounds on the allowable error estimate (εmin ≦ ε ≦ εmax), the step size h is doubled or halved as needed to keep ε within these bounds. A range for the allowable step size h is also specified by the user (hmin ≦ |h| ≦ hmax). Clearly, the user must set the bounds (εmin, εmax, hmin, hmax) carefully so that the adaptive procedure does not get caught in a loop, trying repeatedly to halve and double the step size from the same point to meet error bounds that are too restrictive for the given differential equation.
Basically, our adaptive process is as follows:
Overview of Adaptive Process
1. Givenastepsizehandaninitialvaluex(t),theRK45routinecomputesthevalue x(t + h) and an error estimate ε.
2. Ifεmin≦ε≦εmax,thenthestepsizehisnotchangedandthenextstepistakenby repeating step 1 with initial value x(t + h).
3. Ifε<εmin,thenhisreplacedby2h,providedthat|2h|≦hmax.
4. Ifε>εmax,thenhisreplacedbyh/2,providedthat|h/2|≧hmin.
5. If hmin ≦ |h| ≦ hmax, then the step is repeated by returning to step 1 with x(t) and the new h value.
■ Algorithm
The procedure for this adaptive scheme is RK45 Adaptive. In the parameter list of the pseudocode, f is the function f (t , x ) for the differential equation, t and x contain the initial values, h is the initial step size, tb is the final value for t, itmax is the maximum number of steps to be taken in going from a = ta to b = tb, εmin and εmax are lower and upper bounds on the allowable error estimate ε, hmin and hmax are bounds on the step size h, and iflag is an error flag that returns one of the following values:
iflag
0 1
Meaning
Successful march from ta to tb Maximum number of iterations reached
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
RK45 Adaptive
Pseudocode
7.3 Adaptive Runge-Kutta and Multistep Methods 323 On return, t and x are the exit values, and h is the final step size value considered or used:
procedure RK45 Adaptive( f, t, x, h, tb, itmax, εmax, εmin, hmin, hmax, iflag) integer iflag, itmax, n; external function f
real ε, εmax, εmin, d, h, hmin, hmax, t, tb, x, xsave, tsave
realδ← 1 ×10−5
2 output 0, h, t, x
iflag ← 1 k←0
while k ≦ itmax
k←k+1
if |h| < hmin then h ← sign(h)hmin if |h| > hmax then h ← sign(h)hmax d ← |tb − t|
if d ≦ |h| then
iflag ← 0
if d ≦ δ · max{|tb|, |t|} then exit loop h ← sign(h)d
end if
xsave ← x
tsave ← t
call RK45( f, t, x, h, ε) output n,h,t,x,ε
if iflag = 0 then exit loop ifε<εmin thenh←2h if ε > εmax then
h ← h/2 x ← xsave
t ← tsave
k←k−1 end if
end while
end procedure RK45 Adaptive
In the pseudocode, notice that several conditions must be checked to determine the size of the final step, since floating-point arithmetic is involved and the step size varies.
Repeat the computer example in the previous section using RK45 Adaptive, which allows variable step size, instead of RK4. Compare the accuracy of these two computed solutions. (See Computer Exercise 7.4.22.)
An Industrial Example
A first-order differential equation that arose in the modeling of an industrial chemical process is as follows:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x′ =a+bsint+cx x(0) = 0
(4)
324 Chapter 7
Initial Values Problems
Test RK45 Pseudocode
in which a = 3, b = 5, and c = 0.2 are constants. This equation is amenable to the solution techniques of calculus, in particular the use of an integrating factor. However, the analytic solution is complicated, and a numerical solution may be preferable.
To solve this problem numerically using the adaptive Runge-Kutta formulas, identify (and program) the function f that appears in the general description. In this problem, it is f (t, x) = 3 + 5 sin t + 0.2x. Here is a brief pseudocode for solving the equation on the interval [0,10] with particular values assigned to the parameters in the routine RK45 Adaptive:
program Test RK45 Adaptive
integer iflag; real t , x , h , tb ; external function f
integer itmax ← 1000
real εmax ← 10−5, εmin ← 10−8, hmin ← 10−6, hmax ← 1.0 t←0.0; x←0.0; h←0.01; tb ←10.0
callRK45 Adaptive(f,t,x,h,tb,itmax,εmax,εmin,hmin,hmax,iflag) output itmax, iflag
end program Test RK45 Adaptive
real function f (t, x) real t, x
f ←3+5sin(t)+0.2x end function f
We obtain the approximation x(10) ≈ 135.917. The output from the code is a table of values that can be sent to a plotting routine. The resulting graph helps the user to visualize the solution curve.
Adams-Bashforth-Moulton Formulas
We now introduce a strategy in which numerical quadrature formulas are used to solve a single first-order ordinary differential equation. The model equation is
x′(t)= f(t,x(t))
and we suppose that the values of the unknown function have been computed at several points to the left of t, namely, t,t − h,t − 2h,…,t − (n − 1)h. We want to compute x(t + h). By the theorems of calculus, we can write
t+h t
t+h t
n j=1
wheretheabbreviation fj = f(t−(j−1)h,x(t−(j−1)h))hasbeenused.Inthelast line of the above equation, we have brought in a suitable numerical integration rule. The simplest case of such a formula over interval [0, 1] uses values of the integrand at points
x(t + h) = x(t) +
= x(t)+
≈x(t)+
x′(s) ds f(s,x(s))ds
cj fj
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.3 Adaptive Runge-Kutta and Multistep Methods 325
0, −1, −2, . . . , 1 − n in the case of an Adams-Bashforth formula. Once we have such a basic rule, a change of variable produces the rule for any other interval with any other uniform spacing.
Let’s find a rule of the form
1
F(r)dr ≈c1F(0)+c2F(−1)+···+cnF(1−n)
0
Therearencoefficientscj atourdisposal.Weknowfrominterpolationtheorythattherule
can be made exact for all polynomials of degree n − 1. It suffices that we insist on integrating each function 1,r,r2,…,rn−1 exactly. Hence, we write down the appropriate equation:
1 n
ri−1 dt = cj(1− j)i−1 (1≦i≦n)
0
This is a linear system Au = b of n equations in n unknowns. The elements of the matrix A are Aij = (1 − j)i−1, and the right-hand side is bi = 1/i.
When this program is run, the output is the vector of coefficients 55/24, −59/24, 37/24, −3/8 . Of course, higher-order formulas are obtained by changing the value of n in the code. To get the Adams-Moulton formulas, we start with a quadrature rule of the form
j=1
1 n
G(r)dr ≈ CjG(2− j)
0
A program similar to the one above yields the coefficients 9/24, 19/24, −5/24, 1/24. The distinction between the two quadrature rules is that one involves the value of the integrand at 1 and the other does not. t+h
How do we arrive at formulas for t g(s)ds from the work already done? Use the change of variable from s to σ given by s = hσ − t. In these considerations, think of t as a constant. The new integral is h 01 g(hσ + t)dσ, which can be treated with either of the two formulas already designed for the interval [0, 1]. For example, we have
Adams-Bashforth Quadrature Rules
t+h t
t+h
t
h
F(r)dr ≈ 24 [55F(t)−59F(t −h)+37F(t −2h)−9F(t −3h)]
h
G(r)dr ≈ 24 [9G(t +h)+19G(t)−5G(t −h)+G(t −2h)]
(5)
The method of undetermined coefficients used here to obtain the quadrature rules does not, by itself, provide the error terms that we would like to have. An assessment of the error can be made from interpolation theory, because the methods considered here come from integrating an interpolating polynomial. Details can be found in more advanced books. You can experiment with some of the Adams-Bashforth-Moulton formulas in Computer Exercises 7.3.2 and 7.3.4. These methods are taken up again in Section 7.5.
Stability Analysis
Let us now resume the discussion of errors that inevitably occur in the numerical solution
of an initial-value problem
x′ = f(t,x) x(a) = s
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
j=1
(6)
326
Chapter 7
Initial Values Problems
x
FIGURE 7.4
Solution curves to x′=xwithx(a)=s
Example of Divergent Solution Curves
a5t0 t1 t2 t3
t4 t5
x 5 se(t2a)
s5
s4
s3 Global error s2
s1
t
Example of Convergent Solution Curves
For an example in which this difficulty does not arise, consider x′ =−x
x(a) = s
(7)
The exact solution is a function x(t). It depends on the initial value s, and to show this, we write x(t,s). The differential equation therefore gives rise to a family of solution curves, each corresponding to one value of the parameter s. For example, the differential equation
x′ =x x(a) = s
gives rise to the family of solution curves x = se(t−a) that differ in their initial values x(a) = s. A few such curves are shown in Figure 7.4. The fact that the curves there diverge from one another as t increases has important numerical significance. Suppose, for instance, that initial value s is read into the computer with some roundoff error. Then even if all subsequent calculations are precise and no truncation errors occur, the com- puted solution is still wrong! An error made at the beginning has the effect of selecting the wrong curve from the family of all solution curves. Since these curves diverge from one another, any minute error made at the beginning is responsible for an eventual com- plete loss of accuracy. This phenomenon is not restricted to errors made in the first step, because each point in the numerical solution can be interpreted as the initial value for succeeding points.
FIGURE 7.5
Solution curves to x′ =−x withx(a)=s
x s5
s4 s3 s2 s1
a5t0 t1
x 5 se2(t2a)
t2 t3 t4 t5
Global error
t
Its solutions are x = se−(t−a). As t increases, these curves come closer together, as in Figure 7.5. Thus, errors made in the numerical solution still result in selecting the wrong curve, but the effect is not as serious because the curves coalesce.
At a given step, the global error of an approximate solution to an ordinary differential equation contains both the local error at that step and the accumulative effect of all the local
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Conditions for Curves to Diverge or Converge
errors at all previous steps. For divergent solution curves, the local errors at each step are magnified over time, and the global error may be greater than the sum of all the local errors. In Figures 7.4 and 7.5, the steps in the numerical solution are indicated by dots connected by dark lines. Also, the local errors are indicated by small vertical bars and the global error by a vertical bar at the right end of the curves.
For convergent solution curves, the local errors at each step are reduced over time, and the global error may be less than the sum of all the local errors. For the general differential Equation (6), how can the two modes of behavior just discussed be distinguished? It is simple. If fx > δ for some positive δ, the curves diverge. However, if fx < −δ, they converge. To see why, consider two nearby solution curves that correspond to initial values s and s + h. By Taylor series, we have
7.3 Adaptive Runge-Kutta and Multistep Methods 327
whence
∂ 1 2 ∂2 x(t,s+h)=x(t,s)+h∂sx(t,s)+2h ∂s2x(t,s)+···
∂ ∂s
lim |x(t, s + h) − x(t, s)| = ∞ t→∞
lim∂ x(t,s)=∞ t→∞ ∂s
x(t, s + h) − x(t, s) ≈ h Thus, the divergence of the curves means that
and can be written as
x(t, s)
To calculate this partial derivative, start with the differential equation satisfied by x(t,s):
∂
x(t,s)= f(t,x(t,s))
∂t
and differentiate partially with respect to s:
∂∂∂
∂t∂sx(t,s)= fx(t,x(t,s))∂sx(t,s)+ ft(t,x(t,s))∂s (8)
But s and t are independent variables (a change in s produces no change in t), so ∂t/∂s = 0. If s is now fixed and if we put u(t) = (∂/∂s)x(t, s) and q(t) = fx (t, x(t, s)), then Equa- tion (8) becomes
u′ = qu (9)
This is a linear differential equation with solution u(t) = ceQ(t), where Q is the indefinite integral (antiderivative) of q. The condition limt→∞ |u(t)| = ∞ is met if limt→∞ Q(t) = ∞. This situation, in turn, occurs if q(t) is positive and bounded away from zero because then
t t
Q(t) = q(θ) dθ > δ dθ = δ(t − a) → ∞
aa as t → ∞ if fx = q > δ > 0.
To illustrate, consider the differential equation x ′ = t + tan x . The solution curves diverge from one another as t → ∞ because fx (t, x) = sec2 x > 1.
Hence, we obtain
x(t, s) = f (t, x(t, s)) ∂s
∂s ∂t
∂∂ ∂ ∂t
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
328 Chapter 7
Initial Values Problems
Summary 7.3
• The Runge-Kutta-Fehlberg method is
x(t) = x(t)+ 25 K1 + 1408K3 + 2197K4 − 1K5
216 2565 4104 5
16 6656 28561 9 2
x(t + h) = x(t) + 135 K1 + 12825 K3 + 56430 K4 − 50 K5 + 55 K6 where
K 1 = h f ( t , x )
K2 =hft+1h, x+1K1
4 4
K3 = hft + 3h, x + 3 K1 + 9 K2
8 32 32
K4 = hft + 12h, x + 1932 K1 − 7200 K2 + 7296 K3
13 2197 2197 2197
439 3680 845
K5=hf t+h,x+216K1−8K2+513K3−4104K4
K6 = h ft + 1 h, x − 8 K1 + 2K2 − 3544 K3 + 1859 K4 − 11 K5
2 27 2565 4104 40
The quantity ε = |x(t + h) − x| can be used in an adaptive step-size procedure.
• A fourth-order multistep method is the Adams-Bashforth-Moulton method: x(t+h)=x(t)+ h 55f(t,x(t))−59f(t−h,x(t−h))
a1. Solvetheproblem
by using the trapezoid rule, as discussed at the beginning of this chapter. Compare the true solution at t = 1 to the
approximate solution obtained with n steps. Show, for example, that for n = 5, the error is 0.00123.
a2. DeriveanimplicitmultistepformulabasedonSimpson’s rule,involvinguniformlyspacedpointsx(t−h),x(t),and x(t +h), for numerically solving the ordinary differential equation x′ = f .
24
+ 37f(t −2h,x(t −2h))−9f(t −3h,x(t −3h))
x(t+h)=x(t)+ h [9f(t+h,x(t+h))+19f(t,x)(t)) 24
− 5f(t −h,x(t −h))+ f(t −2h,x(t −2h))]
The value x(t + h) is the predicted value, and x(t + h) is the corrected value. The truncation errors for these two formulas are O(h5). Since the value of x(a) is given, the values for x(a + h), x(a + 2h), x(a + 3h), x(a + 4h) are computed by some single-step method such as the fourth-order Runge-Kutta method.
x′ =−x x(0) = 1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 7.3
3. An alert student noticed that the coefficients in the Adams-Bashforth formula add up to 1. Why is that so?
a4. Derive a formula of the form
x(t +h) = ax(t)+bx(t −h)
+ h[cx′(t + h) + dx′′(t) + ex′′′(t − h)]
that is accurate for polynomials of as high a degree as possible.
Hint: Use polynomials 1, t, t2, and so on.
a5. Determine the coefficients of an implicit, one-step, ordi- nary differential equation method of the form
x(t + h) = ax(t) + bx′(t) + cx′(t + h)
so that it is exact for polynomials of as high a degree as
possible. What is the order of the error term?
6. The differential equation that is used to illustrate the adap- tive Runge-Kutta program can be solved with an integrat- ing factor. Do so.
1. Use mathematical software to solve systems of linear equations whose solutions are
a. Adams-Bahforth coefficients
b. Adams-Moulton coefficients
2. The second-order Adams-Bashforth-Moulton method is given by
x(t +h) = x(t)+ h[3f(t,x(t))− f(t −h,x(t −h))] 2
aa. x′=sint+ex ac. x′=xt
ae. x′=cost−ex
b. x′=x+te−t d. x′=x3(t2+1)
f. x′=(1−x3)(1+t2)
7.3
Adaptive Runge-Kutta and Multistep Methods 329 7. Establish Equation (9).
a8. The initial-value problem x′ = (1 + t2)x with x(0) = 1 is to be solved on the interval [0, 9]. How sensitive is x (9) to perturbations in the initial value x(0)?
9. For each differential equation, determine regions in which the solution curves tend to diverge from one another as t increases:
10. For the differential equation x′ = t(x3 − 6×2 + 15x), determine whether the solution curves diverge from one another as t → ∞.
a11. Determine whether the solution curves of x′ = (1 + t 2 )−1 x diverge from one another as t → ∞.
4. (Predictor-Corrector Scheme) Using the fourth- order Adams-Bashforth-Moulton method, derive the predictor-corrector scheme given by the following equations:
x(t+h)=x(t)+ h [55f(t,x(t))−59f(t−h,x(t−h)) 24
+ 37 f (t − 2h, x(t − 2h)) − 9 f (t − 3h, x(t − 3h))]
x(t +h) = x(t)+ h[f(t +h,x(t +h))+ f(t,x(t))] x(t+h)=x(t)+ h [9f(t+h,x(t+h))+19f(t,x(t))
2 24 The approximate single-step error is ε ≡ K|x(t + h) −
− 5 f (t − h, x(t − h)) + f(t −2h,x(t −2h))]
Write and test a procedure for the Adams-Bashforth- Moulton method.
Note: This is a multistep process because values of x at t , t −h, t −2h, and t −3h are used to determine the predicted valuex(t+h),which,inturn,isusedwithvaluesofx att, t−h,andt−2htoobtainthecorrectedvaluex(t+h).The error terms for these formulas are (251/720)h5 f (iv)(ξ) and −(19/720)h5 f (iv)(η), respectively. (See Section 7.5 for additional discussion of these methods.)
a5. Solve
x′ = 3x + 9t −13
t2 x(3) = 6
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x(t + h)|, where K = 1 . Using ε to monitor the conver- 6
gence, write and test an adaptive procedure for solving an ODE of your choice using these formulas.
3. (Continuation) Carry out the instructions of the pre- vious computer problem for the third-order Adams- Bashforth-Moulton method:
x(t +h) = x(t)+ h [23f(t,x(t))−16f(t −h,x(t −h)) 12
+5f(t −2h,x(t −2h))] x(t +h) = x(t)+ 12[5f(t +h,x(t +h))8f(t,x(t))
− f (t − h, x(t − h))]
where K = 1 in the expression for the approximate
h
10 single-step error.
Computer Exercises 7.3
330 Chapter 7 Initial Values Problems
at x1 using procedure RK45 Adaptive to obtain the
the true solution:
3 92 13 x=t−2t+2t
a6. (Continuation) Repeat the previous problem for x − 1 . 2
7. Itisknownthatthefourth-orderRunge-Kuttamethodde- scribed in Equation (12) of Section 7.2 has a local trunca- tion error that is O(h5). Devise and carry out a numerical experiment to test this.
Suggestions: Take just one step in the numerical solu- tion of a nontrivial differential equation whose solution is known beforehand. However, use a variety of values for h, such as 2−n , where 1 ≦ n ≦ 24. Test whether the ratio of errors to h5 remains bounded as h → 0. A multiple- precision calculation may be needed. Print the indicated ratios.
8. Compute the numerical solution of x′ =−x
x(0) = 1 using the midpoint method
Note: This is an example of an elliptic integral of the sec- ond kind. It arises in finding an arc length on an ellipse and in many engineering problems.
By solving an appropriate initial-value problem, make a
2
desired solution to nine decimal places. Compare with
a 12.
a 13.
table of the function
f (x) =
∞
1/x tet
dt
on the interval [0, 1]. Determine how well f is approxi- mated by x e−1/x .
Hint: Let t = −lns.
By solving an appropriate initial-value problem, make a
table of the function x
2 −t2
f(x)= √π e 0
dt
on the interval 0 ≦ x ≦ 2. Determine how accurately f (x )
is approximated on this interval by the function
2 32−x2 g(x)=1− ay+by +cy √ e
π
b = −0.08497 13
x n + 1 = x n − 1 + 2 h x n′ √2
14.
a 15.
16.
17.
c = 0.66276 98, UsetheRunge-Kuttamethodtocompute
withx0 =1andx1 =−h+ 1+h.Arethereany difficulties in using this method for this problem? Carry out an analysis of the stability of this method.
Hint: Consider fixed h and assume xn = λn.
1 0
a9. Tabulate and graph the function [1 − ln v(x )]v(x ) on [0, e], where v(x) is the solution of the initial-value prob- lem (dv/dx)[lnv(x)] = 2x,v(0) = 1.
Check value: v(1) = e.
sine integral
x sinr Si(x)= r dr
0
The table should cover the interval 0 ≦ x ≦ 1 in steps of
size 0.01.
(Use sin(0)/0 = 1. See Computer Exercise 5.1.9.)
Compute a table of the function
x sinht
Shi(x) = t dt 0
by finding an initial-value problem that it satisfies and then solving the initial-value problem. Your table should be accurate to nearly machine precision.
(Use sinh(0)/ 0 = 1.)
Design and carry out a numerical experiment to verify that a slight perturbation in an initial-value problem can cause catastrophic errors in the numerical solution. Note: An initial-value problem is an ordinary differen- tial equation with conditions specified only at the initial point. (Compare this with a boundary value problem as given in Chapter 11.)
where
a = 0.30842 84,
y = (1 + 0.47047x )−1 1 + s3 ds
Write and run a program to print an accurate table of the
10. Determinethenumericalvalueof 5s
in three ways: solving the integral, an ordinary differen- tial equation, and using the exact formula.
2π eds 4s
11. Computeandprintatableofthefunction
f (φ) = φ 1 − 1 sin2 θ dθ
4 0
by solving an appropriate initial-value problem. Cover the interval [0, 90◦ ] with steps of 1◦ and use the Runge- Kutta method of order 4.
Check values: Use f (30◦ ) =
f (90◦) = 1.46746 221.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
0.51788 193 and
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
18. Run example programs for solving the industrial example in Equation (4), compare the solutions, and produce the plots.
19. Another adaptive Runge-Kutta method was developed by England [1969]. The Runge-Kutta-England method is similar to the Runge-Kutta-Fehlberg method in that it combines a fourth-order Runge-Kutta formula and a com- panion fifth-order one. To reduce the number of function evaluations, the formulas are derived so that some of the same function evaluations are used in each pair of for- mulas. (A fourth-order Runge-Kutta formula requires at least four function evaluations, and a fifth-order one re- quires at least six.) The Runge-Kutta-England method uses the fourth-order Runge-Kutta methods in Computer Exercise 7.2.14a and takes two half-steps as follows:
7.4
Methods for First and Higher-Order Systems 331 With these two half-steps, there are enough function eval-
uations so that only one more
K =1hf t+h,x(t)−1(K +96K −92K 92 12123
+ 121K4 − 144K5 − 6K6 + 12K7
is needed to obtain a fifth-order Runge-Kutta method:
11
xt+2h =x(t)+6(K1+4K3+K4)
where
1
x(t+h)=x(t)+ 90 14K1+64K3+32K3
− 8K5 +64K7 +15K8 − K9
An adaptive procedure can be developed by using an error
estimationbasedonthetwovaluesx(t+h)andx(t+h). Program and test such a procedure. (See, for example, Shampine, Allen, and Pruess [1997].)
K1 = 1hf(t,x(t))
2
20.
Investigate the numerical solution of the initial-value problem
K2=1hf t+1h,x(t)+1K1
242 √
K3 = 1hft+1h, x(t)+1K1+1K2 x′=− 1−x2 2 4 4 4 x(0) = 1
K = 1 h f t + 1 h , x ( t ) − K + 2 K 42223
This problem is ill-conditioned, since x (t ) = cos t is a solution and x(t) = 1 is also. For more informa- tion on this and other test problems, see Cash [2003] or www2.imperial.ac.uk/∼jcash/.
21. (Student Research Project) Learn and write a report about algebraic differential equations.
and 1 x(t+h)=xt+1h + K5+4K7+K8
26
1
where
K = hf t+1h,xt+1h
522 2
K =1hft+3h,xt+1h+1K
22. Write software to implement the following pseudocodes and verify the numerical results given in the text for IVP (13) in Section 7.2 and for IVP (4) in this section.
6 2 4 2 2 5
1 a.TestRK4andRK4
K7= hf t+3h,xt+1h +1K5+1K6
2 4 2 4 4 b. TestRK45andRK45
K8=1hft+h,xt+1h−K6+2K7 c. TestRK45AdaptiveandRK45Adaptive 22
7.4
Methods for First and Higher-Order Systems
So far in our study of the numerical solutions of ordinary differential equations, we have re- stricted our attention to a single differential equation of the first order with an accompanying auxiliary condition. Scientific and technological problems often lead to more complicated situations, however. The next degree of complication occurs with systems of several first- order equations.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
332 Chapter 7
Initial Values Problems
Sample ODE Coupled System
x′(t)=x(t)−y(t)+2t−t2 −t3 y′(t)=x(t)+y(t)−4t2 +t3
x (0) = 1 y(0) = 0
(1)
Analytic Solution
Sample ODE Uncoupled System
This is an example of an initial-value problem that involves a system of two first-order differential equations. Note that in the example given, it is not possible to solve either of the two differential equations by itself because the first equation governing x′ involves the unknown function y, and the second equation governing y′ involves the unknown function x. In this situation, we say that the two differential equations are coupled.
The reader is invited to verify that the analytic solution is
x(t) = et cos(t) + t2 = cos(t)[cosh(t) + sinh(t)] + t2
y(t) = et sin(t) − t3 = sin(t)[cosh(t) + sinh(t)] − t3
Let us look at another example that is superficially similar to the first but is actually
Uncoupled and Coupled Systems
The sun and the nine planets form a system of particles moving under the jurisdiction of Newton’s law of gravitation. The position vectors of the planets constitute a system of 27 functions, and the Newtonian laws of motion can be written, then, as a system of 54 first-order ordinary differential equations. In principle, the past and future positions of the planets can be obtained by solving these equations numerically.
Taking an example of more modest scope, we consider two equations with two auxiliary conditions. Let x and y be two functions of t subject to the system
with initial conditions
simpler:
with initial conditions
x′(t)=x(t)+2t−t2 −t3 y′(t) = y(t) − 4t2 + t3
x (0) = 1 y(0) = 0
(2)
These two equations are not coupled and can be solved separately as two unrelated initial- value problems (using, for instance, the methods of Sections 7.1–3). Naturally, our concern here is with systems that are coupled, although methods that solve coupled systems also solve those that are not. The procedures extend to systems, whether coupled or uncoupled.
Taylor Series Method
We illustrate the Taylor series method for System (1) and begin by differentiating the
equations constituting it:
x′ =x−y+2t−t2 −t3
y′ = x + y − 4t2 + t3
x′′ = x′ − y′ + 2 − 2t − 3t2
y′′ = x′ + y′ − 8t + 3t2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.4 Methods for First and Higher-Order Systems 333
x′′′ = x′′ − y′′ − 2 − 6t y′′′ = x′′ + y′′ − 8 + 6t
x(iv) =x′′′ −y′′′ −6 y(iv) = x′′′ + y′′′ + 6
x(v) = x(iv) − y(iv) y(v) = x(iv) + y(iv)
etc.
A program to proceed from x(t) to x(t + h) and from y(t) to y(t + h) is easily written by using a few terms of the Taylor series:
Taylor Series for x(t+h)and y(t+h)
x(t+h)=x+hx′ + y(t + h) = y + hy′ +
h2 h3 h4 h5 x′′ + x′′′ + x(iv) +
x(v) +··· y(v) + + · · ·
2 6 24 120 h2 h3 h4 h5
y′′ + y′′′ + y(iv) +
2 6 24 120
Taylor System1
Pseudocode
together with equations for the various derivatives. Here, x and y and all their derivatives are functions of t; that is, x = x(t), y = y(t), x′ = x′(t), y′′ = y′′(t), and so on.
A pseudocode program that generates and prints a numerical solution from 0 to 1 in 100 steps is as follows. Terms up to h4 have been used in the Taylor series.
program Taylor System1
integer k; real h, t, x, y, x′, y′, x′′, y′′, x′′′, y′′′, x(iv), y(iv) integernsteps←100; reala←0,b←1
x←1; y←0; t←a
output 0, t , x , y
h ← (b − a)/nsteps
for k = 1 to nsteps
x′ ←x−y+t(2−t(1+t)) y′ ←x+y+t2(−4+t)
x′′ ← x′ − y′ + 2 − t(2 + 3t) y′′ ← x′ + y′ + t(−8 + 3t) x′′′ ← x′′ − y′′ − 2 − 6t
y′′′ ← x′′ + y′′ − 8 + 6t x(iv) ← x′′′ − y′′′ − 6
y(iv) ← x′′′ + y′′′ + 6
x(v) ← x(iv) − y(iv)
y(v) ← x(iv) + y(iv)
x ← x +hx′ + 1hx′′ + 1hx′′′ + 1hx(iv)
2 3 4 y←y+h y′+1h y′′+1h y′′′+1h y(iv)
t←t+h
output k, t, x, y end for
end program Taylor System1
234
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
334 Chapter 7 Initial Values Problems
ODE System1 in Vector Form
Vector Notation
Observe that System (1) can be written in vector notation as
x′=x−y+2t−t2 −t3 (3)
First-Order ODEs
′
xn′ = fn(t,x1,x2,…,xn) .
Taylor Series Method mth Order
The m-order Taylor series method would be written as
h2 h3 h4 h5
hm
X(v) + ··· + X(m) (5)
with initial conditions
y′ x + y − 4t2 + t3 x (0) = 1
y(0) 0
This is a special case of a more general problem that can be written as
where
X′ = F(t, X)
X(a) = S (given)
X=x, X′ =x′ y y′
(4)
and F is the vector whose two components are given by the right-hand sides in Equation (1). Since F depends on t and X, we write F(t, X).
Systems of ODEs
We can continue this idea in order to handle a system of n first-order differential equations. First, we write them as
Then we let
′
x = f (t,x ,x ,…,x )
1112 n x2 = f2(t,x1,x2,…,xn)
.
x1(a) = s1,x2(a) = s2,…, xn(a) = sn
(all given)
x1 x1′ f1
X = x 2 , X ′ = x 2′ , F = f 2 , S = s 2
.
xn xn′ fn sn
.
and we obtain Equation (4), which is an ordinary differential equation written in vector
.
notation.
Taylor Series Method: Vector Notation
X(t + h) = X + hX′ + X′′ + X′′′ + X(iv) +
2 3! 4! 5!
m!
s1 .
where X = X(t), X′ = X′(t), X′′ = X′′(t), X′′′ = X′′′(t), and so on.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Taylor System2
Pseudocode
Runge-Kutta Method Fourth-Order in Vector Form
program if the computer language supports vector operations.
Runge-Kutta Method
The Runge-Kutta methods also extend to systems of differential equations. The classical fourth-order Runge-Kutta method for System (4) uses these formulas:
X(t+h)=X+h(K1 +2K2 +2K3 +K4) (6) 6
where
K 1 = F ( t , X )
K 2 = F t + 1 h , x + 1 h k 1 22
K =Ft+1h,x+1hk 3 2 22
K 4 = F ( t + h , X + h K 3 )
Here, X = X(t), and all quantities are vectors with n components except variables t and h.
7.4 Methods for First and Higher-Order Systems 335
A pseudocode for the Taylor series method of order 4 applied to the preceding problem can be easily rewritten by a simple change of variables and the introduction of an array and an inner loop.
program Taylor System2
integer i,k; real h,t; real array (xi)1:n,(dij)1:n×1:4 integer n ← 2, nsteps ← 100
real a ← 0, b ← 1
t←0; (xi)←(1,0)
output 0,t,(xi)
h ← (b − a)/nsteps
for k = 1 to nsteps
d11 ← x1 −x2 +t(2−t(1+t)) d21 ← x1 + x2 + t2(−4 + t)
d12 ← d11 −d21 +2−t(2+3t) d22 ←d11 +d21 +t(−8+3t) d13 ← d12 −d22 −2−6t
d23 ← d12 +d22 −8+6t d14 ←d13 −d23 −6
d24 ←d13 +d23 +6
fori=1ton 1 1 1 xi ←xi +h di1+2h di2+3h di3+4h[di4]
end for
t←t+h
output k, t, (xi ) end for
end program Taylor System2
Here, a two-dimensional array is used instead of all the different derivative variables; that is, d ↔ x(j). In fact, this and other methods in this chapter become particularly easy to
iji
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
336 Chapter 7
Initial Values Problems
Pseudocode
A procedure for carrying out the Runge-Kutta procedure is given next. It is assumed that the system to be solved is in the form of Equation (4) and that there are n equations in the system. The user furnishes the initial value of t, the initial value of X, the step size h, and the number of steps to be taken, nsteps. Furthermore, procedure XP System(n, t, (xi ), ( fi )) is needed, which evaluates the right-hand side of Equation (4) for a given value of array (xi ) and stores the result in array ( fi ). (The name XP System2 is chosen as an abbreviation of X′ for a system.)
procedure RK4 System1(n, h, t, (xi ), nsteps) integer i, j, n; real h, t; real array (xi )1:n allocate real array (yi )1:n , (Ki, j )1:n×1:4 output 0,t,(xi)
for j = 1 to nsteps
call XP System(n, t, (xi ), (Ki,1)) fori =1ton
yi ←xi +1hKi,1 2
end for
call XP System(n, t + h/2, (yi ), (Ki,2)) fori =1ton
yi ←xi +1hKi,2 2
end for
call XP System(n, t + h/2, (yi ), (Ki,3)) fori =1ton
yi ←xi +hKi,3 end for
call XP System(n, t + h, (y)i , (Ki,4)) fori =1ton
xi ← xi + 1h[Ki,1 +2Ki,2 +2Ki,3 + Ki,4] 6
end for
t←t+h
output j,t,(xi) end for
deallocate array (yi ), (Ki, j ) end procedure RK4 System1
RK4 System1
Pseudocode
Test RK4 System1
Pseudocode
To illustrate the use of this procedure, we again use System (1) for our example. Of course, it must be rewritten in the form of Equation (4). A suitable main program and a procedure for computing the right-hand side of Equation (4) follow:
program Test RK4 System1 integer n ← 2, nsteps ← 100 real a ← 0, b ← 1
real h, t; real array (xi )1:n t←0
(xi) ← (1,0)
h ← (b − a)/nsteps
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Results
Taylor Series
x (1.0) ≈ 2.46869 40
y(1.0) ≈ 1.28735 46
Runge-Kutta
2.46869 42
1.28735 61
Analytic Solution
2.46869 39399
1.28735 52872
7.4 Methods for First and Higher-Order Systems 337
call RK4 System1(n, h, t, (xi ), nsteps) end program Test RK4 System1
procedure XP System(n, t, (xi ), ( fi )) real array (xi )1:n , ( fi )1:n
integer n
real t
f1 ← x1 − x2 + t(2 − t(1 + t))
f2 ← x1 + x2 − t2(4 − t) end procedure XP System
We use a numerical experiment to compare the results of the Taylor series method and the Runge-Kutta method with the analytic solution of System (1). At the point t = 1.0, the results are as follows:
We can use mathematical software routines found in MATLAB, Maple, or Mathematica to obtain the numerical solution of the system of ordinary differential equations (1). For t over the interval [0,1], we invoke an ODE procedure to march from t = 0 at which x(0) = 1 and y(0) = 0 to t = 1 at which x(1) = 2.468693912 and y(1) = 1.287355325.
To obtain the numerical solution of the ordinary differential equation defined for t over the interval [1, 1.5], invoke an ordinary differential equation solving procedure to march from t = 0 at which x(1) = 2 and y(1) = −2 to t = 1.5 at which x(1.5) ≈ 15.5028 and y(1.5) ≈ 6.15486.
Autonomous ODE
When we wrote the system of differential equations in vector form
X′ = F(t, X)
we assumed that the variable t was explicitly separated from the other variables and treated differently. It is not necessary to do this. Indeed, we can introduce a new variable x0 that is t in disguise and add a new differential equation x0′ = 1. A new initial condition must also be provided, x0(a) = a. In this way, we increase the number of differential equations from n to n + 1 and obtain a system written in the more elegant vector form
X′ = F(X)
X(a) = S (given)
Consider the system of two equations given by Equation (1). We write it as a system with three variables by letting
Thus, we have
x0 =t, x1 =x, x2 =y
x0′ 1
x1′ =x1 −x2 +2×0 −x02 −x03
x2′ x1 +x2 −4×02 +x03
The auxiliary condition for the vector X is X (0) = [0, 1, 0]T .
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
338 Chapter 7
Initial Values Problems
IVP Autonomous Vector Form
X′ = F(X) X(a) = S
x0′
x1′
X ′ = x 2′ ,
. . . . . . . . . . . .
(7)
As a result of the preceding remarks, we sacrifice no generality in considering a system of n + 1 first-order differential equations written as
where
x0
(given)
f0
x0′ = f0(x0,x1,x2,…,xn) x1′ = f1(x0,x1,x2,…,xn) x2′ = f2(x0,x1,x2,…,xn)
.
x′ = f (x ,x ,x ,…,x )
nn012n
x0(a) = s0,x1(a) = s1,x2(a) = s2,…, xn(a) = sn
x1 X = x 2 ,
f1 F = f 2 ,
s1 S = s 2
(all given)
s0
xn xn′ fn sn
We can write this system in general vector notation as
A system of differential equations without the t variable explicitly present is said to be autonomous. The numerical methods that we discuss do not require that x0 = t or f0 = 1 ors0 =a.
For an autonomous system, the classical fourth-order Runge-Kutta method for System (6) uses these formulas:
X(t+h)=X+h(K1 +2K2 +2K3 +K4) (8) 6
where
K 1 = F ( X )
Runge-Kutta Method Fourth Order
K2 =FX+1hK1 2
K 3 = F X + 1 h K 2 2
K4 =F(X+hK3)
Here, X = X(t), and all quantities are vectors with n+1 components except the variables h. In the previous example, the procedure RK4 System1 would need to be modified by beginningthearrayswith0ratherthan1andomittingthevariablet.(WecallitRK4 System2
and leave it as Computer Exercise 7.4.4.) Then the calling programs would be as follows:
program Test RK4 System2 real h, t; real array (xi )0:n integer n ← 2, nsteps ← 100
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.4 Methods for First and Higher-Order Systems 339
real a ← 0, b ← 1
(xi) ← (0,1,0)
h ← (b − a)/nsteps
call RK4 System2(n, h, (xi ), nsteps) end program Test RK4 System2
procedure XP System(n, (xi ), ( fi )) real array (xi )0:n , ( fi )0:n
integer n
f0 ← 1
f1 ← x1 −x2 +x0(2−x0(1+x0)) f 2 ← x 1 + x 2 − x 02 ( 4 − x 0 )
end procedure XP System
Test RK4 System2
Pseudocode
Sample IVP Second Order
x′′(t) = −3 cos2(t) + 2 x(0) = 0, x′(0) = 0
Without the auxiliary conditions, the general analytic solution is
x(t) = 1t2 + 3 cos(2t) + c1t + c2 48
(9)
It is typical in ordinary differential equation solvers, such as those found in mathe- matical software libraries, for the user to interface with them by writing a subprogram in a nonautonomous format. In other words, the ordinary differential equation solver takes as input both the independent variable and the dependent variable and returns values for the right-hand side to the ordinary differential equation. Consequently, the nonautonomous programming convention may seem more natural to those who are using these software packages.
It is a useful exercise to find a physical application in your field of study or profession involving the solution of an ordinary differential equation. It is instructive to analyze and solve the physical problem by determining the appropriate numerical method and translating the problem into the format that is compatible with the available software.
Higher-Order Differential Equations and Systems
Consider the initial-value problem for ordinary differential equations of order higher than 1. A differential equation of order n is normally accompanied by n auxiliary conditions. This many initial conditions are needed to specify the solution of the differential equation pre- cisely (assuming certain smoothness conditions are present). Take, for example, a particular second-order initial-value problem
where c1 and c2 are arbitrary constants. To select one specific solution, c1 and c2 must be fixed, and two initial conditions allow this to be done. In fact, x (0) = 0 yields c2 = − 3 ,
andx′(0)=0forcesc1 =0.
Higher-Order Differential Equations
8
In general, higher-order problems can be much more complicated than this simple example because System (9) has the special property that the function on the right-hand side of the differential equation does not involve x. The most general form of an ordinary differential
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
340 Chapter 7
Initial Values Problems
IVP
equation with initial conditions that we shall consider is x(n) = f(t,x,x′,x′′,…,x(n−1))
(10)
IVP Vector Form
(11)
Sample IVP Third Order
To illustrate, let us transform the initial-value problem x′′′=cosx+sinx′−ex′′ +t2
(12)
x(a), x′(a), x′′(a), . . . , x(n−1)(a)
This can be solved numerically by turning it into a system of first-order differential equa-
tions.Todoso,wedefinenewvariablesx1,x2,…,xn asfollows:
x1 =x, x2 =x′, x3 =x′′, …, xn−1 =x(n−2), xn =x(n−1)
Consequently, the original Initial-Value Problem (10) is equivalent to
x′ =x 1 2
or, in vector notation,
x1(a),x2(a),…,xn(a) (all given) X′ = F(t, X)
where
and
X(a) = S (given)
X = [x1,x2,…,xn]T X′ =[x1′,x2′,…,xn′]T
F = [x2,x3,x4,…,xn, f]T X(a) = [x1(a),x2(a),…,xn(a)]
x′=x3 2
. x′ =x
n−1 n
xn′ = f(t,x1,x2,…,xn)
Whenever a problem must be transformed by introducing new variables, it is recom- mended that a dictionary be provided to show the relationship between the new and the old variables. At the same time, this information, together with the differential equations and the initial values, can be displayed in a chart. Such systematic bookkeeping can be helpful in a complicated situation.
x(0) = 3, x′(0) = 7, x′′(0) = 13
into a form suitable for solution by the Runge-Kutta procedure. A chart summarizing the
transformed problem is as follows:
Old Variable New Variable Initial Value Differential Equation
x x1 3 x1′=x2
x ′ x 2 7 x 2′ = x 3
x′′ x3 13 x3′ =cosx1+sinx2−ex3 +t2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
(all given)
7.4 Methods for First and Higher-Order Systems 341 So the corresponding first-order system is
x2 X′ = x3
cosx1 +sinx2 −ex3 +t2
and X(0) = [3, 7, 13]T .
Systems of Higher-Order Differential Equations
By systematically introducing new variables, we can transform a system of differential equations of various orders into a larger system of first-order equations. For instance, the system
x′′ = x − y − (3x′)2 + (y′)3 + 6y′′ + 2t Sample IVP Fifth Order
IVP Vector Form
X′ = F(X)
X(a) = S (given)
X = [x0,x1,x2,…,xn]T X′ = [x0′,x1′,x2′,…,xn′ ]T
F = [1,×2,x3,x4,…,xn, f]T
y′′′ = y′′ − x′ + ex − t (13)
x(1)=2, x′(1)=−4, y(1)=−2, y′(1)=7, y′′(1)=6
can be solved by the Runge-Kutta procedure if we first transform it according to the following
chart:
Old Variable New Variable Initial Value Differential Equation
xx1 2×1′=x2
x′ x2 −4 x2′ =x1 −x3 −9×2 +x43 +6×5 +2t
y x3 −2 x3′=x4
y ′ y′′
Hence, we have
x 4 7 x 4′ = x 5
x5 6 x5′ =x5−x2+ex1 −t
x2 x1 −x3 −9×2 +x43 +6×5 +2t
We notice that t is present on the right-hand side of Equation (11) and that therefore the equations x0 = t and x0′ = 1 can be introduced to form an autonomous system of ordinary differential equations in vector notation. It is easy to show that a higher-order system of differential equations having the form in Equation (10) can be written in vector notation as
and X(1) = [2,−4,−2,7,6]T . Autonomous ODE Systems
where
X′ = x4
x5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x5−x2+ex1 −t
342 Chapter 7
Initial Values Problems
Sample IVP Autonomous Form
As an example, the ordinary differential equation system in Equation (12) can be written in autonomous form as
1 x2
X′ =x1 −x3 −9×2 +x43 +6×5 +2×0
x4 x5 x5 −x2 +ex1 −x0
and X(1) = [1,2,−4,−2,7,6]T . Summary 7.4
and
X(a) = [a,x1(a),x2(a),…,xn(a)]
• A system of ordinary differential equations ′
x = f (t,x ,x ,…,x ) 1112 n
′
xn′ = fn(t,x1,x2,…,xn) .
x2 = f2(t,x1,x2,…,xn)
.
x1(a) = s1,x2(a) = s2,…,xn(a) = sn, can be written in vector notation as
X′ = F(t, X) X(a) = S (given)
(all given)
where we define the following n component vectors X = [x , x , . . . , x ]T
1 2 n X′ = [x1′,x2′,…,xn′ ]T
F = [ f 1 , f 2 , . . . , f n ] T
X(a) = [x1(a),x2(a),…,xn(a)]T
• The Taylor series method of order m is
h2 h3 hm
X(t + h) = X + hX′ + X′′ + X′′′ + ··· + X(m)
2 3! m! where X = X(t), X′ = X′(t), X′′ = X′′(t), X′′′ = X′′′(t), and so on.
• The Runge-Kutta method of order 4 is
X(t+h)=X+h(K1 +2K2 +2K3 +K4)
6
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.4 Methods for First and Higher-Order Systems 343
where
K 1 = F ( t , X )
K 2 = F t + 1 h , X + 1 h K 1 22
K3 =Ft+1h,X+1hK2 2 2
K 4 = F ( t + h , X + h K 3 )
Here, X = X(t), and all quantities are vectors with n components except variables t
and h.
• We can absorb the t variable into the vector by letting x0 = t and then writing the autonomous form for the system of ordinary differential equations in vector notation
as
X′ = F(X)
X(a) = S (given)
where vectors are defined to have n + 1 components. Then X = [x , x , x , . . . , x ]T
0 1 2 n X′ = [x0′,x1′,x2′,…,xn′ ]T
F = [1, f1, f2,…, fn]T
X(a) = [a,x1(a),x2(a),…,xn(a)]T
• TheRunge-Kuttamethodoforder4forthesystemofordinarydifferentialequations in autonomous form is
X(t+h)=X+h(K1 +2K2 +2K3 +K4) 6
where
K 1 = F ( X )
1 K =F X+ hK 2 21
1 K3=F X+2hK2
K4 =F(X+hK3)
Here, X = X (t ), and all quantities F and K i are vectors with n + 1 components except
the variables t and h.
• A single nth-order ordinary differential equation with initial values has the form
x(n) = f(t,x,x′,x′′,…,x(n−1))
x(a), x′(a), x′′(a), . . . , x(n−1)(a) (all given)
It can be turned into a system of first-order equations of the form
X′ = F(t, X) X(a) = S (given)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
344 Chapter 7
Initial Values Problems
a1. Consider
a4.
a5.
Write an equivalent system of first-order differential equations without t appearing on the right-hand side:
•
We can absorb the variable t into the vector notation by letting x0 = t and extending the vectors to length n + 1. Thus, a single nth-order ordinary differential equation can be written as
x′=y, x(0)=−1 ′
x′ = x2 + log(y) + t2
y′ = ey − cos(x) + sin(tx) − (xy)7
Turn this differential equation into a system of first- order equations suitable for applying the Runge-Kutta method:
Assuming that a program is available for solving initial-value problems of the form in Equation (11), how can it be used to solve the following differential equation?
′′′ ′ ′′ x =t+x+2x+3x
x(1) = 3, x′(1) = −7, x′′(1) = 4
How would this problem be solved if the initial con-
ditions were x(1) = 3, x′(1) = −7, and x′′′(1) = 0?
y=x, y(0)=0
Write down the equations, without derivatives, to be used
in the Taylor series method of order 5.
a2. How would you solve this system of differential equa- tions numerically?
′2t2 x 1 = x 1 + e − t
x2′ =x2−cost x1(0) = 0, x2(1) = 0
a3. Howwouldyousolvetheinitial-valueproblem
x1′(t)=x1(t)et +sint−t2 x 2′ ( t ) = [ x 2 ( t ) ] 2 − e t + x 2 ( t ) x1(1) = 2, x2(1) = 4
if a computer program were available to solve an initial- value problem of the form x ′ = f (t , x ) involving a single unknown function x = x(t)?
x(0)=1, y(0)=3
where
where
X′ = F(X)
X(a) = S (given)
X = [x , x , x , . . . , x ]T 012 n
X′ = [x0′,x1′,x2′,…,xn′ ]T
F =[1,×2,x3,x4,…,xn, f]T
X(a) = [a,x1(a),x2(a),…,xn(a)]
X =[x1,x2,…,xn]T X′ = [x1′,x2′,…,xn′ ]T
F =[x2,x3,x4,…,xn, f]T X(a) = [x1(a),x2(a),…,xn(a)]T
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
6. a.
b.
x′′′ = 2x′ + log(x′′) + cos(x) ′ ′′
x(0)=1, x (0)=−3, x (0)=5
Exercises 7.4
a7.
a8.
a9.
a10.
How would you solve this differential equation problem numerically?
′′ ′ 2
x1 = x1 + x1 − sin t
x′′=x −(x′)1/2+t 222
with initial conditins
x (0)=1, x (1)=3, x′(0)=0, x′(1)=−2 1212
Converttoafirst-ordersystemtheorbitalequations x′′ + x(x2 + y2)−3/2 = 0
y′′ + y(x2 + y2)−3/2 = 0 with initial conditions
x(0) = 0.5 x′(0) = 0.75 y(0) = 0.25 y′(0) = 1.0
Rewritethefollowingequationasasystemoffirst-order differential equations without t appearing on the right- hand side:
7.4 Methods for First and Higher-Order Systems 345 Determine the associated first-order system and its aux-
11.
a
Determine a system of first-order equations equivalent to each of the following:
Followtheinstructionsintheprecedingproblemonthis example:
txyz + x′y′/t = tx2 + x/y′′ + z t2x/z + y′z′t = y2 − (z′′)2x + x′y′
12.
Consider
x(iv) =(x′′′)2 +cos(x′x′′)−sin(tx)+logx
14.
15.
16.
17.
18.
t x(0)=1, x (0)=3, x (0)=4, x (0)=5
′ ′′ ′′′ Expressthesystemofordinarydifferentialequations
d2z dz
xz − 2t = 2te
dt2 dt
dx −2xzdx =3x2yt2
dt2 dt d2y dy
′ ′′2√2 2′2
a 13.
iliary initial conditions. The problem
x′′(t) = x + y − 2x′ + 3y′ + log t y′′(t) = 2x − 3y + 5x′ + ty′ − sin t x(0) = 1, x′(0) = 2
y(0) = 3, y′(0) = 4
is to be put into the form of an autonomous system of five first-order equations. Give the resulting system and the appropriate initial values.
WriteprocedureXPSystemforusewiththefourth-order Runge-Kutta routine RK4 System1 for the following dif- ferential equation:
x′′′ = 10ex′′ − x′′′ sin(x′x) − (xt)10
x(2) = 6.5, x′(2) = 4.1, x′′(2) = 3.2
Ifwearegoingtosolvetheinitial-valueproblem x ′′′ = x ′ − t x ′′ + x + ln t
x(1) = x′(1) = x′′(1) = 1
using Runge-Kutta formulas, how should the problem be
transformed?
Convert this problem involving differential equations into an autonomous system of first-order equations (with ini- tial values):
2
2 − ey = 4xt2z dt dt
z(1) = x′′(1) = y′(1) = 2 z′(1) = x(1) = y(1) = 3
as a system of first-order ordinary differential equations.
3x+tanx−x= t+1+y+(y)
−3y′ + cot y′′ + y2 = t2 + (x + 1)1/2 + 4x′ x(1)=2, x′(1)=−2, y(1)=7, y′(1)=3
aa. x′′′+x′′sinx+tx′+x=0 (iv) ′′ ′ ′
tyz − x′z′y′ = z2 − zx′′ − (yz)′ x(3)=1, y(3)=2, z(3)=4 x′(3) = 5, y′(3) = 6, z′(3) = 7
Turn this pair of differential equations into a second-order differential equation involving x alone:
b. x +x cosx +txx =0
c.
x′′ = 3×2 − 7y2 + sin t + cos(x′ y′) y′′′ = y + x2 − cos t − sin(xy′′)
x′′ =x′−x x(0)=0,
x′(0)=1
x′ =−x+axy y′ =3y−xy
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
346 Chapter 7 Initial Values Problems
a1. Solve the system of differential equations (1) by using a7. two different methods given in this section and compare
the results with the analytic solution.
a2. Solve the initial-value problem
x ′ = t + x 2 − y
y′ = t2 − x + y2 x(0)=3, y(0)=2
by means of the Taylor series method using h = 1/128 8. on the interval [0, 0.38]. Include terms involving three derivatives in x and y. How accurate are the computed
function values?
9.
a4. Write procedure RK4 System2 and a driver program
for solving the ordinary differential equation system
given by Equation (2). Use h = −10−2, and print out
the values of x0, x1, and x2, together with the true
solution on the interval [−1, 0]. Verify that the true so-
lution is x(t) = et +6+6t +4t2 +t3 and y(t) = 10. et−t3+t2+2t+2.
a5. Using the Runge-Kutta procedure, solve the following initial-value problem on the interval 0 ≦ t ≦ 2π . Plot the resulting curves (x1(t), x2(t)) and (x3(t), x4(t)). They should be circles.
Write and test a program, using the Taylor series method of order 5, to solve the system
x(5)=2, y(5)=3 −3
3. Write the Runge-Kutta procedure to solve x′ =−3x
Print a table of sin t and cos t on the interval [0, π/2] by numerically solving the system
x ′ = y y′ = −x
x(0)=0, y(0)=1
Write a program for using the Taylor series method of
1 2 x′ = 1x
2 3 1 x1(0) = 0,
order 3 to solve the system
x ′ = t x + y ′ − t 2
y ′ = t y + 3 t
z′ =tz−y′+6t3
x(0)=1, y(0)=2 z(0)=3 on the interval [0, 0.75] using h = 0.01.
Write and test a short program for solving the system of differential equations
X = − x 1 x 12 + x 2 2 11.
series method of order 4.
Recode and test procedure RK4 System2 using a com- puter language that supports vector operations.
Verify the numerical results given in the text for the system of differential equations (1) from programs Test RK4 System1 and RK4 System2.
(Continuation) Use mathematical software such as MATLAB, Maple, or Mathematica containing symbolic manipulation capabilities to verify the analytic solution for the system of differential equations (1).
(Continuation)Usemathematicalsoftwareroutinessuch as in MATLAB, Maple, or Mathematica to verify the numerical solutions given in the text. Plot the resulting
x4 ′ −3/2
x ′ = t x − y 2 + 3 t y′ = x2 − ty − t2
on the interval [5, 6] using h = 10 . Print values of x and y at steps of 0.1.
x2(0) = 1 on the interval 0 ≦ t ≦ 4. Plot the solution.
x 3
y(2)=5, x(2)=3
over the interval [2, 5] with h = 0.25. Use the Taylor
y ′ = x 3 − t 2 y − t 2 x′ = tx2 − y4 + 3t
−x x2 + x2−3/2 212
X(0) = [1, 0, 0, 1]T 12. 6. Solvetheproblem
x ′ = 1
x1′ =−x2+cosx0
13.
14.
x2′= x1+sinx0 0
x0(1) = 1, x1(1) = 0,
Use the Runge-Kutta method and the interval −1 ≦ t ≦ 2.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x2(1) = −1
Computer Exercises 7.4
solution curve. Compare with the results from programs Test RK4 System1 and Test RK4 System2.
15. Use RK4 System1 to solve each of the following for 0 ≦ t ≦ 1. Use h = 2−k with k = 5, 6, and 7, and compare results.
x′′ = 2(e2t − x2)1/2
7.5 Adams-Bashforth-Moulton Methods 347 17. Solve ′′ ′ 2
x + x + x − 2t = 0
x(0) = 0, x′(0) = 0.1
on [0, 3] by any convenient method. If a plotter is avail-
able, graph the solution. 18. Solve ′′ ′
x =2x −5x
x(0) = 0, x′(0) = 0.4 on the interval [−2, 0].
19. Write computer programs based on the pseudocode in the text to find the numerical solution of these ordinary differential equation systems:
a. IVP (9) b. IVP (12) c. IVP (13)
20. (Continuation) Use mathematical software such as MAT- LAB, Maple, or Mathematica with symbolic manipula- tion capabilities to find their analytical solutions.
21. (Continuation) Use mathematical software routines such as in MATLAB, Maple, or Mathematica to verify the nu- merical solutions for these ordinary differential equation systems. Plot the resulting solution curves.
a.
b.
x(0) = 0, x′(0) = 1 x′′ = x2 − y + et
y′′ = x − y2 − et
x(0) = 0, x′(0) = 0
y(0) = 1, y′(0) = −2
16. SolvetheAirydifferentialequation
x′′ =tx
x (0) = 0.35502 80538 87817 x′(0) = −0.25881 94037 92807
on the interval [0, 4.5] using the Runge-Kutta method. Check value: x (4.5) = 0.00033 02503 is correct.
7.5 Adams-Bashforth-Moulton Methods
Single-Step versus Multi-Step Methods
Adams-Bashforth Predictor Step
by means of single-step numerical methods. In other words, if the solution X(t) is known at a particular point t, then X(t + h) can be computed with no knowledge of the solution at points earlier than t. The Runge-Kutta and Taylor series methods compute X(t + h) in terms of X(t) and various values of F.
Moreefficientmethodscanbedevisedifseveralvalues X(t), X(t−h), X(t−2h),… are used in computing X(t + h). Such methods are called multi-step methods. They have the obvious drawback that at the beginning of the numerical solution, no prior values of X are available. So it is usual to start a numerical solution with a single-step method, such as the Runge-Kutta procedure, and transfer to a multistep procedure for efficiency as soon as enough starting values have been computed.
An example of a multistep formula is known as the Adams-Bashforth formula (see Section 7.3 (p. 325) and the related Computer Exercises 7.3.2–4). It is
X(t + h) = X(t) + h {55F[X(t)] − 59F[X(t − h)] + 37F[X(t − 2h)] 24
− 9F[X(t − 3h)]} (2)
A Predictor-Corrector Scheme
The procedures explained so far have solved the initial-value problem
X′ = F(X)
X(a) = S (given)
(1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
348 Chapter 7
Initial Values Problems
Adams-Moulton Corrector Step
Adams-Moulton (Predictor-Corrector) Method
Here, X(t + h) is the predicted value of X(t + h) computed by using Formula (2). If the solution X has been computed at the four points t, t − h, t − 2h, and t − 3h, then Formula (2) can be used to compute X (t + h). If this is done systematically, then only one evaluation of F is required for each step. This represents a considerable savings over the fourth-order Runge-Kutta procedure; the latter requires four evaluations of F per step. (Of course, a consideration of truncation error and stability might permit a larger step size in the Runge-Kutta method and make it much more competitive.)
In practice, Formula (2) is never used by itself. Instead, it is used as a predictor, and then another formula is used as a corrector. The corrector that is usually used with Formula (2) is the Adams-Moulton formula:
X(t+h)=X(t)+ h{9F[X(t+h]+19F[X(t)]−5F[X(t−h)] 24
+ F[X(t − 2h)]} (3)
Thus, Formula (2) predicts a tentative value of X(t + h), and Formula (3) computes this X value more accurately. The combination of the two formulas results in a predictor-corrector scheme.
With initial values of X specified at a, three steps of a Runge-Kutta method can be performed to determine enough X values that the Adams-Bashforth-Moulton procedure can begin. The fourth-order Adams-Bashforth and Adams-Moulton formulas, started with the fourth-order Runge-Kutta method, are referred to as the Adams-Moulton method. Predictor and corrector formulas of the same order are used so that only one application of the corrector formula is needed. Some suggest iterating the corrector formula, but experience has demonstrated that the best overall approach is only one application per step.
Pseudocode
Storage of the approximate solution at previous steps in the Adams-Moulton method is usually handled either by storing in an array of dimension larger than the total number of steps to be taken or by physically shifting data after each step (discarding the oldest data and storing the newest in their place). If an adaptive process is used, the total number of steps to be taken cannot be determined beforehand. Physical shifting of data can be eliminated by cycling the indices of a storage array of fixed dimension. For the Adams-Moulton method, the xi data for X(t) are stored in a two-dimensional array with entries zim in locations m = 1,2,3,4,5,1,2,… for t = a,a + h,a + 2h,a + 3h,a + 4h,a + 5h,a + 6h,…, respectively. The sketch in Figure 7.6 shows the first several t values with corresponding m values and abbreviations for the formulas used.
m:1 2 3 4 5 1 2
How the Scheme Works
FIGURE 7.6
Starting values for applications of RK and AB/AM methods
a a 1 h RK
a 1 2h RK
a 1 3h RK
a 1 4h AB / AM
a 1 5h AB / AM
a 1 6h AB / AM
An error analysis can be conducted after each step of the Adams-Moulton method. If x(p) is the numerical approximation of the ith equation in System (1) at t + h obtained by
i
predictorFormula(2)andxi isthatfromcorrectorFormula(3)att+h,thenitcanbeshown
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.5 Adams-Bashforth-Moulton Methods 349 that the single-step error for the i th component at t + h is given approximately by
So we compute
19 xi − x(p) εi= i
270 |xi | est= max |εi|
1≦i≦n
in the Adams-Moulton procedure AM System to obtain an estimate of the maximum single- step error at t + h.
A control procedure is needed that calls the Runge-Kutta procedure three times and then calls the Adams-Moulton predictor-corrector scheme to compute the remaining steps. Such a procedure for doing nsteps steps with a fixed step size h follows:
procedure AMRK(n, h, (xi ), nsteps)
integer i, k, m, n; real est, h; real array (xi )0:n allocate real array ( fi j )0:n×0:4, (zi j )0:n×0:4
m←0
output h
output 0, (xi )
fori =0ton
zim ←xi end for
for k = 1 to 3
callRK System(m,n,h,(zij),(fij)) output k,(zim)
end for
for k = 4 to nsteps
callAM System(m,n,h,est,(zij),(fij),) output k,(zim)
output est
end for
fori =0ton
xi ←zim end for
deallocate array ( f, z) end procedure AMRK
AMRK Pseudocode
The Adams-Moulton method for a system and the computation of the single-step error are accomplished in the following pseudocode:
procedureAM System(m,n,h,est,(zij),(fij))
integer i, j, k, m, mp1; real d, dmax, est, h
real array (zi j )0:n×0:4, ( fi j )0:n×0:4 allocate real array (si )0:n , ( yi )0:n realarray(ai)1:4 ←(55,−59,37,−9) realarray(bi)1:4 ←(9,19,−5,1) mp1 ← (1 + m)mod5
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
350 Chapter 7 Initial Values Problems
callXP System(n,(zim),(fim)) fori =0ton
si ←0 end for
for k = 1 to 4
j ←(m−k+6)mod5
fori =0ton
si ←si +ak fij
end for end for
fori =0ton
yi ←zim +hsi/24
end for
callXP System(n,(yi),(fi,mp1)) fori =0ton
si ←0 end for
for k = 1 to 4
j ←(mp1−k+6)mod5
fori =0ton
si ←si +bk fij
end for end for
fori =0ton
zi,mp1 ←zim +hsi/24
end for
m ← mp1
dmax ← 0 fori =0ton
d←|zim −yi|/|zim| if d > dmax then
dmax ← d j←i
end if end for
est ← 19dmax/270 deallocate array (s, y) end procedure AM System
AM System Pseudocode
Here, the function evaluations are stored cyclically in fim for use by Formulas (2) and (3).
Various optimization techniques are possible in this pseudocode. For example, the program-
mer may wish to move the computation of 1 h outside of the loops. 24
A companion Runge-Kutta procedure is needed, which is a modification of procedure RK4 System2 from Section 7.4:
procedureRK System(m,n,h,(zij),(fij))
integer i, m, mp1, n; real h; real array (zi j )0:n×0:4, ( fi j )0:n×0:4 allocate real array (gi j )0:n×0:3, (yi )0:n
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.5 Adams-Bashforth-Moulton Methods 351
mp1 ← (1 + m)mod5
callXP System(n,(zim),(fim)) fori =0ton
yi ←zim +1hfim 2
end for
callXP System(n,(yi),(gi,1)) fori =0ton
yi ←zim +1hgi,1 2
end for
callXP System(n,(yi),(gi,2)) fori =0ton
yi ←zim +hgi,2 end for
callXP System(n,(yi),(gi,3)) fori =0ton
zi,mp1 ← zim + h[ fim + 2gi,1 + 2gi,2 + gi,3]/6 end for
m ← mp1
deallocate array (gi j ), (yi ) end procedure RK System
RK System Pseudocode
As before, the programmer may wish to move 1 h out of the loop. 6
To use the Adams-Moulton pseudocode, we supply the procedure XP System that defines the system of ordinary differential equations and write a driver program with a call to procedure AMRK. The complete program then consists of the following five parts: the main program and procedures XP System, AMRK, RK System, and AM System.
As an illustration, the pseudocode for IVP (12) in Section 7.4 is as follows:
program Test AMRK
real h; real array (xi )0:n integer n ← 5, nsteps ← 100 real a ← 0, b ← 1
(xi) ← (1,2,−4,−2,7,6)
h ← (b − a)/nsteps
call AMRK(n, h, (xi ), nsteps) end program Test AMRK
procedure XP System(n, (xi ), ( fi )) integer n; real array (xi )0:n , ( fi )0:n
f0 ← 1
f1 ← x2
f2 ← x1 −x3 −9×2 +x43 +6×5 +2×0 f3 ← x4
f4 ← x5
f5 ← x5 − x2 + ex1 − x0
end procedure XP System
Test AMRK Pseudocode
Here, we have programmed this procedure for an autonomous system of ordinary differential equations.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
352 Chapter 7
Initial Values Problems
How Adaptive Scheme Works
An Adaptive Scheme
Since an estimate of the error is available from the Adams-Moulton method, it is natural to replace procedure AMRK with one that employs an adaptive scheme—that is, one that changes the step size. A procedure similar to the one used in Section 7.3 is outlined here. The Runge-Kutta method is used to compute the first three steps, and then the Adams-Moulton method is used. If the error test determines that halving or doubling of the step size is necessary in the first step using the Adams-Moulton method, then the step size is halved or doubled, and the whole process starts again with the initial values—so at least one step of the Adams-Moulton method must take place. If during this process the error test indicates that halving is required at some point within the interval [a, b], then the step size is halved. A retreat is made back to a previously computed value, and after three Runge-Kutta steps have been computed, the process continues, using the Adams-Moulton method again but with the new step size. In other words, the point at which the error was too large should be computed by the Adams-Moulton method, not the Runge-Kutta method. Doubling the step size is handled in an analogous manner. Doubling the step size requires only saving an appropriate number of previous values; however, one can simplify this process (whether halving or doubling the step size) by always backing up two steps with the old step size and then using this as the beginning point of a new initial-value problem with the new step size. Other, more complicated procedures can be designed and can be the subject of numerical experimentation. (See Computer Exercise 7.5.3.)
An Engineering Example
In chemical engineering, a complicated production activity may involve several reactors connected with inflow and outflow pipes. The concentration of a certain chemical in the i th reactorisanunknownquantity,xi.Eachxi isafunctionoftime.Iftherearenreactors,the whole process is governed by a system of n differential equations of the form
X′ = AX + V X(0) = S (given)
where X is the vector containing the unknown quantities xi , A is an n × n matrix, and V is a constant vector. The entries in A depend on the flow rates permitted between different reactors of the system.
There are several approaches to solving this problem. One is to diagonalize the matrix A by finding a nonsingular matrix P for which is P −1 A P is diagonal and then using the matrix exponential function to solve the system in an analytic form. This is a task that mathematical software can handle. On the other hand, we can simply turn the problem over to an ODE solver and get the numerical solution. One piece of information that is always wanted in such a problem is a description of the steady state of the system. That means the values of all variables at t = ∞. Each function xi should be a linear combination of
exponential functions of the form t → eλt , in which λ < 0. Here is a simple example that can illustrate all of this:
x1′−8/3−4/3 1x112
x2′ =−17/3 −4/3 1x2+29 (4)
x3′ −35/3 14/3 −2 x3 48
Sample ODE 1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.5 Adams-Bashforth-Moulton Methods 353 Using mathematical software such as MATLAB, Maple, or Mathematica, we can obtain a
closed-form solution:
x(t) = 1e−3t(6−50et +10e2t +34e3t)
6
z(t)= 1e−3t(14−200et +70e2t +116e3t)
y(t) = 1e−3t(12−125et +40e2t +73e3t) 6
FIGURE 7.7
Solution curves for a stiff ODE
Sample ODE 2
True Solution
For a system of ordinary differential equations with a large number of variables, it may be more convenient to represent them in a computer program with an array such as x(i,t) rather than by separate variables names. To see the numerical value of the analytic solution at a single point, say, t = 2.5, we obtain x(2.5) ≈ 5.74788, y(2.5) ≈ 12.5746, z(2.5) ≈ 20.0677. Also, we can produce a graphing of the analytic solution to the problem.
Finally, the programs presented in this section can be used to generate a numerical solution on a prescribed interval with a prescribed number of points.
Stiff ODEs and an Example
In many applications of differential equations there are several functions to be tracked together as functions of time. A system of ordinary differential equations may be used to model the physical phenomena. In such a situation, it can happen that different solution functions (or different components of a single solution) have quite disparate behavior that makes the selection of the step size in the numerical solution problematic. For example, one component of a function may require a small step in the numerical solution because it is varying rapidly, whereas another component may vary slowly and not require a small step size for its computation. Such a system is said to be stiff. Figure 7.7 illustrates a slowly varying solution surrounded by other solutions with rapidly decaying transients.
x
t
An example illustrates this possibility. Consider a system of two differential equations
with initial conditions:
x′ =−20x−19y, x(0)=2 (5) y′ =−19x−20y, y(0)=0
6
The solution is easily seen to be
x(t)=e−39t +e−t
y(t)=e−39t −e−t
The component e−39t quickly becomes negligible as t increases, starting at 0. The solution is then approximately given by x(t) = −y(t) = e−t, and this function is smooth and decreasing to 0. It would seem that in almost any numerical solution, a large step size could
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
354 Chapter 7 Initial Values Problems
be used. However, let’s examine the simplest of numerical procedures: Euler’s method. It
Difference Equations
Closed Form
generates the solution by using the following equations:
xn+1 =xn +h(−20xn −19yn), x0 =2
yn+1 = yn + h(−19xn − 20yn), y0 = 0 These difference equations can be solved in closed form, and we have
xn =(1−39h)n +(1−h)n yn = (1−39h)n −(1−h)n
For the numerical solution to converge to 0 (and thus imitate the actual solution), it is
necessary that h < 2 . If we were solving only the differential equation x′ = −x to get the 39
solution x(t) = e−t , the step size could be as large as h = 2 to get the correct behavior as t increased. (See Exercise 7.5.2.)
To see that numerical success (in the sense of being able to use a reasonable step size) depends on the method used, let us consider the implicit Euler method. For a single differential equation, this employs the formula
xn+1 =xn +hf(tn+1,xn+1)
Since xn+1 appears on both sides of this equation, the equation must be solved for xn+1. In
the example being considered, the Euler equations are
xn+1 = xn + h(−20xn+1 − 19yn+1)
yn+1 = yn + h(−19xn+1 − 20yn+1)
This pair of equations has the form Xn+1 = Xn + AXn+1, where A is the 2×2 matrix in the previous pair of equations and Xn is the vector having components xn and yn. This matrix equation can be written (I − A)Xn+1 = Xn or Xn+1 = (I − A)−1 Xn. A consequence is that the explicit solution is Xn = (I − A)−n X0. At this point, it is necessary to appeal to a result concerning such iterative processes. For Xn to converge to 0 for any choice of initial vector X0, it is necessary and sufficient that all eigenvalues of (I − A)−1 be less than one in modulus (see Kincaid and Cheney [2002]). Equivalently, the eigenvalues of I − A should be greater than 1 in modulus. An easy calculation shows that for positive h this condition is met, without further hypotheses. Thus, the implicit Euler method can be used with any reasonable step size on this problem. In the literature on stiff equations, much more infor- mation can be found, and there are books that address this topic thoroughly. Some essential references are Dekker and Verwer [1984], Gear [1971], Miranker [1981], and Shampine and Gordon [1975].
In general, stiff ordinary differential equations are rather difficult to solve. This is com- pounded by the fact that in most cases, you do not know beforehand whether an ordinary differential equation that you’re trying to solve numerically is stiff. Software packages usu- ally have ordinary differential equation solvers specifically designed to handle stiff ordinary differential equations. Some of these procedures may vary both the step size and the order of the method. In such algorithms, the Jacobian matrix ∂ F /∂ X y may play a role. Solving an associated linear system involving the Jacobian matrix is critical to the reliability and efficiency of the code. The Jacobian matrix may be sparse, an indication that the function F does not depend on some of the variables in the problem.
Euler’s Equations
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Additional Reading
7.5 Adams-Bashforth-Moulton Methods 355 History of ODE Numerical Methods
For readers interested in the history of numerical analysis, we recommend the book by Goldstine [1977]. The textbook on differential equations by Moulton [1930] gives some insight into the numerical methods used prior to the advent of high-speed computing ma- chines. Also Moulton gives some of the history, going back to Newton! The calculation of orbits in celestial mechanics has always been a stimulus for the invention of numerical methods; the needs of ballistic science have been also. Moulton mentions that the retardation of a projectile by air resistance is a very complicated function of velocity that necessitates numerical solution of the otherwise simple equations of ballistics.
Summary 7.5
• For the autonomous form for a system of ordinary differential equations in vector notation X′ = F(X)
X(a) = S (given)
the Adams-Bashforth-Moulton method of fourth order is
h X(t + h) = X(t) + 24 55F[X(t)] − 59F[X(t − h)] + 37F[X(t − 2h)]
− 9F[X(t − 3h)] X(t + h) = X(t) + h 9F[X(t + h)] + 19F[X(t)] − 5F[X(t − h)]
24
+ F[X(t − 2h)]
Here, X (t + h) is the predictor, and X(t + h) is the corrector. The Adams-Bashforth- Moulton method needs five evaluations of F per step. With the initial vector X(a) given, the values for X(a + h), X(a + 2h), X(a + 3h) are computed by the Runge- Kutta method of fourth order. Then the Adams-Bashforth-Moulton method can be used repeatedly. The predicted value X is computed from the four X values at t, t −h, t −2h, and t −3h, and then the corrected value X(t +h) can be computed by using the predictor value X(t + h) and previously evaluated values of F at t, t − h, and t − 2h.
a1. Findthegeneralsolutionofthissystembyturningitinto a first-order system offour equations:
x′′ = αy
y′′ =βx
2. Verify the assertions made about the step size h in the
discussionofstiffEquation(5).
3. Write autonomous systems of first-order differential equations for each of these:
y′′ + yz = 0 a. z′ + 2yz = 4
y(0) = 1, y′(0) = 0, z(0) = 3
b. c.
x′′′ −sin(x′′)+etx′ +2tcosx =25 x(0) = 5, x′(0) = 3, x′′(0) = 7
x′′′ − [sin x′′ + et x′]2 + cos x = 0 x(0)=3, x′(0)=4, x′′(0)=5
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4. Correctthefollowingsystemsofhigher-orderdifferential equations into a system of first-order equations in which t does not appear explicitly:
x′′′ −5tx′′y′′ −ln(x′)z = 0
y′′ − sin(ty) + 7tx′′ = 0 z′ + 16ty′ − et zx′ = 0
Exercises 7.5
356 Chapter 7 Initial Values Problems
1. Test the procedure AMRK on the system given in Com- puter Exercise 7.4.2.
2. The single-step error is closely controlled by using fourth- order formulas; however, the roundoff error in performing the computations in Equations (3) and (4) can be large. Itislogicaltocarrytheseoutinwhatisknownaspar- tial double-precision arithmetic. The function F would be evaluated in single precision at the desired points X(t +ih), but the linear combination i ci F(X(t +ih)) would be accumulated in double precision. Also, the addi- tion of X (t ) to this result is done in double precision. Re- code the Adams-Moulton method so that partial double- precision arithmetic is used. Compare this code with that in the text for a system with a known solution. How do they compare with regard to roundoff error at each step?
3. Write and test an adaptive process similar to RK45 Adaptive in Section 7.3 with calling sequence
This routine should carry out the adaptive procedure out- lined in this section and be used in place of the AMRK procedure.
4. Solve the predator-prey problem in the example at the
order differential equation:
3 x′′= mx−x/d3
k=1 k̸=j
ij
k ik ij jk
where d2 = 2 (xij − xik)2 for k, j = 1,2,3. As-
jk
sume that the bodies have equal mass, say, m1 = m2 =
i=1
m3 = 1, and with the appropriate starting conditions, they will follow the same figure-eight orbit as a peri- odic steady-state solution. When the system is rewrit- ten as a first-order system, the dimension of the prob- lem is twelve, and the initial conditions at t = 0 are given by
−0.97000436, 0.24308753, 0.0,
′
x =
x′ = 21
x′ = 12
22
0.466203685
0.43236573 −0.93240737 −0.86473146
0.466203685 0.43236573
x
= = =
11
11
x21
x
12
x = 0.0, x′ =
22 x13
procedure AMRK Adaptive(n,h,ta,tb,(xi),
itmax, εmin, εmax, hmin, hmax, iflag)
x′ = x23 = −0.24308753, x′ =
beginningofthischapterwitha=−10−2,b=−1×102, 4
vx =10(x2−x1)
′ −22 1
tions of time t . x′ =x1x2−8x3 33
7b.
Solve the following test problem and plot the solution curves. The Lorenz problem is well known, and it arises in the study of dynamical systems:
0.97000436,
13
23 Solve the problem for t ∈ [0, 20].
=
c = 10 and d =− 10 and with initial values u(0) = 80,
v(0) = 30. Plot u (the prey) and v (the predator) as func- x2′ =x1(28−x3)−x2
5. Solve and plot the numerical solution of the system of x1(0) = 15, x2(0) = 15, x3(0) = 36 ordinary differential equations given by Equation (4) us-
ing mathematical software such as MATLAB, Maple, or Mathematica.
6. (Continuation) Repeat for Equation (5) using a routine specifically designed to handle stiff ordinary differential equations.
7a. Solve the following test problem and plot the solution curves. This problem corresponds to a recently discov- ered stable orbit that arises in the restricted three-body problem in which the orbits are co-planar. The two spa- tial coordinates of the j th body are x1 j and x2 j for
j = 1, 2, 3. Each of the six coordinates satisfies a second-
Solve the problem for t ∈ [0, 20]. It is known to have so- lutions that are potentially poorly conditioned. Note: For additional details on these problems, see Enright [2006].
8. Write a computer program based on pseudocode Test AMRK to find the numerical solution to the ordinary differential equation systems, and compare the results with that by using a built-in routine such as in MAT- LAB, Maple, or Mathematica. Plot the resulting solution curves.
9. (Tacoma Narrows Bridge) In 1940, the third-longest suspension bridge in the world collapsed in a high wind.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 7.5
y′′=−y′d
θ′′ =−θy′d
−e
FIGURE 7.8 Tacoma Narrows Bridge collapsing in 1940.
The following system of differential equations is a math- ematical model that attempts to explain how twisting os- cillations can be magnified and cause such a calamity:
− [K/(ma)]ea(y−lsinθ) − 1+ea(y+lsinθ)
−1 +0.2Wsinωt
+ (3cosθ/l)[K/(ma)]ea(y−lsinθ)
a(y+lsinθ)
The last term in the y equation is the forcing term for the wind W , which adds a strictly vertical oscillation to the bridge. Here, the roadway has width 2l hanging between two suspended cables, y is the current distance from the center of the roadway as it hangs below its equilibrium point, and θ is the angle the roadway makes with the hori- zontal. Also, Newton’s Law F = ma is used and Hooke’s constant K . Explore how ODE solvers are used to gener- ate numerical trajectories for various parameter settings. Illustrate different types of phenomena that are available in this model. For additional details, see McKenna and Tuama [2001] and Sauer [2012].
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7.5 Adams-Bashforth-Moulton Methods 357
AP Images
8
358
Our main objective is to show that the naive Gaussian algorithm applied to A yields a factorization of A into a product of two simple matrices, one unit lower triangular:
1
l21 1 L = l31 l32 1 . . . ...
ln1 ln2 ln3 ··· 1
More on Linear Systems
In applications that involve partial differential equations, large linear systems arise with sparse coefficient matrices such as
4 −1
0
−1 A= 0 0 0 0 0
−1 4 −1 0 −1 0 0 0 0
0 −1 4 0 0 −1 0 0 0
−1 0 0 0 −1 0 0 0 −1 4 −1 0
−1 4 −1 0 −1 4 −1 0 0 0 −1 0 0 0 −1
0 0 0 0 0 0
−1 0 0 −1 0 0 4 −1 −1 4 0 −1
0 0
0 0
0 −1 0 −1
4
Gaussian elimination may
On the other hand, iterative methods preserve its sparse structure.
8.1 Matrix Factorizations LU Factorization
cause fill-in of the zero entries by nonzero values.
An n × n system of linear equations can be written in matrix form Ax = b
where the coefficient matrix A has the form
(1)
··· a1n ··· a2n ··· a3n
... . ··· ann
a11
a21 A = a31
a12 a13 a22 a23 a32 a33
. . . an1 an2 an3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Sample 4 × 4 Linear System
After Forward Elimination
In short, we refer to this as an LU factorization of A; that is, A = LU. Numerical Example
The system of Equations (2) of Section 2.1 can be written succinctly in matrix form:
6 −2 2 4x1 16
12 −8 6 10x2= 26 (2) 3 −13 9 3x3 −19
−641−18x4 −34
Furthermore, the operations that led from this system to Equation (5) of Section 2.1, that
Step 1
and the other upper triangular:
u22 u23 ··· U= u33 ··· ...
u1n u2n
u3n .
unn
is, the system
where M is a matrix chosen so that M A is we have
the coefficient matrix for System (3). Hence, 2 4
2 2 ≡ U 2 −5
6−2 0 −4 00
0 0
2 4x1 16
2 2x2 = −6 (3)
2 −5x3 −9 0 −3 x4 −3
u11 u12 u13 ···
could be effected by an appropriate matrix multiplication. The forward elimination phase can be interpreted as starting from Equation (1) and proceeding to
MAx = Mb (4)
6 −2 M A = 0 −4
00 0 0
0 −3
Step 1 of naive Gaussian elimination results in Equation (3) of Section 2.1 or the system
6 −2 2 4x1 16 0 −4 2 2x2= −6 0 −12 8 1x3 −27 023−14x4 −18
This step can be accomplished by multiplying (1) by a lower triangular matrix M1: M1 Ax = M1b
which is an upper triangular matrix.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.1 Matrix Factorizations 359
360
Chapter 8
More on Linear Systems
Step 3
M3M2M1 Ax = M3M2M1b 1000
M =0 1 0 0 30010
where
which is equivalent to
where
1000
M =−2 1 0 0
1 −1 0 1 0 2
1001
Step 2
2 4x1 16 2 2x2 = −6 2 −5x3 −9 4 −13 x4 −21
M2M1 Ax = M2M1b
1000
Notice the special form of M1. The diagonal elements are all 1’s, and the only other nonzero elements are in the first column. These numbers are the negatives of the multipliers located in the positions where they created 0’s as coefficients in Step 1 of the forward elimination phase. To continue, Step 2 resulted in Equation (4) of Section 2.1 or the system
6 −2 0 −4 0 0
0 0
M =0 1 0 0 2 0−3 1 0
0101 2
Again, M2 differs from an identity matrix by the presence of the negatives of the multipliers in the second column from the diagonal down. Finally, Step 3 gives System (3), which is equivalent to
where
0 0 −2 1 Now the forward elimination phase is complete, and with
M = M3M2M1 (5)
we have the upper triangular coefficient System (3).
Using Equations (4) and (5), we can give a different interpretation of the forward
elimination phase of naive Gaussian elimination. Now we see that A = M−1U
= M−1M−1M−1U 123
= LU
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.1 Matrix Factorizations 361 Since each Mk has such a special form, its inverse is obtained by simply changing the signs
of the negative multiplier entries! Hence, we have
100010001000
L=2 1 0 00 1 0 00 1 0 0
1 01003100010 2
Unit Lower Triangular
−1 0 0 1 0 −1 0 1 0 0 2 1 2
1000
=2 1 0 0
1310 2
−1−1 21 2
It is somewhat amazing that L is a unit lower triangular matrix composed of the multipliers. Notice that in forming L, we did not determine M first and then compute M−1 = L. (Why?)
It is easy to verify that
1 0 0 06−2 2 4
LU=2 1 0 00−4 2 2
1 3 1 00 0 2−5 2
LU Factorization
−1−1 21000−3 2
6 −2 =12 −8 3−13 −6 4
2 4
6 10=A 9 3
1 −18
We see that A is factored or decomposed into a unit lower triangular matrix L and an
upper triangular matrix U. The matrix L consists of the multipliers located in the positions
of the elements they annihilated from A, of unit diagonal elements, and of 0 upper triangular
elements. In fact, we now know the general form of L and can just write it down directly
using the multipliers without forming the M ’s and the M−1’s. The matrix U is upper kk
triangular (not generally having unit diagonal) and is the final coefficient matrix after the forward elimination phase is completed.
It should be noted that the pseudocode Naive Gauss of Section 2.1 replaces the original coefficient matrix with its LU factorization. The elements of U are in the upper triangular part of the (ai j )-array including the diagonal. The entries below the main diagonal in L (that is, the multipliers) are found below the main diagonal in the (ai j )-array. Since it is known that L has a unit diagonal, nothing is lost by not storing the 1’s. (In fact, we have run out of room in the (ai j )-array anyway!)
Formal Derivation
To see formally how the Gaussian elimination (in naive form) leads to an L U -factorization, it is necessary to show that each row operation used in the algorithm can be effected by multiplying A on the left by an elementary matrix. Specifically, if we wish to subtract λ times row p from row q, we first apply this operation to the n × n identity matrix to create an elementary matrix Mqp. Then we form the matrix product Mqp A.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
362 Chapter 8
More on Linear Systems
Before proceeding, let us verify that Mqp A is obtained by subtracting λ times row p from row q in matrix A. Assume that p < q (for in the naive algorithm, this is always true). Then the elements of Mqp = (mi j ) are
1, if i = j
mij= −λ, ifi=qandj=p
0, otherwise Therefore, the elements of Mqp A are given by
n aij
(Mqp A)ij = misasj = aqj −λapj
s=1
TheqthrowofMqp AisthesumoftheqthrowofAand−λtimesthepthrowofA,as was to be proved.
Step k of Gaussian elimination corresponds to the matrix Mk, which is the product of n − k elementary matrices:
Mk =Mnk Mn−1,k···Mk+1,k
Notice that each elementary matrix M i k here is lower triangular because i > k , and therefore Mk is also lower triangular. If we carry out the Gaussian forward elimination process on A, the result will be an upper triangular matrix U. On the other hand, the result is obtained by applying a succession of factors such as Mk to the left of A. Hence, the entire process is summarized by writing
Mn−1 · · · M2 M1 A = U Since each Mk is invertible, we have
A=M−1M−1···M−1 U 1 2 n−1
Each Mk is lower triangular having 1’s on its main diagonal (unit lower triangular). Each inverse M−1 has the same property, and the same is true of their product. Hence, the matrix
k
is unit lower triangular, and we have
L = M−1 M−1 · · · M−1 (6) 1 2 n−1
This is the so-called LU factorization of A. Our construction of it depends upon not encountering any 0 divisors in the algorithm. It is easy to give examples of matrices that have no LU factorization; one of the simplest is
A=0 1 11
(Also, see Exercise 8.1.4.)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A = LU
ifi≠ q ifi=q
■ Theorem1
Proof
We define the Gaussian algorithm formally as follows. Let A(1) = A. Then we compute A(2), A(3), . . . , A(n) recursively by the naive Gaussian algorithm, following these equations:
a(k+1) = a(k) ij ij
a(k) a(k+1)= ik
ij a(k)
kk a(k)
a(k+1) = a(k) −
ij ij a(k) kj
kk
These equations describe in a precise form the forward elimination phase of the naive Gaussian elimination algorithm.
For example, Equation (7) states that in proceeding from A(k) to A(k+1), we do not alter rows 1,2,…,k or columns 1,2,…,k − 1. Equation (8) shows how the multipliers are computed and stored in passing from A(k) to A(k+1). Finally, Equation (9) shows how multiplesofrowkaresubtractedfromrowsk+1,k+2,…,ntoproduce A(k+1) from A(k).
Notice that A(n) is the final result of the process. (It was referred to as A in the statement of the theorem.) The formal definitions of L = (li k ) and U = (u k j ) are therefore
=0
Now we draw some consequences of these equations.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
l i k ik
=1 = a(n)
ik
=0 = a(n)
kj
(i = k) (10) (ki) (12) ( j ≧ k) (13) (j < k) (14)
First, it follows immediately from Equation (7) that
a(i) = a(i+1) = ··· = a(n) (15)
l lik
u
kj ukj
ijijij
8.1 Matrix Factorizations 363
LU FactorizationTheorem
Let A = (ai j ) be an n × n matrix. Assume that the forward elimination phase of the naive Gaussian algorithm is applied to A without encountering any 0 divisors. Let the resulting matrix be denoted by A = (ai j ). If
100···0
a21 1 0 ··· 0 L = a31 a32 1 · · · 0
. . . . . . . . . . . . . . . an1 an2 · · · an,n−1 1
a11 a12 a13 ··· a1n
0 a22 a23 ··· a2n U=0 0 a33 ··· a3n
. . . . . . . . . . . . . . . 0 0 ··· 0 ann
A = LU
and
then
ik a(k)
(if i ≦ k or j < k) (7) (ifi>kandj=k) (8)
(if i > k and j > k) (9)
364 Chapter 8
More on Linear Systems
Likewise, we have, from Equation (7),
a(j+1) = a(j+2) = ··· = a(n)
ijijij
From Equations (16) and (8), we now have
a(j) a(n)=a(j+1)= ij
ij ij a(j) jj
From Equations (17) and (11), it follows that a(k)
l =a(n)= ik ik ik a(k)
kk
From Equations (13) and (15), we have
n k=1
i k=1
i − 1
a(k) kj ij k=1 kk
i−1
(j < n) (16)
(j
k=1
a(1) = a
a(k) − a(k+1) + a(i) ijijij
a(k) kj
a(k) +a(j) a(k) kj ij
a(k) −a(k+1) +a(j) ijijij
k=1 kk j−1
k=1
a(1) = a
■
ij ij
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Unit Lower Triangular System
Upper Triangular System
Forward Substitution Pseudocode
for z and
Lz = b (20)
Ux = z (21)
Pseudocode
The following is the pseudocode for carrying out the LU factorization, which is sometimes called the Doolittle factorization:
Solving Linear Systems Using LU Factorization Once the LU-factorization of A is available, we can solve the system
Ax = b
by writing
LUx = b Then we solve two triangular systems:
for x. This is particularly useful for problems that involve the same coefficient matrix A and many different right-hand vectors b.
Since L is unit lower triangular, z is obtained by the pseudocode
8.1 Matrix Factorizations 365
integer i, k, n; real array (ai j )1:n×1:n , (li j )1:n×1:n , (ui j )1:n×1:n for k = 1 to n
lkk ←1
for j = k to n
k − 1 s=1
end do
for i = k + 1 to n
k−1
lik ← aik − lisusk ukk
end do end do
s=1
ukj ←akj −
lksusj
Doolittle Factorization Pseudocode
integer i, n; real array (bi )1:n , (li j )1:n×1:n , (zi )1:n z1 ← b1
fori =2ton
i − 1 j=1
zi ←bi − end for
lijzj
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
366 Chapter 8 More on Linear Systems
Likewise, x is obtained by the pseudocode
integer i, n; real array (ui j )1:n×1:n , (xi )1:n , (zi )1:n
xn ←zn/unn
fori =n−1to1 step−1
end for
n
xi← zi− uijxj uii j=i+1
Backward Substitution Pseudocode
The first of these two algorithms applies the forward phase of Gaussian elimination to the right-hand-side vector b. (Recall that the li j ’s are the multipliers that have been stored in the array (ai j ).)
The easiest way to verify this assertion is to use Equation (6) and to rewrite the equation
in the form
From this, we get immediately
Lz=b
M−1M−1···M−1 z=b 1 2 n−1
z = Mn−1 ···M2M1b
Thus, the same operations used to reduce A to U are to be used on b to produce z.
Another way to solve Equation (20) is to note that what must be done is to form Mn−1Mn−2 ···M2M1b
This can be accomplished by using only the array (bi ) by putting the results back into b; that is,
We know what Mk looks like because it is made up of negative multipliers that have been saved in the array (ai j ). Consequently, we have
1
…
Mk b =
b1
bk
−aik . . .
−ank
bi bn
b ← Mk b
1 −ak+1,k 1
.
.
The entries b1 to bk are not changed by this multiplication, while bi (for i ≧ k + 1) is replaced by −ai k bk + bi . Hence, the following pseudocode updates the array (bi ) based on the stored multipliers in the array a:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
…
bk +1 .
1
.
. . .
1
Update rhs b Pseudocode
This pseudocode should be familiar. It is the process for updating b from Section 2.2 (p. 91). The algorithm for solving Equation (21) is the back substitution phase of the naive
Gaussian elimination process.
LDLT Factorization
In the L D L T factorization, L is unit lower triangular, and D is a diagonal matrix. This factorization can be carried out if A is symmetric and has an ordinary LU factorization, with L unit lower triangular. To see this, we start with
LU=A=AT =(LU)T =UTLT
Since L is unit lower triangular, it is invertible, and we can write U = L−1UT LT . Then U(LT )−1 = L−1UT . Since the right side of this equation is lower triangular and the left side is upper triangular, both sides are diagonal, say, D. From the equation U(LT )−1 = D, wehaveU=DLT andA=LU=LDLT.
We now derive the pseudocode for obtaining the L D L T factorization of a symmetric matrix A in which L is unit lower triangular and D is diagonal. In our analysis, we write ai j as generic elements of A and li j as generic elements of L . The diagonal of D has elements dii, or di. From the equation A = LDLT , we have
n n ν=1 μ=1
nn
n
= liνdνljν (1≦i, j≦n)
ν=1
Use the fact that lij = 0 when j > i and lii = 1 to continue the argument
aij = Assume now that j ≦ i. Then
aij = = =
liνdνljν
liνdνljν +lijdjljj liνdνljν +lijdj
(1≦ j≦i≦n)
a = ij
l d lT iν νμ μj
= liνdνδνμljμ ν=1 μ=1
min(i, j) ν=1
j ν=1
j − 1 ν=1
j − 1 ν=1
liνdνljν
(1≦i,j≦n)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.1 Matrix Factorizations 367
integer i, k, n; real array (ai j )1:n×1:n , (bi )1:n for k = 1 to n − 1
for i = k + 1 to n
bi ←bi −aikbk
end for end for
368 Chapter 8
More on Linear Systems
LDLT Factorization Pseudocode
In particular, let j = i. We get
i − 1 ν=1
Equivalently, we have
i − 1
(1≦i≦n)
(1≦i≦n)
Particular cases of this are
d=a − dl2 i ii νiν
ν=1
d = a 1 11
aij = Solving for li j , we obtain
liνdνljν+lijdj
aii =
liνdνliν +di
d2 =a22−d1l21
d3 =a33−d1l231−d2l232
etc.
Now we can limit our attention to the cases 1≦ j < i ≦ n, where we have
j − 1 ν=1
j−1 lij= aij− liνdνljν
ν=1
Let’s do some checking. Taking j = 1, we have
(1≦j 0 for every nonzero vector x. It follows at once that A is nonsingular because A obviously cannot map any nonzero vector into 0. Moreover, by considering special vectors of the form x = (x1,x2,…,xk,0,0,…,0)T , we see that the leading principal minors of A are also positive definite. Theorem 1 implies that A has an LU decomposition. By the symmetry of
CholeskyTheoremonLLT Factorization
If A is a real, symmetric, and positive definite matrix, then it has a unique factorization A = LLT
in which L is lower triangular with a positive diagonal.
■ Theorem2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
370 Chapter 8
More on Linear Systems
A, we then have, from the previous discussion
A = LDLT
It can be shown that D is positive definite, and thus its elements dii are positive. Denoting by D1/2 the diagonal matrix whose diagonal elements are √di i , we have
A = L L T
where L ≡ L D1/2, which is the Cholesky factorization. We leave the proof of uniqueness to the reader.
The algorithm for the Cholesky factorization is a special case of the general LU factorization algorithm. If A is real, symmetric, and positive definite, then by Theorem 2, it has a unique factorization of the form
A = LLT
in which L is lower triangular and has positive diagonal. Thus, in the equation A = LU, we must have U = LT . In the kth step of the general algorithm, the diagonal entry is computed by
(22)
s=1
The algorithm for the Cholesky factorization is as follows:
k−1 1/2 lkk=akk− l2ks
integer i, k, n, s; real array (ai j )1:n×1:n , (li j )1:n×1:n for k = 1 to n
lkk←akk− l2ks s=1
for i = k + 1 to n
end do end do
s=1
k−1 1/2
k−1
lik ← aik − lislks lkk
Cholesky Factorization Pseudocode
Theorem 2 guarantees that lkk > 0. Observe that Equation (22) gives us the following bound:
√
from which we conclude that
k s=1
l 2k s ≧ l 2k j |lkj|≦ akk
a k k =
( j ≦ k )
(1≦ j≦k)
Hence, any element of L is bounded by the square root of a corresponding diagonal element in A. This implies that the elements of L do not become large relative to A even without any pivoting. In the Cholesky algorithm (and the Doolittle algorithms), the dot products of vectors should be computed in double precision to avoid a buildup of roundoff errors.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 2
Solution
8.1 Matrix Factorizations 371 Determine the Cholesky factorization of the matrix in Example 1.
Using the results from Example 1, we write
where
L = L D 1 / 2
A=LDLT =(LD1/2)(D1/2LT)=LLT 10002000
3 1 0 00 1√3 0 0 =4 2
12100020 233
11110001 4322
√
2 0 0 0 2.0000 0 0 0 3 1√3 0 0
1.5000 0.8660 0 0 =11320=
2 2 √
3 3 1.0000 0.5774 0.8165 0
1 1√312 1
2 6 2 3 √2 0.5000 0.2887 0.4082 0.7071 Clearly, L is the lower triangular matrix in the Cholesky factorization
Solving Ax = B
A = L L T
Many software packages for solving linear systems allow the input of multiple right-hand
sides. Suppose an n × m matrix B is
B = [b(1), b(2), . . . , b(m)]
in which each column corresponds to a right-hand side of the m linear systems Ax(j) = b(j)
for1≦ j≦m.Thus,wecanwrite
A[x(1), x(2),…, x(m)] = [b(1), b(2),…, b(m)]
or
AX = B
From Section 2.2, procedure Gauss can be used once to produce a factorization of A, and procedure Solve can be used m times with right-hand side vectors b( j ) to find the m solution vectorsx(j) for1≦ j≦m.
Multiple Right-Hand Sides
Since the factorization phase can be done in 1 n3 long operations while each of the back 3
■
substitution phases requires n2 long operations, this entire process can be done in 1 n3 +mn2 13 2 3
Operation Count
long operations. This is much less than m 3 n + n , which is what it would take if each of the m linear systems were solved separately.
Computing A−1
In some applications, such as in statistics, it may be necessary to compute the inverse of a
matrix A and explicitly display it as A−1. If an n × n matrix A has an inverse, it is an n × n
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
372 Chapter 8 More on Linear Systems
matrix X with the property that
Computing A Inverse
AX = I (23) where I is the identity matrix. If x(j) denotes the jth column of X and I(j) denotes the jth
column of I, then matrix Equation (23) can be written as
A[x(1), x(2),…, x(n)] = [I(1), I(2),…, I(n)]
This can be written as n linear systems of equations of the form Ax(j) = I(j) (1≦ j≦n)
This can be done by using procedures Gauss and Solve from Section 2.2. Use procedure Gauss once to produce a factorization of A, and use procedure Solve n times with the right-hand side vectors I ( j ) for 1 ≦ j ≦ n. This is equivalent to solving, one at a time, for the columns of A−1, which are x(j). Hence, we obtain
A−1 =[x(1),x(2),…,x(n)]
A word of caution on computing the inverse of a matrix: In solving a linear system Ax = b, it is not advisable to determine A−1 and then compute the matrix-vector product x = A−1b because this requires many unnecessary calculations, compared to directly
solving Ax = b for x.
Example Using Software Packages
A permutation matrix is an n×n matrix P that arises from the identity matrix by permuting its rows. It then turns out that permuting the rows of any n × n matrix A can be accomplished by multiplying A on the left by P. Every permutation matrix is invertible, since the rows still form a basis for Rn . When Gaussian elimination with row pivoting is performed on a matrix A, the result is expressible as
P A = LU
where L is lower triangular and U is upper triangular. The matrix P A is A with its rows rearranged.
If we have the LU factorization of P A, how do we solve the system Ax = b? First, write it as
P Ax = Pb
then LUx = Pb. Let y = Ux, so that our problem is now
Ly=Pb Ux=y
The first equation is easily solved for y, and then the second equation is easily solved for x. Mathematical software systems such as MATLAB, Maple, and Mathematica produce factorizations of the form P A = LU upon command.
Caution
Solving Ax = b Using PA = LU
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Maple
A=LU=2 1 0 00 −4 2 2
1 3 1 00 0 2−5
MATLAB
0 0 1.0000 0
0.0909 1.0000
Mathematica
1000
where P is a permutation matrix corresponding to the pivoting strategy used. Finally, we
use Mathematica to create this LU decomposition: 3−139 3
−2 −22 19 −12 2−12 52 −166
11 11 11
4 −2 22 −6 13 13
The output is in a compact store scheme that contains both the lower triangular matrix and the upper triangular matrix in a single matrix. However, the storage arrangement may be complicated because the rows are usually permuted during the factorization in an effort to make the solution process numerically stable. Verify that this factorization corresponds to the permutation of rows of matrix A in the order 3, 4, 1, 2. ■
Summary 8.1
• If A = (ai j ) is an n × n matrix such that the forward elimination phase of the naive Gaussian algorithm can be applied to A without encountering any zero divisors, then
EXAMPLE 3
Solution
8.1 Matrix Factorizations 373 Use the mathematical software systems in Maple, MATLAB, and Mathematica to find the
LU factorization of this matrix:
6 −2 A=12 −8 3−13 −6 4
First, we use Maple and find this factorization: 1 0 0
2 4 6 10
9 3 1 −18
(24)
0 0 0 0.2727
0100 P=0 0 1 0 0001
06−2
2 4 −1−1 21000−3
2
2
Next, we use MATLAB and find a different factorization:
P A = L U
1.0000 0 0 0
L = 0.2500 1.0000 −0.5000 0 0.5000 −0.1818
12.0000 −8.0000 6.0000 10.0000 U= 0 −11.0000 7.5000 0.5000 0 0 4.0000 −13.0000
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
374 Chapter 8
More on Linear Systems
the resulting matrix can be denoted by A = (ai j ), where
100···0
a21 1 0 ··· 0 L = a31 a32 1 · · · 0
. . …… . an1 an2 · · · an,n−1 1
a11 a12 a13 ··· a1n
0 a22 a23 ··· a2n U=0 0 a33 ···a3n
. . … … . 0 0 ··· 0 ann
This is the LU factorization of A, so A = LU, where L is a unit lower triangular and U is upper triangular. When we carry out the Gaussian forward elimination process on A, the result is the upper triangular matrix U. The matrix L is the unit lower triangular matrix whose entries are negatives of the multipliers in the locations of the elements
they zero out.
• We can also give a formal description as follows. The matrix U can be obtained by applying a succession of matrices Mk to the left of A. The kth step of Gaussian elimi- nation corresponds to a unit lower triangular matrix M k , which is the product of n − k elementary matrices
Mk =Mnk Mn−1,k···Mk+1,k
where each elementary matrix Mik is unit lower triangular. If Mqp A is obtained by subtracting λ times row p from row q in matrix A with p < q, then the elements of
Mqp =(mij)are
1, ifi=j
mij= −λ, ifi=qandj=p
0, otherwise
The entire Gaussian elimination process is summarized by writing
Mn−1 · · · M2 M1 A = U Since each Mk is invertible, we have
A=M−1M−1···M−1 U 1 2 n−1
Each M is a unit lower triangular matrix, and the same is true of each inverse M−1, as kk
well as their products. Hence, the matrix
L = M−1 M−1 · · · M−1
is unit lower triangular.
• For symmetric matrices, we have the L D L T factorization, and for symmetric positive definite matrices, we have the LLT factorization, which is also known as Cholesky factorization.
• If the LU factorization of A is available, we can solve the system Ax = b
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
and
1 2 n−1
bysolving
1. Using naive Gaussian elimination, factor the following matrices in the form A = LU, where L is a unit lower triangular matrix and U is an upper triangular matrix.
303 aa.A=0−1 3 130
Ly=Pb fory Ux = y for x
1010 3
b.A=0 1 3−1 3 −3 0 6
0 2 4 −6
8.1 Matrix Factorizations 375 by solving two triangular systems: L y = b for y
Ux=y forx
This is useful for problems that involve the same coefficient matrix A and many different
right-hand vectors b. For example, let B be an n × m matrix of the form B = [b(1), b(2), . . . , b(m)]
where each column corresponds to a right-hand side of the m linear systems Ax(j) =b(j) (1≦ j≦m)
Thus, we can write
or
Ax(1), x(2),..., x(m) = b(1), b(2),..., b(m)
AX = B
A special case of this is to compute the inverse of an n × n invertible matrix A. We write AX = I
where I is the identity matrix. If x(j) denotes the jth column of X and I(j) denotes the jth column of I, this can be written as
Ax(1), x(2),..., x(n) = I(1), I(2),..., I(n)
or as n linear systems of equations of the form
Ax(j) = I(j) (1≦ j≦n)
We can use LU factorization to solve these n systems efficiently, obtaining A−1 =x(1),x(2),...,x(n)
• When Gaussian elimination with row pivoting is performed on a matrix A, the result is expressible as
P A = LU
where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. Here, the matrix P A is A with its rows interchanged. We can solve the system Ax = b
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 8.1
376
Chapter 8 More on Linear Systems
2.
Consider the matrix
1002
−20 −15 −10 −5 c.A=1 0 0 0
aa. ab. a c.
A = LU, where L is unit lower triangular and U is upper triangular.
A = LDU′, where L is unit lower triangular, D is diagonal, and U′ is unit upper triangular.
A = L′U′, where L′ is lower triangular and U′ is unit upper triangular.
A = (L′′)(L′′)T , where L′′ is lower triangular. Evaluate the determinant of A.
Hint: det(A) = det(L)det(D)det(U′) = det(D).
0100 0010
0300 ad. A=0940 ae.
5 0 8 10
aa. Determine a unit lower triangular matrix M and an
upper triangular matrix U such that MA = U.
b. Determine a unit lower triangular matrix L and an up-
per triangular matrix U such that A = L U . Show that ML = I so that L = M−1.
Consider the matrix
25 0 0 0 1
0 27 4 3 2 A = 0 54 58 0 0
0 108 116 0 0
100 0 0 0 24
aa. Determine the unit lower triangular matrix M and the upper triangular matrix U such that MA = U.
7.
a8.
9.
a10. 0b00
Consider the 3 × 3 Hilbert matrix 1 1 1
2 3 A=1 1 1
Repeat the preceding problem using this matrix.
2 3 4 1 1 1 345
3.
4.
5.
6.
Find the L U gular, for
b. Determine M−1 = L such that A = LU. Considerthematrix
2 −1
6 −1 8
Consider
A=2−33
2 2 1 A=111 321
a. Show that A cannot be factored into the product of a unit lower triangular matrix and an upper triangular matrix.
a
ab. InterchangetherowsofAsothatthiscanbedone. Considerthematrix a 0 0 z
Repeattheprecedingproblemfor
L is unit lower trian- 1
− 1 1
−1 2
aa. Find the matrix factorization A = L DU′, where L is unit lower triangular, D is diagonal, and U′ is unit upper triangular.
ab. Use this decomposition of A to solve Ax = b, where b = [−2,−5,0]T .
224 4 w0yd 11. Consider the system of equations
A=0 x c 0
Determine a unit lower triangular matrix M and an 1
6x =12 upper triangular matrix U such that MA = U. 6x2 + 3x1 = −12
a.
ab. Determine a lower triangular matrix L′ and a unit up-
7x3 − 2x2 + 4x1 =14 21x4 +9x3 −3x2 +5x1 =−2
Solve for x1, x2, x3, and x4 (in order) by forward sub- stitution.
Write this system in matrix notation Ax = b, where x = [x1, x2, x3, x4]T . Determine the LU factoriza- tion A = L U , where L is unit lower triangular and U is upper triangular.
per triangular matrix U′ such that A = L′U′. Consider the matrix
a. b.
4 A = − 1
−1 −1 0 4 0 −1 0 4 −1
−1 −14 Factor A in the following ways:
−1 0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
decomposition, where
10 A=1 1
0 0 1 1
−1 1 1 −1
−21−2 1 A=−4 3−3, b=4
a 12.
Given
8.1 Matrix Factorizations 377 solution matrix X can be found by Gaussian elimination
with scaled partial pivoting in 4 n3 + O(n2) multiplica- 3
tions and divisions.
Hint: If X(j) and B(j) are the jth columns of X and B, respectively, then AX( j) = B( j).
3 2−1 A=532 −1 1−3
1 0 0 3 2−1 L−1=−5 1 0, U=0−1 11
33316. −8 5 1 0 0 15
Let X be a square matrix that has the form A B
13.
14.
obtain the inverse of A by solving U X( j) = L−1 I( j) for j = 1, 2, 3.
Using the system of Equation (2), form M = M3 M2 M1 and determine M−1. Verify that M−1 = L. Why is this, in general, not a good idea?
Consider the matrix A = Tridiagonal (ai ,i −1 , ai i , ai ,i +1 ), whereaii ≠ 0.Hereisthe4×4case.
aa. Establish the algorithm
X=CD
where A and D are square matrices and A−1 exists. It is
integer i
real array (ai j )1:n×1:n , (li j )1:n×1:n , (ui j )1:n×1:n
l11 ← a11 fori =2to4
li,i−1 ← ai,i−1
ui−1,i ←ai−1,i/li−1,i−1 li,i ← ai,i − li,i−1ui−1,i
end for
for determining the elements of a lower bidiagonal matrix L = (li j ) and a unit upper bidiagonal matrix U = (uij) such that A = LU.
b. Establishthealgorithm
17.
18.
known that X−1 exists if and only if (D − C A−1 B)−1 exists. Verify that X −1 is given by
I −A−1BA−1 0 X= 0 I 0 (D−CA−1B)−1
I0 × −CA−1 I
As an application, compute the inverse of the following: 1001
aa.X=0 1 1 0 1012 0001
1001 a b . X = 0 1 0 1
0011 1112
Let A be an n × n complex matrix such that A−1 exists. Verify that
A A −1 1 A −1 A −1i −Ai −Ai = 2 A−1 −A−1i
where A denotes the complex conjugate of A; if A = (ai j ), then A = (ai j ). Recall that for a complex number z = a + bi, where a and b are real, and z = a − bi.
FindtheLUfactorizationofthismatrix:
221 A=472
2 11 5
integer i;
real array (ai j )1:n×1:n , (li, j )1:n×1:n , (ui, j )1:n×1:n
u11 ← a11 fori =2to4
ui−1,i ← ai−1,i
li,i−1 ← ai,i−1/ui−1,i−1 ui,i ← ai,i − li,i−1ui−1,i
end for
15.
for determining the elements of a unit lower bidiago- nal matrix L = (li j ) and an upper bidiagonal matrix U = (uij) such that A = LU.
By extending the loops, we can generalize these algo- rithms to n × n tridiagonal matrices.
Show that the equation Ax = B can be solved by Gaussian elimination with scaled partial pivoting in (n3/3) + mn2 + O(n2) multiplications and divisions, whereA,X,andBarematricesofordern×n,n×m, andn×m,respectively.Thus,ifBisn×n,thenthen×n
19. a. Provethattheproductoftwolowertriangularmatrices is lower triangular.
b. Prove that the product of two unit lower triangular matrices is unit lower triangular.
c. Provethattheinverseofaunitlowertriangularmatrix is unit lower triangular.
d. By using the transpose operation, prove that all of the preceding results are true for upper triangular matrices.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
378 Chapter 8 More on Linear Systems
20. Let L be lower triangular, U be upper triangular, and D
be diagonal.
a. IfLandUarebothunittriangularandLDUisdiag-
onal, does it follow that L and U are diagonal?
b. If L DU is nonsingular and diagonal, does it follow
that L and U are diagonal?
c. If L and U are both unit triangular and if L DU is diagonal, does it follow that L = U = I?
25.
(Sparse Factorizations) Consider the following sparse symmetric matrices with the nonzero pattern shown where nonzero entries in the matrix are indicated by the × symbol and zero entries are a blank. Show the nonzero pattern in the matrix L for the Cholesky factorization by using the symbol + for the fill-in of a zero entry by a nonzero entry.
×××
× × × ×
×× ×
× × a. A= × ×
× × × ×
×××× ×××
× ××
× ××
× ××
× b. A=×× ×
× × × × × ×
× ××× ×××××
×× × ×× × × × × × ×
c.A= × × × × ××
× ×××× ××××
and compute the first row in U by
u1j = a1j (1≦ j≦n)
l11
Now suppose that columns 1,2,...,k − 1 have been
computed in L and that rows 1,2,...,k − 1 have been computed in U. At the kth step, specify either lkk or ukk, and compute the other such that
k−1
lkkukk =akk − lkmumk m=1
21. Determine matrix:
the L D L T factorization for
1 2−11 A=2 3−43
−1 −4 −1 3 1330
the
following
×× ×××
××
22. FindtheCholeskyfactorizationof
4 610 A= 6 25 19 10 19 62
23. Consider the system
A0x=b BCyd
×× × ××
Show how to solve the system more cheaply using the submatrices rather than the overall system. Give an esti- mate of the computational cost of both the new and old approaches. This problem illustrates solving a block lin- ear system with a special structure.
factorization of the matrix
−20 −143 73 −232 65 461 −232 856
24. Determine the L D L T 5
35 −20 65 A = 35 244 −143 461
×× ×× ×××
Can you find the Cholesky factorization?
1. Write and test a procedure for implementing the algo- rithms of Exercise 8.1.14.
2. The n × n factorization A = LU, where L = (lij) is lower triangular and U = (ui j ) is upper triangular, can be computed directly by the following algorithm (provided zero divisions are not encountered): Specify either l11 or u11 and compute the other such that l11u11 = a11. Compute the first column in L by
li1 = ai1 (1≦i≦n) u11
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 8.1
Compute the kth column in L by
1k−1 024 23
lik = u aik − limumk (k≦i≦n) A= 0 0 8 , p3(x)=1+3x−3x +x kk m=1 000
and compute the kth row in U by
1k−1
ukj = l akj − lkmumj , (k≦ j≦n)
kk m=1
This algorithm is continued until all elements of U and L are completely determined. When lii = 1 (1 ≦ i ≦ n), this procedure is called the Doolittle factorization, and whenujj =1(1≦ j≦n),itisknownastheCroutfac- torization.
Define the test matrix
a Case 4.
2−1 0 0 A=−1 2−1 0
0 −1 2 −1 0 0 −1 2
p5(x) = 10 + x − 2x2 + 3x3 − 4x4 + 5x5 Case 5.
−20 −15 −10 −5 A= 1 0 0 0
Case 3.
8.1 Matrix Factorizations 379
5765 0100 0010
A=7 10 8 7 p4(x) = 5 + 10x + 15x2 + 20x3 + x4 6 8 10 9
5 7 9 10
Using the algorithm above, compute and print factoriza- tions so that the diagonal entries of L and U are of the following forms:
Case 6.
5765 A = 7 10 8 7 ,
6 8 10 9
5 7 9 10
p4(x) = 1 − 100x + 146x2 − 35x3 + x4
Write and test a procedure for determining A−1 for a given square matrix A of order n. Your procedure should use procedures Gauss and Solve.
Write and test a procedure to solve the system AX = B in which A, X, and B are matrices of order n ×n, n ×m, and n × m, respectively. Verify that the procedure works on several test cases, one of which has B = I so that the solution X is the inverse of A.
Hint: See Exercise 8.1.15.
Write and test a procedure for directly computing the in- verse of a tridiagonal matrix. Assume that pivoting is not necessary.
(Continuation)Testtheprocedureoftheprecedingcom- puter problem on the symmetric tridiagonal matrix A of order 10:
diag( L )
[1, 1, 1, 1] [?, ?, ?, ?] [1, ?, 1, ?] [?, 1, ?, 1] [?, ?, 7, 9]
diag(U ) [?, ?, [1, 1, [?, 1, [1, ?, [3, 5,
?, ?] 1, 1] ?, 1] 1, ?] ?, ?]
Doolittle Crout
4.
5.
6.
7.
Here the question mark means that the entry is to be com- puted. Write code to check the results by multiplying L and U together.
3. Write
procedure Poly(n, (ai j ), (ci ), k, (yi j ))
for computing the n × n matrix pk(A) stored in array (yij):
yk =pk(A)=c0I+c1A+c2A2+···+ckAk
where A is an n × n matrix and pk is a kth-degree poly- nomial. Here (ci ) are real constants for 0 ≦ i ≦ k. Use nested multiplication and write efficient code. Test pro- cedure Poly on the following data:
Case 1.
A=I5, p3(x)=1−5x+10x3 Case 2.
A=1 2, 3 4
−2 1 1−2 1 1−21
p2(x)=1−2x+x2
A = ... ... ...
The inverse of this matrix is known to be
(A−1)ij =(A−1)ji = −i(n+1− j) (i≦ j)
(n+1)
1−2 1
1 −2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
380
Chapter 8 More on Linear Systems
8.
9.
Investigatethenumericaldifficultiesininvertingthefol- lowing matrix:
−0.0001
11.
12. 13. 14.
Deviseacodeforinvertingaunitlowertriangularmatrix. Test it on the following matrix:
100 0 310 0 521 0 7 4 −3 1
Verify Example 1 using MATLAB, Maple, or Mathe- matica.
In Example 3, verify the factorizations of matrix A using MATLAB, Maple, and Mathematica.
Find the PA = LU factorization of this matrix:
3.392
0. 0. 4.567 Considerthefollowingtwotestmatrices:
A=
4 6 10 6 25 19 ,
10 19 62
4 6 10 B= 6 13 19 10 19 62
A = 0. 0.
5.096 5.101
1.853
3.737 3.740
0. 0. 0.006 5.254
Show that the first Cholesky factorization has all integers in the solution, while the second one is all integers until the last step, where there is a square root.
a. ProgramtheCholeskyalgorithm.
b. Use MATLAB, Maple, or Mathematica to find the
Cholesky factorizations.
−0.05811
−0.11696 0.56850 0.38953 0.32179
0.51004 −0.31330 0.07041 0.68747 0.01203 −0.52927
10. Let A be real, symmetric, and positive definite. Is the same true for the matrix obtained by removing the first row and column of A?
−0.22094 0.42448 which was studied by Wilkinson [1965, p. 640].
8.2 Eigenvalues and Eigenvectors Av Scalar Multiple of v
EXAMPLE 1
Solution
Let A be an n × n matrix. We ask the following natural question about A: Are there nonzero vectors v for which Av is a scalar multiple of v? Although we pose this question in the spirit of pure curiosity, there are many situations in scientific computation in which this question arises.
The answer to our question is a qualified Yes! We must be willing to consider complex scalars, as well as vectors with complex components. With that broadening of our viewpoint, such vectors always exist. Here are two examples. In the first, we need not bring in complex numbers to illustrate the situation, whereas in the second, the vectors and scalar factors must be complex.
Let A = 3 2. Find a nonzero vector v for which Av is a multiple of v. 7 −2
You can easily verify that
−7 28 −7
We have two different answers (but we have not revealed how to find them). ■
−0.04291 A = −0.01652 −0.06140
A1 = 5 = 51
A 2=−8=−4 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Eigenvalues Eigenvectors
Eigenspace
Characteristic Polynomial
When the equation Ax = λx is valid and x is not zero, we say that λ is an eigenvalue of A and x is an accompanying eigenvector. Thus, in Example 1, the matrix has 5 as an eigenvalue with accompanying eigenvector [1, 1]T , and −4 is another eigenvalue with accompanying eigenvector [2, −7]T . Example 2 emphasizes that a real matrix may have complex eigenvalues and complex eigenvectors. Notice that an equation A0 = λ0 and an equation A0 = 0x say nothing useful about eigenvalues and eigenvectors of A.
Many problems in science lead to eigenvalue problems in which the principal question usually is: What are the eigenvalues of a given matrix, and what are the accompanying eigenvectors? An outstanding application of eigenvalues and eigenvectors is to systems of linear differential equations, which we discuss later.
Notice that if Ax = λx and x ≠ 0, then every nonzero multiple of x is an eigenvector (withthesameeigenvalue).Ifλisaneigenvalueofann×nmatrix A,thentheset{x: Ax = λx} is a subspace of Rn called an eigenspace. It is necessarily of dimension at least 1.
Calculating Eigenvalues and Eigenvectors
Given a square matrix A, how does one discover its eigenvalues? Begin by observing that the equation Ax = λx is equivalent to (A − λI)x = 0. Since we are interested in nonzero solutions to this equation, the matrix A − λ I must be noninvertible, and therefore Det( A−λ I ) = 0. This is how (in principle) we can find all the eigenvalues of A. Specifically, form the function p defined by
p(λ) = Det(A − λI)
and find the zeros of p. It turns out that p is a polynomial of degree n and must have n zeros, provided that we allow complex zeros and count each zero a number of times equal to its multiplicity. Even if the matrix A is real, we must be prepared for complex eigenvalues. The polynomial just described is called the characteristic polynomial of the matrix A. If this polynomial has a repeated factor, such as (λ − 3)k , then we say that 3 is a root of multiplicity k. Such roots are still eigenvalues, but they can be troublesome when k > 1.
To illustrate the calculation of eigenvalues, let us use the matrix in Example 1, namely,
A=3 2 7 −2
The characteristic polynomial is 3 − λ 2
p(λ) = Det(A − λI) = Det 7 −2 − λ = (3 − λ)(−2 − λ) − 14
=λ2 −λ−20=(λ−5)(λ+4) The eigenvalues are 5 and −4.
EXAMPLE 2
Repeat the preceding example with the matrix A = 1 1 . −2 3
Eigenvalues Calculation
Solution
As in Example 1, it can be verified that
A 1 =(2+i) 1
√
1+i 1+i A 1 =(2−i) 1
8.2 Eigenvalues and Eigenvectors 381
1−i 1−i
−1. Surprisingly, we find answers involving complex numbers
In these equations, i =
even though the matrix does not contain any complex entries! ■
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
382 Chapter 8 More on Linear Systems
Mathematical Software
We can carry out this calculation with one or two commands in MATLAB, Maple, or Mathematica. We can determine the characteristic polynomial and subsequently compute its zeros. This gives us the two roots of the characteristic polynomial, which are the eigenvalues 5 and −4. These mathematical software systems also have single commands to produce a list of eigenvalues, computed in the best possible way, which is usually not to determine the characteristic polynomial and subsequently compute its zeros!
In general, an n × n matrix has a characteristic polynomial of degree n, and its roots are the eigenvalues of A. Since the calculation of zeros of a polynomial is numerically challenging if not unstable, this straightforward procedure is not recommended. (See Com- puter Exercise 8.2.2 for an experiment pertaining to this situation.) For small values of n, it may be quite satisfactory, however. It is called the direct method for computing eigenvalues.
Once an eigenvalue λ has been determined for a matrix A, an eigenvector can be computed by solving the system (A − λI)x = 0. Thus, in Example 1, we must solve (A − 5I)x = 0, or
−2 2 x1 = 0 7−7×2 0
Of course, this matrix is singular, and the homogeneous equation has nontrivial solutions, such as [1, 1]T . The other eigenvalue is treated in the same way, leading to an eigenvector [2, −7]T . Any scalar multiple of an eigenvector is also an eigenvector.
This work can be done by using mathematical software to find an eigenvector for each eigenvalue λ via the null space of the matrix A − λI. Also, we can use a single command to compute all the eigenvalues directly or request the calculation of all the eigenvalues and eigenvectors at once. The MATLAB command [V,D] = eig(A) produces two arrays, V and D. The array V has eigenvectors of A as its columns, and the array D contains all the eigenvalues of A on its diagonal. The program returns a vector of unit length such as [0.7071, 0.7071]T . That vector by itself provides a basis for the null space of A − 5 I .
(Maple and Mathematica have commands for computing eigenvalues and eigenvectors.) Notice that the eigenvalue-eigenvector problem is nonlinear. The equation
Ax = λx
has two unknowns, λ and x. They appear in the equation multiplied together. If either x or
λ were known, finding the other would be a linear problem and very easy. Mathematical Software
A typical, mundane use of mathematical software such as MATLAB might be to compute the eigenvalues and eigenvalues of this matrix
Eigenvector Calculation
Using Software
Using MATLAB
1 3−7
A = −3 4 1 (1)
2 −5 3
with a command such as
[V,D] = eig(A)
MATLAB responds instantly with the eigenvectors in the array V and the eigenvalues in the
diagonal array D. The real eigenvalue is 0.0214, and the complex pair of eigenvalues are
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
LAPACK
3.9893 ± 5.5601i . Behind the scenes, much complicated computing may be taking place! The general procedure has these components: First, by means of similarity transforma- tions, A is put into lower Hessenberg form. This means that all elements below the first subdiagonal are zero. Thus, the new matrix A = (aij) satisfies aij = 0 when i > j + 1. Similarity transformations ensure that the eigenvalues are not disturbed. If A is real, fur- ther similarity transformations put A into a near-diagonal form in which each diagonal element is either a single real number or a 2 × 2 real matrix whose eigenvalues are a pair of conjugate complex numbers. Creating the additional zeros just below the diagonal re- quires some iterative process, because after all, we are in effect computing the zeros of a polynomial. The iterative process is reminiscent of the power method that is described in Section 8.3.
Maple can be used to compute the eigenvalues and eigenvectors. The quantities are computed in exact arithmetic and then converted to floating-point. In some versions of MATLAB, symbolic computations are available. In Mathematica, we can use either numer- ical or symbolical commands to obtain similar results.
The best advice for anyone who is confronted with challenging eigenvalue problems is to use the software in the package LAPACK. Special eigenvalue algorithms for vari- ous types of matrices are available there. For example, if the matrix in question is real and symmetric, use an algorithm tailored for that case. There are about a dozen cate- gories available to choose from in LAPACK. MATLAB itself employs some of the routines in LAPACK.
Properties of Eigenvalues
A theorem that summarizes the special properties of a matrix that impinge on the computing of its eigenvalues follows.
Symmetric Hermitian Conjugate Transpose
Positive Definite
Recall that a matrix A is symmetric if A = AT , where AT = (aji ) is the transpose of A = (aij). On the other hand, a complex matrix A is Hermitian if A = A∗, where A∗ = AT = (a j i ). Here A∗ is the conjugate transpose of the matrix A. Using the syntax of programming, we can write AT (i, j) = A( j, i) and A∗(i, j) = A( j, i). Recall also that
A is positive definite if xT Ax > 0 for all nonzero vectors x.
8.2 Eigenvalues and Eigenvectors 383
Matrix Eigenvalue Properties
The following statements are true for any square matrix A:
1. IfλisaneigenvalueofA,thenp(λ)isaneigenvalueofp(A),foranypolynomial
p. In particular, λk is an eigenvalue of Ak .
2. If A is invertible and λ is an eigenvalue of A, then p(1/λ) is an eigenvalue of
p(A−1), for any polynomial p. In particular, λ−1 is an eigenvalue of A−1.
3. If A is real and symmetric, then its eigenvalues are real.
4. If A is complex and Hermitian, then its eigenvalues are real.
5. If A is Hermitian and positive definite, then its eigenvalues are positive.
6. If P is invertible, then A and P A P −1 have the same characteristic polynomial (and the same eigenvalues).
■ Theorem1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
384 Chapter 8
More on Linear Systems
Similar Matrices
■ Theorem2
Two matrices A and B are similar to each other if there exists an invertible matrix P such that B = P AP−1. Similar matrices have the same characteristic polynomial
Det(B − λI) = Det(P AP−1 − λI) = Det(P(A − λI)P−1)
= Det( P ) · Det( A − λ I ) · Det( P −1 ) = Det( A − λ I )
Thus, we have this important theorem.
This theorem suggests a strategy for finding eigenvalues of A. Transform the matrix A to a matrix B using a similarity transformation
B = P AP−1
in which B has a special structure, and then find the eigenvalues of matrix B. Specifically, if B is triangular or diagonal, the eigenvalues of B (and those of A) are simply the diagonal elements of B.
Matrices A and B are said to be unitarily similar to each other if B = U∗ AU for some unitary matrix U . Recall that a matrix U is unitary if U U ∗ = I . This brings us naturally to another important theorem and two corollaries.
In this theorem, an arbitrary complex n × n matrix A is given, and the assertion made is that a unitary matrix U exists such that:
UAU∗ =T (2)
where UU∗ = I and T is a triangular matrix.
The proof of Schur’s Theorem can be found in Kincaid and Cheney [2002] and Golub
and Van Loan [1996].
Thus, the factorization
PAP−1 =T
is possible, where T is triangular, P is invertible, and A is real.
Eigenvalues of Similar Matrices
Similar matrices have the same eigenvalues.
Unitarily Similar Unitary
■ Theorem3
Schur’s Theorem
Every square matrix is unitarily similar to a triangular matrix.
Matrix Similar to a Triangular Matrix
Every square real matrix is similar to a triangular matrix.
■ Corollary1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 3
Solution
8.2 Eigenvalues and Eigenvectors 385 We illustrate Schur’s Theorem by finding the decomposition of this 2 × 2 matrix:
A=3 −2 83
From the characteristic equation
Det(A−λI)=λ2 −6λ+25=0
the eigenvalues are 3 ± 4i. By solving A − λI = 0 with each of these eigenvalues, the corresponding eigenvectors are v1 = [i, 2]T and v2 = [−i, 2]T . Using the Gram-Schmidt orthogonalization process, we obtain u1 = v1 and u2 = v2 −[v∗2u1/u∗1u1]u1 = [−2,−i]T . After normalizing these vectors, we obtain the unitary matrix
1 i −2 U=√5 2 −i
which satisfies the property UU∗ = I, Finally, we obtain the Schur form UAU∗=3+4i −6
0 3−4i
which is an upper triangular matrix with the eigenvalues on the diagonal. ■
Corollary 2 says that a Hermitian matrix A = A∗ can be factored as UAU∗ = D
where D is diagonal and U is unitary.
This follows from Corollary 1 since a Hermitian matrix (A = A∗) is unitarily similar
to a triangular metrix T
UAU∗ =T
where UU∗ = I. Furthermore, we have
U∗ A∗U = T∗
Since A = A∗, we obtain T = T∗, which must be a diagonal matrix.
Most numerical methods for finding eigenvalues of an n × n matrix A proceed by determining such similarity transformations. Then one eigenvalue at a time, say, λ, is com- puted, and a deflation process is used to produce an (n − 1) × (n − 1) matrix A whose eigenvalues are the same as those of A, except for λ. Any such procedure can be repeated with the matrix A to find as many eigenvalues of the matrix A as desired. In practice, this strategy must be used cautiously because the successive eigenvalues may be infected with
roundoff error!
Gershgorin’s Theorem
Sometimes it is necessary to determine in a coarse manner where the eigenvalues of a matrix are situated in the complex plane C. The most famous of these so-called localization theorems is the following.
Hermitian Matrix Unitarily Similar to a Diagonal Matrix
Every square Hermitian matrix is unitarily similar to a diagonal matrix.
Deflation Process
■ Corollary2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
386 Chapter 8 More on Linear Systems
Gershgorin’s Theorem
All eigenvalues of an n × n matrix A = (ai i ) are contained in the union of the n discs Ci =Ci(aii,ri)inthecomplexplanewithcenteraii andradiiri givenbythesumof the magnitudes of the off-diagonal entries in the ith row.
■ Theorem4
The matrix A can have either real or complex entires. The region containing the eigenvalues of A can be written
More Gershgorin Discs
All eigenvalues of an n × n matrix A = (ai i ) are contained in the union of the n discs Di = Di (aii , si ) in the complex plane having center at aii and radii si given by the sum of the magnitudes of the columns of A.
■ Corollary3
TheeigenvaluesofAandAT arethesamebecausethecharacteristicequationinvolves the determinant, which is the same for a matrix and its transpose. Therefore, we can apply theGershgorinTheoremtoAT andobtainthefollowingusefulresult.
Consequently, the region containing the eigenvalues of A can be written as
wheretheradiiareri =
n n
Ci = z∈C:|z−aii|≦ri (3)
i=1 i=1 nj=1 |aij|.
j ≠ i
where the radii are si =
i=1
n
n n
Di = z∈C:|z−aii|≦si (4)
i=1
ni =1 |ai j |. Finally, the region containing the eigenvalues of A is
i̸=j
n
Di (5) This may result in tighter bounds on the eigenvalues in some case. Also, a useful localization
Ci
Localization
For a matrix A, the union of any k Gershgorin discs that do not intersect the remaining n − k circles contains exactly k (counting multiplicities) of the eigenvalues of A.
■ Corollary4
■ Corollary5
result is
For a strictly diagonally dominant matrix, zero cannot lie in any of its Gershgorin discs, so it must be invertible. Consequently, we obtain the following results.
i=1
i=1
Strictly Diagonally Dominant Matrix
Every strictly diagonally dominant matrix is invertible.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 4
Solution
Consider the matrix
Draw the Gershgorin discs.
4−i 2 i A= −1 2i 2
Im(z) 6
4 2
C2, D2 D3 *
8.2 Eigenvalues and Eigenvectors 387
1 −1 −5
Using the rows of A, we find that the Gershgorin discs are C1(4 − i, 3), C2(2i, 3), and C3(−5, 2). By using the columns of A, we obtain more Gershgorin discs: D1(4 − i, 2), D2(2i, 3), and D3(−5, 3). Consequently, all the eigenvalues of A are in the three discs D1, C2, and C3, as shown in Figure 8.1. By other means, we compute the eigenvalues of A as λ1 = 3.7208 − 1.05461i, λ2 = 4.5602 + −0.2849i, and λ3 = −0.1605 + 2.3395i. In Figure 8.1, the center of the discs are designated by dots • and the eigenvalues by ∗. ■
FIGURE 8.1
Gershorgin discs
C3 C1 0* D1
*
−2 −4
−6 −4 −2 0 2 4 6 Singular Value Decomposition
Re(z)
■ Theorem5
This subsection requires some further knowledge of linear algebra, in particular the di- agonalization of symmetric matrices, eigenvalues, eigenvectors, rank, column space, and norms. See online, at the textbook Web site, Appendix D2 for a brief reveiw of these topics. (In the discussion following, we assume that the Euclidean norm is being used.)
The singular value decomposition is a general-purpose tool that has many uses, par- ticularly in least-squares problems (Chapter 9). It can be applied to any matrix, whether square or not. We begin by stating that the singular values of a matrix A are the nonnegative square roots of the eigenvalues of AT A.
Matrix Spectral Theorem
Let A be m × n. Then AT A is an n × n symmetric matrix, and it can be diagonalized by an orthogonal matrix, say, Q:
AT A= QDQ−1 whereQQT =QTQ=IandDisadiagonaln×nmatrix.
Furthermore, the diagonal matrix D contains the eigenvalues of AT A on its diagonal. This follows from the fact that AT A Q = Q D, so the columns of Q are eigenvectors of AT A.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
388 Chapter 8
More on Linear Systems
Singular Values
If λ is an eigenvalue of AT A and if x is a corresponding eigenvector, then AT Ax = λx, whence
||Ax||2 = (Ax)T (Ax) = xT AT Ax = xT λx = λ||x||2
This equation shows that λ is real and nonnegative. We can order the eigenvalues as λ1 ≧ λ2 ≧ · · · ≧ λn ≧ 0. (Reordering the eigenvalues requires reordering the columns of Q.) The numbers σj = + λj are the singular values of A.
Since Q is an orthogonal matrix, its columns form an orthonormal base for Rn . They are unit eigenvectors of AT A, so if vj is the jth column of Q, then AT Avj = λjvj. Some of the eigenvalues of AT A can be zero. Define r by the condition
λ1≧λ2≧ ···≧λr >0=λr+1 =···=λn
For a review of concepts such as rank, orthogonal basis, orthonormal basis, column
space, null space, and so on, see Appendix D.2 on the textbook Web site.
Observe that
This establishes the orthogonality of the set {Avj:1≦ j≦n}. By letting k = j, we get
Orthogonal Basis Theorem
Iftherankof Aisr,thenanorthogonalbasisforthecolumnspaceof Ais{Avj:1≦ j ≦r}.
Rank
Singular Value Decomposition
The preceding theorem gives a reasonable way of computing the rank of a numerical matrix. First, compute its singular values. Any that are very small can be assumed to be zero. The remaining ones are strongly positive, and if there are r of them, we take r to be the numerically computed rank of A.
A singular value decomposition of an m × n matrix A is any representation of A in the form
A = U DV T (6)
where U and V are orthogonal matrices and D is an m × n diagonal matrix having non- negative diagonal entries that are ordered d11 ≧ d22 ≧ · · · ≧ 0. Then from Exercise 8.2.4, it follows that the diagonal elements dii are necessarily the singular values of A. Note that the matrix U is m × m and V is n × n. A nonsquare matrix D is diagonal if the only elements that are not zero are among those whose two indices are equal.
Now a singular value decomposition of A (there are many of them) can be described. Start with the vectors v1, v2, . . . , vr . Normalize the vectors Av j to get vectors u j . Thus, we have
uj = Avj/||Avj|| (1≦ j≦r)
■ Theorem6
Proof
(Avk)T (Avj) = vkT AT Avj = vkT λjvj = λjδkj
||Avj|| = λj. Hence, Avj ≠ 0 if and only if 1≦ j ≦r. If w is any vector in the column
2
space of A, then w = Ax for some x in Rn. Putting x = nj=1 cjvj, we get
n w=Ax=
j=1
r j=1
cjAvj
cjAvj = and therefore, w is in the span of {Av1, Av2,…, Avr}.
■
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Verifying A=UDVT
Condition Number
Extend this set to an orthonormal base for Rm . Let U be the m ×m matrix whose columns are u1,u2,…,um.Define Dtobethem×nmatrixconsistingofzerosexceptforσ1,σ2,…,σr on its diagonal. Let V = Q, where Q is as discussed.
ToverifytheequationA=UDVT,firstnotethatσj =||Avj||2andthatσjuj =Avj. Then compute U D. Since D is diagonal, this is easy. We get
UD = [u1,u2,…,um]D = [σ1u1,σ2u2,…,σrur,0,…,0] = [Av1, Av2,…, Avr,…, Avn] = AQ = AV
This implies that
A = U DV T
The condition number of a matrix can be expressed in terms of its singular values
κ(A) = σmax (7) σmin
since || A||2 = ρ( AT A) = σmax( A) and || A−1||2 = ρ( A−T A−1|| = σmin( A). Numerical Examples of Singular Value Decomposition
The numerical determination of a singular value decomposition is best left to mathematical software such as MATLAB, Maple, Mathematica, LAPACK, and other software packages. Usually, they do not form AT A in the numerical work because its condition number may be much worse than that of A. This phenomenon is easily illustrated by the matrices
1 1 1 1+ε2 1 1 A=ε 0 0, AT A= 1 1+ε2 1
EXAMPLE 5
Solution
1 1 1+ε2
There are small values of ε for which A has rank 3 and AT A has rank 1 (in the computer).
In Example 2 of Section 1.1 (p. 4), we encountered this matrix: A = 0.1036 0.2122
0.2081 0.4247 Determine its eigenvalues, singular values, and condition number.
By using mathematical software, it is easy to find the eigenvalues λ1(A) ≈ −0.0003 and λ2(A) ≈ 0.5286. We can form the matrix
AT A = 0.0540 01104 0.1104 0.2254
and find its eigenvalues λ1( AT A) ≈ 0.3025 × 10−4 and λ2( AT A) ≈ 0.2794. Therefore, the singular values are
σ1( A) = |λ1( AT A)| ≈ 0.0003, σ2( A) = |λ2( AT A)| ≈ 0.5286
Also, we can obtain the singular values directly as σ1 ≈ 0.0003 and σ2 ≈ 0.5286 using mathematical software. Consequently, the condition number is κ(A) = σ2/σ1 ≈ 1747.6. Because of this large condition number, we now understand why there was difficultly in solving the linear system of equations with this coefficient matrix! ■
0ε0 00ε
8.2 Eigenvalues and Eigenvectors 389
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
390 Chapter 8 More on Linear Systems
EXAMPLE 6
Solution
Calculate the singular value decomposition of the matrix
1 1
A=0 1 (8)
10
Here, the matrix A is m ×n with m = 3 and n = 2. First, we find that the eigenvalues of the matrix
ATA=2 1 12
in the same order as the eigenvalues to form the column vectors of the n × n matrix V :
arranged in descending order are λ1 = 3 and λ2 = 1. So there are 2 nonzero eigenvalues of the matrix AT A. Next, we determine that the eigenvectors of the matrix AT A are [1, 1]T for λ1 = 3 and [1, −1]T for λ2 = 1. Consequently, the orthonormal set of eigenvectors of
AT A are 1√2, 1√2T for λ1 =3 and 1√2,−1√2T for λ2 =1. Then we arrange them 22 22
1√2 1√2 V=v1v2=2√ 2√
√
3andσ2 =1onthe
Here, on the leading diagonal are the square roots of the eigenvalues of AT A in descending
order, and the rest of the entries of the matrix D are zeros. Next, we compute vectors
u =σ−1Av fori=1and2,whichformthecolumnvectorsofthem×mmatrixU.In iii
1 2 −1 2 22
Nowweformthem×nsingularvaluematrix Dbyplacingσ1 = leading diagonal
√3 √0 D=0 1
00
this case, we find
and
1 1√ 1√6 1√ 12 3√
u=σ−1Av= 3012√ =1 6 1113126√
102 16 6
u
1 1 1√2 √0 = σ − 1 A v = 0 1 2 √ = − 1 2
222 −122√ 10212
Finally, we add to the matrix U the rest of the m − r vectors using the Gram-Schmidt orthogonalization process in Section 9.2 (p. 444). So we make the vector u3 perpendicular to u1 and u2:
1 3
2
u 3 = e 1 − u 1T e 1 u 1 − u 1T e 2 u 2 = − 1 3
−1 3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Normalizing the vector u3, we get
1√3
3√ u3=−1 3
8.2 Eigenvalues and Eigenvectors 391
3√ −1 3
So we have the matrix
3√√3√ U = u 1 u 2 u 3 = 1 6 1 2 − 1 3
1 √6
0 1 √3
3
6√2√3√ 1 6 −1 2 −1 3
623 The singular value decomposition of the matrix A is
A = UDVT
1 1 1√6 0 1√3√3 0 √ √
3√√3√√1212 0 1=1 6 1 2 −1 3 0 1 2√ 2√
6√ 2√ 3√ 12−12 1016−12−13002 2
623
So there we have it! Fortunately, there is mathematical software for doing all of this instantly! We can verify the results by computing the matrix A from the factorization on the right-hand side. ■
See Chapters 9 and 13 for some important applications of the singular value decompo- sition. Further examples are given there and in the problems of those chapters.
Application: Linear Differential Equations
The application of eigenvalue theory to systems of linear differential equations is briefly explained here. Let us start with a single linear differential equation with one dependent variable x. The independent variable is t and often represents time. We write x′ = ax, or in more detail (d/dt)x(t) = ax(t). There is a family of solutions, namely, x(t) = ceat, where c is an arbitrary real parameter. If an initial value x(0) is prescribed, we shall need parameter c to get the initial value right.
A pair of linear differential equations with two dependent variables, x1 and x2, looks
like this:
x1′ =a11x1+a12x2 x2′ =a21x1+a22x2
The general form of a system of n linear first-order differential equations, with constant coefficients, is simply
x′ = Ax
Here, A is an n × n numerical matrix, and the vector x has n components, x j , each being a function of t. Differentiation is with respect to t. To solve this, we are guided by the easy case of n = 1, discussed above. Here, we try
x(t) = eλtv
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
(9)
392 Chapter 8 More on Linear Systems
Linear Differential Equations
If λ is an eigenvalue of the matrix A and if v is an accompanying eigenvector, then one solution of the differential equation x′ = Ax is x(t) = eλt v.
■ Theorem7
where v is a constant vector. Taking the derivative of x, we have x′ = λeλtv. Now the system of equations has become λeλt v = Aeλt v, or λv = Av. This is how eigenvalues come into the process. We have proved the following result.
Application: A Vibration Problem
Eigenvalue-eigenvector analysis can be utilized for a variety of differential equations. Con- sider the system of two masses and three springs shown in Figure 8.2. Here, the masses are constrained to move only in the horizontal direction.
From this situation, we write the equations of motion in matrix-vector form:
x′′
1 = −β α x1 , x′′=Ax (10) x′′ α −β x2
FIGURE 8.2
Two-mass vibration problem
2
By assuming that the solution is purely oscillatory (no damping), we have
In matrix form, we get
x = veiωt
x1 = v1 eiωt
x2 v2
x′′ = −ω2veiωt = −ω2x
−β α x = −ω2 x α −β
By differentiation, we obtain and
This is the eigenvalue problem
where λ = −ω2. Eigenvalues can be found from the characteristic equation:
We find
Det(A+ω2I)=detω2 −β α=0 α ω2−β
(ω2 −β)2 −α2 =ω4 −2βω2 +(β2 −α2)=0 ω2 = 12β±4β2 −4(β2 −α2)=β±α
2
Ax = λx
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.2 Eigenvalues and Eigenvectors 393 For simplicity, we now assume unit masses and unit springs so that β = 2 and α = 1. Then
we obtain
A = −2 1 1 −2
Thentherootsofthecharacteristicequationsareω12 =β+α=3andω2 =β−α=1. Next, we can find the eigenvectors. For the first eigenvalue, we obtain
A+ω12Iv1=0 ⇒ 1 1v11=0 1 1 v12
Since v11 = −v12, we obtain the eigenvector 1 v1 = −1
For the second eigenvector, we have A+ω2Iv2 =0 ⇒
−1 1 v21 1 −1 v22
=0
Since v21 = −v22, we obtain the eigenvector
v2 =1
The general solution for the equations of motion for the two-mass system is x(t) = c1v1eiω1t + c2v1e−iω1t + c3v2eiω2t + c4v2e−iω2t
Because the solution was for the square of the frequency, each frequency is used twice (one positive and one negative). We can use initial conditions to solve for the unknown coefficients.
Summary 8.2
• An eigenvalue λ and eigenvector x satisfy the equation Ax = λx. The direct method to compute the eigenvalues is to find the roots of the characteristic equation p(λ) = det( A − λ I ) = 0. Then, for each eigenvalue λ, the eigenvectors can be found by solving the homogeneous system ( A − λ I )x = 0. There are software packages for finding the eigenvalue-eigenvector pairs using more-sophisticated methods.
• There are many useful properties for matrices that influence their eigenvalues. For ex- ample, the eigenvalues are real when A is symmetric or Hermitian. The eigenvalues are positive when A is symmetric or Hermitian positive definite.
• Many eigenvalue procedures involve similarity or unitary transformations to produce triangular or diagonal matrices.
• Gershgorin’s discs can be used to localize the eigenvalues by finding coarse estimates of them.
• The singular value decomposition of an m × n matrix A is A = U DV T
where D is an m × n diagonal matrix whose diagonal entries are the singular values, U is an m × m orthogonal matrix, and V is an n × n orthogonal matrix. The singular values of A are the nonnegative square roots of the eigenvalues of AT A.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
394
Chapter 8 More on Linear Systems
a1. 2.
a3.
4.
5.
6.
7.
Are [i, −1 + i]T and [−i, −1 − i]T eigenvectors of the matrix in Example 2? Show why or why not.
Provethatifλisaneigenvalueofarealmatrixwitheigen- vector x, then λ is also an eigenvalue with eigenvector x. (For a complex number z = x + iy, the conjugate is defined by z = x − iy.)
a9.
10.
c. p( A)x = p(λ)x for any polynomial p d. Akx=(1−λ)kx
(Multiple Choice) For what values of s will the matrix I − svv∗ be unitary, where v is a column vector of unit length?
a. 0,1 b. 0,2 c. 1,2 d. 0,√2 e. None of these.
(Multiple Choice) Let U and V be unitary n×n matrices, possibly complex. Which conclusion is not justified?
a. U + V is unitary. b. U∗ is unitary. c. UVisunitary.
d. U −vv∗ is unitary when ||v|| = √2 and v is a column vector.
Let
A = cosθ sin θ
−sinθ cos θ
Account for the fact that the matrix A has the effect of rotating vectors counterclockwise through an angle θ and thus cannot map any vector into a multiple of itself.
Let Abeanm×nmatrixsuchthat A=UDVT,where U and V are orthogonal and D is diagonal and non- negative. Prove that the diagonal elements of D are the singular values of A.
Let A, U, D, and V be as in the singular value decom- position: A = U DV T . Let r be as described in the text. Define Ur to consist of the first r columns of U. Let Vr consist of the first r columns of V, and let Dr be the r × r matrix having the same diagonal as D. Prove that A = U r Dr V rT . (This factorization is called the econom- ical version of the singular value decomposition.)
AlinearmapPisaprojectionifP2=P.Wecanuse the same terminology for an n × n matrix: A2 = A is the projection property. Use the Pierce decomposition, I = A + (I − A), to show that every point in Rn is the sum of a vector in the range of A and a vector in the null space of A. What are the eigenvalues of a projection?
FindalloftheGershgorindiscsforthefollowingmatri- ces. Indicate the smallest region(s) containing all of the eigenvalues:
3−1 1 3 1 2 aa. 2 4 −2 b. −1 4 −1
e. Noneofthese.
a11. (MultipleChoice)Whichassertionistrue?
3 −1 9
1−i 1 i ac. 0 2i 2 102
1 −2 9
8. (Multiple Choice) Let A be an n × n invertible (nonsin- gular) matrix. Let x be a nonzero vector. Suppose that Ax = λx. Which equation does not follow from these
hypotheses?
a. Akx=λkx b. λ−kx=(A−1)kxfork≧0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
n j=1
j ̸=i
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a. Every n × n matrix has n distinct (different) eigenvalues.
b. Theeigenvaluesofarealmatrixarereal.
c. IfUisaunitarymatrix,thenU∗=UT.
d. A square matrix and its transpose have the same eigenvalues.
e. Noneofthese.
(Multiple Choice) Consider the symmetric matrix
1 3 4−1 A=3 7−6 1
4 −6 3 0 −1 1 0 5
What is the smallest interval derived from Gershgorin’s Theorem such that all eigenvalues of the matrix A lie in that interval?
a. [−7,9] b. [−7,13] c. [3,7]
d. [−3, 17] e. None of these.
a13. (TrueorFalse)Gershgorin’sTheoremassertsthatevery eigenvalue λ of an n × n matrix A must satisfy one of these inequalities:
12.
|λ−aii|≦
|aij| for 1≦i≦n.
Exercises 8.2
14.
a 15.
16.
17.
1.
(True or False) A consequence of Schur’s Theorem is that every square matrix A can be factored as A = P T P −1 , where P is a nonsingular matrix and T is upper
triangular.
(True or False) A consequence of Schur’s Theorem is that every (real) symmetric matrix A can be factored in theformA=PDP−1,wherePisunitaryandDis diagonal.
18.
19. 20.
8.2 Eigenvalues and Eigenvectors 395 Plot the Gershgorin discs in the complex plane for A and
AT aswellasindicatethelocationsoftheeigenvalues.
(Continuation)LetBbethematrixobtainedbychanging the negative entries in A to positive numbers. Repeat the process for B.
4 0 −2 1 2 0 . 119
Explain why ||U B||2 = ||B||2 for any matrix B when UTU=I. 4 −1 0
(Continuation) Repeat for C = Find the Schur decomposition of
2 A=−2−4. Consider the matrix A = 3 5 −3 .
57
013 2
Use MATLAB, Maple, Mathematica, or other computer programs available to compute the eigenvalues and eigen- vectors of these matrices:
2.
3.
Use MATLAB to compute the eigenvalues of a random 100 × 100 matrix by direct use of the command eig and by use of the commands poly and roots. Use the timing functions to determine the CPU time for each.
Letpbethepolynomialofdegree20whoserootsarethe integers 1, 2, . . . , 20. Find the usual power form of this polynomial so that
p(t)=t20 +a19t19 +a18t18 +···+a0
Next, form the so-called companion matrix, which is 20 × 20 and has zeros in all positions except all 1’s on the superdiagonal and the coefficients −a0, −a1,…,−a19 as its bottom row. Find the eigenvalues of this matrix, and account for any difficulties encountered.
(Student Research Project) Investigate some modern methods for computing eigenvalues and eigenvectors. For the symmetric case, see the book by Parlett [1997]. Also, read the LAPACK User’s Guide. (See Anderson, et al. [1999].)
(Student Research Project) Experiment with the Cayley-Hamilton Theorem, which asserts that every square matrix satisfies its own characteristic equation. Check this numerically by using MATLAB or some other mathematical software system. Use matrices of size 3, 6, 9, 12, and 15, and account for any surprises. If you can use higher-precision arithmetic do so—MATLAB works with 15 digits of precision.
(Student Research Project) Experiment with the QR algorithm and the singular value decomposition of matrices—for example, using MATLAB. Try examples with four types of equations Ax = b—namely, (a) the
55
aa. A=1 7 2 −5
4−7
1 6 b. 5−5 9 −3 3 2
3 2 11 −1 −2 −4 1 6 5 −5
3 2
1 5
1
c. Letn=12,aij =i/jwheni≦j,andaij = j/i when i > j. Find the eigenvalues.
ad. Create an n × n matrix with a tridiagonal structure and nonzero elements (−1, 2, −1) in each row. For n = 5 and 20, find all of the eigenvalues, and verify that they are 2 − 2 cos( jπ/(n + 1)).
e. For any positive integer n, form the symmetric matrix A whose upper triangular part is given by
1 1
4.
5.
6.
n n−1 n−2 n−3 ··· 2
n−1 n−2 n−3 ··· 2
1 .
n−2 n−3 ··· 2
.. . .··· .
. 1
… 2 2
The eigenvalues of A are 1/{2−2 cos[(2i −1)π/(2n+ 1)]}. (See Frank [1958] and Gregory and Kar- ney [1969].) Numerically verify this result for n = 30.
1 1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 8.2
396
Chapter 8 More on Linear Systems
7.
UsingmathematicalsoftwaresuchasMATLAB,Maple, or Mathematica on each of the following matrices, com- pute the eigenvalues via the characteristic polynomial, compute the eigenvectors via the null space of the matrix, and compute the eigenvalues and eigenvectors directly:
a 3 2 a 1 3 −7 a.7−1 b.−341
2 −5 3
5411 a11. ConsiderA=4 5 1 1
Find the eigenvalues and accompanying eigenvectors of this matrix, from Gregory and Karney [1969], without using software.
Hint: The answers can be integers.
12. Findthesingularvaluedecompositionofthesematrices: 3
system has a unique solution; (b) the system has many solutions; (c) the system is inconsistent but has a unique least-squares solution; (d) the system is inconsistent and has many least-squares solutions.
Create the diagonal matrix D = U T A V to check the re- sults (always recommended). One can see the effects of roundoff errors in these calculations, for the off-diagonal elements in D are theoretically zero.
8. UsingmathematicalsoftwaresuchasMATLAB,Maple, or Mathematica, determine the execution time for com- puting all eigenvalues of a 1000 × 1000 matrix with ran- dom entries.
9. Using mathematical software such as MATLAB, Maple, or Mathematica, compute the Schur factorization of these complex matrices, and verify the results according to Schur’s Theorem and its corollaries:
3−i 2−i 2+i 3+i a. 2+i 3+i b. 3−i 2−i
a. 2 1 −2 b. 4 c.−5+3√3 5√3+3
1142 1124
d.17 1 −17 −1 10 10 10 10 39−3−9 5555
10. Using mathematical software such as MATLAB, Maple, 66
or Mathematica, compute the singular value decomposi- tion of these matrices, and verify that each result satisfies theequation A=UDVT:
22 2222
7 − 13√6 7 + 13√6
c. e.−7−13 6 −7+13 6
2−i2+i 26√26√
a.
8.3
Power Method
Mathematical Derivation
3−i 3+i 26√26√ −13 6 13 6
1 1 1 3 −2 0 1 b. 2 7 5 10 −2−34
5 −3 −2
−25
−149 −50 537 180 −27 −9
−154 546 .
13. Consider B =
Find the eigenvalues, singular values, and condition num-
ber of the matrix B.
Eigenvalues
A procedure called the power method can be employed to compute eigenvalues of a given matrix. It is an example of an iterative process that, under the right circumstances, produces a sequence converging to an eigenvalue of a given matrix.
Suppose that A is an n × n matrix, and that its eigenvalues (which we do not know) have the following property:
|λ1|>|λ2|≧|λ3|≧ ···≧|λn|
Notice the strict inequality in this hypothesis. Except for that, we are simply ordering the
eigenvalues according to decreasing absolute value. (This is only a matter of notation.) Each
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Eigenvectors
Linear Combination
eigenvalue has a nonzero eigenvector u(i) and
Au(i) =λiu(i) (i =1,2,…,n)
(1)
Sequence of Vectors
.
x(k) = Ax(k−1) = Ak x(0)
We assume that there is a linearly independent set of n eigenvectors {u(1), u(2), . . . , u(n)}. It is necessarily a basis for Cn .
We want to compute the single eigenvalue of maximum modulus (the dominant eigen- value) and an associated eigenvector. We select an arbitrary starting vector, x(0) ∈ Cn, and express it as a linear combination of the eigenvectors u(1), u(2), . . . , u(n):
x(0) = c1u(1) +c2u(2) +···+cnu(n)
In this equation, we must assume that c1 ≠ 0. Since the coefficients can be absorbed into
the vectors u(i), there is no loss of generality in assuming that
x(0) = u(1) + u(2) + · · · + u(n) (2)
Then we repeatedly carry out matrix-vector multiplication, using the matrix A to produce a sequence of vectors. Specifically, we have
x(1) = Ax(0)
x(2) = Ax(1) = A2x(0)
(3) (2) 3 (0) x =Ax =Ax
. .
In general, we have
Substituting x(0) in Equation (2), we obtain x(k) = Ak x(0)
= Aku(1) + Aku(2) + Aku(3) +···+ Aku(n) = λk1u(1) +λk2u(2) +λk3u(3) +···+λknu(n)
x(k) = Ak x(0) (k = 1, 2, 3, . . .)
by using Equation (1). This can be written in the form
λ2 k λ3 k λn k
the notation, we write the above equation in the form
x(k) = λk1u(1) + ε(k) (3)
where ε(k) → 0 as k → ∞. We let φ be any complex-valued linear functional on Cn such that φ(u(1)) ≠ 0. Recall that φ is a linear functional if φ(ax + by) = aφ(x) + bφ(y) for scalars a and b and vectors x and y. For example, φ(x) = xj for some fixed j (1≦ j ≦ n), is a linear functional. Now, looking back at Equation (3), we apply φ to it:
φx(k) = λk1φu(1) + φε(k)
u(n) Since|λ1|>|λj|for j >1,wehave|λj/λ1|<1andλj/λ1k →0ask→∞.Tosimplify
x(k) =λk1 u(1) + λ u(2) + λ u(3) +···+ λ 111
8.3 Power Method 397
Linear Functional
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
398 Chapter 8 More on Linear Systems
Next, we form ratios r1, r2, . . . as follows:
Ratio Converges to Dominant Eigenvalue
φx(k+1) φu(1) + φε(k+1)
rk ≡ φx(k) =λ1 φu(1)+φε(k) →λ1 as k →∞
Hence, we are able to compute the dominant eigenvalue λ1 as the limit of the sequence {rk }. With a little more care, we can get an accompanying eigenvector. In the definition of the vec- tors x(k) in Equation (2), we see nothing to prevent the vectors from growing or converging to zero. Normalization cures this problem, as in one of the following pseudocodes.
Power Method Pseudocode
Here we present pseudocode for calculating the dominant eigenvalue and an associated eigenvector for a prescribed matrix A. In each algorithm, φ is a linear functional chosen by the user. For example, one can use φ(x) = x1 (the first component of the vector).
Power Method Pseudocode
integer k, kmax, n; real r
real array ( A)1:n×1:n , (x)1:n , ( y)1:n external function φ
output 0, x
fork =1tokmax
y ← Ax
r ← φ(y)/φ(x) x←y
output k, x, r
end do
Power Method Pseudocode
FIGURE 8.3
In 2D, power method illustration
We use a simple 2 × 2 matrix such as
A=3 1
13
to give a geometric illustration of the power method as shown in Figure 8.3. Clearly, the eigenvalues are λ1 = 2 and λ2 = 4 with eigenvectors v(1) = [−1, 1]T and v(2) = [1, 1]T , respectively.Startingwithx(0) =[0,1]T,thepowermethodrepeatedlymultipliesthematrix A by a vector. It produces a sequence of vectors x(1), x(2), and so on that move in the direction of the eigenvector v(2), which corresponds to the dominant eigenvalue λ2 = 4.
v(2) x(0) x(1) x(2) v(1)
–1 0 1
We can easily modify this algorithm to produce normalized eigenvectors by using the infinity vector norm ||x||∞ = max1 ≦ j ≦ n |x j |, as in the following code:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Aitken Acceleration
EXAMPLE 1
Solution
Computer Results
Aitken Acceleration
From a given sequence {rk }, we can construct another sequence {sk } by means of the Aitken acceleration formula
(rk −rk−1)2
sk = rk − r − 2r + r (k ≧ 3)
k k−1 k−2
If the original sequence {rk} converges to r and if certain other conditions are satisfied, then the new sequence {sk } converges to r more rapidly than the original one. (For details, see Kincaid and Cheney [2002].) Because subtractive cancellation may eventually spoil the results, the Aitken acceleration process should be stopped soon after the values become apparently stationary.
Use the modified power method algorithm and Aitken acceleration to find the dominant eigenvalue and an eigenvector of the given matrix A, with vector x(0) and φ(x) given as
Modified Power Method Algorithm with Normalization
x(0) x (1) x (2) x (3) x (4) x (5) x (6)
λ1 = 6, u(1) = [1,1,1]T
=[−1.0000,1.0000,1.0000]T
= [−1.0000, 0.3333, 0.3333]T r0 = 2.0000 = [−1.0000, −0.1111, −0.1111]T r1 = −2.0000
= [−1.0000, −0.4074, −0.4074]T r2 = = [−1.0000, −0.6049, −0.6049]T r3 = = [−1.0000, −0.7366, −0.7366]T r4 = = [−1.0000, −0.8244, −0.8244]T r5 =
. .
22.0000 8.9091 7.3061 6.7151
s3 = 13.5294 s4 = 7.0825 s5 = 6.3699
x (14)
The Aitken-accelerated sequence, sk, converges noticeably faster than the sequence {rk}.
= [−1.0000, −0.9931, −0.9931]T r13 =
The actual dominant eigenvalue and an associated eigenvector are
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.3 Power Method 399
integer k, kmax, n; real r
real array ( A)1:n×1:n , (x)1:n , ( y)1:n external function φ
output 0, x
fork =1tokmax
y ← Ax
r ← φ(y)/φ(x) x ← y/||y||∞ output k, x, r
end do
Normalized Power Method Pseudocode
follows:
6 5 −5 −1
A=2 6 −2, x(0) = 1, φ(x)=x2
2 5 −1 1
After coding and running the modified power method algorithm with Aitken acceleration, we obtain the following results:
6.0208
.
s13 = 6.0005
400 Chapter 8
More on Linear Systems
Computing Smallest Eigenvalue by Inverse Power Method
The coding of the modified power method is very simple, and we leave the actual imple- mentation as an exercise. We also use the simple infinity-norm for normalizing the vectors. The final vectors and estimates of the eigenvalue are displayed with 15 decimals digits. ■
In such a problem, one should always seek an independent verification of the purported answer. Here, we simply compute Ax to see whether it coincides with s14x. The last few commands in the code are doing this rough checking, taking s14 as probably the best estimate of the eigenvalue and the last x-vector as the best estimate of an eigenvector. The results after 14 steps are not very accurate. For better accuracy, take 80 steps!
Inverse Power Method
It is possible to compute other eigenvalues of a matrix by using modifications of the power method. For example, if A is invertible, we can compute its eigenvalue of smallest magnitude by noting this logical equivalence:
Ax=λx ⇐⇒ x=A−1(λx) ⇐⇒ A−1x=1x λ
Thus, the smallest eigenvalue of A in magnitude is the reciprocal of the largest eigenvalue of A−1. We compute it by applying the power method to A−1 and taking the reciprocal of the result.
Suppose that there is a single smallest eigenvalue of A, which is λn with our usual ordering:
|λ1|≧|λ2|≧|λ3|≧ ···≧|λn−1|>|λn|>0
It follows that A is invertible. (Why?) The eigenvalues of A−1 are λ−1 for 1 ≦ j ≦ n. There-
fore, we have
We can use the power method on the matrix A−1 to compute its dominant eigenvalue λ−1.
n
The reciprocal of this is the eigenvalue of A that we sought. Notice that we need not compute A−1 because the equation
EXAMPLE 2
so the vector x(k+1) can be more easily computed by solving this last linear system. To do this, we first find the LU factorization of A, namely, A = LU. Then we repeatedly update the right-hand side and back solve:
Ux(k+1) = L−1x(k)
to obtain x(1), x(2), . . . .
Compute the smallest eigenvalue and an associated eigenvector of this matrix
1 −154 528 407 A=3 55 −144 −121
−132 396 318 using the following initial vector and linear function:
x(0) = [1,2,3]T , φ(x) = x2
is equivalent to the equation
x(k+1) = A−1x(k) Ax(k+1) = x(k)
|λ−1|>|λ−1 |≧ ···≧|λ−1|>0 n n−1 1
j
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Solution
We decide to take the easy route and use the inverse of A for producing the successive x vectors. We leave the actual implementation as an exercise. The ratios rk are saved, and once they are complete, the Aitken accelerated values, sk , are computed. Notice that at the end, we want the reciprocal of the limiting ratio. Hence, it is easier to use reciprocals at every step in the code. Thus, you see rk = x2/y2 rather than y2/x2, and these ratios should converge to the smallest eigenvalue of A. The final results after 80 steps are these:
x = [0.26726101285547, −0.53452256017715, 0.80178375118802]T s80 = 3.33333333343344
We can divide each entry in x by the first component and arrive at
x = [1.0, −2.00000199979120, 3.00000266638827]T
The eigenvalue is actually 10 , and the eigenvector is [1, −2, 3]T . The discrepancy between 3
Ax and s80x is about 2.6 × 10−6. ■
Example: Inverse Power Method
Using mathematical software on a small example,
6 5 −5
A=26 2 (4)
2 5 −1
we can first get A−1 and then use the power method. (We have changed one entry in the matrix A from Example 1 to solve a different problem.) We leave the implementation of the code as an exercise. In the code, r is the reciprocal of the quantity r in the original power method. Thus, at the end of the computation, r should be the eigenvalue of A that has the smallest absolute value. After the prescribed 30 steps, we find that r = 0.214 and x = [0.7916, 0.5137, 0.3308]T . As usual, we can verify the result independently by computing Ax and rx, which should be equal. The method just illustrated is called the inverse power method. On larger examples, the successive vectors should be com- puted not via A−1 but rather by solving the equation A y = x for y. In mathematical software systems such as MATLAB, Maple, and Mathematica, this can be done with a single command. Alternatively, one can get the LU factorization of A and solve Lz = x and U y = z.
In this example, two eigenvalues are complex. Since the matrix is real, they must be conjugate pairs of the form α + βi and α − βi. They have the same magnitude; thus, the hypothesis |λ1| > |λ2| needed in the convergence proof of the power method is violated. What happens when the power method is applied to A? The values of r for k = 26 to 30 are 0.76, −53.27, 8.86, 2.69, and −9.42. We leave the implementation of the code as a computer exercise.
Shifted (Inverse) Power Method
Other eigenvalues of a matrix (besides the largest and smallest) can be computed by ex- ploiting the following logical equivalences:
Ax=λx⇐⇒(A−μI)x=(λ−μ)x⇐⇒(A−μI)−1x= 1 x λ−μ
8.3 Power Method 401
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
402 Chapter 8
More on Linear Systems
If we want to compute an eigenvalue of A that is close to a given number μ, we can apply the inverse power method to A − μ I and take the reciprocal of the limiting value of r . This should be λ − μ.
We can also compute an eigenvalue of A that is farthest from a given number μ. Suppose that for some eigenvalue λj of matrix A, we have
|λj −μ|>ε and 0<|λi −μ|<ε foralli ≠ j
Consider the shifted matrix A − μI. Applying the power method to the shifted matrix A − μ I , we compute ratios rk that converge to λ j − μ. This procedure is called the shifted
power method.
If we want to compute the eigenvalue of A that is closest to a given number μ, a variant
of the procedure is needed. Suppose that λ j is an eigenvalue of A such that 0<|λj −μ|<ε and |λi −μ|>ε foralli ≠ j
Consider the shifted matrix A − μI. The eigenvalues of this matrix are λi − μ. Applying the inverse power method to A − μ I gives an approximate value for (λ j − μ)−1 . We can use the explicit inverse of A − μI or the LU factorization A − μI = LU. Now we repeatedly solve the equations
(A−μI)x(k+1) =x(k)
bysolvinginsteadUx(k+1) = L−1x(k).Sincetheratiosrk convergeto(λj −μ)−1,wehave
−1
This algorithm is called the shifted inverse power method.
Example: Shifted Inverse Power Method
To illustrate the shifted inverse power method, we consider this matrix
137
A=2−4 5 (5)
3 4 −6
and we write mathematical software to compute the eigenvalue closest to −6. The code we use takes ratios of y2/x2, and we are therefore expecting convergence of these ratios to λ + 6. After eight steps, we have r = 0.9590 and x = [−0.7081, 0.6145, 0.3478]T . Hence, the eigenvalue should be λ = 0.9590 − 6 = −5.0410. We can ask MATLAB to confirm the eigenvalue and eigenvector by computing both Ax and λx to be approximately [3.57, −3.10, −1.75]T .
Summary 8.3
• We have considered the following methods for computing eigenvalues of a matrix. In the power method, we approximate the largest eigenvalue λ1 by generating a sequence of points using the formula
x(k+1) = Ax(k)
and then forming a sequence rk = φ(x(k+1))/φ(x(k)), where φ is a linear functional. Under the right circumstances, this sequence, rk , will converge to the largest eigenvalue of A.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
λj=μ+ limrk k→∞
=μ+lim k→∞ rk
1
•
•
•
(Multiple Choice) Let A =
Intheinversepowermethod,wefindthesmallesteigenvalueλn byusingthepreceding process on the inverse of the matrix. The reciprocal of the largest eigenvalue of A−1 is the smallest eigenvalue of A. We can also describe this process as one of computing the sequence so that
Ax(k+1) = x(k)
In the shifted power method, we find the eigenvalue that is farthest from a given number μ by seeking the largest eigenvalue of A − μI. This involves an iteration to produce a sequence
x(k+1) =(A−μI)x(k)
In the shifted inverse power method, we find the eigenvalue that is closest to μ by
applying the inverse power method to A − μ I . This requires solving the equation (A−μI)x(k+1) =x(k) (A−μI=LU)
a2. (Multiple Choice) What is the expected final output of the following pseudocode? Here y1 and x1 are the first components of y and x, respectively.
8.3 Power Method 403
a
1.
5 2 4 7 .
3. Briefly describe how to compute the following:
a. The dominant eigenvalue and associate eigenvector.
b. The next dominant eigenvalue and associated eigen- vector.
c. Theleastdominanteigenvalueandassociatedeigen- vector.
d. An eigenvalue other than the dominant or least dom- inant eigenvalue and associated eigenvectors.
2−1 0 4. LetA= −1 2 −1
0 −1 2
Carry out several iterations of the power method, start- ing with x(0) = (1, 1, 1). What is the purpose of this procedure?
−2 −1 0 5. LetB=A−4I= −1 −2 −1 .
0 −1 −2
Carry out some iterations of the power method applied to B, starting with x(0) = (1, 1, 1). What is the purpose of this procedure?
−1 1321 6. LetC=A=4242.
123
Carry out a few iterations of the power method applied to C, starting with x(0) = (1,1,1). What is the purpose of this procedure?
7. The Rayleigh quotient is the expression ⟨x,x⟩A/ ⟨x, x⟩ = xT Ax/xT x. How can the Rayleigh quotient be used when Ax = λx?
The power method has been applied to the matrix A. The result is a long list of vectors that seem to settle down to a vector of the form [h, 1]T , where |h| < 1. What is the largest eigenvalue, approximately, in terms of that number h?
a. 4h+7 b. 5h+2 c. 1/h
d. 5h + 4 e. None of these.
integer n, kmax; real r
real array ( A−1)1:n×1:n , (x)1:n , ( y)1:n fork =1to30
y ← A−1x r ← y1/x1 x ← y/||y|| output r, x
end do
a. r is the eigenvalue of A largest in magnitude, and x is an accompanying eigenvector.
b. r=1/λ,whereλisthesmallesteigenvalueofA,and x is such that Ax = λx.
c. A vector x such that Ax = rx, where r is the eigen- value of A having the smallest magnitude.
d. r is the largest (in magnitude) eigenvalue of A and x is a corresponding eigenvector of A.
e. Noneofthese.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 8.3
404 Chapter 8 More on Linear Systems
1. Use the power method, the inverse power method, and their shifted forms as well as Aitken’s acceleration to find some or all of the eigenvalues of the following matrices:
7.
−4 14 0 Let A = −5 13 0 .
−1 0 2
Code and apply each of the following:
1
1 b.
Maple, or Mathematica.
3. Modify and test the pseudocode for the power method to normalize the vector so that the largest component is always 1 in the infinity-norm. This procedure gives the eigenvector and eigenvalue without having to compute a linear functional.
−57 192 148 a4. LetA= 20 −53 −44 −48 144 115
Find the eigenvalues that are close to −4, 2, and 8 by using the inverse power method.
5. UsingmathematicalsoftwaresuchasMATLAB,Maple, or Mathematica, write and execute code for implement- ing the methods in Section 8.3. Verify that the results are consistent with those described in the text.
a. Example 1 using the modified power method.
b. Example 2 using the inverse power method with
Aitken acceleration.
c. Matrix(4)usingtheinversepowermethod.
5 4 1 a.4 5 1 1 1 4 1 1 2
234 7 −1 3
1 −1 5
0
−2
The modified power algorithm starting with x(0) = [1, 1, 1]T as well as the Aitken’s acceleration process.
The inverse power algorithm.
The shifted power algorithm.
The shifted inverse power algorithm.
−2
1 0 0 −2 1 0 c.0 1−2 1 0 01 −2 0 00 1
0 0 1
2 4
a.
b. c. d.
(Continuation) Let B =
Repeat the previous exercise starting with x(0) =
1
2. Redotheexamplesinthissection,usingeitherMATLAB,
d. Matrix (5) using the shifted power method.
111 113
8.
9.
10.
4 −1 1 −1 3 −2 . 1 −2 3
18 −5 −7
2 (0)T
[1,0,0]T. (Continuation) Let C =
−8 −5 8 6 3 −8 .
−3 1 9
Use x(0) = [1, 1, 1]T . Repeat the previous exercise start-
ingwithx(0)=[1,0,0]T.
By means of the power method, find an eigenvalue and associated eigenvector of these matrices from the histor- ical books by Fox [1957] and Wilkinson [1965]. Use the given starting values and carry out the procedure with and without normalization. Verify your results by using math- ematical software such as MATLAB, Maple, or Mathe- matica.
a. 0.9901 0.002, x(0) = [1,0.9]T −0.0001 0.9904
8−1−5 (0) T b. −4 4 −2 , x =[1,0.8,1]
6. Considerthematrix A=1 1 1 c. 1 −2 1 , x =[1,1,1] 114 313
242
a. Use the normalized power method starting with
x(0) = [1, 1, 1]T , and find the dominant eigenvalue and eigenvector of the matrix A.
b. Repeat, starting with the initial value x(0) = [−0.64966116, 0.74822116, 0]T . Explain the results. See Ralston [1965, p. 475–476].
−2 −1 4 (0)
d. 2 1 −2 , x =[3,1,2]
−1 −1 3
T
11.
Find all of the eigenvalues and associated eigenvectors of these matrices from Fox [1957] and Wilkinson [1965] by means of the power method and variations of it.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 8.3
b.
8.4
0.4812 −0.0024
0.0023 0.4810
d. −1 3 −2 −2 −2 5
0.987 0.400 −0.487 e. −0.079 0.500 −0.479 0.082 0.400 0.418
110 c. −1+10−8 3 0 21 011
a. 4 2 5 −1 −2
Verify your results by using mathematical software such as MATLAB, Maple, or Mathematica.
Vector Norm
l1, l2, l∞-vector Norm
Matrix Norm
||x||1 = ||x||2 =
|xi|
l1-vectornorm Euclidean/l2-vectornorm
1 / 2 ||x||∞ = max |xi |
n i=1
8.4 Iterative Solutions of Linear Systems 405
Iterative Solutions of Linear Systems
In this section, we explore a completely different strategy for solving a nonsingular linear system
Ax = b (1)
This alternative approach is often used on enormous problems that arise in solving partial differential equations numerically. In that subject, systems having hundreds of thousands of equations arise routinely. (See Section 12.3.)
Vector and Matrix Norms
We first present a brief overview of vector and matrix norms because they are useful in the discussion of errors and in the stopping criteria for iterative methods. Norms can be defined on any vector space, but we usually use Rn or Cn. A vector norm ||x|| can be thought of as the length or magnitude of a vector x ∈ Rn . A vector norm is any mapping from Rn to R that obeys these three properties:
||x|| > 0 if x ̸= 0 ||αx|| = |α|||x||
||x + y|| ≦ ||x|| + || y|| (triangle inequality)
for vectors x, y ∈ Rn and scalars α ∈ R. Examples of vector norms for the vector x = (x1,x2,…,xn)T ∈Rn are
n i=1
1≦i≦n
||A||>0if A̸=0
||α A|| = |α| || A||
|| A + B|| ≦ || A|| + || B|| (triangular inequality)
for matrices A, B, and scalars α.
xi2
For n × n matrices, we can also have matrix norms, subject to the same requirements:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
l∞-vector norm
406 Chapter 8
More on Linear Systems
We usually prefer matrix norms that are related to a vector norm. For a vector norm || · ||, the subordinate matrix norm is defined by
||A|| ≡ sup{||Ax|| : x ∈ Rn and ||x|| = 1}
Here, A is an n × n matrix. For a subordinate matrix norm, some additional properties are
||I|| = 1
||Ax|| ≦ ||A||||x||
||AB|| ≦ ||A||||B||
There are two meanings associated with the notation || · || p , one for vectors and another for
matrices. The context determines which one is intended. Examples of subordinate matrix
normsforann×nmatrix Aare
n
σmax in absolute value is termed the spectral radius of A. (See Sections 7.2 and 9.3 for a discussion of singular values.)
Condition Number and Ill-Conditioning
An important quantity that has some influence in the numerical solution of a linear system
Ax = b is the condition number, which is defined as
κ(A) = ∥A∥2 ∥A−1∥2
It turns out that it is not necessary to compute the inverse of A to obtain an estimate of the condition number. Also, it can be shown that the condition number κ ( A) gauges the transfer of error from the matrix A and the vector b to the solution x.
■ Rule
Ill-Conditioned Matrix
Subordinate Matrix Norm
Matrix Norm Properties
l1, l2, l∞-matrix Norms
Singular Value/ Spectral Radius
Condition Number
||A||1 = max 1≦j≦n
||A||2 = max 1≦i≦n
|aij| |σmax|
l1-matrix norm
spectral /l2-matrix norm l∞-matrix norm
||A||∞ = max
i=1
n 1≦i≦n
|aij|
Here, σi are the eigenvalues of AT A, which are called the singular values of A. The largest
j=1
Rule of Thumb
If κ ( A) = 10 k , then one can expect to lose at least k digits of precision in solving the system Ax = b.
If the linear system is sensitive to perturbations in the elements of A, or to perturbations of the components of b, then this fact is reflected in A having a large condition number. In such a case, the matrix A is said to be ill-conditioned. Briefly, the larger the condition number, the more ill-conditioned the system.
Suppose we want to solve an invertible linear system of equations
Ax = b
for a given coefficient matrix A and right-hand side b, but there may have been pertur-
bations of the data owing to uncertainty in the measurements and roundoff errors in the
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Condition Number Bound
3 × 3 Hilbert Matrix
||x|| ||b||
From the perturbed linear system Aδx = δb, we obtain δx = A−1δb and
||δx||≦ ||A−1||||δb|| Combining the two inequalities above, we obtain
||δx|| ≦ κ(A)||δb|| ||x|| ||b||
which contains the condition number of the original matrix A.
As an example of an ill-conditioned matrix consider the Hilbert matrix
111
which gives us
||b|| = || Ax|| ≦ || A|| ||x||
1 ||A|| ≦
8.4 Iterative Solutions of Linear Systems 407
calculations. Suppose that the right-hand side is perturbed by an amount assigned the symbol δb and the corresponding solution is perturbed an amount denoted by the symbol δx. Then we have
A(x + δx) = Ax + Aδx = b + δb
where
Aδx = δb
From the original linear system Ax = x and norms, we have
23 H3 = 1 1 1
We can use the MATLAB commands to generate the matrix and then to compute both the condition number using the 2-norm and the determinant of the matrix. We find the condition number is 524.0568 and the determinant is 4.6296 × 10−4. In solving linear systems, the condition number of the coefficient matrix measures the sensitivity of the system to errors in the data. When the condition number is large, the computed solution of the system may be dangerously in error! Further checks should be made before accepting the solution as being accurate. Values of the condition number near 1 indicate a well-conditioned matrix whereas large values indicate an ill-conditioned matrix. Using the determinant to check for singularity is appropriate only for matrices of modest size. Using mathematical software, one can compute the condition number to check for singular or near-singular matrices.
A goal in the study of numerical methods is to acquire an awareness of whether a numerical result can be trusted or whether it may be suspect (and therefore in need of fur- ther analysis). The condition number provides some evidence regarding this question. With the advent of sophisticated mathematical software systems, an estimate of the condition number is often returned, along with an approximate solution so that one can judge the trustworthiness of the results. In fact, some solution procedures involve advanced features that depend on an estimated condition number and may switch solution techniques based on it. For example, this criterion may result in a switch of the solution technique from a variant of Gaussian elimination to a least-squares solution for an ill-conditioned system.
234 111 345
Well/Ill-Conditioned Matrices
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
408 Chapter 8
More on Linear Systems
General Iteration
Richardson Iteration
General Iterative Pseudocode
Unsuspecting users may not realize that this has happened unless they look at all of the results, including the estimate of the condition number. (Condition numbers can also be associated with other numerical problems, such as locating roots of equations.)
Basic Iterative Methods
The iterative-method strategy produces a sequence of approximate solution vectors x(0), x(1), x(2), . . . for system Ax = b. The numerical procedure is designed so that, in principle, the sequence of vectors converges to the actual solution. The process can be stopped when sufficient precision has been attained. This stands in contrast to the Gaussian elimination algorithm, which has no provision for stopping midway and offering up an approximate solution. A general iterative algorithm for solving System (1) goes as follows: Select a nonsingular matrix Q, and having chosen an arbitrary starting vector x(0), generate vectors x(1), x(2), . . . recursively from the equation
Qx(k) = (Q − A)x(k−1) + b (k = 1,2,…) (2) To see that this is sensible, suppose that the sequence x(k) does converge, to a vector x∗,
say. Then by taking the limit as k → ∞ in System (2), we get Qx∗ = (Q − A)x∗ + b
This leads to Ax∗ = b. Thus, if the sequence converges, its limit is a solution to the original System (1). For example, the Richardson iteration uses Q = I.
An outline of the pseudocode for carrying out the general iterative procedure (2) follows:
integer k, kmax
real array (x(0))1:n , (b)1:n , (c)1:n , (x)1:n , ( y)1:n , ( A)1:n×1:n , ( Q)1:n×1:n x ← x(0)
fork =1tokmax
y←x
c ← ( Q − A)x + b solve Qx = c output k, x
if∥x− y∥<εthen
output “convergence”
stop end if
end for
output “maximum iteration reached”
In choosing the invertible matrix Q, we are influenced by the following.
One should not believe that it is necessary to compute the inverse of Q to carry out an iterative procedure. For small systems, we can easily compute the inverse of Q, but in
• System(2)shouldbeeasytosolveforx(k),whentheright-handsideisknown.
• MatrixQshouldbechosentoensurethatthesequencex(k)converges,nomatter what initial vector is used. Ideally, this convergence will be rapid.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Jacobi Method
general, this is definitely not to be done! We want to solve a linear system in which Q is the coefficient matrix. As was mentioned previously, we want to select Q so that a linear system with Q as the coefficient matrix is easy to solve. Examples of such matrices are diagonal, tridiagonal, banded, lower triangular, and upper triangular.
Now, let’s view System (1) in its detailed form
n
aijxj =bi (1≦i≦n) (3)
j=1
Solving the i th equation for the i th unknown term, we obtain an equation that describes the
Jacobi method:
n
x(k) = − (aij/aii)x(k−1) +(bi/aii) (1≦i≦n) (4) ij
j=1 j ̸=i
Here, we assume that all diagonal elements are nonzero. (If this is not the case, we can usually rearrange the equations so that it is.)
In the Jacobi method above, the equations are solved in order. The components x(k−1) j
and the corresponding new values x(k) can be used immediately in their place. If this is j
done, we have the Gauss-Seidel method:
n n
x(k) = − (a /a )x(k) − (a /a )x(k−1) +(b/a ) (5) i ij ii j ij ii j i ii
j=1 j=1 ji
If x(k−1) is not saved, then we can dispense with the superscripts in the pseudocode as follows:
Gauss-Seidel Method
Gauss-Seidel Pseudocode
SOR Method
n x(k) =ω −
j=1 ji
(a /a )x(k−1) +(b/a ) ij ii j i ii
+(1−ω)x(k−1) (6) i
8.4 Iterative Solutions of Linear Systems 409
integer i, j,k,kmax,n; fork =1tokmax
real array (aij)1:n×1:n,(bi)1:n,(xi)1:n n
fori =1ton
xi ← bi − aijxj aii
end for end for
j=1 j ̸=i
An acceleration of the Gauss-Seidel method is possible by the introduction of a relax- ation factor ω, resulting in the successive overrelaxation (SOR) method:
i
The SOR method with ω = 1 reduces to the Gauss-Seidel method.
We now consider numerical examples using iterative methods associated with the names
Jacobi, Gauss-Seidel, and successive overrelaxation.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
410 Chapter 8 More on Linear Systems
EXAMPLE 1
Solution
(Jacobi iteration) Let
0−12 −5
Carry out a number of iterations of the Jacobi method, starting with the zero initial vector.
Rewriting the equations, we have the Jacobi method:
x(k) = 1x(k−1) + 1 1222
x(k) = 1x(k−1) + 1x(k−1) + 8 231333
x(k) = 1x(k−1) − 5 3222
Taking the initial vector to be x(0) = [0, 0, 0]T , we find (with the aid of a computer program or a programmable calculator) that
x(0) =[0,0,0]T
x(1) =[0.5000,2.6667,−2.5000]T x(2) =[1.8333,2.0000,−1.1667]T
.
x(21) =[2.0000,3.0000,−1.0000]T
The actual solution (to four decimal places rounded) is obtained. ■ In the Jacobi method, Q is taken to be the diagonal of A:
2−10 1 A = −1 3 −1 , b = 8
200 Q=0 3 0
002
100 1−1 0
22 Q−1 =0 1 0, Q−1A=−1 1 −1
001 0−1 1 22
The Jacobi iteration matrix and constant vector are
010 1
22 B = I − Q−1 A = 1 0 1 , h = Q−1b = 8
010 −5 22
One can see that Q is close to A, Q−1 A is close to I, and I − Q−1 A is small. We write the Jacobi method as
x(k) = Bx(k−1) + h
Now
Jacobi Iteration Matrix
333
333
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 2
Solution
8.4 Iterative Solutions of Linear Systems 411 (Gauss-Seidel iteration) Repeat the preceding example using the Gauss-Seidel method.
The idea of the Gauss-Seidel method is simply to accelerate the convergence by incorpo-
rating each vector as soon as it has been computed. Obviously, it would be more efficient
in the Jacobi method to use the updated value x(k) in the second equation instead of the old 1
value x(k−1). Similarly, x(k) could be used in the third equation in place of x(k−1). Using the 122
new iterates as soon as they become available, we have the Gauss-Seidel method:
x(k) = 1x(k−1) + 1 1222
x(k) = 1x(k) + 1x(k−1) + 8 231333
x(k) = 1x(k) − 5 3222
Starting with the initial vector zero, some of the iterates are
x(0) =[0,0,0]T
x(1) =[0.5000,2.8333,−1.0833]T x(2) =[1.9167,2.9444,−1.0278]T
.
x(9) =[2.0000,3.0000,−1.0000]T
In this example, the convergence of the Gauss-Seidel method is approximately twice as fast as that of the Jacobi method. ■
In the iterative algorithm that goes by the name Gauss-Seidel, Q is chosen as the lower triangular part of A, including the diagonal. Using the data from the previous example, we now find that
200 Q = −1 3 0
The usual row operations give us
0 −1 2
100 1−1 0
22 Q−1 = 1 1 0, Q−1A=0 5 −1
63 63 111 0−15 12 6 2 12 6
Again, we emphasize that in a practical problem we would not compute Q−1. The Gauss- Seidel iterative matrix and constant vector are
010 1
Gauss-Seidel Iteration Matrix
22 L = I − Q−1 A = 0 1 1 , h = Q−1b = 17
636 011 −13 12 6 12
We write the Gauss-Seidel method as
x(k) = Lx(k−1) + h
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
412 Chapter 8 More on Linear Systems
EXAMPLE 3
Solution
(SOR iteration) Repeat the preceding example using the SOR iteration with ω = 1.1. Introducing a relaxation factor ω into the Gauss-Seidel method, we have the SOR method:
x(k) =ω1x(k−1)+1+(1−ω)x(k−1) 12221
x(k) =ω1x(k)+1x(k−1)+8+(1−ω)x(k−1) 2313332
x(k) =ω1x(k)−5+(1−ω)x(k−1) 32223
Starting with the initial vector of zeros and with ω = 1.1, some of the iterates are
x(0) =[0,0,0]T
x(1) =[0.5500,3.1350,−1.0257]T x(2) =[2.2193,3.0574,−0.9658]T
.
x(7) =[2.0000,3.0000,−1.0000]T
In this example, the convergence of the SOR method is faster than that of the Gauss-Seidel method. ■
In the iterative algorithm that goes by the name successive overrelaxation (SOR), Q is chosen as the lower triangular part of A including the diagonal, but each diagonal element aij isreplacedbyaij/ω,whereωistherelaxationfactor.(InitialworkontheSORmethod was done by Southwell [1946] and Young [1950].)
From the previous example, this means that
20 00
11 Q = −1 30 0
Now
20 11
11−11 0 20 10 20 Q−1=121 11 0, Q−1A= 11 539 −11
11 00
11 0 −1
600 30 300 600 30 1331 121 11 121 671 539 12000 600 20 6000 12000 600
The SOR iteration matrix and constant vector are
−1110 11
SOR Iteration Matrix
SOR Method
1020 20 Lω = I − Q−1 A = − 11 61 11 , h = Q−1b = 627
300 600 30 200 − 121 − 671 61 −4103 6000 12000 600 4000
We write the SOR method as
Pseudocode
x(k) =Lωx(k−1) +h
We can write pseudocode for the Jacobi, Gauss-Seidel, and SOR methods assuming that the linear System (1) is stored in matrix-vector form:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Jacobi Pseudocode
for j = 1 to i − 1
sum ← sum − ai j x j
end for
for j = i + 1 to n
sum ← sum − ai j x j end for
Gauss-Seidel Pseudocode
SOR Pseudocode
Here, the vector y contains the old iterate values, and the vector x contains the updated ones. The values of kmax, δ, and ε are set either in a parameter statement or as global variables. The pseudocode for the procedure Gauss Seidel( A, b, x) is the same as that for the
shown Jacobi pseudocode except that the innermost j-loop is replaced by the following:
The pseudocode for procedure SOR( A, b, x, ω) is the same as that for the Gauss-Seidel pseudocode with the statement following the j-loop replaced by the following:
8.4 Iterative Solutions of Linear Systems 413
procedure Jacobi( A, b, x)
realkmax ←100,δ←10−10,ε← 1 ×10−4
2 integer i, j, k, kmax, n; real diag, sum
real array ( A)1:n×1:n , (b)1:n , (x)1:n , ( y)1:n n ← size(A)
fork =1tokmax
y←x
fori =1ton
sum ← bi
diag ← aii
if |diag| < δ then
output “diagonal element too small”
return end if
for j = 1 to n
if j ̸=i then
sum ← sum − ai j y j end if
end for
xi ← sum/diag end for
output k, x
if∥x− y∥<εthen
output k, x
return end if
end for
output “maximum iterations reached” return
end Jacobi
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
xi ← sum/diag
xi ←ωxi +(1−ω)yi
414 Chapter 8
More on Linear Systems
Large Sparse Linear Systems
In the solution of partial differential equations, iterative methods are frequently used to solve large sparse linear systems, which often have special structures. The partial derivatives are approximated by stencils composed of relatively few points, such as 5, 7, or 9. This leads to only a few nonzero entries per row in the linear system. In such systems, the coefficient matrix A is usually not stored because the matrix-vector product can be written directly in the code. See Section 12.3 for additional details on this and how it is related to solving elliptic partial differential equations.
Convergence Theorems
For the analysis of the method described by System (2), we write
Iteration Matrix and Vector
x(k) = Gx(k−1) + h (7) where the iteration matrix and vector are
G = I − Q−1 A, h = Q−1b
Notice that in the pseudocode, we do not compute Q−1. The matrix Q−1 is used to facilitate the analysis. Now let x be the solution of System (1). Since A is invertible, x exists and is unique. We have, from Equation (7),
x(k) − x = (I − Q−1 A)x(k−1) − x + Q−1b
= (I − Q−1 A)x(k−1) −(I − Q−1 A)x = (I − Q−1 A)(x(k−1) − x)
One can interpret e(k) ≡ x(k) − x as the current error vector. Thus, we have
e(k) = (I − Q−1 A)e(k−1) (8)
We want e(k) to become smaller as k increases. Equation (8) shows that e(k) is smaller than e(k−1) if I − Q−1 A is small, in some sense. In turn, Q−1 A should be close to I and Q should be close to A. (Norms can be used to make small and close precise.)
Error Vector
or
x(k) = Q−1(Q − A)x(k−1) + b
Spectral Radius Theorem
In order that the sequence generated by
Qx(k) = (Q − A)x(k−1) + b
to converge, no matter what starting point x(0) is selected, it is necessary and sufficient that all eigenvalues of I − Q−1 A lie in the open unit disc, |z| < 1, in the complex plane.
Spectral Radius
The conclusion of this theorem can also be written as ρ(I − Q−1 A) < 1
where ρ is the spectral radius function: For any n × n matrix G, having eigenvalues λi , we define
ρ(G)= max |λi| 1≦i≦n
■ Theorem1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
EXAMPLE 4
Solution
8.4 Iterative Solutions of Linear Systems 415 Determine whether the Jacobi, Gauss-Seidel, and SOR methods (with ω = 1.1) of Exam-
ple 3 converge for all initial iterates.
For the Jacobi method, we can easily compute the eigenvalues of the relevant matrix B. The steps are
211 det(B−λI)=det 1 −λ 1=−λ3+ λ+ λ=0
−λ 1 0
3366
0 1 −λ 2
1/3 ≈ ±0.5774. Thus, by Spectral Radius Theorem, the Jacobi iteration succeeds for any starting vector in this example.
The eigenvalues are λ = 0, ±
For the Gauss-Seidel method, the eigenvalues of the iteration matrix L are determined
from
Theeigenvaluesareλ=0,0,1 ≈0.333.Hence,theGauss-Seideliterationconvergesfor 3
any initial vector in this example.
For the SOR method with ω = 1.1, the eigenvalues of the iteration matrix Lω are
determined from
−1−λ 11 0
−121 671 61 −λ 6000 12000 600
1 61 2 121 11 11 =−−λ−λ−··
10 600 6000 30 20
+11· 1161 −λ−− 1 −λ 671 ·11 20 300 600 10 12000 30
=−1 +31λ+31λ2−λ3=0 1000 3000 3000
The eigenvalues are λ ≈ 0.1200, 0.0833, −0.1000. Hence, the SOR iteration converges for any initial vector in this example. ■
A condition that is easier to verify than the inequality ρ(I − Q−1 A) < 1 is the dominance of the diagonal elements over the other elements in the same row. As defined in Section 2.3, we can use the property of diagonal dominance
j ̸=i
−λ11 0
20 121
det(L−λI)=det 0 1 −λ 1 =−λ −λ + λ=0 6 3 6 36
011−λ 12 6
10 20 det(L −λI)=det −11 61 −λ
11 ω 300600 30
Diagonal Dominance
|ai j |
|ai i | >
to determine whether the Jacobi and Gauss-Seidel methods converge.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
n j=1
416 Chapter 8 More on Linear Systems
Jacobi and Gauss-Seidel Convergence Theorem
If A is diagonally dominant, then the Jacobi and Gauss-Seidel methods converge for any starting vector x(0).
■ Theorem2
■ Definition1
■ Theorem3
Notice that this is a sufficient, but not a necessary condition. Indeed, there are matrices that are not diagonally dominant for which these methods converge.
Another important property follows:
For a matrix A to be SPD, it is necessary and sufficient that A = AT and that all eigenvalues of A are positive.
Matrix Formulation of Iterative Methods
For the formal theory of iterative methods, we split the matrix A into the sum of a nonzero diagonal matrix D, a strictly lower triangular matrix CL, and a strictly upper triangular matrix CU such that
A = D − CL − CU
Here, D = diag(A), CL = (−aij)i>j, and CU = (−aij)i
SOR Convergence Theorem
Suppose that the matrix A has positive diagonal elements and that 0 < ω < 2. The SOR method converges for any starting vector x(0) if and only if A is symmetric and positive definite.
Jacobi Method
Gauss-Seidel Method
SOR Method
Seidel method becomes
This corresponds to Equation (2) with Q = diag( A) + lower triangular( A) = D − C L .
From Equation (6), the SOR method can be written as (D−ωCL)x(k) =[ωCU +(1−ω)D]x(k−1) +ωb
This corresponds to Equation (2) with Q = (1/ω)diag( A) + lower triangular( A) = (1/ω)( D − ωC L ).
(D − CL)x(k) = CU x(k−1) + b
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Jacobi Matrix
Gauss-Seidel Matrix
SOR Matrix
In summary, the iteration matrix and constant vector for the basic three iterative methods (Jacobi, Gauss-Seidel, and SOR) can be written in terms of this splitting. For the Jacobi method, we have Q = D and
B=I−Q−1A=D−1(CL +CU) h = Q−1b = D−1b
For the Gauss-Seidel method, we have Q = D − C L and
L = I − Q−1 A = (D − CL)−1CU
h= Q−1b=(D−CL)−1b FortheSORmethod,wehave Q=1/ω(D−ωCL)and
Lω =I−Q−1A=(D−ωCL)−1[ωCU +(1−ω)D] h= Q−1b=ω(D−ωCL)−1b
Another View of Overrelaxation
In some cases, the rate of convergence of the basic iterative scheme (2) can be improved by the introduction of an auxiliary vector and an acceleration parameter ω as follows:
Overrelaxation
JOR Method
SOR Method
The parameter ω gives a weighting in favor of the updated values. When ω = 1, this pro- cedure reduces to the basic iterative method, and when 1 < ω < 2, the rate of convergence may be improved, which is called overrelaxation. When Q = D, we have the Jacobi overrelaxation (JOR) method:
x(k) = ωBx(k−1) + h + (1 − ω)x(k−1)
Overrelaxation has particular advantages when used with the Gauss-Seidel method in
a slightly different way:
Dz(k) =CLx(k)+CUx(k−1)+b x(k) = ωz(k) + (1 − ω)x(k−1)
and we have the SOR method:
x(k) =Lωx(k−1) +h Conjugate Gradient Method
The conjugate gradient method is one of the most popular iterative methods for solving sparse systems of linear equations. This is particularly true for systems that arise in the numerical solutions of partial differential equations. (See Section 12.1.)
We begin with a brief presentation of definitions and associated notation. Assume that the real n × n matrix A is symmetric, meaning that AT = A. The inner product of two vectors u = (u1,u2,...,un) and v = (v1,v2,...,vn) can be written as ⟨u,v⟩ = uT v =
n uivi, which is the scalar sum. Note that ⟨u,v⟩ = ⟨v,u⟩. If u and v are mutually i=1
or
Qz(k) =(Q−A)x(k−1)+b x(k) = ωz(k) + (1 − ω)x(k−1)
x(k) =ω(I−Q−1A)x(k−1)+Q−1b+(1−ω)x(k−1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
8.4 Iterative Solutions of Linear Systems 417
418 Chapter 8
More on Linear Systems
A-inner Product
Positive Definite
Quadratic Form
Gradient
orthogonal, then ⟨u, v⟩ = 0. An A-inner product of two vectors u and v is defined as ⟨u,v⟩A =⟨Au,v⟩=uT ATv
Two nonzero vectors u and v are A-conjugate if ⟨u, v⟩ A = 0. An n × n matrix A is positive definite if
⟨x,x⟩A >0
for all nonzero vectors x ∈ Rn. In general, expressions such as ⟨u,v⟩ and ⟨u,v⟩A reduce to 1 × 1 matrices and are treated as scalar values. A quadratic form is a scalar quadratic function of a vector of the form
f(x)= 1⟨x,x⟩A −⟨b,x⟩+c 2
Here, A is a matrix, x and b are vectors, and c is a scalar constant. The gradient of a quadratic form is
f′(x)=∂f(x)/∂x1, ∂f(x)/∂x2, ···, ∂f(x)/∂xn T We can derive the following:
f ′(x) = 1 AT x + 1 Ax − b 22
If A is symmetric, this reduces to
f′(x)= Ax−b
Setting the gradient to zero, we obtain the linear system to be solved, Ax = b. Therefore, the solution of Ax = b is a critical point of f (x). If A is symmetric and positive definite, then f (x) is minimized by the solution of Ax = b. So an alternative way of solving the linear system Ax = b is by finding an x that minimizes f (x).
We want to solve the linear system
Ax = b
where the n × n matrix A is symmetric and positive definite.
Suppose that { p(1), p(2), . . . , p(k), . . . , p(n)} is a set containing a sequence of n mutually
conjugate direction vectors. Then they form a basis for the space Rn . Hence, we can expand the true solution vector x∗ of Ax = b into a linear combination of these basis vectors:
x∗ =α1p(1)+α2p(2)+···+α(k)p(k)+···+αnp(n) where the coefficients are given by
αk =⟨p(k),b⟩/⟨p(k),p(k)⟩A
This can be viewed as a direct method for solving the linear system Ax = b: First find the sequence of n conjugate direction vectors p(k), and then compute the coefficients αk. However, in practice, this approach is impractical because it takes too much computer time and storage.
On the other hand, if we view the conjugate gradient method as an iterative method, then we are able to solve large sparse linear systems in a reasonable amount of time and storage. The key is carefully choosing a small set of the conjugate direction vectors p(k) so that we do not need them all to obtain a good approximation to the true solution vector.
Direction Vectors
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Discussion of Conjugate Gradient Method
8.4 Iterative Solutions of Linear Systems 419 Start with an initial guess x(0) to the true solution x∗. We can assume without loss of
generality that x(0) is the zero vector. The true solution x∗ is also the unique minimizer of f(x)= 1⟨x,x⟩A −⟨x,x⟩= 1xT Ax−xTx
22
for x ∈ Rn. This suggests taking the first basis vector p(1) to be the gradient of f at x = x(0),
which equals −b. The other vectors in the basis are now conjugate to the gradient—hence the name conjugate gradient method. The kth residual vector is
r(k) = b − Ax(k)
The gradient descent method moves in the direction r(k). Take the direction closest to the gradient vector r(k) by insisting that the direction vectors p(k) be conjugate to each other. Putting all this together, we obtain the expression
p(k+1) = r(k) − p(k), r(k)A p(k), p(k)A pk
After some simplifications, the algorithm is obtained for solving the linear system Ax = b, where the coefficient matrix A is real, symmetric, and positive definite. The input vector x(0) is an initial approximation to the solution or the zero vector.
In theory, the conjugate gradient iterative method solves a system of n linear equations in at most n steps, if the matrix A is symmetric and positive definite. Moreover, the nth iterative vector x(n) is the unique minimizer of the quadratic function
q(x) = 1 xT Ax − xT b 2
When the conjugate gradient method was introduced by Hestenes and Stiefel [1952], the initial interest in it waned once it was discovered that this finite-termination property was not obtained in practice. But two decades later, there was renewed interest in this method when it was viewed as an iterative process by Reid [1971] and others. In practice, the solution of a system of linear equations can often be found with satisfactory precision in a number of steps considerably less than the order of the system.
CG Pseudocode and Features
Here is a pseudocode for the conjugate gradient algorithm:
Quadratic Function
k ← 0; x ← 0; r ← b − Ax; δ ← ⟨r, r⟩ while √δ>ε√⟨b,b⟩andk
• The SOR method is
n n
(−a /a )x(k−1) −(b/a ) +(1−ω)x(k−1) The SOR method reduces to the Gauss-Seidel method when ω = 1.
• For a matrix formulation, we split the matrix A = (ai j ): A = D − CL − CU
where D = diag( A) is a nonzero diagonal matrix, C L = (−ai j )i > j is a strictly lower triangularmatrix,andCU =(−aij)i
i ii
422 Chapter 8
More on Linear Systems
a. ||Q||<1
c. ||I−QA||<1
e. Noneofthese.
b. ||QA||<1
d. ||I−Q−1A||<1
e. Noneofthese.
for the standard iteration formula x(k) = Gx(k−1) + h to
Hint: The spectral radius is less than or equal to the
norm.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
produce a sequence x(k) that converges to a solution of the equation (I − G)x = h is that:
a. ThespectralradiusofGisgreaterthan1. b. ThematrixGisdiagonallydominant.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
The splitting matrices, iteration matrices, and constant vectors are as follows:
• For the Jacobi method, we have
Q=D
B = D−1(CL +CU)
h = D−1b
• For the Gauss-Seidel method, we have
Q = D − CL
L = (D − CL)−1CU
• For the SOR method, we have ω
h = ( D − C L )−1 b Q= 1(D−ωCL)
Lω =(D−ωCL)−1[ωCU +(1−ω)D] h = ω ( D − ω C L )−1 b
• An iterative method converges for a specific matrix A if and only if ρ(I − Q−1 A) < 1
If A is diagonally dominant, then the Jacobi and Gauss-Seidel methods converge for any x(0). The SOR method converges, for 0 < ω < 2 and any x(0), if and only if A is symmetric and positive definite with positive diagonal elements.
a1. GiveanalternativesolutiontoExample4.
2. WritethematrixformulafortheGauss-Seideloverrelax-
ation method.
a3. (Multiple Choice) In solving a system of equations
Ax = b, it is often convenient to use an iterative method, which generates a sequence of x(k) vectors that should converge to a solution. The process is stopped when suf- ficient accuracy has been attained. A general procedure istoobtainx(k)bysolvingQx(k)=(Q−A)x(k−1)+b. Here, Q is a certain matrix that is usually connected somehow to A. The process is repeated, starting with any available guess, x(0). What hypothesis guarantees that the method works, no matter what starting point is selected?
4. (MultipleChoice)Fromavectornorm,wecancreatea subordinate matrix norm. Which relation is satisfied by every subordinate matrix norm?
a. ||Ax||≧||A||||x|| c. ||AB||≧ ||A||||B|| e. Noneofthese.
b. ||I||=1
d. ||A + B||≧ ||A|| + ||B||
a5. (MultipleChoice)Theconditionfordiagonaldominance
of a matrix A is:
a. |aii| < nj=1 |aij|
b. |aii|≧ nj=1 |aij| j ̸=i
j ̸=i
c. |aii| < nj=1 |aij|
d. |aii| > nj=1 |aij|
6. (Multiple Choice) A necessary and sufficient condition
Exercises 8.4
c. The spectral radius of G is less than 1. d. G is nonsingular. e. None of these.
(Multiple Choice) A sufficient condition for the Jacobi method to converge for the linear system Ax = b.
a. A − I is diagonally dominant.
b. A is diagonally dominant.
c. G is nonsingular.
d. The spectral radius of G is less than 1. e. Noneofthese.
(Multiple Choice) A sufficient condition for the Gauss- Seidel method to work on the linear system Ax = b.
a. A is diagonally dominant.
b. A − I is diagonally dominant.
c. The spectral radius of A is less than 1. d. G is nonsingular. e. None of these.
(Multiple Choice) Necessary and sufficient conditions for the SOR method, where 0 < ω < 2, to work on the linearsystemAx=b.
a. A is diagonally dominant. b. ρ(A) < 1. c. A is symmetric positive definite.
d. x(0) = 0. e. None of these.
Redo several or all of Examples 1–5 using the linear sys- tem involving one of the following coefficient matrix and right-hand side vector pairs:
10.
The Frobenius norm ||A||F =
n n 2
aij
8.4 Iterative Solutions of Linear Systems 423
7.
8.
a9.
i=1 j=1
is frequently used because it is so easy to compute. Find
the value of this norm for these matrices:
a.
c.
1 23 0012 0 54 b.3054 2 13 1112
1 1 1 2 3 4 0 1 0 3 4 3
1322 11
5 6 −3 −1 3 −1
1 0 d. 1 3 1 −3 43 3−131 55 0301
5 5 5
11. Determinetheconditionnumbersκ(A)ofthesematrices:
−2 1 0 0 0 1 a. 1 −2 1 b. 0 1 0
0 1 −2 3 0 0
c. 020 0 0 1
(rounded) of accuracy:
−1 0 4 −1 x3 4 2−2−16x4 −3
1 1 1 −2 −1
2 −1 d.1 2 1−2
2 −1 2 1 0201
1.
7 1 −1
1 8 0 − 2 x 2 = − 5
2.
UsingtheJacobi,Gauss-Seidel,andSOR(ω=1.1)iter- ative methods, write and execute a computer program to solve the following linear system to four decimal places
7 a.A=−13, b=4
2x1 3
5 b.A=−1
0 2
c.A=−1 4
7 d.A= 3
−1 3 −1
−1 6 −3
a3.
4.
5 −1
0 7 −1, b=4
25
Compare the number of iterations needed in each case. Hint: The exact solution is x = (1, −1, 1, −1)T .
Using the Jacobi, Gauss-Seidel, and the SOR (ω = 1.4) iterative methods, write and run code to solve the follow- ing linear system to four decimal places of accuracy:
7 3−1 2x1−1 3 8 1 −4x2 0 −1 1 4 −1 x3 = −3
2 −4 −1 6 x4 1 Compare the number of iterations in each case.
Hint: Here, the exact solution is x = (−1, 1, −1, 1)T . (Continuation)SolvethesystemusingtheSORiterative
method with values of ω = 1(0.1)2. Plot the number of iterations for convergence versus the values of ω. Which value of ω results in the fastest convergence?
3
8
01 −2, b=3 89
−1 3 1, b=−4 −114 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 8.4
424 Chapter 8 More on Linear Systems
5. Program and run the Jacobi, Gauss-Seidel, and SOR
methods for the system of Example 1 as follows: a. Use equations involving the splitting matrix Q.
b. Use matrix-vector multiplication.
c. UsetheequationformulationsinExamples1-3.
6. (Continuation) Select one or more of the systems in Com- puter Exercise 8.4.1, and rerun these programs.
a7. Consider the linear system
9 −3x1= 6
−2 8 x2 −4
Using mathematical software, compare solving it by us- ing the Jacobi method and the Gauss-Seidel method start- ing with x(0) = (0, 0)T .
8. (Continuation)
a. Changethe(1,1)entryfrom9to1sothatthecoeffi- cient matrix is no longer diagonally dominant and see whether the Gauss-Seidel method still works. Explain why or why not.
b. Thenchangethe(2,2)entryfrom8to1aswelland test. Again explain the results.
9. Use the conjugate gradient method to solve this linear system:
y1
y2
y3
y4
b1 b2
b3 b4
. bn−3
bn−2 bn−1
bn
b.
The right-hand side represents forces on the beam. Set the right-hand side so that there is a known solution, such as a sag in the middle of the beam. Using an iterative method, repeatedly solve the system by al- lowing n to increase. Does the error in the solution in- crease when n increases? Use mathematical software that computes the condition number of the coefficient matrix to explain what is happening.
Thelinearsystemofequationsforacantileverbeam with a free boundary condition at only one end is
× .
yn−3 yn−2 yn−1
=
yn
12−6 4 3
−46−41 1−46−41 1−46−41
25 25 25 12 24 12 25 25 25
2.0 −0.3 −0.2 x1 7 −0.3 2.0 −0.1 x2 = 5 −0.2 −0.1 2.0 x3 3
... ... ... ... ... ... 1 −4 6 −4 1
10. (Euler-Bernoulli Beam) A simple model for a bending beam under stress involves the Euler-Bernoulli differen- tial equation. A finite difference discretization converts it into a system of linear equations. As the size of the dis- cretization decreases, the linear system becomes larger and more ill-conditioned.
a. For a beam pinned at both ends, we obtain the follow- ing banded system of linear equations with a band- width of five:
1 −4 6 −4 1 1 −93 111 −43
3
y1
y2
y3
y4
b4
12 −6 4
−4 6 −4
1 −4 6 −4
1−46−41
× . = . .. yn−3 bn−3 y n − 2 b n − 2
b1 b2
b3
yn−1 bn−1 yn bn
1
... ... ... ... ... ...
1 −4 6 −4 1
1 −4 6 −4 1
1 −4 6 −4
4 3
Repeat the numerical experiment for this system. See Sauer [2012] for additional details.
1
−6 12
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
11. Considerthissparselinearsystem:
The true solution is x = [1,1,1,...,1,1,1]T . Use an iterative method to solve this system for increasing values of n.
Consider the sample two-dimensional linear system 32x2
Ax1==b 26x1−8
and c = 0. Plot graphs to show the following:
a. The solution lies at the intersection of two lines.
b. Graph of the quadratic form
F(x) = c + bT x + 1 xT Ax
3−1 1 −13−1 12
−13−112 2
12.
8.4 Iterative Solutions of Linear Systems 425
... ... ... ...
−13−1
... ... ... ... 1−13−1
12
2 −1 3 −1 1 −1 3
2
x1 x2 x3 .
2.5 1.5 1.5 .
1.0 .
1.5 1.5 2.5
2
showing that the minimum point of this surface is the
solution of A x = b.
Contours of the quadratic form so that each ellipsoidal
curve has a constant value.
Gradient F′(x) of the quadratic form. Show that for
every x, the gradient points in the direction of the steepest increase of F(x) and is orthogonal to the contour lines. (See Section 13.2.)
× . .
=
c. d.
. xn−2
xn−1 xn
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
9
Least Squares Methods and Fourier Series
Surface tension S in a liquid is known to be a linear function of tempera- ture T . For a particular liquid, measurements have been made of the surface tension at certain temperatures. The results were as follows:
T 0 10 20 30 40 80 90 95 S 68.0 67.1 66.4 65.6 64.6 61.8 61.0 60.0
How can the most probable values of the constants in the equation
S = aT + b
be determined? Methods for solving such problems are developed in this
chapter.
9.1 Method of Least Squares Linear Least Squares
In experimental, social, and behavioral sciences, an experiment or survey often produces a mass of data. To interpret the data, the investigator may resort to graphical methods. For instance, an experiment in physics might produce a numerical table of the form
m + 1 Data Points (xi, yi)
Linear Case
FIGURE 9.1
Experimentaldata
x x0 x1 ··· xm y y0 y1 ··· ym
(1)
and from it, m + 1 points on a graph could be plotted. Suppose that the resulting graph looks like Figure 9.1. A reasonable tentative conclusion is that the underlying function is linear and that the failure of the points to fall precisely on a straight line is due to experimental error. Proceeding on this assumption—or if theoretical reasons exist for believing that the function is indeed linear—the next step is to determine the correct function. Assuming that
y2
y3
y = ax + b
y y5 y6
4
x4 x5
y7
x
y0 y1
426
x0 x1 x2x3
x6 x7
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
l1 Approximation
9.1 Method of Least Squares 427 we want to find the coefficients a and b. Thinking geometrically, we ask:
What line most nearly passes through the eight points plotted?
To answer this question, suppose that a guess is made about the correct values of a and b. This is equivalent to deciding on a specific line to represent the data. In general, the data points may not fall on the line y = ax + b. If by chance the kth datum falls on the line, then
axk +b−yk =0
If it does not, then there is a discrepancy or error of magnitude
| axk + b − yk | The total absolute error for all m + 1 points is therefore
m
| axk + b − yk |
k=0
This is a function of a and b, and it would be reasonable to choose a and b so that the function assumes its minimum value. This problem is an example of l1 approximation and can be solved by the techniques of linear programming, a subject dealt with in Chapter 14. (The methods of calculus do not work on this function because it is not generally differentiable.)
In practice, it is common to minimize a different error function of a and b:
m k=0
This function is suitable because of statistical considerations. Explicitly, if the errors follow a normal probability distribution, then the minimization of φ produces a best estimate of a and b. This is called an l2 approximation. Another advantage is that the methods of calculus can be used on Equation (2).
The l1 and l2 approximations are related to specific cases of the lp norm defined by
n 1 / p
∥x∥p= |xi|p , (1≦p<∞)
i=1 forthevectorx =[x1,x2,...,xn]T.
Let us try to make φ(a, b) a minimum. By calculus, the conditions
∂φ ∂φ
=0 =0 ∂a ∂b
(partial derivatives of φ with respect to a and b, respectively) are necessary at the minimum. Taking derivatives in Equation (2), we obtain
m
2 ( a x k + b − y k ) x k = 0
k=0 m
2(axk +b−yk)=0 k=0
Minimize φ(a, b) φ(a,b)=
(axk +b−yk)2 (2)
l2 Approximation
lp Norm
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
428 Chapter 9
Least Squares Methods and Fourier Series
Normal Equations
kkkk
k=0 k=0 k=0 (3)
2 × 2 Linear System
Using Cramer’s Rule
The system of Equations (3) is now
s p a=r pm+1bq
We solve this pair of equations by Gaussian elimination and obtain the following algorithm. Alternatively, since this is a 2 × 2 linear system, we can use Cramer’s Rule∗ to solve it. The determinant of the coefficient matrix is
d = Det s p = (m + 1)s − p2 p m+1
This is a pair of simultaneous linear equations in the unknowns a and b. They are called the normal equations and can be written as
m m m
x a + x b = y x
m
2
xk a + (m + 1)b =
m k=0 k=0
yk
Here, of course, mk=0 1 = m + 1, which is the number of data points. To simplify the
notation, we set
p = x k , q = y k , r = x k y k , s = x k2
n n n n k=0 k=0 k=0 k=0
Moreover, we obtain
a= 1Detr p = 1[(m+1)r−pq]
dqm+1d b=1Dets r=1[sq−pr]
We can write this as an algorithm:
dpqd
Linear Least Squares
The coefficients in the least-squares line y = ax + b through the set of m + 1 data points (xk, yk) for k = 0,1,2,...,m are computed (in order) as follows:
m
1. p=k=0xk
m
2. q =k=0 yk
m
3. r =k=0 xkyk
4.s= mk=0xk2
5. d=(m+1)s−p2
6. a=[(m+1)r−pq]/d 7. b=[sq−pr]/d
■ Algorithm
∗Cramer’s Rule is given in Appendix D.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Another form of this result is
1 m m m
a=d (m+1) xkyk − xk yk k=0 k=0 k=0
9.1 Method of Least Squares 429
Coefficients a and b in Detail
mmmm (4) b=1xk2 yk−xk xkyk
EXAMPLE 1
Solution
The preceding analysis illustrates the least-squares procedure in the simple linear case.
As a concrete example, find the linear least-squares solution for the following table of values: x 4 7 11 13 17
y20267
Plot the original data points and the line using a finer set of grid points. The equations in Algorithm 1 lead to this system of two equations:
644a + 52b = 227 52a+ 5b=17
whose solution is a = 0.4864 and b = −1.6589. By Equation (3), we obtain the value φ(a,b) = 10.7810. Figure 9.2 is a plot of the given data and the linear least-squares straight line.
where
Linear Example
d
k=0 k=0 k=0 k=0
m m 2 d = ( m + 1 ) x k2 − x k
k=0 k=0
y
FIGURE 9.2
Linear least squares
Four Data Points
10 8 6 4 2 0 22
y 5 ax 1 b
x
■
0 2 4 6 8 101214161820
We can use mathematical software such as MATLAB, Maple, or Mathematica to fit a linear least-squares polynomial to the data and verify the value of φ. (See Computer Exercise 9.1.5.) To understand what is going on here, we want to determine the equation of a line of the form y = ax + b that fits the data best in the least-squares sense. With four data points
(xi,yi),wehavefourequationsyi =axi +bfori=1,2,3,4thatcanbewrittenas Ax = y
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
430 Chapter 9
Least Squares Methods and Fourier Series
4 × 2 Linear System
m × n Linear System
Normal Equations
Nonpolynomial Case
In general, we want to solve a linear system
Ax = b
where A is an m × n matrix and m > n. The solution coincides with the solution of the
normal equations
AT Ax = AT b This corresponds to minimizing ||Ax − b||2.
Nonpolynomial Example
The method of least squares is not restricted to linear (first-degree) polynomials or to any specific functional form. Suppose, for instance, that we want to fit a table of values (xk , yk ), where k = 0,1,…,m, by a function of the form
y = a ln x + b cos x + cex
in the least-squares sense. The unknowns in this problem are the three coefficients a, b, and c. We consider the function
3 × 3 Normal Equations
EXAMPLE 2
Solution
a
m
(lnxk)(cosxk) + b
(cosxk)2
+ c
where
x1 1 y1 x2 1 a =y2 x3 1b y3 x41 y4
equations:
a (lnxk)2 + b
m k=0
m k=0
m k=0
(lnxk)(cosxk) + c
m k=0
m k=0
(lnxk)exk = (cosxk)exk = (exk )2 =
m k=0
m k=0
yk lnxk yk cosxk yk exk
k=0 m
m k=0
m k=0
m k=0
a
k=0
m k=0
(alnxk +bcosxk +cexk −yk)2
and set ∂φ/∂a = 0, ∂φ/∂b = 0, and ∂φ/∂c = 0. This results in the following three normal
φ(a,b,c)=
(ln xk )exk
Fitafunctionoftheformy=alnx+bcosx+cex tothefollowingtablevalues:
+ b
(cos xk )exk
+ c
x 0.24 0.65 0.95 1.24 1.73 2.01 2.23 2.52 2.77 2.99 y 0.23 −0.26 −1.10 −0.45 0.27 0.10 −0.29 0.24 0.56 1.00
Using the table and the equations above, we obtain the 3 × 3 system
6.79410a − 5.34749b + 63.25889c = 1.61627 −5.34749a + 5.10842b − 49.00859c = −2.38271 63.25889a − 49.00859b + 1002.50650c = 26.77277
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
y
1 0.5 0 20.5
9.1 Method of Least Squares 431
“y 5 a ln x + b cos x + ce^x”
21.5 x
0 0.5 1 1.5 2 2.5 3
It has the solution a = −1.04103, b = −1.26132, and c = 0.03073. So the curve y = −1.04103 ln x − 1.26132 cos x + 0.03073ex
has the required form and fits the table in the least-squares sense. The value of φ(a, b, c) is 0.92557. Figure 9.3 is a plot of the given data and the nonpolynomial least-squares curve. ■
We can use mathematical software such as MATLAB, Maple, or Mathematica to verify these results and to plot the solution curve. (See Computer Exercise 9.1.6.)
Basis Functions {g0, g1, . . . , gn}
The principle of least squares, illustrated in these two simple cases, can be extended to general linear families of functions without involving any new ideas. Suppose that the data in Equation (1) are thought to conform to a relationship such as
n
y= cjgj(x) (5)
j=0
inwhichthefunctionsg0,g1,…,gn (calledbasisfunctions)areknownandheldfixed.The coefficientsc0,c1,…,cn aretobedeterminedaccordingtotheprincipleofleastsquares. In other words, we define the expression
FIGURE 9.3 21 Nonpolynomial least
squares
More General Case
Basis Functions
m n φ(c0,c1,…,cn) = cjgj(xk)− yk
k=0 j=0
2
(6)
and select the coefficients to make it as small as possible. Of course, the expression φ(c0,c1,…,cn) is the sum of the squares of the errors associated with each entry (xk, yk) in the given table.
Proceeding as before, we write down as necessary conditions for the minimum n equations
∂φ
= 0 (0 ≦ i ≦ n) ∂ci
These partial derivatives are obtained from Equation (7). Indeed, we have
∂φ m n
∂c = 2 cjgj(xk)−yk gi(xk) (0≦i≦n)
i k=0 j=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
432 Chapter 9
Least Squares Methods and Fourier Series
Normal Equations
Necessary Conditions
When set equal to zero, the resulting equations can be rearranged as
n m m
gi(xk)gj(xk) cj = ykgi(xk) (0≦i≦n) (7)
j=0 k=0 k=0
These are the normal equations in this situation and serve to determine the best values of the parameters c0 , c1 , . . . , cn . The normal equations are linear in ci ; thus, in principle, they can be solved by the method of Gaussian elimination (see Chapter 2).
In practice, the normal equations may be difficult to solve if care is not taken in choosing
the basis functions g0,g1,…,gn. First, the set{g0,g1,…,gn} should be linearly inde-
pendent.Thismeansthatnolinearcombination n cigi canbethezerofunction(except i=0
in the trivial case when c0 = c1 = ··· = cn = 0). Second, the functions g0,g1,…,gn should be appropriate to the problem at hand. Finally, one should choose a set of basis functions that is well conditioned for numerical work. We elaborate on this aspect of the problem in Section 9.2.
Summary 9.1
• We wish to find a line y = ax +b that most nearly passes through the m +1 pairs of points (xi, yi) for 0≦ i ≦ m. An example of l1 approximation is to choose a and b so that the total absolute error for all these points is minimized:
m
| axk + b − yk |
k=0
This can be solved by the techniques of linear programming.
• An l2 approximation will minimize a different error function of a and b:
m
k=0
The minimization of φ produces a best estimate of a and b in the least-squares sense.
One solves the normal equations
m m m
m 2
φ(a,b)=
(axk +b−yk)2
x a + x b = y x kkkk
k=0 k=0 k=0
m xk a + (m + 1)b = yk
k=0
• In a more general case, the data points conform to a relationship such as
k=0
j=0 inwhichthebasisfunctionsg0,g1,…,gn areknownandheldfixed.Thecoefficients
c0,c1,…,cn aretobedeterminedaccordingtotheprincipleofleastsquares.Thenormal equations in this situation are
n m m
gi(xk)gj(xk) cj = ykgi(xk) (0≦i≦n)
j=0 k=0 k=0
and can be solved, in principle, by the method of Gaussian elimination to determine the
best values of the parameters c0,c1,…,cn.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
n
y= cjgj(x)
a1. Using the method of least squares, find the constant func- m
tion that best fits the following data: c= (xk−x)2,
k=0 y545 1m
9.1 Method of Least Squares 433
x −1 2 3 4 3 12
a = c (xk − x)(yk − y), k=0
Hint: Show that d = (m + 1)c.
b = y − ax
a2. Determine the constant function c that is produced by the least-squares theory applied to a table of m + 1 data points. Does the resulting formula involve the points xk in any way? Apply your general formula to the preceding exercise.
a3. Find an equation of the form y = aex2 +bx3 that best fits the points (−1, 0), (0, 1), and (1, 2) in the least-squares sense.
4. Supposethatthexpointsinatableofm+1datapoints are situated symmetrically about 0 on the x-axis. In this case, there is an especially simple formula for the line that best fits the points. Find it.
a5. Find the equation of a parabola of form y = ax2 + b that best represents the following data. Use the method of least squares.
x −1 0 1 y 3.1 0.9 2.9
6. Suppose that a table of m + 1 data points is known to conform to a function like y = x2 − x + c. What value of c is obtained by the least-squares theory?
a7. Supposethatatableofm+1datapointsisthoughttobe represented by a function y = c log x . If so, what value for c emerges from the least-squares theory?
8. ShowthatEquation(4)isthesolutionofEquation(3).
9. (Continuation) How do we know that divisor d is not zero? In fact, show that d is positive for m ≧ 1.
Hint: Show that
a11. a 12. a 13.
14.
15.
a16.
How do we know that the coefficients c0, c1, . . . , cn that satisfy the normal Equations (7) do not lead to a maxi- mum in the function defined by Equation (6)?
If a table of m + 1 data points is thought to conform to a relationship y = log(cx), what is the value of c obtained by the method of least squares?
What straight line best fits the following data x1234
y0112 in the least-squares sense?
In analytic geometry, we learn that the distance from a point (x0, y0) to a line represented by the equation ax +by = c is (ax0 +by0 −c)(a2 +b2)−1/2. Determine a straight line that fits a table of data points (xi , yi ), for 0≦i≦m, in such a way that the sum of the squares of the distances from the points to the line is minimized.
Showthatifastraightlineisfittedtoatable(xi,yi)bythe method of least squares, then the line will pass through the point (x∗, y∗), where x∗ and y∗ are the averages of the xi ’s and yi ’s, respectively.
TheviscosityVofaliquidisknowntovarywithtemper- ature according to a quadratic law V = a + bT + cT 2. Find the best values of a, b, and c for the following table:
m k−1
d=(xk−xl)2 T1234567
k=0 l=0
by induction on m. The Cauchy-Schwarz inequality can
also be used to prove that d > 0.
10. (Continuation)Showthataandbcanalsobecomputed
V 2.31 2.01 1.80 1.66 1.55 1.47 1.41
as follows:
x=m+1 xk, y=m+1 yk
17.
a 18.
An experiment involves two independent variables x and y and one dependent variable z. How can a function z =a+bx+cybefittedtothetableofpoints(xk,yk,zk)? Give the normal equations.
Find the best function (in the least-squares sense) that fits the following data points and is of the form f (x) = asinπx+bcosπx:
1 m 1 m k=0 k=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 9.1
434
Chapter 9 Least Squares Methods and Fourier Series
x −1 −1 0 1 1 22
y −1 0 1 2 1
Find the quadratic polynomial that best fits the following
data in the sense of least squares:
x −2 −1 0 1 2
y21112
What line best represents the following data in the least-
squares sense?
x012
y 5 −6 7 Whatconstantcmakestheexpression
where
n n
2 kyk − (n + 1) yk
k=1 k=1
2 n n
b=n(n−1) (2n+1) yk−3 kyk k=1 k=1
Establish the normal equations and verify the results in Example 1.
A vector v is asserted to be the least-squares solution of an inconsistent system Ax = b. How can we test v without going through the entire least-squares procedure?
Findthenormalequationsforthefollowingdatapoints:
x 1.0 2.0 2.5 3.0
y 3.7 4.1 4.3 5.0
Determine the straight line that best fits the data in the least-squares sense. Plot the data point and the least- squares line.
For the case n = 4, show directly that by forming the normal equations from the data points (xi , yi ), we obtain the results in Theorem 1.
x123456
a =
6 n(n2 − 1)
a 19.
a 20.
a21.
22.
1.
2.
3.
4.
m k=0
as small as possible? Showthattheformulaforthebestlinetofitdata(k,yk)
at the integers k for 1 ≦ k ≦ n is
y = ax + b
WriteaprocedurethatsetsupthenormalsystemofEqua- tions (7). Using that procedure and other routines, such as Gauss and Solve from Section 2.2, verify the solution given for the problem involving ln x, cos x, and ex in the subsection entitled “Nonpolynomial Example.”
Write a procedure that fits a straight line to Table (1). Use this procedure to find the constants in the equation S = aT + b for the table in the example that begins this chapter (p. 426). Also, verify the results obtained for Example 1 (p. 429).
Write and test a program that takes m + 1 points in the plane(xi,yi),where0≦i≦m,withx0
i=0
Statistical theory tells us that if the trend of the table is truly a polynomial of degree N (but
infected by noise), then
σ2 >σ2 >···>σ2 =σ2 =σ2 =···=σ2
This fact suggests the following strategy for dealing with the case in which N is not known:
Compute σ02 , σ12 , . . . in succession. As long as these are decreasing significantly, continue
the calculation. When an integer N is reached for which σ2 ≈ σ2 ≈ σ2 ≈ ···, stop N N+1 N+2
and declare pN to be the polynomial sought.
If σ02 , σ12 , . . . are to be computed directly from the definition in Equation (6), then each
of the polynomials p0, p1,… will have to be determined. The procedure described next can avoid the determination of all but the one desired polynomial.
In the remainder of the discussion, the abscissas xi are to be held fixed. These points are assumed to be distinct, although the theory can be extended to include cases in which some points repeat. If f and g are two functions whose domains include the points {x0, x1, . . . , xm }, then the following notation is used:
m i=0
This quantity is called the inner product of f and g. Much of our discussion does not depend on the exact form of the inner product but only on certain of its properties. An inner product ⟨· , ·⟩ has the following properties:
Inner Product
■ Properties
⟨ f, g⟩ =
f (xi )g(xi ) (7)
9.2 Orthogonal Systems and Chebyshev Polynomials 441
0 1 N N+1 N+2 m−1
Defining Properties of an Inner Product
1. ⟨f,g⟩=⟨g,f⟩
2. ⟨f, f⟩>0unless f(xi)=0foralli 3. ⟨af,g⟩=a⟨f,g⟩wherea∈R
4. ⟨f,g+h⟩=⟨f,g⟩+⟨f,h⟩
Orthogonal Functions
Recurrence Relation
The reader should verify that the inner product defined in Equation (7) has the properties listed.
A set of functions is now said to be orthogonal if ⟨ f, g⟩ = 0 for any two different functions f and g in that set. An orthogonal set of polynomials can be generated recursively by the following formulas:
q0(x) = 1 q1(x) = x − α0
qn+1(x) = xqn(x) − αnqn(x) − βnqn−1(x) (n ≧ 1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
442 Chapter 9
Least Squares Methods and Fourier Series
where
αn = ⟨xqn,qn⟩, βn = ⟨xqn,qn−1⟩ ⟨qn,qn⟩ ⟨qn−1,qn−1⟩
First Few Cases
In these formulas, a slight abuse of notation occurs where “xqn” is used to denote the function whose value at x is xqn(x).
To understand how this definition leads to an orthogonal system, let’s examine a few cases. First, we have
⟨q1,q0⟩=⟨x−α0,q0⟩=⟨xq0 −α0q0,q0⟩=⟨xq0,q0⟩−α0⟨q0,q0⟩=0
Notice that several properties of an inner product listed previously have been used here.
Also, the definition of α0 was used. Another of the first few cases is this: ⟨q2, q1⟩ = ⟨xq1 − α1q1 − β1q0, q1⟩
= ⟨xq1, q1⟩ − α1⟨q1, q1⟩ − β1⟨q0, q1⟩ = 0
Here, the definition of α1 has been used, as well as the fact (established above) that ⟨q1 , q0 ⟩ = 0. The next step in a formal proof is to verify that ⟨q2,q0⟩ = 0. Then an inductive proof completes the argument.
One part of this proof consists in showing that the coefficients αn and βn are well
defined. This means that the denominators ⟨qn,qn⟩ are not zero. To verify that this is the
case,supposethat⟨qn,qn⟩=0.Then m [qn(xi)]2 =0,andconsequently,qn(xi)=0for i=0
eachvalueofi.Thismeansthatthepolynomialqn hasm+1roots,x0,x1,…,xm.Since the degree n is less than m, we conclude that qn is the zero polynomial. However, this is not possible because obviously
q 0 ( x ) = 1
q1(x) = x − α0
q2(x) = x2 + (lower-order terms)
and so on. Observe that this argument requires n < m.
The system of orthogonal polynomials {q0, q1, . . . , qm−1} generated by the above algo-
rithm is a basis for the vector space m−1 of all polynomials of degree at most m − 1. It is clearfromthealgorithmthateachqn startswiththehighesttermxn.Ifitisdesiredtoexpress agivenpolynomial pofdegreen(n≦m−1)asalinearcombinationofq0,q1,...,qn,this can be done as follows: Set
n i=0
On the right-hand side, only one summand contains x n . It is the term an qn . On the left-hand side, there is also a term in xn. One chooses an so that anxn on the right is equal to the corresponding term in p. Now write
n−1 i=0
On both sides of this equation, there are polynomials of degree at most n − 1 (because of the choice of an ). Hence, we can now choose an−1 in the way we chose an ; that is, choose an−1 so that the terms in xn−1 are the same on both sides. By continuing in this way, we discover theuniquevaluesthatthecoefficientsai musthave.Thisestablishesthat{q0,q1,...,qn}is a basis for n, for n = 0,1,...,m − 1.
Using Inner Products
p−anqn =
aiqi
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
p =
ai qi (8)
Another Approach
9.2 Orthogonal Systems and Chebyshev Polynomials 443 Another way of determining the coefficients ai (once we know that they exist!) is to
take the inner product of both sides of Equation (8) with q j . The result is
⟨p,qj⟩ = aj⟨qj,qj⟩ Thisgivesaj asaquotientoftwoinnerproducts.
Least-Squares Problem Revisited
Now we return to the least-squares problem. Let F be a function that we wish to fit by a polynomial pn of degree n. We shall find the polynomial that minimizes the expression
m
[F(xi)− pn(xi)]2
Least-Squares Problem
Normal Equations
Results c = (ci )
Check Results
n m m qi(xk)qj(xk) cj =
j=0 k=0 k=0 Using the inner product notation, we get
property ⟨qi,qj⟩ = 0 when i ≠ j. The result is
⟨qi,qi⟩ci = ⟨F,qi⟩ (0≦ i ≦ n) (10)
Now we return to the variance numbers σ02,σ12,... and show how they can be easily computed. First, an important observation: The set {q0 , q1 , . . . , qn , F − pn } is orthogonal! The only new fact here is that ⟨F − pn,qi⟩ = 0 for 0≦ i ≦ n.
To check this, write
i=0 The solution is given by the formulas
n pn =
i=0
ciqi,
⟨ F , q i ⟩
ci = ⟨qi,qi⟩ (9)
⟨p,qj⟩=
n i=0
ai⟨qi,qj⟩ (0≦ j≦n)
Since the set q0,q1,...,qn is orthogonal, ⟨qi,qj⟩ = 0 for each i different from j. Hence,
we obtain
It is especially noteworthy that ci does not depend on n. This implies that the various
polynomials p0,p1, . . . that we are seeking can all be obtained by simply truncating one
series—namely, m−1 ciqi.Toprovethat pn,asgiveninEquation(9),solvesourproblem, i=0
we return to the normal equation, Equation (1). The basic functions now being used are q0, q1, . . . , qn . Thus, the normal equations are
n j=0
(0≦i≦n)
⟨qi,qj⟩cj =⟨F,qi⟩ whereFissomefunctionsuchthatF(xk)=yk for0≦k≦m.Next,applytheorthogonality
⟨F − pn,qi⟩ = ⟨F,qi⟩−⟨pn,qi⟩
n
= ⟨F,qi⟩− cjqj,qi j=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
ykqi(xk)
(0≦i≦n)
444 Chapter 9
Least Squares Methods and Fourier Series
= ⟨F,qi⟩−
= ⟨F, qi ⟩ − ci ⟨qi , qi ⟩ = 0
In this computation, we used Equations (9) and (10). Since pn is a linear combination of q0,q1,...,qn, it follows easily that
⟨F − pn, pn⟩ = 0 Now recall that the variance σn2 was defined by
ρn m
σ n2 = m − n , ρ n = [ y i − p n ( x i ) ] 2
i=0 The quantities ρn can be written in another way:
ρn =⟨F−pn,F−pn⟩
= ⟨F − pn, F⟩
= ⟨F, F⟩ − ⟨F, pn⟩
n
ci ⟨F, qi ⟩ =⟨F,F⟩−n ⟨F,qi⟩2
i=0 ⟨qi,qi⟩
Thus, the numbers ρ0 , ρ1 , . . . can be generated recursively by the algorithm
ρ0 =⟨F,F⟩−⟨F,q0⟩2 ⟨q0, q0⟩
⟨F,qn⟩2 ρn =ρn−1−⟨qn,qn⟩
Gram-Schmidt Process
The projection operator is defined as
projy x = ⟨x, y⟩ y
⟨y, y⟩
that projects the vector x orthogonally onto the vector y. The Gram-Schmidt process can
= ⟨F, F⟩ −
i=0
n j=0
cj⟨qj,qi⟩
Recurrence Relations for ρn
(n≧1)
be written as
z1=v1,
z2=v2−projz1v2,
q1= z1 ||z1||
q2= z2 ||z2||
q3 = z3 ||z3||
z qk = ||zk
z3 = v3 −projz1 v3 −projz2 v3, In general, the k step is
k − 1 zk = vk −
j=1
projvj vk,
k||
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
9.2 Orthogonal Systems and Chebyshev Polynomials 445
Here {z1,z2,z3,...,zk} is an orthogonal set and {q1,q2,q3,...,qk} is an orthonormal set. When implemented on a computer, the Gram-Schmidt process is numerically unstable because the vectors zk may not be exactly orthogonal due to roundoff errors. By a minor modification, the Gram-Schmidt process can be stabilized. Instead of computing the vectors uk as above, it can be computed a term at a time. A computer algorithm for the modified Gram-Schmidt process:
for j = 1 to k
for i = 1 to j − 1
s ← ⟨v j , vi ⟩
vj ←vj −svi end for
vi ← vj/||vj|| end for
EXAMPLE 1
Solution
Herethevectorsv1,v2,...,vk arereplacedwithorthonormalvectorsthatspanthesame subspace.Thei-loopremovescomponentsinthevi directionfollowedbynormalizationof the vector. In exact arithmetic, this computation gives the same results as the original form above. However, it produces smaller errors in finite-precision computer arithmetic.
Consider the vectors v1 = (1,ε,0,0), v1 = (1,0,ε,0), and v1 = (1,0,0,ε). Assume ε is a small number. Carry out the standard Gram-Schmidt procedure and the modified Gram-Schmidt procedure. Check the orthogonality conditions of the resulting vectors.
√ Using the classical Gram-Schmidt process, we obtain u = (1, ε, 0, 0), u =(0, −1, 1, 0)/ 2,
√ and u = (0, −1, 0, 1)/ 2.
12
Using the modified Gram-Schmidt process, we find z =
3√√1 (1, ε, 0, 0), z2 = (0, −1, 1, 0)/ 2, and z3 = (0, −1, −1, 2)/ 6. Checking orthogonality,
wefind⟨u2,u3⟩= 1 and⟨z2,z3⟩=0. ■ 2
Summary 9.2
• WeuseChebyshevpolynomials{Tj}asanorthogonalbasisthatcanbegeneratedrecur- sively by
T j (x ) = 2x T j −1 (x ) − T j −2 (x ) ( j ≧ 2)
with T0(x) = 1 and T1(x) = x. The coefficient matrix A = (ai j )0:n×0:n and the right-hand
side b = (bi )0:n of the normal equations are
m k=0
m
k=0
A linear combination of Chebyshev polynomials
n j=0
aij = bi =
Ti(zk)Tj(zk) ykTi(zk)
(0≦i, j≦n) (0≦i≦n)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
g(x) =
cj Tj (x)
446
Chapter 9
Least Squares Methods and Fourier Series
1.
a2.
a3.
4.
a5.
a6. a7.
a
Let g0,g1,...,gn be a set of functions such that mk=0 gi(xk)gj(xk) = 0 if i ≠ j. What linear combina- tion of these functions best fits the data at the beginning
of Section 9.1 (p. 426)?
Considerpolynomialsg0,g1,...,gn definedbyg0(x)= 1, g1(x) = x − 1, and gj(x) = 3xgj−1(x) + 2gj−2(x). Develop an efficient algorithm for computing values of thefunction f(x)= nj=0cjgj(x).
Showthatcosnθ=2cosθcos(n−1)θ−cos(n−2)θ. Hint: Use the familiar identity cos(A ∓ B) = cosAcosB±sinAsinB.
(Continuation) Show that if fn(x) = cos(n arccos x), then f0(x) = 1, f1(x) = x, and fn(x) = 2x fn−1(x) −
fn−2(x).
(Continuation) Show that an alternate definition of Chebyshev polynomials is Tn (x ) = cos(n arccos x ) for −1 ≦ x ≦ 1.
(Continuation) Give a one-line proof that Tn (Tm (x )) = Tnm(x).
(Continuation) Show that |Tn (x )| ≦ 1 for x in the interval
. What recursive relation
ShowthatT0,T2,T4,...areevenandthatT1,T3,...are odd functions. Recall that an even function satisfies the equation f (x ) = f (−x ); an odd function satisfies the equation f (x) = − f (−x).
a 12. a 13.
14. 15.
16. a 17.
18.
(Continuation) Count the operations for the algorithm in the preceding problem.
Determine T6(x) as a polynomial in x.
Verify the four properties of an inner product that were
listed in the text, using Definition (7). Verifytheseformulas:
1 m p0(x)= m+1 yi,
i=0
βn= ⟨qn,qn⟩ , cn=ρn−1−ρn
⟨qn−1,qn−1⟩ ⟨F,qn⟩ Completetheproofthatthealgorithmforgeneratingthe
orthogonal system of polynomials works. There is a function f of the form
f (x) = αx12 + βx13
forwhich f(0.1)=6×10−13 and f(0.9)=3×10−2. What is it? Are α and β sensitive to perturbations in the two given values of f (x )?
8. 9.
1 1 do these functions satisfy?
TT
can be evaluated recursively:
wn+2 = wn+1 = 0
wj =cj +2xwj+1 −wj+2 (j =n,n−1,...,0) g(x) = w0 − xw1
• We discuss smoothing of data by polynomial regression.
[−1, 1].
Define gk(x) = Tk 2 x + 2
(Multiple Choice) Let x1 = [2,2,1] , x2 = [1,1,5] , and x3 = [−3, 2, 1]T . If the Gram-Schmidt process is applied to this ordered set of vectors to produce an or- thonormal set {u1, u2, u3}, what is u1?
a. 2, 2, 1T b. [2,2,1]T 333
a
10.
11. Show that the algorithm for computing g(x) =
Count the number of operations involved in the algorithm used to compute g(x) = nj=0 cjTj(x).
221T T c. 5,5,5 d. [1,0,0]
n c j T j (x ) can be modified to read j=0
n−1 n−1 n a. √ [1,1,5] wk =ck+2xwk+1−wk−2 (n−2≧k≧1) 27
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
(Multiple Choice, continued) What is u2 ? w=c+2xc 1T1T
19.
e. Noneofthese.
b. √ [−1,−1,4] 18
g(x)=c0+xw1−w2 c. [2, 2, 1]T
thus making wn+2, wn+1, and w0 unnecessary. e. Noneofthese.
d. [1, 1, −4]T
Exercises 9.2
1. Carry out an experiment in data smoothing as follows:
Start with a polynomial of modest degree, say, 7. Com-
pute 100 values of this polynomial at random points in the
to recover the polynomial from these perturbed values by using the method of least squares.
2. WriterealfunctionCheb(n,x)forevaluatingTn(x).Use the recursive formula satisfied by Chebyshev polynomi- als. Do not use a subscripted variable. Test the program on these 15 cases: n = 0,1,3,6,12 and x = 0,−1,0.5.
3. Write real function Cheb(n, x, (yi )) to calculate T0(x),T1(x),...,Tn(x), and store these numbers in the array (yi ). Use your routine, together with suitable plot- ting routines, to obtain graphs of T0, T1, T2, . . . , T8 on [−1, 1].
tion of Chebyshev polynomials. Test it in the manner of Computer Exercise 9.2.1, first by using an unper- turbed polynomial. Find out experimentally how large n can be in this process before roundoff errors become serious.
a 7. Define xk = cos[(2k − 1)π/(2m)]. Select modest values of n and m > 2n. Compute and print the matrix A whose elements are
polynomial-fitting problem.
8. Program the algorithm for finding σ02, σ12, . . . in the poly- nomial regression problem.
interval [−1, 1]. Perturb these values by adding random
numbers chosen from a small interval, say, − 1 , 1 . Try 88
4. Write real function F(n, (ci ), x) for evaluating f (x) = 9. Programthecompletepolynomialregressionalgorithm.
n2
cjTj(x). Test your routine by means of the for- j=0
Theoutputshouldbeαn,βn,σn,andcn for0≦n≦N, where N is determined by the condition σ 2 > σ 2 ≈
σ2. N+1
Using orthogonal polynomials, find the quadratic poly- nomial that fits the following data in the sense of least squares:
a. x −1 −1 0 1 1 22
y −1 0 1 2 1 b. x −2 −1 0 1 2
y21112
mula ∞ tkT(x)=(1−tx)/(1−2tx+t2),valid k=0 k
N−1 N
for |t| < 1. If |t|≦ 1, then only a few terms of the series 2
9.3 Examples of the Least-Squares Principle 447
aij =
m k=0
Ti(xk)Tj(xk) (0≦i, j≦n) Interpret the results in terms of the least-squares
are needed to give full machine precision. Add terms in ascending order of magnitude.
5. Obtain a graph of Tn for some reasonable value of n by means of the following idea: Generate 100 equally spaced angles θi in the interval [0, π ]. Define xi cos θi and yi = Tn(xi) = cos(narccosxi) = cosnθi. Send the points (xi , yi ) to a suitable plotting routine.
6. Write suitable code to carry out the procedure outlined in the text for fitting a table with a linear combina-
10.
9.3 Examples of the Least-Squares Principle Inconsistent Systems
Case: Inconsistent System
The principle of least squares is also used in other situations. In one of these, we attempt to solve an inconsistent system of linear equations of the form
n
akjxj =bk (0≦k≦m) (1)
j=0
in which m > n. Here, there are m + 1 equations, but only n + 1 unknowns. If a given n+1-tuple(x0,x1,…,xn)issubstitutedontheleft,thediscrepancybetweenthetwosides of the kth equation is termed the kth residual. Ideally, of course, all residuals should be zero.Ifitisnotpossibletoselect(x0,x1,…,xn)soastomakeallresidualszero,System(1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 9.2
448 Chapter 9
Least Squares Methods and Fourier Series
Normal Equations
QR Factorization
This is a linear system of just n + 1 equations involving unknowns x0 , x1 , . . ., xn . It can be shown that this system is consistent, provided that the column vectors in the original coef- ficient array are linearly independent. System (3) can be solved, for instance, by Gaussian elimination. The solution of System (3) is then a best approximate solution of Equation (1) in the least-squares sense.
Modified Gram-Schmidt Process
Special methods have been devised for the problem just discussed. Generally, they gain in precision over the simple approach outlined above. One such algorithm for solving System (1),
Ax = b
begins by factoring
A=QR where matrix Q is (m + 1) × (n + 1) satisfying
QT Q = I
andmatrixRis(n+1)×(n+1)satisfyingrii >0andrij =0forj
The weight function (1 − x 2 )−1/2 assigns heavy weight to the ends of the interval [−1, 1]. If a sequence of nonzero functions g0 , g1 , . . . , gn is orthogonal according to Equa- tion (6), then the sequence λ0 g0 , λ1 g1 , . . . , λn gn is orthonormal for appropriate positive
real numbers λ j , namely,
j=0
Case: y = ecx
Minimize φ
φ(c) =
(ecxk − yk )2
−1/2
As another example of the least-squares principle, here is a nonlinear problem. Suppose
that a table of points (xk , yk ) is to be fitted by a function of the form y = ecx
Proceeding as before leads to the problem of minimizing the function
m k=0
Nonlinear Example
b
λj = [gj(x)]2w(x)dx
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a
450 Chapter 9
Least Squares Methods and Fourier Series
Case: y = a sin(bx)
k=0 k=0
This value of c is not the solution of the original problem, but may be satisfactory in some
applications.
Linear and Nonlinear Example
The final example contains elements of linear and nonlinear theory. Suppose that an (xk , yk ) table is given with m + 1 entries and that a functional relationship such as
y = a sin(bx)
is suspected.
Notice that parameter b enters this function in a nonlinear way, creating some difficulty, as will be seen. According to the principle of least squares, the parameters should be chosen such that the expression
m
[a sin(bxk ) − yk ]2
k=0
has a minimum value. The minimum value is sought by differentiating this expression with
m
2[a sin(bxk) − yk]axk cos(bxk) = 0
The minimum occurs for a value of c such that ∂φ m
is easy and leads to
m m
c = z k x k x k2
0=∂c= 2(ecxk −yk)ecxkxk k=0
Nonlinear Least-Squares Problems
Linearized
Minimizing φ
This equation is nonlinear in c. One could contemplate solving it by Newton’s method or the secant method. On the other hand, the problem of minimizing φ(c) could be attacked directly. Since there can be multiple roots in the normal equation and local minima in φ itself, a direct minimization of φ would be safer. This type of difficulty is typical of nonlinear least-squares problems. Consequently, other methods of curve fitting are often preferred if the unknown parameters do not occur linearly in the problem.
Alternatively, this particular example can be linearized by a change of variables z = ln y and by considering
z = cx The problem of minimizing the function
φ(c)=
m k=0
(cxk −zk)2
zk =lnyk
Can the least-squares principle be used to obtain the appropriate values of the parameters a and b?
respect to a and b and setting these partial derivatives equal to zero. The results are m
2[a sin(bxk ) − yk ] sin(bxk ) = 0 k=0
k=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
SVD Factorization
σ1
UTAV=D=
σ2
m × n
9.3 Examples of the Least-Squares Principle 451
If b were known, a could be obtained from either equation. The correct value of b is the one for which these corresponding two a values are identical. So each of the preceding equations should be solved for a, and the results set equal to each other. This process leads to the equation
m k=0
secant method. Then either side of this equation can be evaluated as the value of a. Additional Details on SVD
The singular value decomposition (SVD) of a matrix is a factorization that can reveal important properties of the matrix that otherwise could escape detection. For example, from the SVD decomposition of a square matrix one could be alerted to the near-singularity of the matrix. Or from the SVD factorization of a nonsquare matrix an unexpected loss of rank could be revealed. Since the SVD factorization of a matrix yields a complete orthogonal decomposition, it provides a technique for computing the least-squares solution of a system of equations and at the same time producing the norm of the error vector.
Suppose that a given m × n matrix has the factorization A = U DV T
where U = [u1,u2,…,um] is an m × m orthogonal matrix, V = [v1,v2, …,vn] is an n × n orthogonal matrix, and the m × n diagonal matrix D contains the singular values of A on its diagonal, listed in decreasing order. The singular values of a matrix A are the positive square roots of the eigenvalues of AT A. These are denoted by σ1 ≧ σ2 ≧ ··· ≧ σr ≧ 0. In detail, we have
yk sin bxk
m = k=0
m k=0
m k=0
xk yk cos bxk
xk sinbxk cosbxk
(sinbxk)2
which can now be solved for parameter b, using, for example, the bisection method or the
where UT U = Im and VT V = In. (In the above matrix, blank space corresponds to zero entries.) Moreover, we have Avi = σi ui and σi = || Avi ||2 where vi is column i in V and ui is column i in U. Since U is orthogonal, we obtain
Ax − b2 = UT (Ax − b)2 = UT Ax − UT b2 =UT A(VVT)x−UTb2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
…
σr
0
…
0
452 Chapter 9
Least Squares Methods and Fourier Series
■ Theorem1
EXAMPLE 1
Find the least-squares solution of this nonsquare system
1 1x 1
0 1y=−1 10z1
using the singular value decomposition:
1√6 0 1√3√ √ √
113√√3√301212
and
x = yv= σ−1cv= σ−1 uTbv LS ii iii iii
i=1 i=1 i=1
2m m2 AxLS−b 2= ci2= uiTb
i=r+1 i=r+1
=(UT AV)(VTx)−UTb2
= DVT x − UT b2 = Dy − c2
r =
m i=r+1
(σi yi − ci )2 + wherey=VTxandc=UTb.Here,yisdefinedbyyi =ci/σj andxbyx=Vy.Since
ii iii
n r r
ci2
c =uTbandx=Vy,ify =σ−1c for1≦i≦rthentheleast-squaressolutionis
i=1
which is the smallest of all two-norm minimizers. For additional, details see Golub and Van Loan [1996].
In conclusion, we obtain the following theorem.
SVD Least-Squares Theorem
Let A be an m × n matrix of rank r . Let the SVD factorization be A = U DV T
The least-squares solution of the system Ax = b is x = n (σ−1c )v , where LS i=1 i i i
ci = uiT b. If there exist many least-squares solutions to the given system, then the one of least 2-norm is x as described above.
1 2 −1 3 0 1 2√ 2√ 6√ 2√ 3√ 12−12
0 1=1 6 1016−12−13002 2
623
We have r = rank( A) = 2 and the singular values σ1 = 3 and σ2 = 1. This leads to
√
1√ 1√ 1√1 1√
Solution
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
c 1 = u 1T b = 3 6 6 6 6 6 − 1 1 = 3 6
9.3 Examples of the Least-Squares Principle 453
and
and
This solution is the same as that from the normal equations. ■ Using the Singular Value Decomposition
This material requires the theory of the singular value decomposition discussed in Sec- tion 8.2.
An important application of the singular value decomposition is in the matrix least- squares problem, to which we now return. For any system of linear equations
Ax = b
we want to define a unique minimal solution. This is described as follows. Let A be m × n,
and define
ρ=inf{||Ax−b||2 :x∈Rn}
The minimal solution of our system is taken to be the point of smallest norm in the set {x: || Ax − b||2 = ρ}. If the system is consistent, then ρ = 0, and we are simply asking for the point of least norm among all solutions. If the system is inconsistent, we want Ax to be as close as possible to b; that is, ||Ax − b||2 = ρ. If there are many such points, we choose the one closest to the origin.
The minimal solution is produced by using the pseudo-inverse of A, and this object, in turn, can be computed from the singular value decomposition of A as discussed in Section 8.2. First, consider a diagonal m × n matrix of the following form, where the σ j are positive numbers:
1√1√1√ c 2 = u 2T b = 0 − 2 2 2 2 − 1 1 = 2
1√ 1√ −1 −1 11√2 2 √ 2 2
xLS=σ1 c1v1+σ2 c2v2=√33 61√+ 2 1√ 2−2
114 2 2 = 3 + = 3
1 1 −2 33
Minimal Solution
σ1
σ2
…
D=
σr
m × n
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0
…
0
454 Chapter 9
Least Squares Methods and Fourier Series
Pseudo-Inverse
■ Theorem2
Proof
00
If A is any m × n matrix and if U D V T is one of its singular value decompositions, we
define the pseudo-inverse of A to be
A+ =VD+UT
We do not stop to prove that the pseudo-inverse of A is unique if we impose the order σ1 ≧ σ2 ≧ · · ·.
Use the notation established above, and let x be any point in Rn . Define y = V T x and c = UT b. Using the properties of V and U, we obtain
Its pseudo-inverse D+ is defined to be of the same form, except that it is to be n × m and it has1/σj onitsdiagonal.Forexample,wehave
10 500 +5
D = 0 2 0 , D = 0 1 2
Minimal Solution Theorem
Consider a system of linear equations Ax = b, in which A is an m × n matrix. The minimal solution of the system is A+ b.
= inf || D V T x − U T b||2 x
= inf || D y − c||2 y
Exploiting the special nature of D, we have
ρ = inf||Ax − b||2 x
= inf ||U DV T x − b||2 x
= inf ||U T (U D V T x − b)||2 x
2 r Dy−c 2=
i=1
(σiyi−ci)2+
m i=r+1
ci2
Tominimizethislastexpression,wedefineyi =ci/σi for1≦i≦r.Theothercomponents canremainunspecified.Buttogetthe yofleastnorm,wemustsetyi =0forr+1≦i≦m. This construction is carried out by the pseudo-inverse D+, so y = D+c. Hence, we obtain
x = V y = V D+c = V D+UT b = A+b
Let us express the minimal solution in another form, taking advantage of the zero compo- nents in the vector y. Since yi = 0 for i > r, we require only the first r components of y. Thesearegivenbyyi =ci/σi.Nowitisevidentthatonlythefirstrcomponentsofcare needed. Since c = UT b, ci is the inner product of row i in UT with the vector b. That is
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Mathematical Software
Pseudo-Inverse
9.3 Examples of the Least-Squares Principle 455
the same as the inner product of the ith column of U with b. Thus, yi =uiTb/σi (1≦i≦r)
The minimal solution, which we may denote by x∗, is then ∗ r
x =Vy= yivi i=1
Examples
An example of this procedure can be carried out in mathematical software such as MATLAB, Maple, or Mathematica. We can generate a system of 20 equations with three unknowns by a random process. This technique is often used in testing software, especially in benchmarking studies, in which a large number of examples is run with careful timing. The software has a provision for generating random matrices. When executed, the computer program first exhibits the random input. The three singular values of matrix A are displayed. Then the diagonal 20 × 3 matrix D is displayed. A check on the numerical work is made by computing U DV T , which should equal A. Then the pseudo-inverse of D+ is computed. Next, the pseudo-inverse A+ is computed. The minimal solution, x = A+b, is computed, as well as the residual vector, r = A+ b = b. Then the orthogonality condition AT r = 0 is checked. This program is therefore carrying out all the steps described above for obtaining the minimal solution of a system of equations. Another example is given below to show what happens in the case of a loss in rank. (See Computer Exercise 9.3.10.)
In problems of this type, the user must examine the singular values and decide whether any are small enough to warrant being set equal to zero. The necessity of this step becomes clear when we look at the definition of D+. The reciprocals of the singular values are the principal constituents of this matrix. Any very small singular value that is not set equal to zero will therefore have a disruptive effect on the subsequent calculations. A rule of thumb that has been recommended is to drop any singular value whose magnitude is less than σ1 times the inherent accuracy of the coefficient matrix. Thus, if the data are accurate to three decimal places and if σ1 = 5, then any σi less than 0.005 should be set equal to zero.
An example of a small matrix having a near-deficiency in rank is given next. In the Maple program, certain singular values are set equal to zero if they fail to meet the relative size criterion mentioned in the previous paragraph. Also, we have added, as a check on the calculations, a verification of the following four Penrose properties for a pseudo-matrix.
■
Penrose Properties of the Pseudo-Inverse
The pseudo-inverse A+ for the matrix A has these four properties:
A = A A+ A A+ = A+ A A+ AA+ = (AA+)T A+ A = (A+ A)T
■ Theorem3 Penrose Properties
We can use mathematical software such as MATLAB, Maple, or Mathematica for finding the pseudo-inverse of a matrix that has a deficiency in rank. For example, consider
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
456 Chapter 9
Least Squares Methods and Fourier Series
this 5 × 3 matrix:
A tolerance value is set so that in the evaluation of singular values any value whose magnitude is less than the tolerance is treated as zero. We can verify the Penrose properties for this matrix. (See Computer Exercise 9.3.11.)
Orthogonal Matrices and Spectral Theorem
A matrix Q is said to be orthogonal if
QQT =QTQ=I
This forces Q to be square and nonsingular. Furthermore, Q−1 = QT
With this concept available, we can state one of the principal theorems of linear algebra: the spectral theorem for symmetric matrices.
−85 −55
−115 −167
−35 97 A = 79 56
102 (7) 63 57 69
45 −8 97.5
Spectral Theorem for Symmetric Matrices
If A is a symmetric real matrix, then there exists an orthogonal matrix Q such that QT A Q is a diagonal matrix.
■ Theorem4
The equation
is equivalent to
QTAQ = D
AQ=QD
If D is diagonal, the columns vi of Q obey the equation
Avi =diivi
In other words, the columns of Q form an orthonormal system of eigenvectors of A, and
the diagonal elements of D are the eigenvalues of A. Summary 9.3
• We attempt to solve an inconsistent system
n
akjxj =bk (0≦k≦m)
j=0
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1. Analyze the least-squares problem of fitting data by a function of the form y = xc.
a2. Show that the Hilbert matrix (Computer Exercise 2.2.4) arises in the normal equations when we minimize
1 n 2 cjxj−f(x) dx
0 j=0
a3. Find a function of the form y = ecx that best fits this
a8. Determine the best approximate solution of the inconsis- tent system of linear equations
2 x + 3 y = 1 x − 4y = −9 2x−y =−1
in the least-squares sense.
9.aa. Find the constant c for which cx is the best approx- imation in the sense of least squares to the function sin x on the interval [0, π/2].
table:
9.3 Examples of the Least-Squares Principle 457 in which there are m + 1 equations but only n + 1 unknowns with m > n. We minimize
the sum of the squares of the residuals and are led to minimize the expression
m n 2 φ(x0,x1,…,xn)= akjxj −bk
k=0 j=0
We solve the (n + 1) × (n + 1) system of normal equations
nm m
akiakj xj = bkaki (0≦i≦n)
j=0 k=0 k=0
by Gaussian elimination, and the solution is a best approximate solution of the original
system in the least-squares sense.
ab. Do the same for ex on [0, 1].
10. Analyze the problem of fitting a function y = (c − x)
to a table of m + 1 points.
11. Showthatthenormalequationsfortheleast-squaresso-
lutionof Ax=bcanbewritten(AT A)x= ATb.
12. DerivethenormalequationsgivenbySystem(5).
13. A table of values (xk,yk), where k = 0,1,…,m, is obtained from an experiment. When plotted on semilog- arithmic graph paper, the points lie nearly on a straight line, implying that y ≈ eax+b. Suggest a simple proce- dure for obtaining parameters a and b.
a 14. In fitting a table of values to a function of the form
a + bx−1 + cx−2, we try to make each point lie on the
x01
y11 −1
a4. (Continuation)Repeattheprecedingproblemforthefol- lowing table:
x01 yab
5. (Continuation) Repeat the preceding problem under the supposition that b is negative.
a6. Showthatthenormalequationfortheproblemoffitting y = ecx to points (1, −12) and (2, 7.5) has two real roots: c = ln 2 and c = 0. Which value is correct for the fitting problem?
7. ConsidertheinconsistentSystem(1).Supposethateach equation has associated with it a positive number wi indicating its relative importance or reliability. How should Equations (2) and (3) be modified to reflect this?
curve.Thisleadstoa+bx−1+cx−2 = yk for0≦k≦m. kk
2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
An equivalent equation is axk2 + bxk + c = yk xk2 for 0 ≦ k ≦ m. Are the least-squares problems for these sys- tems of equations equivalent?
a 15. A table of points (xk , yk ) is plotted and appears to lie on a hyperbola of the form y = (a + bx)−1. How can
Exercises 9.3
458
Chapter 9 Least Squares Methods and Fourier Series
a 16.
a 17.
18. 19.
a20.
21.
22. 23.
24.
a1.
the linear theory of least squares be used to obtain good 25. estimates of a and b?
Consider f (x ) = e2x over [0, π ]. We wish to approxi- 26. mate the function by a trigonometric polynomial of the
form p(x) = a + b cos(x) + c sin(x). Determine the lin-
ear system to be solved for determining the least-squares
fit of p to f .
Find the constant c that makes the expression
Use the Penrose equations to find the pseudo-inverse of any1×nmatrixandanym×1matrix.
(MultipleChoice)Let A = PDQ,where Aisanm×n matrix, P is an m ×m unitary matrix, D is an m ×n di- agonal matrix, and Q is an n × n unitary matrix. Which equation can be deduced from those hypotheses?
a minimum.
27. (MultipleChoice,continued)Assumethehypothesesof the preceding problem. Use the notation + to indicate a pseudo-inverse. Which equation is correct?
a. A+=PD+Q b. A∗=Q∗D−1P∗ c. A+ = Q∗D+P∗ d. A−1 = Q∗D+P∗ e. None of these.
(Multiple Choice) Let D be an m × n diagonal matrix withdiagonalelements p1, p2,…, pr,0,0,…,0.Here all the numbers pi , for 1 ≦ i ≦ r, are positive. Which as- sertion is not valid?
1
(ex −cx)2dx
0
a. A∗=P∗D∗Q∗ c. D=PAQ
e. Noneofthese.
b. A−1=Q∗D−1P∗ d. A∗A=Q∗D∗DQ
Showthatineveryleast-squaresmatrixproblem,thenor- mal equations have a symmetric coefficient matrix.
Verifythatthefollowingstepsproducetheleast-squares solution of A x = b.
a. Factor A = Q R, where Q and R have the properties 28. described in the text.
b. Define y = QT b.
c. Solve the lower triangular system Rx = y.
Whatvalueofcshouldbeusedifatableofexperimen- taldata(xi,yi)for0≦i≦m istoberepresentedbythe formula y = c sin x ? An explicit usable formula for c is required. Use the principle of least squares.
Refertotheformulasleadingtotheminimalsolutionof
the system Ax = b. Prove that the y-vector is given by
the formula yi = σ−2bT Av for 1≦i ≦r. ii
Prove that the pseudo-inverse satisfies the four Penrose 29. equations.
UsethefourPenrosepropertiestofindthepseudo-inverse of the matrix [a, 0]T , where a > 0. Prove that the pseudo- inverse is a discontinuous function of a.
Usethetechniquesuggestedintheprecedingproblemto find the pseudo-inverse of the m × n matrix consisting solely of 1’s.
Using the method suggested in the text, fit the data in the 2. table
a.
b.
c. d. e.
D+ is the m × n diagonal matrix with diagonal ele- ments (1/p1, 1/p2, …, 1/pr, 0, 0, …, 0)
D+ is the n × m diagonal matrix with diagonal ele- ments (1/p1, 1/p2, …, 1/pr, 0, 0, …, 0)
(D+)∗ = (D∗)+
D++=D None of these.
x 0.1 0.2
y 0.6 1.1
0.3 0.4 1.6 1.8
0.5 0.6 0.7 0.8 2.0 1.9 1.7 1.3
(Multiple Choice) Consider an inconsistent system of equations Ax = b. Let U be a unitary matrix and let E = U∗ A. Let v, w, and z be vectors such that Uv = Eb, Uw = E∗b, Ey = U∗b, and Ex = Ub. A vector that solves the least-squares problem for the original system
Ax = b is:
a. v b. w c. y d. z
b. Noneofthese.
(Prony’s Method, n = 1) To fit a table of the form x12···m
y y1 y2 ··· ym
by the function y = abx , we can proceed as follows: If
y is actually abx, then yk = abk and yk+1 = byk for
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
byafunctiony=asinbx.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
Computer Exercises 9.3
k = 1,2,…,m − 1. So we determine b by solving this system of equations using the least-squares method. Hav- ing found b, we find a by solving the equations yk = abk in the least-squares sense. Write a program to carry out this procedure, and test it on an artificial example.
3. (Continuation) Modify the procedure of the preceding computer problem to handle any case of equally spaced points.
4. A quick way of fitting a function of the form
a + bx
f (x) ≈
1+cx
is to apply the least-squares method to the problem (1 + cx)f(x) ≈ a + bx. Use this technique to fit the
5. (Student Research Project) Explore the question of whether the least-squares method should be used to pre- dict. For example, study the variances in the preceding problem to determine whether a polynomial of any degree would be satisfactory.
6. Write a procedure that takes as input an (m + 1) × (n + 1) matrix A and an m + 1 vector b and returns the least- squares solution of the system Ax = b.
7. Write a Maple program to find the minimal solution of any system of equations, Ax = b.
8. (Continuation)WriteaMATLABprogramforthetaskin the preceding problem.
9. Investigate some of the newer methods for solving in-
consistent linear equations Ax = b, when the criterion is
to make Ax close to b in one of the other useful norms,
namely, the maximum norm ||x||∞ = max1 ≦ i ≦ n |xi | or
world population data Year
1000
1650
1800
1900
1950
1960
1970
1980
1990
2000
2010
given here (in billions):
Population
0.340 0.545 0.907 1.61 2.56 3.15 3.65 4.20 5.30 6.12 6.98
9.4 Fourier Series 459
n |xi |. Use some of the avail- i=1
10. UsingmathematicalsoftwaresuchasMATLAB,Maple, or Mathematica, generate a system of 20 equations with 3 unknowns by a random-number generator. Form the pseudo-inverse matrix and verify the properties in Theo- rem 2.
11. (Continuation.)RepeatusingMatrix(7).
12. Write a computer program for carrying out the least- squares curve fit using Chebyshev polynomials. Test the code on a suitable data set and plot the results.
Determine when the world population becomes infinite!
9.4 Fourier Series Introduction
the l1 norm ||x||1 = able software.
Introduced by Jean Baptiste Joseph Fourier (1763–1830), Fourier series decompose a peri- odic function into a linear combinations of sines and cosines.∗ Since then, this idea has been expanded into an entire area of study known as Fourier analysis, which has applications in acoustics, optics, vibrations, quantum mechanics, signal processing, and many other fields.
Least-Squares Approximation
First, we establish the least-squares approximation of a continuous function f (which is periodic over the interval [−π,π], in the space of trigonometric polynomials, denoted T [−π, π ]) and show that it can be spanned by the orthogonal set
W ={1,cosx,sinx,cos2x,sin2x,…,cosNx,sinNx} ∗In 1798, Joseph Fourier was Napoleon’s scientific advisor during France’s expedition to Egypt!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
460 Chapter 9
Least Squares Methods and Fourier Series
Inner Products
⟨ f, g⟩ =
f (x)g(x) dx (1)
Orthonormal Basis
1111
, √ cosx, √ sinx, √ cos2x, 2ππ π π
111
√ sin2x,…,√ cosNx,√ sinNx
where N is a positive integer. The space is geometrically defined using the inner product
π −π
Any continuous function f on the interval [−π,π] can be approximated by linear combi- nations of elements from the set W , as closely as needed with a sufficiently large value of N . Throughout, we assume n, m, j, k, N , etc. are integers.
The vectors in W are mutually orthogonal
||1||2 = ⟨1,1⟩ = 2π
||cosnx ||2 = ⟨cosnx,cosnx⟩ = π ||sinnx||2 =⟨sinnx,sinnx⟩=π
where integer n ≠ 0. (See Example 1, p. 461.) Moreover, an orthonormal basis is U = {g0, g1, g2, …, g2N−1, g2N}
=√
πππ
g(x)=ProjT f =⟨f,g0⟩g0+⟨f,g1⟩g1+⟨f,g2⟩g2+···+⟨f,g2N−1⟩g2N−1+⟨f,g2N⟩g2N
to find the least-squares approximation of g in terms of f in the general form
111111 g(x)= f,√ √ + f,√ cosx √ cosx+ f,√ sinx √ sinx
2π2πππ ππ
1111
+···+ f,√ cosNx √ cosNx+ f,√ sinNx √ sinNx
ππππ
We use U and the formula
We introduce this convenient notation π
1111
2a0= f,√2π √2π=2π f(x)dx
−π 111
an= f,√πcosnx √π=π 111
bn= f,√πsinnx √π=π Orthogonality Properties
Coefficients a0, an, bn
π
−π π
−π
f(x)cosnxdx f(x)sinnxdx
(1≦n≦N) (1≦n≦N)
Trigonometric Identities
The following trigonometric identities are particularly useful in Fourier series because they express products of sines and cosines as sums.
cos mx cos nx = cos[(m + n)x] + cos[(m − n)x] sin mx sin nx = sin[(m + n)x] + sin[(m − n)x]
sinx cosy= 1sin(x−y)+sin(x+y) 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Orthogonality Properties
EXAMPLE 1
π, (m=n≠ 0) ⟨sin mx, sin nx⟩ = πδmn = 0, (m ≠ n)
Establish these orthogonality properties: ⟨cosmx,cosnx⟩=πδmn =0, (m≠ n)
π, (m = n ≠ 0)
Here δmn is the Kronecker delta. These properties can be used to establish that W is an
⟨cosmx,sinnx⟩ = 0 orthogonal set and U is an orthonormal set.
Establish the second property above, when m ≠ n. (We leave the others as exercises.)
π Solution
⟨sinmx,sinnx⟩ =
= 1 π sin[(m+n)x]+sin[(m−n)xdx
−π
2 −π
sinmx sinnx dx
9.4 Fourier Series 461
11 1 π
= − cos[(m + n)x] + cos[(m − n)x] = 0
2 m+n m−n −π
Standard Integrals
The computation of the Fourier series is based on these standard integrations formulas
π
cosmx dx = 0
−π π
sinmx dx = 0 −π
sinmx cosnx dx = 0 −π
■
π Integration Formulas
π 0 , m ≠ n
cosmxcosnxdx=πδmn= π,m=n −π
π 0 , m ≠ n
sinmxsinnxdx=πδmn= π,m=n −π
forintegersm ≠ 0andn ≠ 0.
Some other useful standard integration formulas are
Useful Integrals
Integration by Parts
xcosmxdx= 1xsinmx+ 1 cosmx+C (2) m m2
xsinmxdx =−1xcosmx+ 1 sinmx+C (3) m m2
which are obtained by integration by parts
udv=uv− vdu
For example, let u = x and dv = cos mx dx in Formula (2).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
462 Chapter 9
Least Squares Methods and Fourier Series
Trigonometric Polynomial
Consider the trigonometric polynomial of degree N of the form
pN(x)= 1a0 +a1cosx+b1sinx+···+aN cosNx+bN sinNx
2
=1a0+N ancosnx+bnsinnx (4)
2 n=1
where aN ≠ 0 or bN ≠ 0. It is sometimes more advantageous to expand a function into a trigonometric series rather than as a power series. For example, it makes sense to express some phenomena in nature as periodic functions. Recall that a Taylor series of a function
f we can be written as an N-th degree polynomial approximation
N n=1
where cn = f (n)(c)/n!. For Fourier series, we find formulas for the coefficients in terms of integrals of f rather than derivatives of f .
Consider a subspace of continuous functions over [−π,π] spanned by the elements of the set W. The best approximation to f by functions in W is the N-th order Fourier approximation to f over [−π , π ]. It is given by the orthogonal projection onto W since the functions in the set are orthogonal. The standard formulas for the orthogonal projections are
an = ⟨f,cosnx⟩ (n≧0) ⟨cos nx, cos nx⟩
bn = ⟨f,sinnx⟩ (n≧1) ⟨sin nx, sin nx⟩
Using ⟨sin nx, sin nx⟩ = π, ⟨cos nx, cos nx⟩ = π, and the inner product Formula (1), we
f (x) ≈
cn(x − c)k
obtain
1π
f(x)cosnxdx f(x)sinnxdx
(n≧0) (n≧1)
an =π bn = 1
π −π
Fourier Coefficients
−π π
We can verify a0 using the constant function f (x) = 1 and the orthogonal projection
⟨f,1⟩ π π 1 π ⟨1,1⟩ = f(x)·1dx 1·1dx = 2π
−π −π −π
Solving for a0, we have
1π
a0 = π f(x)dx −π
1 f(x)dx = 2a0
Clearly, formula for an holds when n = 0 since cos 0 = 1. For clarity, we often write the formula for a0 separately from that for an .
TheN-thorderFourierapproximationof f(x)usingtheFouriercoefficientsa0,an, and bn can be written as Summation (4). This approximation becomes increasingly better as N increases in the sense that || f − pN || becomes smaller, which is called the means square error in the approximation. The Fourier series of f on the interval [−π,π] is the
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Fourier Series of f
■ Theorem1
f(x)=1a0+∞ ancosnx+bnsinnx 2 n=1
(5)
Fourier Series of f
Fourier Coefficients
TheDirichletconditionsaresufficientconditionsforareal-valuedperiodicfunction f(x)to be the same as the Fourier series at points of continuity, and the behavior of the Fourier series at points of discontinuity are prescribed as above. The Gibbs phenomenon describes the peculiar behavior of the Fourier series approximations at simple discontinuities. Although there may be large oscillations near the jumps, these overshoots (ringing) do not die out as the frequency increases, but they do approach a finite limit.
Insummary,theFourierseriesforthecontinuousfunction f(x),periodicon[−π,π],is
f(x)=1a0+∞ ancosnx+bnsinnx (6) 2 n=1
infinite sum
Fourier series can be shown to hold for a wider class of functions such as piecewise continuous functions; that is, when f (x) is continuous except perhaps for a finite number of removable or jump discontinuities. Here is a typical theorem with regard to Fourier series for continuous and piecewise continuous functions.
9.4 Fourier Series 463
Fourier Convergence Theorem
The Fourier Series (5) is convergent if f is a continuous periodic function on [−π, π ] with period 2π or if f and f ′ are piecewise continuous. Where f is continuous, the sum of the Fourier series equals f (c) at all numbers c. When f is discontinuous at a point c, the sum of the Fourier series is the average of the left and right limits; namely,
1[f(c+)+ f(c−)],where f(c+)=limx→c+ f(x)and f(c−)=limx→c− f(x). 2
where
1π
a0 = π 1π
f(x)dx (7)
f(x)cosnxdx (n≧1) (8)
f(x)sinnxdx (n≧1) (9)
If f iscontinuousatx,theFourierSeries(6)convergesto f(x)andtheequalitysymbol=is used. On the other hand, if f is discontinuous at x, the Fourier series (6) may not converge to f(x) and the approximation symbol ≈ is used. In the latter case, the Fourier series converges to the midpoint of the jump.
Cosine Series and Sine Series
Symmetry in a function f (x) can be exploited to reduce the computation effort of finding the Fourier coefficients and series when the symmetry exists either about the vertical axis or the origin.
an = π 1π
bn = π
−π
−π
−π
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
464 Chapter 9
Least Squares Methods and Fourier Series
Even Function
Odd Function
Cosine Series
An even function fe(x) is symmetric with respect to the vertical axis (x = 0) and satisfies
fe(−x) = fe(x) (10) An odd function fo(z) is symmetric with respect to the origin and satisfies
fo(−x) = − fo(x) (11)
Based on these results, we obtain the following:
If fe(x) is an even function, then fe(x)sinnx is odd and the Fourier coefficients
bn are
Hence, we obtain a cosine series
bn =0 (n≧1)
1 ∞
where
π
0 2π
2 a0 = π
fe(x)dx fe(x)cosnxdx
fe(x)= 2a0 +
an cosnx (12)
n=1
(n≧1)
If fo(x) is an odd function, then fo(x)cosnx is an odd function and the Fourier
an =π
0
∞ n=1
Sine Series
fo(x)= π
(13)
EXAMPLE 2
Sawtooth Wave Function on [−π, π]
Determine the N th degree polynomial approximation to the Fourier series for the periodic
sawtooth wave function
f(x)=x on (−π,π) with f(−π)=0= f(π)
Since f(x)=xisanoddfunctionon(−π,π),weobtainimmediately an =0 (n≧0)
We can verify this for the a0 coefficient
1 π 1 x 2 π
a0= xdx= =0 π −π π 2 −π
Sawtooth Wave Function on [−π,π]
Solution
coefficients an are
Hence, we obtain a sine series
an =0
(n≧0)
bn sinnx
where
bn = 2 fo(x)sinnxdx π0
(n≧1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Consequently, we need only determine the bn coefficients 1π 2π
bn =π xsinnxdx=π xsinnxdx −π 0
21 1π = − xcosnx+ 2sinnx
9.4 Fourier Series 465
πnn0 =−2cosnπ+ 2 sinnπ
n πn2
= 2(−1)n+1 1 (n ≧ 1)
Sawtooth Polynomial pN
n
Here we use integration Formula (3) as well as cos nπ = (−1)n and sin nπ = 0. Therefore,
we have
N 1
f(x)≈ pN(x)=2 (−1)n+1 sinnx (14)
n=1 n pN(x)=2sinx−1sin2x+1sin3x−1sin4x+···+(−1)N+1 1 sinNx
234 N
As expected for an odd function, this is a sine series. Also, f (0) = 0 and f (nπ) = 0 for all integers n. So at the discontinuous points x = nπ, the Fourier series converges to 0, which
In detail, we obtain
is the midpoint of the jump. See Figure 9.6.
f (x)
2
x
FIGURE 9.6
Sawtooth wave on [−π, π]
by-term.
n=1 −π n=1 −π
Using standard integration formulas, both of the integrals above have the value 0 and we
obtain Formula (7) for a0. Multiplying infinite Series (6) for f (x) by cos nx and integrating
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Infinite Fourier Series
Assume that a continuous periodic function f (x ) on [−π, π ] can be expressed as a con- vergent infinite series of sines and cosines such as (6). It turns out that the formulas for the coefficients a0, an and bn are the same as those given in formulas (7)–(9), respectively.
Here we have assumed that f is a continuous function over the interval [−π,π] such that the infinite trigonometric series of sines and cosines exists and can be integrated term-
1 π f(x)dx=2
π
−π −π
π ∞
a0dx+ ancosnx+bnsinnx dx
−π n=1
1∞π ∞π
= 2(2π)a0 + an cosnx dx + bn sinnx dx
■
466 Chapter 9
Least Squares Methods and Fourier Series
term-by-term, we obtain Formula (8) for an . Similarly, we can determine Formula (9) for bn . We leave verifying an and bn as exercises.
Periodic on [− P /2, P /2]
We can change the interval from [−π, π] to [−P/2, P/2] by using the change of variables xnew = [P/2π]xold. Consequently, for a function f (x) periodic on [−P/2, P/2] with period P, we can write the Fourier series as
f(x)= 1a0 +∞ an cos2πnx+bn sin2πnx (15) 2n=1 P P
where
2 P/2
a0 = P
2 P / 2
f(x)dx
2 π n
P x dx
Periodic on [−L , L ]
Next, we find that the Fourier series for a function f (x) periodic on [−L, L], with period
P = 2L, is given by
f(x)= 1a0 +∞ an cosnπx+bn sinnπx (16) 2n=1 L L
an = P
2 P/2
f(x)cos f(x)sin
(n≧1) (n≧1)
−P/2
−P/2
2πn
P x dx
bn = P
Clearly, P = 2π gives the special case [−π,π] and the Fourier series given by Formulas
(6)–(9).
where
1L
a0 = L
an = L
1 L
−L
f(x)dx
n π
−P/2
−L
f(x)cos
L x dx
(n≧1) (n≧1)
1L nπ
bn = L f(x)sin L x dx
−L
Again, [−π,π] is a special case with L = π.
Periodic on [0, 2L ]
Similarly, if a periodic function f(x) is defined over [0,2L], with period P = 2L, the
Fourier series becomes
f(x)= 1a0 +∞ an cosnπx+bn sinnπx (17)
2n=1 L L
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
where
1 2L
a0 =
an =
f(x)dx
n π
f(x)cos
L x dx
Fourier series can be use to analysis the sounds from musical instruments or to synthesis them as well as other sounds. The difference between wavefronts can be expressed as a Fourier series such as
pN(x)=a0 +N an cosπx+bn sinπx n=1 L L
Since a sound can be expressed as a sum of simple pure sounds, the difference between two instruments can be attributed to the relative sizes of the Fourier coefficients from these respective wavefronts. TheN-th term of a Fourier series is called the N-th harmonic of
pN with amplitude AN = a2N + b2N , and energy A2N .
Sawtooth Wave Function on [0, 2L ]
Now we find the N -th order Fourier series approximation to the sawtooth wave function f(x)= 1 x on (0,2L) with f(0)= 1 = f(2L)
2L 2
and then we plot the function f as well as some of the Fourier series approximations.
The sawtooth wave resembles the teeth on a saw blade. This waveform is one of the best ones for synthesizing string musical instruments such as violins and cellos.
For example, consider a string of length 2L plucked at the right end and fixed at the
left end. First, we compute
f(x)sin
used for a periodic function with period P = 2L.
Fourier Series and Music
L
0
nπ
L x dx
bn =
Clearly, [0, 2π] is a special case with L = π. Moreover, any interval [x0, x0 + 2L] can be
9.4 Fourier Series 467
L
1 2 L
0
(n≧1) (n≧1)
L
1 2L
0
EXAMPLE 3
Sawtooth Wave on [0,2L]
Solution
a0=12L xdx L 2L
0
1 22L =2x=1
4L 0
Next, we have
an=12L 1xcosnπxdx
L02L L
1 L nπ L2 nπ 2L =2 xsinx+22cosx
2Lnπ L nπ L 0 1 2L2 L2 L2
=2L2 nπ sin2nπ+n2π2 cos2nπ−n2π2 =0 (n≧1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
468 Chapter 9
Least Squares Methods and Fourier Series
We have used integration-by-parts Formula (2), as well as sin 2nπ = 0 and cos 2nπ = −1. Finally, we have
bn=12L 1xsinnπxdx L02L L
1 L nπ L2 nπ 2L =2−xcos x+22sin x
2L nπ L nπ L 0 12L2 L
=2L2 −nπcos2nπ+n2π2sin2nπ = − 1 (n ≧ 1)
nπ
We have used integration Formula (3), as well as cos 2nπ = 1 and sin 2nπ = 0.
The N -th order Fourier series approximation of the sawtooth function f is f(x)≈pN(x)=1−1N 1sinnπx (18)
2πn=1n L When L = π , the approximations over [0, 2π ] is
pN(x)= 1− 1sinx+1sin2x+1sin3x+···+ 1 sinNx 2π23N■
Fourier Series Examples
Here are some common Fourier series, which are periodic on [0,2L] with P = 2L. (See Figures 9.7–9.9 over [0,2π] with approximations p2, p4, and p10.)
SawtoothWave ST(x)=1−1∞ 1sinnπx (19) 2πn=1n L
FIGURE 9.7
Sawtooth Wave on [0, 2π] with approximations p2, p6, and p10
y
1.2 1 0.8 0.6 0.4 0.2 0
0123456
x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
SquareWave
SW(x)= 4 ∞ 1sinnπx π n=1,3,5,… n L
(20)
9.4 Fourier Series 469
FIGURE 9.8
Square Wave on [0, 2π] with approximations p2, p6, and p10
y
1
0.5
0
20.5
21
21.5
21 0 1 2 3 4 5 6
x
Triangle Wave
y
1 0.8 0.6 0.4 0.2 0 20.2 20.4 20.6 20.8
Euler’s and de Moivre Formulas
In the complex plane Cn , i = √−1 and we can use Euler’s equations eiθ =cosθ+isinθ
T W(x) =
sin x (21)
8 ∞ 1 n π
(−1)(n−1)/2
π2 n2L
n=1,3,5,…
FIGURE 9.9
Triangle Wave on [0, 2π ] with approximations p 2 , p6, and p10
Euler’s Equation
to obtain
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0123456
x
e−iθ =cosθ−isinθ cosθ = 1eiθ +e−iθ
2
sinθ= 1eiθ−e−iθ
2i
470 Chapter 9
Least Squares Methods and Fourier Series
de Moivre’s Formula
Moreover, the de Moivre’s formula is
(cosθ +i sinθ)k = coskθ +i sinkθ
Also, we have these famous identity
ei2π =cos2π+isin2π=1
ei2πk = cos2πk +i sin2πk = 1
We can write a complex number x in polar form (r, θ ), with a magnitude r and an angle θ , as x=reiθ =r(cosθ+isinθ)
So we obtain
xk =rk(coskθ+isinkθ) Complex Fourier Series
Suppose the function f (x) has period P on the interval [−P/2, P/2]. Often it is conve- nient to write the Fourier series, containing real functions (sines and cosines), as a sum of exponential functions in the form
∞ n=−∞
where ω0 = 2π/P is the frequency (for example, radians per second). Here the αn are the complex Fourier coefficients.
Recall Equation (15), which is the Fourier series for the periodic function f (x) of period P on the interval [−P/2, P/2]. By substituting these identities
cos(nω0x)= 1einω0x +e−inω0x 2
sin(nω0x)= 1einω0x −e−inω0x 2i
f (x) =
αneinω0x (22)
we find the relationship between the trigonometric and exponential forms of the series α = 1 ( a + i b ) n < 0
n2n n
α0 = 1 a0 n = 0 (23)
2
Thus, we have
αn = 1 (an − bn ) n > 0 2
f(x)=α0 +∞ αneinω0x +α−ne−inω0x n=1
Note that α−n is the complex conjugate of αn . (See Exercise 9.4.16.) To find the coefficients αn, multiply each side of Series (22) by e−imω0x and integrate over [−P/2, P/2] yielding
P / 2 −P/2
f (x)e−imω0x dx =
∞ P / 2 n=−∞ −P/2
αnei(n−m)ω0x dx
In the integral on the right-hand side, all the terms are zero except those for which m = n because the terms with different exponents are orthogonal. Consequently, we obtain
P/2 −P/2
f(x)e−imω0x dx =αn
P/2 −P/2
dx =αnP
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
So we find that
for all integers n.
Roots of Unity
P/2 P −P/2
αn = 1
f(x)e−inω0x dx
9.4 Fourier Series 471
Roots of unity arise when finding the complex roots of the polynomial xn =1
(Remember that a n-th degree polynomial has n complex roots.) Let x = reiθ in polar formwithr=1.Then1=xn =rneinθ =einθ.Since1=ei2πk fork≧0,thisleavesus with einθ = ei2πk. Taking the natural logarithm of both sides, we obtain inθ = i2πk and θ = 2π(k/n).Sothesolutionstoxn = 1aregivenbyx = ei2π(k/n) fork = 0,1,2,…,n−1. Each k = 0,1,2,…,n − 1 gives a distinct value of θ, but once we get to k ≧ n the values begin to repeat. Hence, the solutions to xn = 1 are given by
x =ei2π(k/n) =cos2πk+isin2πk (k =0,1,2,…,n−1) nn
All roots of unity lie on the unit circle in the complex plane. The roots are the vertices of a regular n polygon in the complex plane. For n ≧ 1, the sum of the n-th roots of unity is zero. For a given positive integer n, the roots of the cyclotomic equation xn = 1 are the n
solutions
√
x= n 1=ei2π(k/n) (0≦k≦n−1)
n-th Roots of Unity
Sample ωnk
1 2 3 4
This is called the n-th roots of unity or de Moivre numbers. Moreover, the n-th roots of unity can be expressed as
cos 0 + i sin 0 = 1 (k = 0)
ωnk =ei2π(k/n) =cos2πk+isin2πk≠ 1 (1≦k≦n−1)
nn
The first of these (k = 0) is called the primitive n-th root of unity.
Examples of the first few of the n-th roots of unity are n k=0 k=1 k=2
k=3
ω43 = ei2π(3/4) = −i
ω10 = ei2π(0/1) = 1
ω20 = ei2π(0/2) = 1 ω21 = ei2π(1/2) = −1
ω0 =ei2π(0/3) =1 ω1 =ei2π(1/3) =−1 +i1√3 ω2 =ei2π(2/3) =−1 −i1√3 3322322
ω40 = ei2π(0/4) = 1 ω41 = ei2π(1/4) = i ω42 = ei2π(2/4) = −1 Three cases are shown in Figure 9.10.
2 12 1 12 3 i i
21 1 121 1
FIGURE 9.10 2 12 2 12 3 i 2i n-th roots of unity z251 z351 z451
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
472 Chapter 9
Least Squares Methods and Fourier Series
DFT Matrix Form
Vandermonde Matrix
Discrete Fourier Transform
The Discrete Fourier Transform (DFT) of the vector x = [x0, x1, x2, . . . , xn−1]T to the vectory=[y0,y1,y2,…,yn−1]T isgivenbythesequence
n−1
y= ωjkx (0≦j≦n−1)
In matrix notation, the DFT is
y = Fn x
Here x is the original input signal and y is the DFT of the signal. Also, DFTs are used for
trigonometric interpolation among other things. The transformation matrix is defined by
F =(F ) =ωjk
n jk n×n n 0≦j≦n−1,0≦k≦n−1
This is known as the Vandermonde matrix of the roots of unity ωk . To make the matrix
√
unitary, multiply Fn by the normalization factor 1/ n. Notice this pattern
F1 =ω10 =1, ω40 ω40
F= ω 40 ω 41
4 ω40 ω42 ω40 ω43
n
where
1111 F=1−i −1 i 4 1−1 1−1
jnk k=0
For n = 4, we have
y0 y1
1
1 1 1 x0 ω41 ω42 ω43 x1
1 y = y = 1
F2 =ω20 ω20=1
ω40 ω40 1 1 1 1
ω42 ω43=1 −i −1 i ω4 ω46 1 −1 1 −1 ω46 ω49 1 i −1 −i
y3 4 4 4 x3
1 ω20 ω21 1 −1
ω 42 ω 4 4 ω 46 x = F 4 x 2 1ω3ω6ω92
Inverse DFT Matrix Form
x = F−1 y n
F−1 = 1 F∗ nnn
1 i −1 −i The Inverse DFT is the sequence
1 n − 1
x = ω−jky (0≦k≦n−1) knnj
j=0
In matrix notation, the inverse of the DFT is
where
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
9.4 Fourier Series 473 where F∗ is the conjugate transpose of F. For example with n = 4, we have
11111111
1 ω−1 ω−2 ω−3
F−1 = 1 4 4 4 = 11 i −1 −i
4 41 ω−2 ω−4 ω−6 41 −1 444
1 −1 1 ω−3 ω−6 ω−9 1 −i −1 i
444 Moreover, we can show that
1111 F P =1 −1 −i i= F2 D2F2
4 4 1 1 −1 −1 F2 −D2F2
1−1 i−i
Thus, F4 can be rearranged into diagonally scaled blocks of F2, which holds for any even
n. Here we use the permutation matrix
P=e e e e=0010
and the diagonal matrix
1000 413240100
0001 D=ω40 0=1 0
2 0ω41 0−i
In general, Pn is a permutation matrix that groups even-numbered columns and odd-
numbered columns and the diagonal matrix is
D = Diag(1,ω ,ω2,…,ω(n/2)−1)
n/2 nnn
Recursive Computation of DFT
The Discrete Fourier Transformation of an n-point sequence can be computed by two n/2-point Discrete Fourier Transformations (n even).
■ Theorem2
We can find Fn by applying Fn/2 to the even and to the odd subsequences. Then scaling the results by ±Dn/2, where necessary. This recursive DFT is called the Fast Fourier Transform (FFT).
Computing the DFT of the four-point problem reduces to computing the DFT of the two-point even and odd subsequences.
Fast Fourier Transforms
Direct computation of DFT requires O(n2) operations, but this can be reduced to O(n log2 n) by exploiting the efficiencies of the Fast Fourier Transform (FFT). The case used most often is when n is a power of two; namely, n = 2r . To reduce the total effort required in finding the FFT, we make use of the periodicity of the complex exponential function and clever reordering of the computations.
Both of the indices on the components of the transform and on the summation run from 0ton−1.Exactvaluesof jcanbewritteninbinaryformas j =2r−1 jr+···+22 j3+2j2+j1, where each of the numbers j1, j2, . . . , jr is either 0 or 1. This reordering of the k values as
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
474 Chapter 9 Bit Reversal
Danielson-Lanczo Identity
Fourier Matrix
Least Squares Methods and Fourier Series
they related to the j values is know as bit reversal. This is one of the cleverness in the FFT computations!
The Cooley-Tukey Fast Fourier (FFT) algorithm rearranges the input values in bit-reversal order and then builds the output transformation. The basic idea is breaking a transform of length N into two transforms of length N/2 using the identity
N−1 N/2−1 N/2−1
ane−i2πnk/N = a2ne−i2π(2n)k/N + a2n+1e−i2π(2n+1)k/N
n=0 n=0 n=0
N /2−1 N /2−1
= aeven e−i2πnk/(N/2) + e−i2πk/N aodd e−i2πnk/(N/2) nn
n=0 n=0
This is also called the Danielson-Lanczo Lemma. It can be visualized via the Fourier
matrix
with entries
Fn =(Fjk)n×n
F =ei2π(jk)/n ≡ωjk (0≦ j≦n−1,0≦k≦n−1)
jk n
where ωn = e(i2π)/n. Multiplying by 1/√n makes the matrix unitary.
For example, we can show that
1 1 1 110 1 011001000 F =1 i −1 −i=0 1 0 i1 i2 0 00 0 1 0 4 1 −1 1 −1 1 0 −1 00 0 1 10 1 0 0
1 −i −1 −i 0 1 0 −i 0 0 1 i2 0 0 0 1 In matrix form, we can write
where
F4 =I2 D2F2 0 even-odd I2 −D2 0 F2 shuffle
F=11=11 2 1i2 1−1
So in general, we have
F2n =In DnFn 0 even-odd
In −Dn 0 Fn shuffle
I D 0 0 F 0 0 0 Shuffle:
Fn 0
0
Fn
= In/2 −Dn/2 0 0 0 Fn/2 0 0 0,2(mod4)
and repeating we obtain
n/2 n/2 n/2 even-odd
0 0 In/2 Dn/2 0 0 Fn/2 0 even-odd 0 0 In/2 −Dn/2 0 0 0 Fn/2 1,3(mod4)
Mathematical Software
Mathematical software systems such as MATLAB, Maple, and Mathematica have routines for computing Fourier series and plotting them. Web pages associated with these packages are particularly useful. Moreover, there are software packages specifically designed for handling Fourier series.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Summary 9.4
• The formulas for the orthogonal projections of functions f and g are an = ⟨f,cosnx⟩ (n≧0)
⟨cos nx, cos nx⟩
bn = ⟨f,sinnx⟩ (n≧1) ⟨sin nx, sin nx⟩
where the inner product of f and g is
pN(x)=1a0+N ancosnx+bnsinnx 2 n=1
where the Fourier coefficients are
1π
an =π 1π
⟨ f, g⟩ =
• The N -th order Fourier approximation for a function f (x ) periodic over [−π, π ] is
f(x)cosnxdx f(x)sinnxdx
f(x)=1a0+∞ ancosnx+bnsinnx 2 n=1
f (x)g(x) dx
9.4 Fourier Series 475
π π
bn =π
• TheFourierseriesfor f(x)periodicon[−π,π]is
−π
(0≦n≦N) (1≦n≦N)
where
1π
an =π 1π
bn =π
f(x)cosnxdx f(x)sinnxdx
(n≧0) (n≧1)
−π
−π
−π
• An even function f is symmetric with respect to the vertical axis (x = 0) and satisfies
f(−x)= f(x)
• An odd function f is symmetric with respect to the origin and satisfies
f (−x) = − f (x)
• If fe(x) is an even function, we obtain a cosine series
1 ∞
fe(x)= 2a0 + an cosnx
where
2π
an =π
n=1
fe(x)cosnxdx
(n≧0)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0
476 Chapter 9
Least Squares Methods and Fourier Series
• If fo(x) is an odd function, we obtain a sine series
where
• ThesawtoothwaveFourierseriesover[−π,π]is
N 1
fo(x)= π
∞ n=1
bn sinnx
bn = 2 fo(x)sinnxdx (n≧1) π0
f(x)≈ pN(x)=2 (−1)n+1 sinnx n=1 n
• The Fourier series for a periodic function f (x) with period P over [−P/2, P/2] is f(x)= 1a0 +∞ an cos2nπx+bn sin2nπx
2n=1 P P
where the Fourier coefficients are 1 P / 2
2 n π
P x dx
2nπ
P x dx
f(x)= 1a0 +∞ an cosnπx+bn sinnπx 2n=1 L L
an = P
1 P/2
f(x)cos f(x)sin
(n≧0) (n≧1)
bn = P
• For f (x) periodic on [−L, L], the Fourier series is
−P/2
−P/2
where
1 L
an = bn =
L
f(x)cos
n π
L x dx
(n≧0) (n≧1)
−L
1L nπ
L f(x)sin L x dx
−L
• A function f(x) periodic on [0,2L] can be approximated by a Fourier series of the
form
where
f(x)= 1a0 +∞ an cosnπx+bn sinnπx 2n=1 L L
1 2 L
an = L
1 2L
n π
L x dx
Sawtooth Wave ST(x) = 1 − 1 ∞ 1 sinnπ x 2πn=1n L
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0
f(x)cos f(x)sin
(n≧0) (n≧1)
bn = L
• Fourier series for some common periodic functions on [0, 2 L ] are
nπ
L x dx
0
where
• Euler’s formula is
αn =
f(x)e−inω0x dx
• de Moivre’s formula is
where
x = F−1 y n
F−1 = 1 F∗ nnn
SquareWave
SW(x)= 4 ∞ 1sinnπx π n=1,3,5,… n L
Triangle Wave
• The complex Fourier series for a f (x) of period P over [−P/2, P/2] is
T W(x) =
π2 n2L n=1,3,5,…
f (x) =
αneinω0x
• The k solutions of the n-th roots of unity, or de Moivre numbers, are √
9.4 Fourier Series 477
8 ∞ 1 n π (−1)(n−1)/2 sin x
∞ n=−∞
P/2 −P/2
e±iθ =cosθ±isinθ
(cosθ +i sinθ)k = coskθ +i sinkθ
x= n 1=ei2π(k/n) (0≦k≦n−1)
• The n-th roots of unity are given by
cos 0 + i sin 0 = 1
ωnk =ei2π(k/n) =cos2πk+isin2πk≠ 1 (1≦k≦n−1)
nn
• The Discrete Fourier Transformation (DFT) of x = [x0, x1, x2, . . . , xn−1]T is
y=[y0,y1,y2,…,yn−1]T isgiveninmatrix-vectorformas y = Fn x
where
F =F =ωjk
n jk n×n n 0≦j≦n−1,0≦k≦n−1
which is the Vandermonde matrix of the roots of unity.
• The inverse of the DFT in matrix vector form is
(k = 0)
• In the Fast Fourier Transformation, we have
F2n =In DnFn 0 even-odd
In −Dn 0 Fn shuffle
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
478
Chapter 9 Least Squares Methods and Fourier Series
1. 2.
3. 4.
a5.
6.
8.
9. 10.
For f (x) = x2 on the interval [−1, 1] with period 2, π2∞1 a
Establish the orthogonality properties (p. 461).
Establishtheseproperties:
a. Productoftwoevenfunctionsiseven.
b. Productoftwooddfunctionsiseven.
c. Productofanevenandanoddfunctionisodd.
For the sawtooth wave, show how the results of Exam- ples 2 and 3 are related; that is, Equations (14) and (18).
For a continuous function f (x) periodic over the inter- val [−π, π ] determine Formulas (7), (8), and (9) for the coefficients a0, an, and bn of the Fourier Series (6) using term-by-term integration.
Write out the Fourier series for a periodic function f (x) over [−L, L] with period P = 2L under these condi- tions.
a. f (x) is an even function.
b. f (x) is an odd function.
Explain why it is possible for a0 ≠ 0 in Example 3, but f (−x) = − f (x).
Suppose we are given a periodic function with period P = 2L, but the function f (x) is defined only on the in- terval [0, L]. Show how it can be extended to the interval [−L, L] under the following conditions.
determine the Fourier series. Show that 6 = n2 . 11. n=1
a. b.
a.
f (x) is an even function. f (x) is an odd function.
Suppose a complex number z has n different n-th roots in the complex plane. So z = r(cos θ + i sin θ) satis- fies the equation
zn = a
for a nonzero number a. Show how to determine the
n-th roots of a.
What are the four different fourth roots of two?
Starting with the Fourier series for a period function f over [−π, π ] and Formulas (6)–(9), write out the details for using a change of variables and the substitution rule to convert to the formulas for the Fourier series over these intervals:
a. [−L,L]
b. [−P/2,P/2]
c. [0,2L]
d. [x0,x0+2L]
ab.
Show that +1 is always an n-th root of unity, but −1 is
12. 13.
14.
x
2−L,L≦x<2L ∞6n
7. FindtheFourierseriesoftheseperiodicfunctionsgiven over the given intervals:
such a root only if n is even. Establish these beautiful identities
a.e2πi=1
b. ix=ln(cosx+isinx)
Derive these Fourier series for the periodic function given over [−π,π]:
a. f(x)=
x
x ,
0 ≦ x < L 2−L, L≦x<2L
2
sinx, 0
Write a program to generate and print 1000 points uni- formly and randomly distributed in the circle (x − 3)2 + (y+1)2≦9.
12. Generate in the computer 1000 random numbers in the interval (0, 1). Print and examine them for evidence of nonrandom behavior.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
a 13.
14.
a 15. a 16.
17.
18.
Generate 1000 random numbers xi (1 ≦ i ≦ 1000) on your computer. Let ni denote the eighth decimal digit in xi . Count how many 0’s, 1’s, . . . , 9’s there are among the 1000 numbers ni . How many of each would you expect? This code can be written with nine statements.
(Continuation) Using a random-number generator, gen- erate 1000 random numbers, and count how many times the digit i occurs in the j th decimal place. Print a table of these values—that is, frequency of digit versus decimal place. By examining the table, determine which decimal place seems to produce the best uniform distribution of random digits.
Hint: Use Computer Exercise 1.1.7 (p. 17) to compute the arithmetic mean, variance, and standard deviations of the table entries.
Using random integers, write a short program to simulate five people matching coin flips. Print the percentage of match-ups (five of a kind) after 125 flips.
Write a program to generate 1600 random points uni- formly distributed in the sphere defined by x2 + y2 + z2 ≦ 1. Count the number of random points in the first octant.
Write a program to simulate 1000 simultaneous flips of three coins. Print the number of times that two of the three coins come up heads.
Compute 1000 triples of random numbers drawn from a uniform distribution. For each triple (x, y, z), compute the leading significant digit of the product xyz. (The lead- ing significant digit is one of 1, 2, …, 9.) Determine the frequencies with which the digits 1 through 9 occur among the 1000 cases. Try to account for the fact that
these digits do not occur with the same frequency. (For example, 1 occurs approximately 7 times more often than 9.) If you are intrigued by this, you may wish to consult the articles by Flehinger [1966], Raimi [1969], and Turner [1982].
19. Run the example programs in this section and see whether similar results are obtained on your computer system.
20. Write a program to generate and plot 1000 pseudo- random points with the following exponential distribu- tion inside the following figure: x = − ln(1 − r )/λ for r ∈ [0, 1) and λ = 1/30.
Average of n Numbers
Numerical Integration
Now we turn to applications, the first being the approximation of a definite integral by the Monte Carlo method. If we select the first n elements x1, x2, . . . , xn from a random sequence in the interval (0, 1), then
1 1 n f(x)dx ≈ n f(xi)
0 i=1
Here, the integral is approximated by the average of n numbers f(x1), f(x2),…, f(xn).
When this is actually carried out, the error is of order 1/ n, which is not at all competitive with good algorithms, such as the Romberg method. However, in higher dimensions, the
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques 491
2
z
0
3– 2
21. Improve the program Coarse Check by using ten or a hundred buckets instead of two.
22. (Student Research Project) Investigate some of the latest developments on random-number generators and explore parallel random number generators. Random numbers are often needed for distributions other than the uniform distribution, so this has a statistical aspect.
1
y
x
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques
√
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
492 Chapter 10 Monte Carlo Methods and Simulation
Monte Carlo method can be quite attractive. For example, the triple integral is
Using Random Sequence of Points
1 1 1 1 n f(x,y,z)dxdydz≈ n f(xi,yi,zi)
000 i=1
where (xi , yi , zi ) is a random sequence of n points in the unit cube 0 ≦ x ≦ 1, 0 ≦ y ≦ 1, and 0 ≦ z ≦ 1. To obtain random points in the cube, we assume that we have a random sequence in (0, 1) denoted by ξ1, ξ2, ξ3, ξ4, ξ5, ξ6, . . . To get our first random point p1 in the cube, just let p1 = (ξ1, ξ2, ξ3). The second is, of course, p2 = (ξ4, ξ5, ξ6), and so on.
If the interval (in a one-dimensional integral) is not of length 1 but, say, is the gen- eral case (a,b), then the average of f over n random points in (a,b) is not simply an approximation for the integral, but rather for
1b
f(x)dx b−a a
which agrees with our intention that the function f (x) = 1 have an average of 1. Similarly, in higher dimensions, the average of f over a region is obtained by integrating and dividing by the area, volume, or measure of that region. For instance,
1312
8 1 −1 0
is the average of f over the parallelepiped described by the following three inequalities: 0 ≦ x ≦ 2, −1 ≦ y ≦ 1, 1 ≦ z ≦ 3.
Example Integrals
f(x)dx ≈ n 0
f (x, y, z) dx dy dz
Order of Integration
b d b d f(x,y)dxdy=
ac ac
and
To keep the limits of integration straight, recall that
a2 b2 c2 a1 b1 c1
f(x,y)dx dy
a2 b2 c2 f(x,y,z)dx dy dz
f(x,y,z)dxdydz=
So if (xi , yi ) denote random points with appropriate uniform distribution, the following
examples illustrate Monte Carlo techniques:
5 5 n
f(xi)
f(xi,yi)
In each case, the random points should be uniformly distributed in the regions involved.
In general, we have
i=1 56 15 n
f(x,y)dxdy≈ n 21 i=1
a1 b1 c1
General Case
Here, we are using the fact that the average of a function on a set is equal to the integral of
the function over the set divided by the measure of the set.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A
f ≈ (measure of A) × (average of f over n random points in A)
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques 493 Example and Pseudocode
f(x,y)dxdy sinln(x+y+1)dxdy= f(x,y)dxdy
Approximate Let us consider the problem of obtaining the numerical value of the integral
Using Random Numbers
over the disk in xy-space, defined by the inequality
12 12 1 =(x,y):x− +y− ≦
224
z
Surface f
FIGURE 10.6
Sketch of surface f (x, y) above disk
General Case
1
x
1 Disk V
y
A sketch of this domain, with a surface above it, is shown in Figure 10.6. We proceed by gen- erating random points in the square and discarding those that do not lie in the disk. We take n=5000pointsinthedisk.Ifthepointsarepi =(xi,yi),thentheintegralisestimatedtobe
average height of f f(x,y)dxdy≈(areaofdisk)× overnrandompoints
1n
=(πr2) n f(pi) i=1
π n
= 4n f(pi)
i=1
The pseudocode for this example follows. Intermediate estimates of the integral are printed when n is a multiple of 1000. This gives us some idea of how the correct value is being approached by our averaging process.
program Double Integral integer i, j: real sum, vol, x, y; integer n ← 5000, iprt ← 1000; call Random((ri j ))
j←0; sum←0 fori =1ton
real array (ri j )1:n×1:2 external function f
x = ri,1; y = ri,2
if(x−1/2)2 +(y−1/2)2≦1/4then
j←j+1 sum←sum+ f(x,y) if mod( j, iprt) = 0 then
vol ← (π/4)sum/real( j) output j,vol
(Continued)
Double Integral
Pseudocode
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
494 Chapter 10
Monte Carlo Methods and Simulation
end if end if
end for
vol ← (π/4)sum/real( j) output j,vol
end program Double Integral
real function f (x, y)
real x, y√
f ←sin ln(x+y+1) end function
Complicated Region in 3D
We obtain an approximate value of 0.57 for the integral. Computing Volumes
The volume of a complicated region in 3-space can be computed by a Monte Carlo technique. Taking a simple case, let us determine the volume of the region whose points satisfy the inequalities
0≦x≦1, 0≦y≦1, 0≦z≦1 x2 +siny≦z
x − z + ey ≦ 1
The first line defines a cube whose volume is 1. The region defined by all the given in- equalities is therefore a subset of this cube. If we generate n random points in the cube and determine that m of them satisfy the last two inequalities, then the volume of the desired region is approximately m/n. Here is a pseudocode that carries out this procedure:
program Volume Region
integer i, m; real array (ri j )1:n×1:3; real vol, x, y, z integer n ← 5000, iprt ← 1000
call Random((ri j ))
fori =1ton
x ← ri,1
y ← ri,2
z ← ri,3
if x2 + sin y ≦ z, x − z + ey ≦ 1 then m ← m + 1 if mod(i, iprt) = 0 then
vol ← real(m)/real(i)
output i, vol end if
end for
end program Volume Region
Volume Region
Pseudocode
Observe that intermediate estimates are printed out when we reach 1000, 2000, . . . , 5000 points. An approximate value of 0.14 is determined for the volume of the region.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques 495 Ice Cream Cone Example
Consider the problem of finding the volume above the cone z2 = x2 + y2 and inside the spherex2 +y2 +(z−1)2 =1asshowninFigure10.7.
z
21 0 1 x
The volume is contained in the box bounded by −1 ≦ x ≦ 1, −1 ≦ y ≦ 1, and 0 ≦ z ≦ 2, which has volume 8. Thus, we want to generate random points inside this box and multiply by 8 the ratio of those inside the desired volume to the total number generated. A pseudocode for doing this follows:
FIGURE 10.7
Ice cream cone region
y
program Cone
integer i, m; real vol, x, y, z; real array (ri j )1:n×1:3 integer n ← 5000, iprt ← 1000; m ← 0
call Random((ri j ))
fori =1ton
x ← 2ri,1 − 1; y ← 2ri,2 − 1; z ← 2ri,3
ifx2 +y2≦z2,x2 +y2≦z(2−z)thenm←m+1 if mod(i, iprt) = 0 then
vol ← 8 real(m)/real(i)
output i, vol end if
end for
end program Cone
Cone Pseudocode
The volume of the cone is approximately 3.3. Summary 10.2
• We can approximate integrals by using the Monte Carlo method to estimate areas
and volumes:
1 1 n f(x)dx ≈ n
0 i=1
1 1 1 1 n f(x,y,z)dxdydz ≈ n
f(xi) f(xi,yi,zi)
000 i=1
where {xi } is a sequence of random numbers in the unit interval and (xi , yi , zi ) is a random sequence of n points in the unit cube.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
496 Chapter 10 Monte Carlo Methods and Simulation • In general, we have
A
Exercises 10.2
a1. It is proposed to calculate π by using the Monte Carlo
method. A circle of radius 1 is inside a square of side 2.
We count how many of m random points in the square √
happen to lie in the circle. Assume that the error is 1/ m. How many points must be taken to obtain π with three accurate figures (i.e., 3.142)?
gles that lie above the subintervals [a = x0,x1], [x1, x2], . . . , [xn−1, xn = b]. Here the interval [a, b] is subdivided into n subinterval of equal width h = (b − a)/n. We form the equally spaced nodes ck = a+(k−t)h for k = 1,2,…,n.
2. The Mean-Value Theorem for Integral says that there exists a number c with a < c < b such that
is
The Composite Midpoint Rule for n subintervals b n
b
f(x)dx =(b−a)f(c)
f(x)dx=h f(ck)
a
a k=1
Show that we can approximate the integral
b
f ( x ) d x ≈ ( b − a ) fˆ
a
assuming f (x)
Consequently, this computation shows that the area un- der the curve is the base width b − a times the average height f (c).
is continuous over the interval [a, b].
f ≈ (measure of A) × (average of f over n random points in A)
where
Let f(x)=sinx+1sin3x.Findcsothat
π3 ˆ1n
sinx+1sin3x dx=πf(c)
3. An intuitive method of finding the area under a curve
is to approximate that area with a series of rectan-
1. Run the codes given in this section on your computer system and verify that they produce reasonable answers.
a2. Write and test a program to evaluate the integral 01 ex d x by the Monte Carlo method, using n = 25, 50, 100, 200, 400, 800, 16000, and 32000. Observe that 32,000 random numbers are needed and that the work in each case can be used in the next case. Print the exact answer. Plot the re- sults using a logarithmic scale to show the rate of growth.
f=n f(ck) 03 k=1
Computer Exercises 10.2
π= (4−x2)1/2dx 0
Use the Monte Carlo method and 2500 random numbers. a4. Use the Monte Carlo method to approximate the integral
(x2 + y2 + z2)dx dy dz Compare with the correct answer.
111
−1 −1 −1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
by sampling f (x) at the n equally spaced points c = 1 k
a+ k−2 hforh=1,2,...,nwhereh=(b−a)/n.
a5. Write a program to estimate
2 6 1
(yx2 +zlogy+ex)dxdydz
0 3 −1
6. Using the Monte Carlo technique, write a pseudocode to approximate the integral
3. Writeaprogramtoverifynumericallythat 2x
(e sinylogz)dxdydz
where is the circular cylinder that has height 3 and circular base x2 + y2 ≦ 4.
a7. Estimatetheareaunderthecurve y = e−(x+1)2 andinside the triangle that has vertices (1, 0), (0, 1), and (−1, 0) by writing and testing a short program.
8. Using the Monte Carlo approach, find the area of the irregular figure defined by
15.
16.
Write a program to estimate the area of the region defined
by the inequalities
x2 + y2 ≦ 4
|y|≦ex
1≦x≦3, −1≦y≦4 x3 + y3 ≦ 29
y≧ex −2
a
the solid whose points (x, y, z) satisfy
0≦x≦y, 1≦y≦2, −1≦z≦3 ex ≦ y
(sinz)y≧0
An integral can be estimated by the formula 1 1 n
9. Use the Monte Carlo method to estimate the volume of
f(x)dx ≈ f(xi) n i=1
even if the xi ’s are not random numbers; in fact, some
numerical integration scheme. Test whether the estimates
√
converge at the rate 1/n or 1/ n by using some simple examples, such as 01 ex dx and 01(1 + x2)−1 dx.
Considertheellipsoid
x2 y2 z2 ++=1
4 16 4
a. Write a program to generate and store 5000 random
points uniformly distributed in the first octant of this ellipsoid.
nonrandom sequences may be better. Use the sequence
10.2 Estimation of Areas and Volumes by Monte Carlo Techniques 497
0
a10. Using a Monte Carlo technique, estimate the area of the region determined by the inequalities 0 ≦ x ≦ 1, 10≦ y≦13, y≧12cosx, and y≧10 + x3. Print inter- mediate answers.
11. Use the Monte Carlo method to approximate the follow- ing integrals.
a.
b.
c.
xi =fractionalpartofi√2andtestthecorresponding
17.
18.
111 −1 −1 −1
(x2 −y2 −z2)dxdydz (x2 −y2 +xy−3)dxdy
ab. Write a program to estimate the volume of this ellip- soid in the first octant.
45
1 2 3y
A Monte Carlo method for estimating ab f (x ) d x if f(x)≧0 is as follows: Let c≧ max f(x).
√
2 1+y √
1 yy+z 0 y2 0
dom points (x, y) that satisfy y ≦ f (x). Then
(x2y + xy2)dx dy
a≦x≦b
Then generate n random points (x, y) in the rectangle a≦x≦b, 0≦y≦c. Count the number k of these ran-
d.
12. Thevalueoftheintegral
xy dx dy dz
b
a π/4 2cosφ 2π Verify this and test the method on 2 x2 dx,
using spherical coordinates is the volume above the cone z2 = x2+y2 andinsidethespherex2+y2+(z−1)2 = 1. Use the Monte Carlo method to approximate this integral and compare the results with that from the example in the text.
13. Let R denote the region in the xy-plane defined by the inequalities
19.
(Continuation) Use the method to estimate the value of 1
1(2x2 − ρ2 sin φ dθ dρ dφ 110
0 0 0 x+1)dx,and 0(x2+sin2x)dx.
f(x)dx≈kc(b−a)/n
1 3
20.
21.
π = 4 1 − x2 dx 0
Generaterandompointsin0≦x≦1,0≦y≦1.Usen= 1000, 2000, . . . , 10000 and try to determine whether the
√
error is behaving like 1/ n.
(Continuation) Modify the method to handle the case
when f takes positive and negative values on [a, b]. Test
the method on 1
−1 b
Another Monte Carlo method for evaluating a f (x ) d x is as follows: Generate an odd number of ran- dom numbers in (a,b). Reorder these points so that a < x1 < x2 < · · ·
6. Establish the properties claimed for the function g in Equation (10).
7. Show that for the simple problem x′′ =−x
x(a)=α, x(b)=β
x′′ = −x
x(0)=3, x(π)=7
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
This problem is to be solved analytically, not by computer or calculator.
Exercises 11.2
522
Chapter 11 Boundary-Value Problems
1.
Explain the main steps in setting up a program to solve this two-point boundary value problem by the finite- difference method.
x′′ =xsint+x′cost−et x(0)=0, x(1)=1
Show any preliminary work that must be done before programming. Exploit the linearity of the differential equation. Program and compare the results when differ- ent values of n are used, say, n = 10, 100, and 1000.
Solve the following two-point boundary value problem numerically. For comparisons, the exact solutions are given.
εx′′+(x′)2=1 x(0)=0, x(1)=1
Vary ε = 10−1, 10−2, 10−3, . . . . Compare to the true solution x (t ) = 1 + ε ln cosh((x − 0.745)/ε) which has a corner at t = 0.745.
2.
x′′=(1−t)x+1 aa. (1+t)2
using λ = 3.55.
a
b. 3
x(0) = 0, x(1) = −log2
usingε=10 . Compare to the true solution x (t ) = 1 + erf(t / 2ε)/
x(0)=1, x(1)=0.5
x′′ = 1(2−t)e2x +(1+t)−1
we let λ = 3.51383 . . . , there are two solutions
b.
c.
d.
e.
Troesch’sproblem:
x′′ =μsinh(μx)
x(0) = 0,x(1) = 1 Bratu’sproblem:
x′′+λex =0
x(0) = 0,x(1) = 0
using μ = 50.
If
when λ < λ∗, one solution when λ = λ∗, and no solutions when λ > λ∗.
εx′′+tx′=0 −8
x(−1) = 0, x(1) = 2 √
erf(1/ 2ε).
√
3.
4. 5.
6.
7.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
Solve the boundary-value problem
x′′ =−x+tx′−2tcost+t
Cash [2003] uses these and other test problems in his research. For more information on them, see www2.imperial.ac.uk/∼jcash/.
(Bucking of a Circular Ring Project) A model for a circular ring with compressibility c under hydrostatic pressure p from all directions is given by the follow- ing boundary-value problem involving a system of seven differential equations:
x(0)=0, x(π)=π
by discretization. Compare with the exact solution, which
is x (t ) = t + 2 sin t .
Repeat Computer Exercise 11.1.2 (p. 513), using a dis-
cretization method. Writeacomputerprogramtoimplement
a. program BVP1 b. program BVP2
(Continuation) Using built-in routines in mathematical software systems such as MATLAB, Maple, or Mathe- matica, solve and plot the solution curve for the boundary- value problem associated with
a. program BVP1 b. program BVP2
Investigate the computation of numerical solutions to the following challenging test problems, which are nonlinear:
x′′ =ex
a.
8.
x(0)=0, x(1)=0
′
=−1−cy5 +(c+1)y7
=[1+c(y5−y7)]cosy1
=[1+c(y −y )]siny 571
y
y2′ 1
y5′
y′ 3
y4′ y6′
=1+c(y5−y7)
= y6[−1−cy5 +(c+1)y7] =y5y7−[1+c(y5−y7)](y5+p) =[1+c(y5−y7)]y6
y7′
where y1(0) = π/2, y1π/2 = 0, y2π/2 = 0,
y3(0)=0,y4(0)=0,andy6(0)=0,y6 π/2 =0. Various simplifications are useful in the study of the
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 11.2
buckling or collapse of the circular ring. Consider only a quarter-circle using symmetry. (See Sketch (a).) As the pressure increases, the radius of the circle decreases, and a bifurcation or a change of state can occur. (See Sketch (b).)
The shooting method together with more advanced numerical methods can be used to solve this prob- lem. Explore some of them. See Huddleston [2000] and Sauer [2012].
The figure shows a non-insulated uniform rod positioned between two bodies of constant temperature, but differ- ent values; say, T1 > T2 and T2 > Ta. The resulting differential equation is
d2T
dx2 +α(Ta−T)=0
Here Ta is the temperature of the surrounding air and α is the radiative heat loss coefficient for the rate of heat dissipation to the surrounding air. For a simple case, con- sider a 10 meter rod with the temperature held fixed at the end values T(0) = T1 and T(L) = T2. With these valuesTa =20,T1 =T(0)=40,T2 =T(10)=200, and α = 10−2, solve this problem using the following approaches.
a. Usecalculustodeterminetheanalyticalsolution. b. Usetheshootingmethod.
c. Usethefinite-differencemethod.
Produce a table of computer results comparing the ex- act analytical solution with the shooting method and the finite-difference method at values of x = 0 to x = 10 in steps of 2. Show that the errors can be mitigated by decreasing the step size used in the numerical methods. (See Chapra [2012].)
11.2 A Discretization Method 523
s 5 /2 (y2, y3)
p
(a)
9. UseConservationofHeattodeterminetheheatbalance for a long thin rod. Assume the rod is not insulated along its length and the system is at a steady state.
Ta
T T2 1
y1
y4 p
l
21 1
21 (b)
s50
p
x= 0
x= L
Ta
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
12
Partial Differential Equations
In the theory of elasticity, it is shown that the stress in a cylindrical beam un- der torsion can be derived from a function u(x, y) that satisfies the Poisson equation
∂2u + ∂2u + 2 = 0 ∂x2 ∂y2
In the case of a beam whose cross section is the square defined by |x| ≦ 1, | y | ≦ 1, the function u must satisfy Poisson’s equation inside the square and must be zero at each point on the perimeter of the square. By using the methods of this chapter, we can construct a table of approximate values of u(x, y).
12.1 Parabolic Problems
Many physical phenomena can be modeled mathematically by differential equations. When the function that is being studied involves two or more independent variables, the differential equation is usually a partial differential equation. Since functions of several variables are intrinsically more complicated than those of one variable, partial differential equations (PDEs) can lead to some of the most challenging of numerical problems. In fact, their numerical solution is one type of scientific calculation in which the resources of the fastest and most expensive computing systems easily become taxed. We shall see later why this is so.
Some Partial Differential Equations from Applied Problems
Some important partial differential equations and the physical phenomena that they govern are listed next.
The wave equation in three spatial variables (x, y, z) and time t is ∂2u = ∂2u + ∂2u + ∂2u
∂t2 ∂x2 ∂y2 ∂z2
The function u represents the displacement at time t of a particle whose position at rest is (x, y, z). With appropriate boundary conditions, this equation governs vibrations of a three-dimensional elastic body.
The heat equation is
∂u ∂2u ∂2u ∂2u ∂t =∂x2 +∂y2 +∂z2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Challenging Problems
Wave Equation
Heat Equation
524
Laplace’s Equation
The function u represents the temperature at time t in a physical body at the point that has coordinates (x, y, z).
Laplace’s equation is
∂2u ∂2u ∂2u
∂x2 +∂y2 +∂z2 =0
It governs the steady-state distribution of heat in a body or the steady-state distribution of electrical charge in a body. Laplace’s equation also governs gravitational, electric, and magnetic potentials and velocity potentials in irrotational flows of incompressible fluids. The form of Laplace’s equation given above applies to rectangular coordinates. In cylindrical and spherical coordinates, it takes these respective forms:
∂2u + 1 ∂u + 1 ∂2u + ∂2u = 0 ∂r2 r ∂r r2 ∂φ2 ∂z2
1∂2 1 ∂ ∂u 1 ∂2u r∂r2(ru)+r2sinθ∂θ sinθ∂θ +r2sin2θ∂φ2 =0
The biharmonic equation is
∂4u ∂4u ∂4u
∂x4 +2∂x2∂y2 +∂y4 =0
It occurs in the study of elastic stress, and from its solution the shearing and normal stresses can be derived for an elastic body.
The Navier-Stokes equations are
∂u ∂u ∂u ∂p ∂2u ∂2u
∂t +u∂x +v∂y + ∂x = ∂x2 + ∂y2 ∂v ∂v ∂v ∂p ∂2v ∂2v
∂t +u∂x +v∂y + ∂y = ∂x2 + ∂y2
Here, u and v are components of the velocity vector in a fluid flow. The function p is the pressure, and the fluid is assumed to be incompressible but viscous.
In three dimensions, the following operators are useful in writing many standard partial differential equations
12.1 Parabolic Problems 525
Biharmonic Equation
Navier-Stokes Equations
Operators
Classical PDEs Using Operators
∂∂∂
∇=++
∂x ∂y ∂z
2 ∂2 ∂2 ∂2
∇ = ∂x2 + ∂y2 + ∂z2 For example, we have
Heat equation Diffusion equation
Wave equation Laplace equation
(Laplacian operator)
1 ∂u = ∇2u k ∂t
ν2 ∂t2 ∇2u = 0
∂u ∂t
= ∇(d∇u) + ρ 1 ∂2u = ∇2u
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
526 Chapter 12
Partial Differential Equations
Boundary Conditions
Mathematical Software
Poisson equation ∇2u = −4πρ
Helmholtz equation ∇2u = −k2u
The diffusion equation with diffusion constant d has the same structure as the heat equation because heat transfer is a diffusion process. Some authors use alternate notation such as u = curl(grad(u)) = ∇2u.
Additional examples from quantum mechanics, electromagnetism, hydrodynamics, elasticity, and so on could also be given, but the five partial differential equations shown already exhibit a great diversity. The Navier-Stokes equation, in particular, illustrates a very complicated problem: a pair of nonlinear, simultaneous partial differential equations.
To specify a unique solution to a partial differential equation, additional conditions must be imposed on the solution function. Typically, these conditions occur in the form of bound- ary values that are prescribed on all or part of the perimeter of the region in which the solution is sought. The nature of the boundary and the boundary values are usually the determining factors in setting up an appropriate numerical scheme for obtaining the approximate solution.
MATLAB includes a PDE Toolbox for partial differential equations. It contains many commands for such tasks as describing the domain of an equation, generating meshes, computing numerical solutions, and plotting. Within MATLAB, the command pdetool invokes a graphical user interface (GUI) that is a self-contained graphical environment for solving partial differential equations. One draws the domain and indicates the boundary, fills in menus with the problem and boundary specifications, and selects buttons to solve the problem and plot the results. Although this interface may provide a convenient working environment, there are situations in which command-line functions are needed for additional flexibility. A suite of demonstrations and help files is useful in finding one’s way. For example, this software can handle PDEs of the following types:
Examples of Types of PDEs
Boundary Conditions
Unit Normal Derivative
ParabolicPDE
HyperbolicPDE Elliptic PDE
∂u
b∂t −∇·(c∇u)+au= f
∂2u
b∂t2 −∇·(c∇u)+au= f
−∇ · (c∇u) + au = f
for x and y on the two-dimensional domain for the problem. On the boundaries of the domain, the following boundary conditions can be handled:
Dirichlet
Generalized Neumann
Mixed
hu = r
n⃗ · (c∇u) + qu = g
combination of Dirichlet/Neumann
Here, n⃗ = du/dν is the outward unit length normal derivative. While the PDE can be entered via a dialog box, both the boundary conditions and the PDE coefficients a, c, d can be entered in a variety of ways. One can construct the geometry of the domain by drawing solid objects (circle, polygon, rectangle, and ellipse) that may be overlapped, moved, and rotated. Also, Maple, Mathematica, and specialized software packages can be used to solve a wide variety of PDEs.
Heat Equation Model Problem
Now, we consider a model problem of modest scope to introduce some of the essential ideas. For technical reasons, the problem is said to be of the parabolic type. In it we have
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
12.1 Parabolic Problems 527 the heat equation in one spatial variable accompanied by boundary conditions appropriate
to a certain physical phenomenon:
∂ 2 ∂
∂ x 2 u ( x , t ) = ∂ t u ( x , t )
u(0,t) = u(1,t) = 0 u(x,0) = sinπx
These equations govern the temperature u(x, t) in a thin rod of length 1 when the ends are held at temperature 0, under the assumption that the initial temperature in the rod is given by the function sinπx (see Figure 12.1). In the xt-plane, the region in which the solution is sought is described by inequalities 0 ≦ x ≦ 1 and t ≧ 0. On the boundary of this region (shaded in Figure 12.2), the values of u have been prescribed.
FIGURE 12.1
Heated rod
FIGURE 12.2
Heat equation: xt-plane
f′ and f′′ Finite Difference Approximations
x
Ice
Rod 01
Ice
(1)
t
01
x
Finite-Difference Method
A principal approach to the numerical solution of such a problem is the finite-difference method. It proceeds by replacing the derivatives in the equation by finite differences. Two formulas from Section 4.3 are useful in this context:
f′(x)≈ 1[f(x+h)− f(x)] h
1
f′′(x)≈ h2[f(x+h)−2f(x)+ f(x−h)]
If the formulas are used in the differential Equation (1), with possibly different step lengths h and k, the result is
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t +k)−u(x,t)] (2) h2 k
This equation is now interpreted as a means of advancing the solution step-by-step in the t variable. That is, if u(x,t) is known for 0≦ x ≦ 1 and 0≦ t ≦ t0, then Equation (2) allows us to evaluate the solution for t = t0 + k.
Equation (2) can be rewritten in the form
u(x,t +k)=σu(x +h,t)+(1−2σ)u(x,t)+σu(x −h,t) (3)
where
k σ = h2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Stencil (4-Point Explicit)
528
Chapter 12 Partial Differential Equations
FIGURE 12.3
Heat equation: Explicit stencil
(x 2 h, t)
(x, t 1 k)
(x, t)
(x 1 h, t)
Marching Method
A sketch showing the location of the four points involved in this equation is given in Figure 12.3. Since the solution is known on the boundary of the region, it is possible to compute an approximate solution inside the region by systematically using Equation (3). It is, of course, an approximate solution because Equation (2) is only a finite-difference analog of Equation (1).
To obtain an approximate solution on a computer, we select values for h and k and use Equation (3). An analysis of this procedure, which is outside the scope of this text, shows that for stability of the computation, the coefficient 1−2σ in Equation (3) should be nonnegative. (If this condition is not met, errors made at one step may be magnified at subsequent steps, ultimately spoiling the solution.) The reader is referred to Kincaid and Cheney [2002] or Forsythe and Wasow [1960] for a discussion of stability. Using this algorithm, we can continue the solution indefinitely in the t-variable by computations involving only prior values of t. This is an example of a marching problem or marching method.
Pseudocode for Explicit Method
For utmost simplicity, we select h = 0.1 and k = 0.005. Coefficient σ is now 0.5. This choicemakesthecoefficient1−2σ equaltozero.Ourpseudocodefirstprintsu(ih,0)for 0 ≦ i ≦ 10 because they are known boundary values. Then it computes and prints u(ih, k) for 0 ≦ i ≦ 10 using Equation (3) and boundary values u(0, t) = u(1, t) = 0. This procedure is continued until t reaches the value 0.1. The single subscripted arrays (ui ) and (vi ) are used to store the values of the approximate solution at t and t + k , respectively. Since the analytic solution of the problem is
u(x,t)=e−π2t sinπx
(see Exercise 12.1.3), the error can be printed out at each step.
The procedure described is an example of an explicit method. The approximate values
ofu(x,t+k)arecalculatedexplicitlyintermsofu(x,t).Notonlyisthissituationatypical, but even in this problem the procedure is rather slow because considerations of stability force us to select
k≦ 1h2 2
Since h must be rather small to represent the derivative accurately by the finite difference formula, the corresponding k must be extremely small. Values such as h = 0.1 and k = 0.005 are representative, as are h = 0.01 and k = 0.00005. With such small values of k, an inordinate amount of computation is necessary to make much progress in the t variable.
Explicit Method
program Parabolic1
integer i, j ; real array (ui )0:n , (vi )0:n integer n ← 10, m ← 20
real h ← 0.1, k ← 0.005
realu0 ←0,v0 ←0,un ←0,vn ←0
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Crank-Nicolson Method
An alternative procedure of the implicit type goes by the name of its inventors, John Crank and Phyllis Nicolson, and is based on a simple variant of Equation (2):
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t)−u(x,t −k)] (4) h2 k
If a numerical solution at grid points x = ih, t = jk has been obtained up to a certain level in the t variable, Equation (4) governs the values of u on the next t level. Therefore, Equation (4) may be rewritten as
12.1 Parabolic Problems 529
fori =1ton−1 ui ← sin(πih)
end for output (ui ) for j = 1 to m
fori =1ton−1
vi ← (ui−1 + ui+1)/2
end for
output (vi )
t ← jk
fori =1ton−1
ui ← e−π2t sin(πih) − vi end for
output (ui )
fori =1ton−1
ui ←vi end for
end for
end program Parabolic1
Parabolic1 Pseudocode
Stencil (4-Point Implicit)
in which
−u(x − h, t) + ru(x, t) − u(x + h, t) = su(x, t − k) (5)
h2
r=2+s and s=
k
FIGURE 12.4
Crank-Nicolson method: Implicit stencil
The locations of the four points in this equation are shown in Figure 12.4.
(x 2 h, t) (x, t) (x 1 h, t)
(x, t 2 k)
On the t level, u is unknown, but on the (t − k) level, u is known. So we can introduce
unknownsui =u(ih,t)andknownquantitiesbi =su(ih,t−k)andwriteEquation(5)in
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
530 Chapter 12
Partial Differential Equations
Diagonally Dominant Tridiagonal System
Stable
… … … − 1 r
−1
matrix form:
r −1
−1 r −1
−1 r −1
u1 b1 u2 b2
u3 b3
. = . (6)
− 1 u n − 2 b n − 2 r un−1 bn−1
The simplifying assumption that u(0, t) = u(1, t) = 0 has been used here. Also, h = 1/n. The system of equations is tridiagonal and diagonally dominant because |r| = 2+h2/k > 2. Hence, it can be solved by the efficient method of Section 2.3.
An elementary argument shows that this method is stable. We shall see that if the initial values u(x, 0) lie in an interval [α, β], then values subsequently calculated by using Equation (5) will also lie in [α, β ], thereby ruling out any unstable growth. Since the solution is built up line by line in a uniform way, we need only verify that the values on the first computed line, u(x, k), lie in [α, β]. Let j be the index of the largest ui that occurs on this line t = k. Then we have
−uj−1 +ruj −uj+1 =bj
Since uj is the largest of the u’s, uj−1 ≦ uj and uj+1 ≦ uj. Thus, we obtain
r u j = b j + u j −1 + u j +1 ≦ b j + 2u j Sincer=2+sandbj =su(jh,0),thepreviousinequalityleadsatonceto
uj ≦u(jh,0)≦β Sinceuj isthelargestoftheui,wehave
Similarly, we have
thus establishing our assertion.
ui≦β foralli ui≧α foralli
Pseudocode for the Crank-Nicolson Method
Now we present a pseudocode for carrying out the Crank-Nicolson method on the model program. In it, h = 0.1, k = h2/2, and the solution is continued until t = 0.1. The value of r is 4 and s = 2. It is easier to compute and print only the values of u at interior points on each horizontal line. At boundary points, we have u(0,t) = u(1,t) = 0. The program calls procedure Tri from Section 2.3.
program Parabolic2
integer i, j; real array (ci )1:n−1, (di )1:n−1, (ui )1:n−1, (vi )1:n−1 integer n ← 10, m ← 20
real h ← 0.1, k ← 0.005
real r, s, t
s ← h2/k
r←2+s
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Parabolic2 Pseudocode
We used the same values for h and k in the pseudocode for two methods (explicit and Crank- Nicolson), so a fair comparison can be made of the outputs. Because the Crank-Nicolson method is stable, a much larger k could have been used.
Alternative Version of the Crank-Nicolson Method
Another version of the Crank-Nicolson method can be obtained by using the central differ-
ences at x, t − 1 k in Equation (4) to produce 2
12.1 Parabolic Problems 531
fori =1ton−1 di ←r
ci ←−1
ui ← sin(πih) end for
output (ui ) for j = 1 to m
fori =1ton−1 di ←r
vi ←sui end for
call Tri(n − 1,(ci),(di),(ci),(vi),(vi)) output (vi )
t ← jk
fori =1ton−1
ui ← e−π2t sin(πih) − vi end for
output (ui )
fori =1ton−1
ui ←vi end for
end for
end program Parabolic2
1 ux +h,t − 1k−2ux,t − 1k+ux −h,t − 1k h2 2 2 2
= 1[u(x,t)−u(x,t −k)] k
Since the u values are known only at integer multiples of k, terms such as ux, t − 1 k are 2
replaced by the average of u values at adjacent grid points; that is,
ux,t−1x≈ 1[u(x,t)+u(x,t−k)] 22
So we have
1 [u(x +h,t)−2u(x,t)+u(x −h,t)+u(x +h,t −k)
−2u(x,t−k)+u(x−h,t−k)]= 1[u(x,t)−u(x,t−k)] k
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2h2
532 Chapter 12 Partial Differential Equations
The computational form of this equation is
Stencil (6-Point Implicit)
where
−u(x −h,t)+2(1+s)u(x,t)−u(x +h,t)
= u(x − h, t − k) + 2(s − 1)u(x, t − k) + u(x + h, t − k) (7)
h2 1 s=≡
kσ
The six points in this equation are shown in Figure 12.5. This leads to a tridiagonal system
ofform(6)withr =2(1+s)and
bi =u((i−1)h,t−k)+2(s−1)u(ih,t−k)+u((i+1)h,t−k)
FIGURE 12.5
Crank-Nicolson method: Alternative stencil
(x 2 h, t)
(x 2 h, t 2 k)
(x, t)
(x, t 2 k)
(x 1 h, t)
(x 1 h, t 2 k)
Matrix Form (Explicit Method)
v(j) = [u0j,u1j,u2j,…,unj]T Equation (3) can now be written in the form
ui,j+1 =σui+1,j +(1−2σ)uij +σui−1,j This equation shows how v( j+1) is obtained from v( j). It is simply
v(j+1) = Av(j) where A is the matrix whose elements are
1−2σ σ
σ 1−2σ σ σ1−2σσ
Stability
At the heart of the explicit method is Equation (3), which shows how the values of u for t + k depend on the values of u at the previous time step, t . If we introduce the values of u onthemeshbywritinguij =u(ih,jk),thenwecanassembleallthevaluesforonet-level into a vector v( j ) as follows:
… …
σ 1−2σ σ
σ 1−2σ v(j) = Av(j−1) = A2v(j−2) = A3v(j−3) = ··· = Ajv(0)
From physical considerations, the temperature in the bar should approach zero. After all, the heat is being lost through the ends of the rod, which are being kept at temperature 0. Hence, A j v(0) should converge to 0 as j → ∞.
Our equations tell us that
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
…
Step-Size Condition Mathematical Software
At this juncture, we need a theorem in linear algebra that asserts (for any matrix A) that Ajv → 0 for all vectors v if and only if all eigenvalues of A satisfy |λi| < 1. The eigenvalues of the matrix A in the present analysis are known to be
iπ λi =1−2σ(1−cosθi) where θi = n+1
In our problem, we therefore must have
−1 < 1 − 2σ (1 − cos θi ) < 1
This leads to 0 < σ ≦ 1 , because θi can be arbitrarily close to π . This in turn leads to the 2
step-size condition
k≦ 1h2 2
Mathematical software systems such as MATLAB, Maple, or Mathematica contain routines that solve partial differential equations. For example in Maple and Mathematica, we can invoke commands to verify the general analytical solution. (See Exercise 12.1.3.) In MATLAB, there is a sample program to numerically solve our model heat equation example. In Figure 12.6, we solve the heat equation, generate a three-dimensional plot of its solution surface, and produce a two-dimensional contour plot, which is displayed in color for indicating the various contours.
The PDE Toolbox within MATLAB produces solutions to partial differential equa- tions using the finite-element formulation of the scalar PDE problem. (See Section 12.3
FIGURE 12.6
Heat equation: (a) Solution surface; (b) Contour plot
0
0 0.2 0.4 0.6 0.8 1
1 0.75 0.5
0.25 0
0.5
0.4
0.3
0.2
0.1
0.5 0.4
0.3 0.2
0.1
0
0.4
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0.2
0.6
0.8
10
12.1 Parabolic Problems 533
534 Chapter 12
Partial Differential Equations
PDE Toolbox
for additional discussion of the finite-element method.) This software library contains a graphical user interface with graphical tools for describing domains, generating triangu- lar meshes on them, discretizing the PDEs on the mesh, building systems of equations, obtaining numerical approximations for their solution, and visualizing the results.
In particular, MATLAB has the function parabolic for solving parabolic PDEs. As is found in the online documentation, one can solve the two-dimensional heat equation
∂u =∇2u ∂t
on the square −1 ≦ x , y ≦ 1. There are Dirichlet boundary conditions u = 0 and discontin- uous initial conditions u(0) = 1 in the circle x2 + y2 < 2 and u(0) = 0 otherwise. In fact,
5
the demonstration continues with a movie of the solution curves.
Summary 12.1
• We consider a model problem involving the following parabolic partial differential equation
∂2 ∂ ∂x2u(x,t)= ∂tu(x,t)
Using finite differences with step size h in the x-direction and k in the t-direction, we obtain
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t +k)−u(x,t)] h2 k
The computational form is
u(x,t +k)=σu(x +h,t)+(1−2σ)u(x,t)+σu(x −h,t)
where σ = k/h2. An alternative approach is the Crank-Nicolson method based on other finite differences for the right-hand side:
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] = 1[u(x,t)−u(x,t −k)] h2 k
Its computational form is
−u(x − h, t) + ru(x, t) − u(x + h, t) = su(x, t − k)
where r = 2 + s and s = h2/k. Yet another variant of the Crank-Nicolson method is
based on these finite differences:
1 ux +h,t − 1k−2ux,t − 1k+ux −h,t − 1k
h2 2 2 2 = 1[u(x,t)−u(x,t −k)]
Then by using
k
ux,t−1k≈ 1[u(x,t)+u(x,t−k)]
22
−u(x −h,t) + 2(1+s)u(x,t)−u(x +h,t)
the computational form is
= u(x −h,t −k)+2(s −1)u(x,t −k)+u(x +h,t −k)
where s = h2/k. This results in a tridiagonal system of equations to be solved.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1. A second-order linear differential equation with two vari- ables has the form
∂2u ∂2u ∂2u
A∂x2 +B∂x∂y+C∂y2 +···=0
Here, A, B, and C are functions of x and y, and the terms not written are of lower order. The equation is said to be elliptic, parabolic, or hyperbolic at a point (x, y), depending on whether B2 − 4AC is negative, zero, or positive, respectively. Classify each of these equations in this manner:
aa. uxx+uyy+ux+sinxuy−u=x2+y2 b. uxx−uyy+2ux+2uy+exu=x−y
ac. uxx=uy+u−ux+y d. uxy =u−ux −uy
e.3u +u +u =exy xx xy yy
af. exuxx+cosyuxy−uyy=0 g. uxx+2uxy+uyy=0
h. xuxx+yuxy+uyy=0
a2. Derive the two-dimensional form of Laplace’s equation in polar coordinates.
a8. a9.
What finite difference equation should be a suitable re- placement for the equation ∂2u/∂x2 = ∂u/∂t + ∂u/∂x in numerical work?
Consider the partial differential equation ∂ u /∂ x + ∂u/∂t = 0 with u = u(x, t) in the region [0, 1]×[0, ∞], subject to the boundary conditions u (0, t ) = 0 and u(x, 0) specified. For fixed t, we discretize only the first term using (ui+1 − ui−1)/(2h) for i = 1,2,...,n − 1 and(un−un−1)/h,whereh=1/n.Here,ui =u(xi,t) andxi =ihwithfixedt.Inthisway,theoriginalproblem can be considered a first-order initial-value problem
dy+1Ay=0 dx 2h
where
y=[u1,u2,...,un]T, dy=u′,u′,...,u′nT
u′ = ∂ui i ∂t
Determine the n × n matrix A.
Refer to the discussion of the stability of the Crank-
Nicolson procedure, and establish the inequality ui ≧ α.
What happens to System (6) when k = h2?
(Multiple Choice) In solving the heat equation u x x = u t on the domain t≧1 and 0≦x≦1, one can use the ex- plicit method. Suppose the approximate solution on one horizontal line is a vector V j . Then the whole process turns out to be described by
Vj+1 = AVj
where A is a tridiagonal matrix, having 1 − 2σ on its diagonal and σ in the superdiagonal and subdiagonal po- sitions. Here σ = k/h2, where k is the time step and h is the x-step. For stability in the numerical solution, what should we require?
a.σ=1 2
b. All eigenvalues of A satisfy |λ| < 1. c. k≧h2/2
d. h=0.01andk=5×10−3 e. Noneofthese.
(Continuation)Thefullyimplicitmethodforsolvingthe heat conduction problem requires at each step the solution of the equation
AVj−1 = Vj
Here, A is not the same as in the preceding problem, but
is similar: It has 1 + 2σ on the diagonal and −σ on the
12.1 Parabolic Problems 535
dx 12
3. Show that the function
u ( 0 , t ) = u( 1 , t ) = 0
u(x,0) = N cn sinnπx for all N ≧1
n=1
N n=1
cne−(nπ)2t sinnπx satisfies the boundary condition
u(x,t) =
is a solution of the heat conduction problem u x x = u t
and
a4. Refer to the model problem solved numerically in this section and show that if there is no roundoff, the approx- imate solution values obtained by using Equation (3) lie in the interval [0, 1]. (Assume 1 ≧ 2k/h2.)
a5. Find a solution of Equation (3) that has the form u(x, t) = at sinπx, where a is a constant.
a6. InusingEquation(5),howmustthelinearSystem(6)be modified for u(0,t) = c0 and u(1,t) = cn with c0 ≠ 0, cn ≠ 0? When using Equation (7)?
a7. DescribeindetailhowEquation(1)withboundarycondi- tions u(0, t) = q(t), u(1, t) = g(t), and u(x, 0) = f (x) can be solved numerically by using System (6). Here q, g, and f are known functions.
13.
10.
11. 12.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 12.1
536 Chapter 12 Partial Differential Equations
subdiagonalandsuperdiagonal.Whatdoweknowabout the eigenvalues of this matrix, A?
Hint: This question concerns eigenvalues of A, not A−1.
a. Theyareallnegative.
b. Theyareallintheopeninterval(0,1).
1. Solve the same heat conduction problem as in the text except use h = 2−4, k = 2−10, and u(x, 0) = x(1 − x). Carry out the solution until t = 0.0125.
2. ModifytheCrank-Nicolsoncodeinthetextsothatituses thealternativescheme(7).Comparethetwoprogramson the same problems with the same spacing.
3. Recode and test the pseudocode in this section using a computer language that supports vector operations.
4. RuntheCrank-Nicolsoncodewithdifferentchoicesofh and k, in particular, letting k be much larger. Try k = h, for example.
12.2 Hyperbolic Problems
c. Theyaregreaterthan1.
d. Theyareintheinterval(−1,0).
e. Noneofthese.
5. Try to take advantage of any special commands or pro- cedures in mathematical software such as in MATLAB, Maple, or Mathematica to solve the numerical exam- ple (1).
6. (Continuation)Usethesymbolicmanipulationcapabil- ities in MATLAB, Maple, or Mathematica to verify the general analytical solution of (1).
Hint: See Exercise 12.1.3.
Wave Equation
Wave Equation Model Problem
The wave equation with one space variable
∂2u ∂2u
∂t2 =∂x2 (1)
governs the vibration of a string (transverse vibration in a plane) or the vibration in a rod (longitudinal vibration). It is an example of a second-order linear differential equation of the hyperbolic type. If Equation (1) is used to model the vibrating string, then u(x, t) represents the deflection at time t of a point on the string whose coordinate is x when the string is at rest.
To pose a definite model problem, we suppose that the points on the string have coor- dinates x in the interval 0 ≦ x ≦ 1 (see Figure 12.7). Let’s suppose that at time t = 0, the deflections satisfy equations u(x, 0) = f (x) and ut (x, 0) = 0. Assume also that the ends of the string remain fixed. Then u(0,t) = u(1,t) = 0. A fully defined boundary-value problem (BVP), then, is
Model Problem
FIGURE 12.7
Vibrating string
tt xx u(x,0)= f(x)
(2)
u
ut(x,0)=0
u(0,t) = u(1,t) = 0
u −u =0
u(x, t) 0x1
x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Computer Exercises 12.1
FIGURE 12.8
Wave equation: xt-plane
t
01
x
12.2 Hyperbolic Problems 537
Solution
Extension Conditions
The region in the xt-plane where a solution is sought is the semi-infinite strip defined by inequalities 0 ≦ x ≦ 1 and t ≧ 0. As in the heat conduction problem of Section 12.1, the values of the unknown function are prescribed on the boundary of the region shown (see Figure 12.8).
Analytic Solution
The model problem in (2) is so simple that it can be immediately solved. Indeed, the solution is
u(x,t)= 1[f(x+t)+ f(x−t)] (3) 2
provided that f possesses two derivatives and has been extended to the whole real line by defining
f(−x)=−f(x), f(x+2)= f(x)
To verify that Equation (3) is a solution, we compute derivatives using the chain rule:
ux = 1[f′(x +t)+ f′(x −t)], ut = 1[f′(x +t)− f′(x −t)] 22
uxx = 1[f′′(x +t)+ f′′(x −t)], utt = 1[f′′(x +t)+ f′′(x −t)] 22
Obviously, we obtain Also, we find Furthermore, we have
utt =uxx u(x,0)= f(x)
ut(x,0)= 1[f′(x)− f′(x)]=0 2
In checking endpoint conditions, we use the formulas by which f was extended: u(0,t)= 1[f(t)+ f(−t)]=0
2
u(1,t)= 1[f(1+t)+ f(1−t)] 2
= 1[f(1+t)− f(t −1)] 2
= 1[f(1+t)− f(t−1+2)]=0 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
538 Chapter 12
Partial Differential Equations
Odd Function
Periodic Function
The extension of f from its original domain to the entire real line makes it an odd periodic function of period 2. Odd means that
f (x) = − f (−x)
and the periodicity is expressed by
f(x+2)= f(x)
for all x. To compute u(x, t), we need to know f at only two points on the x-axis, x + t
FIGURE 12.9
Wave equation: f stencil
and x − t, as in Figure 12.9.
(x 2 t, 0)
Numerical Solution
(x, t)
(x, 0)
(x 1 t, 0)
x
Basic Scheme
The model problem is used next to illustrate again the principle of numerical solution. Choosing step sizes h and k for x and t , respectively, and using the familiar approximations for derivatives, we have from Equation (1)
1 [u(x +h,t)−2u(x,t)+u(x −h,t)] h2
= 1 [u(x,t +k)−2u(x,t)+u(x,t −k)] k2
which can be rearranged as
u(x, t + k) = ρu(x + h, t) + 2(1 − ρ)u(x, t) + ρu(x − h, t) − u(x, t − k) (4)
Here, we let
k2 ρ = h2
Figure 12.10 shows the point (x, t + k) and the nearby points that enter into Equation (4). (x, t 1 k)
(x 2 h, t)
(x, t)
(x, t 2 k)
(x 1 h, t)
FIGURE 12.10
Wave equation: Explicit stencil
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Boundary Conditions
u(x,0)= f(x)
1[u(x,k)−u(x,0)] = 0 (5)
The problem defined by Equations (4) and (5) can be solved by beginning at the line t = 0, where u is known, and then progressing one line at a time with t = k, t = 2k, t = 3k,.... Note that because of (5), our approximate solution satisfies
u(x,k)=u(x,0)= f(x) (6)
TheuseoftheO(k)approximationforut leadstolowaccuracyinthecomputedsolution to Model Problem (2). Suppose that there is a row of grid points (x,−k). Letting t = 0 in Equation (4), we have
u(x, k) = ρu(x + h, 0) + 2(1 − ρ)u(x, 0) + ρu(x − h, 0) − u(x, −k) Now in the equation
ut(x,0) = 0 we use the central difference approximation to obtain
1 [u(x,k)−u(x,−k)]=0 2k
which eliminates the fictitious grid point (x, −k). So instead of Equation (6), we set u(x,k)= 1ρ[f(x+h)+ f(x−h)]+(1−ρ)f(x) (7)
2
becauseu(x,0)= f(x).Consequently,valuesofu(x,nk),forn≧2,cannowbecomputed
from Equation (4).
Pseudocode
A pseudocode to carry out this numerical process is given next. For simplicity, three one- dimensional arrays (ui ), (vi ), and (wi ) are used: (ui ) represents the solution being computed on the new t line; (vi ) and (wi ) represent solutions on the preceding two t lines.
12.2 Hyperbolic Problems 539 The boundary conditions in Problem (2) can be written as
k
u(0,t) = u(1,t) = 0
Revised Scheme
program Hyperbolic
integer i, j; real t,x,ρ; real array (ui)0:n,(vi)0:n,(wi)0:n integer n ← 10, m ← 20
real h ← 0.1, k ← 0.05
u0 ←0;v0 ←0;w0 ←0;un ←0;vn ←0;wn ←0
ρ ← (k/h)2
fori =1ton−1
x ← ih
wi ← f(x)
vi ← 1ρ[f(x −h)+ f(x +h)]+(1−ρ)f(x) 2
end for
(Continued)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
540 Chapter 12 Partial Differential Equations
for j = 2 to m
fori =1ton−1
ui ←ρ(vi+1 +vi−1)+2(1−ρ)vi −wi end for
output j,(ui) fori =1ton−1
wi ←vi
vi ←ui
t ← jk
x ← ih
ui ←TrueSolution(x,t)−vi
end for
output j,(ui) end for
end program Hyperbolic
real function f (x) real x
f ←sin(πx) end function f
real function True Solution(x,t) real t, x
True Solution ← sin(πx)cos(πt) end function True Solution
Hyperbolic Pseudocode
Mathematical Software
This pseudocode requires accompanying functions to compute values of f (x) and the true solution. We chose f(x) = sin(πx) in our example. It is assumed that the x interval is [0, 1], but when h or n is changed, the interval can be [0, b]; that is, nh = b. The numerical solution is printed on the t lines that correspond to 1k, 2k, . . . , mk.
More advanced treatments show that the ratios k2
ρ = h2
must not exceed 1 if the solution of the finite difference equations converges to a solution of the differential problem as k → 0 and h → 0. Furthermore, if ρ > 1, roundoff errors that occur at one stage of the computation would probably be magnified at later stages and
thereby ruin the numerical solution.
In MATLAB, the PDE Toolbox has a function for producing the solution of hyperbolic
problems using the finite element formulation of the scalar PDE problem. An example found in the online documentation finds the numerical solution of the two-dimensional wave propagation problem
∂2u = ∇2u ∂t2
on the square −1 ≦ x , y ≦ 1 with Dirichlet boundary conditions on the left and right bound-
aries, u = 0 for x = ±1, and zero values of the normal derivatives on the top and bottom
chosen to avoid putting too much energy into the higher vibration modes.
boundaries. Further, there are Neumann boundary conditions ∂u/∂ν = 0 for y = ±1. The
initial conditions u(0) = arctancosπ x and du(0)/dt = 3sin(πx)expsinπ y are 22
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Advection PDE
Advection Equation
We focus on the advection equation
∂u =−c∂u
∂t ∂x
Here, u = u(x,t) and c = c(x,t) in which one can consider x as space and t as time. The advection equation is a hyperbolic partial differential equation that governs the motion of a conserved scalar as it is advected by a known velocity field. For example, the advection equation applies to the transport of dissolved salt in water. Even in one space dimension and constant velocity, the system remains difficult to solve. Since the advection equation is difficult to solve numerically, interest typically centers on discontinuous shock solutions,
which are notoriously hard for numerical schemes to handle.
Using the forward difference approximation in time and the central-difference approx-
imations in space, we have
1[u(x,t+k)−u(x,t)]=−c 1 [u(x+h,t)−u(x−h,t)]
12.2 Hyperbolic Problems 541
Central Difference Scheme
u(x, t + k) = u(x, t) − 2σ [u(x + h, t) − u(x − h, t)]
where σ = (k/h)c(x,t). All numerical solutions grow in magnitude for all time steps k.
For all σ > 0, this scheme is unstable by Fourier stability analysis. Lax Method
In the central-difference scheme above, replace the u(x,t) term on the right-hand side by 1 [u(x,t−k)+u(x,t+k)].Thenweobtain
This gives
k 2h 1
2
Lax Scheme
11
u(x,t +k) = 2 [u(x,t −k)+u(x,t +k)]− 2σ [u(x +h,t)−u(x −h,t)]
= 1(1+σ)u(x −h,t)+ 1(1+σ)u(x,t −k) 22
Upwind Scheme
This is the Lax method, and this simple change makes the method conditionally stable. Upwind Method
Another way of obtaining a stable method is by using a one-sided approximation to ux in the advection equation as long as the side is taken in the upwind direction. If c > 0, the transport is to the right. This can be interpreted as a wind of speed c blowing the solution from left to right. So the upwind direction is to the left for c > 0 and to the right for c < 0. Thus, the upwind difference approximation is
ux(x,t) ≈ −c[u(x,t)−u(x −h,t)]/h (c > 0) −c [u(x + h, t) − u(x, t)] /h (c < 0)
Then the upwind scheme for the advection equation is
u(x,t +k) = u(x,t)−σ−c[u(x,t)−u(x −h,t)]/h (c > 0)
−c [u(x + h, t) − u(x, t)] /h (c < 0)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
542 Chapter 12 Partial Differential Equations Lax-Wendroff Method
Taylor Series Expansion
The Lax-Wendroff scheme is second-order in space and time. The following is one of several possible forms of this method. We start with a Taylor series expansion over one time step:
u(x,t + k) = u(x,t) + kut(x,t) + 1k2utt(x,t) + O(k3) 2
Now use the advection equation to replace time derivatives on the right-hand side by space derivatives:
ut = −cux utt =(−cux)t
=−ctux −c(ux)t =−ctux −c(ut)x =−ctux +c(cux)t
Here, we have let c = c(x, t) and have not assumed c is a constant. Substituting for ut and uxx givesus
u(x,t+k)=u(x,t)−ckux +1k2−ctux +c(cux)x+O(k3) 2
where everything on the right-hand side is evaluated at (x , t ). If we approximate the space derivative with second-order differences, we obtain a second-order scheme in space and time:
1
u(x, t + k) ≈ u(x, t) − ck 2h [u(x + h, t) − u(x − h, t)]
+1k2−c 1[u(x+h,t)−u(x−h,t)]+c(cu) 2 t2h xx
The difficulty with this scheme arises when c depends on space and we must evaluate the last term in the expression above. In the case in which c is a constant, we obtain
c(cux)x =c2uxx
≈ 1 [u(x +h,t)−2u(x,t)+u(x −h,t)]
2h
The Lax-Wendroff scheme becomes
1
u(x,t+k)=u(x,t)− 2σ[u(x+h,t)−u(x−h,t)]
+ 1cσ2 [u(x +h,t)−2u(x,t)+u(x −h,t)] 2
where σ = c(k/h). As does the Lax method, this method has numerical dissipation (lose of amplitude); however, it is relatively weak.
Summary 12.2
• We consider a model problem involving the following hyperbolic partial differential equation:
∂2u ∂2u ∂t2 = ∂x2
Second-Order Scheme
Lax-Wendroff Scheme
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Using finite differences, we approximate it by
1 [u(x +h,t)−2u(x,t)+u(x −h,t)]
k2 The computational form is
h2
= 1 [u(x,t +k)−2u(x,t)+u(x,t −k)]
12.2 Hyperbolic Problems 543
u(x, t + k) = ρu(x + h, t) + 2(1 − ρ)u(x, t) + ρu(x − h, t) − u(x, t − k) whereρ=k2/h2 <1.Att=0,weuse
u(x,k)= 1ρ[f(x+h)+ f(x−h)]+(1−ρ)f(x) 2
Exercises 12.2
a1. What is the solution of the boundary-value problem
utt =uxx, u(x,0)=x(1−x), ut(x,0)=0,
u(0,t) = u(1,t) = 0 atthepointwherex =0.3andt =4?
a2. Showthatthefunctionu(x,t)= f(x+at)+g(x−at) satisfies the wave equation utt = a2uxx.
a3. (Continuation)Usingtheideaintheprecedingproblem, solve this boundary-value problem:
utt = uxx, u(x,0) = F(x), ut(x,0) = G(x), u(0,t) = u(1,t) = 0
a1. Given f (x) defined on [0, 1], write and test a function for calculating the extended f that obeys the equations
f(−x)=−f(x)and f(x+2)= f(x).
2. (Continuation) Write a program to compute the solution of u(x, t) at any given point (x, t) for the boundary-value problem of Equation (2).
3. Compare the accuracy of the computed solution, using first Equation (6) and then Equation (7), in the computer program in the text.
4. Usetheprograminthetexttosolveboundary-valueProb- lem (2) with
f(x)=11−x−1, h= 1, k= 1 42 2 16 32
5. Modify the code in the text to solve boundary-value Prob- lem (2) when ut (x, 0) = g(x).
Hint: Equations (5) and (7) will be slightly different (a fact that affects only the initial loop in the program).
4.
5.
6.
Show that the boundary-value problem
utt =uxx, u(x,0)=2f(x), ut(x,0)=2g(x)
has the solution
u(x,t)= f(x+t)+ f(x−t)+G(x+t)−G(x−t)
where G is an antiderivative (i.e., indefinite integral) of g.Here,weassumethat−∞
then x∗ ∈ [b′, b]. Taking the midpoint of this interval, we obtain x = 1 (b′ + b) as our 2
F(b) F(b)
ax*ˆxab/bb/b/ b/ b=b′ 2
FIGURE 13.6
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Fibonacci search algorithm: Reset
F(a)
13.1 One-Variable Case 565 estimate of x∗ and find that |x − x∗| ≦ 1 (b − a). On the other hand, if F(b′) < F(b′ + δ),
6
then x∗ ∈ [a′,b′ + δ]. Again we take the midpoint, x = 1(a′ + b′ + δ), and find that 2
|x − x∗| ≦ 1 (b − a) + 1 δ. So if we ignore the small quantity δ/2, our accuracy is 1 (b − a) 626
in using three evaluations of F.
By continuing the search pattern outlined, we find an estimate x of x∗ with only n
Fibonacci Sequence
evaluations of F and with an error not exceeding 1b − a
2 λn
where λn is the (n + 1)st member of the Fibonacci sequence:
λ1=1, λ2=1
λk =λk−1 +λk−2 (k≧3)
(1)
(2)
For example, elements λ1 through λ8 are 1, 1, 2, 3, 5, 8, 13, and 21.
In the Fibonacci search algorithm, we initially determine the number of steps N for
a desired accuracy ǫ > δ by selecting N to be the subscript of the smallest Fibonacci number greater than 1 (b − a)/ǫ. We define a sequence of intervals, starting with the given
2
interval [a,b] of length l = b − a, and, for k = N, N − 1,…,3, use these formulas for
updating:
Fibonacci Search Algorithm
= λk−2 (b − a) (3) λk
a′ =a+ b′ =b− a=a′, ifF(a′)≧F(b′)
b = b′, if F(a′) < F(b′)
AttheStepk =2,weset
a′ = 1(a+b)−2δ, b′ = 1(a+b)+2δ
22 a=a′, ifF(a′)≧F(b′)
b = b′, if F(a′) < F(b′)
and we have the final interval [a, b], from which we compute x = 1 (a + b). This algorithm
To verify the algorithm, consider the situation shown in Figure 13.7. Since λk = λk−1 + λk−2, we have
l′ =l−=l−λk−2l=λk−1l (4) λk λk
l
D
a a b b
2 requires only one function evaluation per step after the initial step.
FIGURE 13.7
Fibonacci search algorithm: Verify using a typical situation
l
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
566 Chapter 13
Minimization of Functions
and the length of the interval of uncertainty has been reduced by the factor (λk−1/λk). The next step yields
′ = λk−3 l′ (5) λk −1
and ′ is actually the distance between a′ and b′. Therefore, one of the preceding points at which the function was evaluated is at one end or the other of [a, b]; that is,
b′ − a′ = l = 2 = λk − 2λk−2 l λk
=λk−1 −λk−2l=λk−3l λk λk
=λk−3l′ =′ λk −1
by Equations (2), (4), and (5).
It is clear by Equation (4) that after N − 1 function evaluations, the next-to-last interval
Golden Section Ratio
Outline of Golden Section Search Algorithm
Case u > v
2
The mathematical history of this number can be found in Roger [1998], and ρ satisfies
has length (1/λN) times the length of the initial interval [a,b]. So the final interval is (b − a)(1/λN ) wide, and the maximum error (1) is established. The final step is similar to that outlined, and F is evaluated at a point 2δ away from the midpoint of the next-to-last interval. Finally, set x = 1 (b + a) from the last interval [a, b].
Also, the desired precision must be given in advance, and the number of steps to be computed for this precision must be determined before beginning the computation. Thus, the initial evaluation points for the function F depend on N , the number of steps.
Golden Section Search Algorithm
A similar algorithm that is free of these drawbacks is described next. It has been termed the golden section search because it depends on a ratio ρ known to the early Greeks as the golden section ratio:
1 √
ρ= 1+ 5 ≈1.6180339887
2
One disadvantage of the Fibonacci search is that the algorithm is rather complicated.
the equation ρ2 = ρ + 1, which has roots 11 + √5 ≈ 1.61803… and 11 − √5 ≈ 22
−0.61803. . . .
In each step of this iterative algorithm, an interval [a, b] is available from the previous
work. It is an interval that is known to contain the minimum point x∗, and our objective is to replace it by a smaller interval that is also known to contain x∗. In each step, two values of F are needed:
x = a + r(b − a), u = F(x) y = a + r2(b − a), v = F(y)
(6)
wherer =1/ρandr2 +r =1,whichhasroots 1−1+√5≈0.61803…and 1−1− √22
5 ≈ −1.61803…. There are two cases to consider: Either u > v or u ≦ v.
Let us take the first. Figure 13.8 depicts this situation. Since F is assumed continuous and unimodal, the minimum of F must be in the interval [a, x]. This interval is the input interval at the beginning of the next step. Observe now that within the interval [a, x], one
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
FIGURE 13.8
Golden section search algorithm: u>v
ayx*xb evaluation of F is already available, at y. Also note that
a + r(x − a) = y
because x − a = r(b − a). In the next step, therefore, y will play the role of x, and we shall need the value of F at the point at a + r2(x − a). In this step we must carry out the following replacements in order:
The other case is similar. If u ≦ v, the picture might be as in Figure 13.9. In this case, the minimum point must lie in [y, b]. Within this interval, one value of F is available, at x. Observe that
y + r2(b − y) = x v
u
a y xx* b
(See Exercise 13.1.9.) Thus, x should now be given the role of y, and the value of F is to
be computed at y + r(b − y). The following ordered replacements accomplish this:
Exercises 13.1.10–13.1.11 hint at a shortcoming of this procedure: It is quite slow. Slowness in this context refers to the large number of function evaluations that are needed
13.1 One-Variable Case 567
u
v
r(b 2 a)
r(b 2 a)
b←x
x←y
u←v
y ← a + r2(b − a) v ← F(y)
Case u≧v
FIGURE 13.9
Golden section search algorithm: u≦v
r(b 2 a)
r(b 2 a)
a←y
y←x
v←u
x ← a + r(b − a) u ← F(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
568 Chapter 13
Minimization of Functions
Golden Section Search versusFibonacciSearch
to achieve reasonable precision. This slowness is attributable to the extreme generality of the algorithm. No advantage has been taken of any smoothness that the function F may possess.
If [a, b] is the starting interval in the search for a minimum of F, then at the beginning, with one evaluation of F, we can be sure only that the minimum point, x∗, is in an interval of width b − a. In the golden section search, the corresponding lengths in successive steps arer(b−a)fortwoevaluationsofF,r2(b−a)forthreeevaluationsofF,andsoon.After n steps, the minimum point has been pinned down to an interval of length rn−1(b − a).
How does this compare with the Fibonacci search algorithm using n evaluations? The corresponding width of interval, at the last step of this algorithm, is λ−1(b − a). Now, the
n
How to Determine the Correct Ratio
extra complexity of the Fibonacci algorithm, together with the disadvantage of having the algorithm itself depend on the number of evaluations permitted, mitigates against its use in general.
In the golden section search algorithm, how is the correct ratio r determined? Remem- ber that when we pass from one interval to the next in the algorithm, one of the points x or y is to be retained in the next step. Here, we present first a sketch of the first interval in which we let x = a + r(b − a) and y = b + r(a − b). It is followed by a sketch of the next interval.
ayxb azyx5b
In this new interval, the same ratios should hold, so we have y = a + r(x − a). Since x − a = r(b − a), we can write y = a + r[r(b − a)]. Setting the two formulas for y equal to each other gives us
Fibonacci algorithm should be better, because it is designed to do as well as possible with a prescribed number of steps. So we expect the ratio rn−1/λ−1 to be greater than 1. But
n
it approaches 1.17 as n → ∞. (See Exercise 13.1.8.) Thus, one may conclude that the
whence
Dividing by (a − b) gives
a + r2(b − a) = b + r(a − b)
a − b + r2(b − a) = r(a − b)
r2 + r − 1 = 0
The roots of this quadratic equation are as given previously.
Quadratic Interpolation Algorithm
Suppose that F is represented by a Taylor series in the vicinity of the point x∗. Then F(x)= F(x∗)+(x −x∗)F′(x∗)+ 1(x −x∗)2F′′(x∗)+···
2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
13.1 One-Variable Case 569 Since x∗ is a minimum point of F, we have F′(x∗) = 0. Thus,
F(x) ≈ F(x∗) + 1(x − x∗)2 F′′(x∗) 2
This tells us that, in the neighborhood of x∗, F(x) is approximated by a quadratic function whose minimum is also at x ∗ . Since we do not know x ∗ and do not want to involve derivatives in our algorithms, a natural stratagem is to interpolate F by a quadratic polynomial. Any three values (xi , F (xi )), i = 1, 2, 3, can be used for this purpose. The minimum point of the resulting quadratic function may be a better approximation to x∗ than is x1, x2, or x3. Writing an algorithm that carries out this idea iteratively is not trivial, and many unpleasant cases must be handled. What should be done if the quadratic interpolant has a maximum instead of a minimum, for example? There is also the possibility that F′′(x∗) = 0, in which case higher-order terms of the Taylor series determine the nature of F near x∗.
Here is the outline of an algorithm for this procedure. At the beginning, we have a function F whose minimum is sought. Two starting points x and y are given, as well as two control numbers δ and ε. Computing begins by evaluating the two numbers
Outline of Quadratic Interpolation Algorithm
Now let
In either case, the number
u = F(x) v = F(y)
z=2x−y, ifu
The usual case occurs if
q′′(t)>0, δ≧ max{|t−x|,|t−y|,|t−z|}≧ε
These inequalities indicate that t is a minimum point of q but not near enough to the three initial points to be accepted as a solution. Also, t is not farther than δ units from each of x, y, and z and can thus be accepted as a reasonable new point. The old point that has the greatest function value is now replaced by t and its function value by F(t).
The first bad case occurs if
q′′(t) > 0, max{|t − x|,|t − y|,|t − z|} > δ
Here, t is a minimum point of q but is so remote that there is some danger in using it as a new point. We identify one of the original three points that is farthest from t, for example, x, and also we identify the point closest to t, say z. Then we replace x by z + δ sign(t − z) and u by F(x). Figure 13.10 shows this case. The curve is the graph of q.
u
v
FIGURE 13.10
Taylor series algorithm: First bad case
Second Bad Case
FIGURE 13.11
Taylor series algorithm: Second bad case
q
t z1sign(t2z) z y x The second bad case occurs if
q′′(t) < 0
thus indicating that t is a maximum point of q. In this case, identify the greatest and the least among u, v, and w. Suppose, for example, that u ≧ v ≧ w. Then replace x by z + δ sign(z − x). An example is shown in Figure 13.11.
z1sign(z2x) z
u v
y x t
q
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Summary 13.1
• We consider the problem of finding the local minimum of a unimodal function of a one-variable.
• Algorithms discussed are Fibonacci search, golden section search, and quadratic interpolation.
a1. Forthefunction F(x ,x ,x ) = x2+3x2+2x2−4x − 1231231
6x2 + 8x3, find the unconstrained minimum point. Then find the constrained minimum over the set K defined by inequalities x1 ≦ 0, x2 ≦ 0, and x3 ≦ 0. Next, solve the sameproblemwhenKisdefinedbyx1≦2,x2≦0,and x3 ≦ −2.
a2. ForthefunctionF(x,y)=13x2+13y2−10xy−18x− 18y, find the unconstrained minimum.
Hint: Try substituting x = u +v and y = u −v.
3. If F is unimodal and continuous on the interval [a,b], how many local maxima may F have on [a, b]?
andα+β=1sothatα=1/randβ=−r.Then establish that r n λ converges to 1/√5 as n → ∞.
Verify that y + r2(b − y) = x in the golden section algorithm.
Hint:User2+r=1.
If F is unimodal on an interval of length l, how many
evaluations are necessary in the golden section algorithm
to estimate the minimum point with an error of at most 10−k ?
(Continuation) In the preceding problem, how large must n be if l = 1 and k = 10?
Usingthedivided-differencealgorithmonthetable
xyz uvw
show that the quadratic interpolant in Newton form is
q(t) = u + a(t − x) + c(t − x)(t − y)
with a, b, and c given by Equation (7). Then verify the
formulas for t and q ′′ (t ) given in (7).
If routines can be written easily for F, F′, and F′′, how can Newton’s method be used to locate the minimum point of F? Write down the formula that defines the iter- ative process. Does it involve F?
If routines are available for F and F′, how can the secant
13.1 One-Variable Case 571
a4. FortheFibonaccisearchalgorithm,writeexpressionsfor x in the two cases n = 2, 3.
b. MinimumofF(x)=2x3−9x2+12x+2on[0,3] c. Maximum of F(x) = 2x3 − 9x2 + 12x on [0,2]
6. Let F be a continuous unimodal function defined on the interval [a, b]. Suppose that the values of F are known atnpoints,namely,a=t1
a ← a ′ a′ ← b′
b′ ← a′
u ← v
v ← u
λk−2
← ( b − a ) ←
( b − a )
5. (Berman Algorithm) Suppose that F is unimodal on [a,b]. Then if x1 and x2 are any two points such that a≦x1
F(x)>F(x) ⇒ x∗∈(x,b] 2121
a′=1(a+a′),andb′=aifF(a′)< F(b′);a=a′, 2∗
b=b′,a′=2a+1b,andb′=1a+2bifF(a′)=F(b′). 33 33
and the minimum of F always occurs between a and b.
Furthermore, only one new function value needs to be
computed at each stage of the calculation after the first
F(x1)=F(x2) ⇒ x ∈[x1,x2]
F(x1) < F(x2) ⇒ x∗ ∈ [a, x2)
So by evaluating F at x1 and x2 and comparing function
values, we are able to reduce the size of the interval that
is known to contain x ∗ . The simplest approach is to start
at the midpoint x = 1 (a + b) and if F is, say, decreasing 02
forx>x0,wetestFatx0+ih,i=1,2,…,q,with h=(b−a)/2q,untilwefindapointx1fromwhichF begins to increase again (or until we reach b). Then we re- peat this procedure starting at x1 and using a smaller step length h/q. Here, q is the maximal number of evaluations at each step, say, 4.
Write a subroutine to perform the Berman algorithm and test it for evaluating the approximate minimization of one-dimensional functions.
Note: The total number of evaluations of F needed for executing this algorithm up to some iterative step k de- pends on the location of x∗. If, for example, x∗ = b, then clearly we need q evaluations at each iteration and hence
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Note: The construction ensures that a < a′ < b′ < b,
unless the case F(a′) = F(b′) is obtained. The values of ′′
a,a,b,andbtendtohavethesamelimit,whichisa minimum point of F . Notice the similarity to the method of bisection of Section 3.1.
4. Write and test a routine for the Fibonacci search algo- rithm. Verify that a partial algorithm for the Fibonacci search is as follows: Initially, set
(b − a)
λ N − 2 =
b′=b−
λN a ′ = a +
u=F(a′) v = F(b′)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
Computer Exercises 13.1
kq evaluations. This number will decrease the closer x∗
istox0,anditcanbeshownthatwithq=4,theex-
pected number of evaluations is three per step. It is inter-
esting to compare the efficiency of the Berman algorithm
(q = 4) with that of the Fibonacci search algorithm. The
expected number of evaluations per step is three, and the
uncertainty interval decreases by a factor 4−1/3 ≈ 0.63
Of course, the factor 0.63 in the Berman algorithm repre- sents only an average and can be considerably lower but also as high as 4−1/4 ≈ 0.87.
6. Select a routine from your program library or from a package such as MATLAB, Maple, or Mathematica for finding the minimum point of a function of one variable.
4
Experiment with the function F (x ) = x + sin(23x ) to
determine whether this routine encounters any difficulties in finding a global minimum point. Use starting values both near to and far from the global minimum point. (See Figure 13.2.)
7. (Student Research Project) The Greek mathematician Euclid of Alexandria (325–265 B.C.E.) wrote a collection of 13 books on mathematics and geometry. In book six, Proposition 30 shows how to divide a line into its mean and extreme mean, which is finding the golden section point on a line. This states that the ratio of the smaller part of a line segment to the larger part is the same as the ratio of the larger part to the whole line segment. For a
13.2 Multivariate Case
13.2 Multivariate Case 573 line segment of length 1, denote the larger part by r and
the smaller part by 1 − r as shown here:
r 12r
01
Hence, we have the ratios
1−r r =
r1 and we obtain the quadratic equation
r2 =1−r
This equation has two roots, one positive and one nega-
B.C.E.). It was also used in the construction of the Great Pyramid of Gizah. Mathematical software systems such as MATLAB, Maple, or Mathematica contain the golden ratio constant. In fact, the default width-to-height ratio for the plot function is the golden ratio. Investigate the golden section ratio and its use in scientific computing.
UsingamathematicalsoftwaresystemsuchasMATLAB, Maple, or Mathematica, write a computer program to reproduce
a. Figure13.1.
b. Figure 13.2. Also, find the global minimum of the function as well as several local minimum points near the origin.
per evaluation. In comparison, the Fibonacci search al-
gorithm has a reduction factor of 1 1 + √5 ≈ 0.62. 2
GradientVector
Now we consider a real-valued function of n real variables F: Rn → R. As before, a point x∗ is sought such that
F(x∗)≦F(x) forallx∈Rn
Some of the theory of multivariate functions must be developed to understand the rather
sophisticated minimization algorithms in current use.
Taylor Series for F : Gradient Vector and Hessian Matrix
If the function F possesses partial derivatives of certain low orders (which is usually assumed in the development of these algorithms), then at any given point x, a gradient vector G(x) = (Gi )n is defined with components
Gi =Gi(x)=∂F(x) (1≦i≦n) (1) ∂xi
8.
tive. The reciprocal of the positive root is the golden ratio
1 1+√5, which was of interest to Pythagoras (580–500 2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
574 Chapter 13
Minimization of Functions
Hessian Matrix
Taylor Series for F
Matrix-Vector Form
and a Hessian matrix H(x) = (Hi j )n×n is defined with components
Hij =Hij(x)=∂2F(x) (1≦i, j≦n) (2)
∂xi ∂xj
We interpret G(x) as an n-component vector and H(x) as an n × n matrix, both depending on x.
Using the gradient and Hessian, we can write the first few terms of the Taylor series for F as
1 n n
Gi(x)hi +2 hiHij(x)hj +··· (3)
i=1 j=1
F(x+h)=F(x)+
Equation (3) can also be written in an elegant matrix-vector form:
n i=1
H Symmetric Matrix
EXAMPLE 1
Solution
F(x + h) = F(x) + G(x)T h + 1 hT H(x)h + · · · (4) 2
Here, x is the fixed point of expansion in Rn , and h is the variable in Rn with components h1,h2,...,hn. The three dots indicate higher-order terms in h that are not needed in this discussion.
A result in calculus states that the order in which partial derivatives are taken is imma- terial if all partial derivatives that occur are continuous. In the special case of the Hessian matrix, if the second partial derivatives of F are all continuous, then H is a symmetric matrix; that is, H = HT because
∂2F ∂2F
Hij = ∂x ∂x = ∂x ∂x = Hji
ij ji
To illustrate Formula (4), let us compute the first three terms in the Taylor series for the function
F(x1, x2) = cos(πx1) + sin(πx2) + ex1x2 taking (1, 1) as the point of expansion.
Partial derivatives are
∂F =−πsin(πx1)+x2ex1x2,
∂x1
∂2F =−π2cos(πx1)+x2ex1x2,
∂ x 12 ∂2F
F(x)=−1+e, G(x)= e , H(x)=π2 +e 2e −π + e 2e e
=(x1x2+1)ex1x2,
Note the equality of cross derivatives; that is, ∂2 F/∂x1 ∂x2 = ∂2 F/∂x2 ∂x1. At the particular
∂F =πcos(πx2)+x1ex1x2 ∂x2
∂2F =(x1x2+1)ex1x2 ∂ x 2 ∂ x 1
∂2F =−π2sin(πx2)+x12ex1x2 ∂x2
∂x1 ∂x2
point x = [1, 1]T , we have
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Mathematical Software
■
In mathematical software systems such Maple or Mathematica, we can verify these calculations using built-in routines for the gradient and Hessian. Also, we can obtain two terms in the Taylor series in two variables expanded about the point (1, 1) and then carry out a change of variables to obtain similar results as shown.
Alternative Form of Taylor Series
Another form of the Taylor series is useful. First let z be the point of expansion, and then let h = x − z. Now from Equation (4),
F(x)= F(z)+G(z)T(x−z)+ 1(x−z)T H(z)(x−z)+··· (5) 2
We illustrate with two special types of functions. First, the linear function has the form
So by Equation (4),
F(1 + h1, 1 + h2) = −1 + e + [e, −π + e] h1
F(x)= F(z)+
F(x1, x2) = c + (b1x1 + b2x2) + 1a11x12 + 2a12x1x2 + a22x2 (6) 2
which can be interpreted as the Taylor series for F when the point of expansion is (0, 0). To verify this assertion, the partial derivatives must be computed and evaluated at (0, 0):
13.2 Multivariate Case 575
h2 +1[h1,h2]π2+e 2eh1+···
2 2eeh2 F(1+h1,1+h2)=−1+e+eh1 +(−π+e)h2
+ 1(π2 +e)h21 +(2e)h1h2 +(2e)h2h1 +eh2+··· 2
or equivalently, by Equation (3),
Linear Function F(x)=c+
for appropriate coefficients c,b1,b2,...,bn. Clearly, the gradient and Hessian are Gi(z) =
Case 2 Variables
Quadratic Function
bi(xi −zi)= F(z)+bT(x−z)
Second, consider a general quadratic function. For simplicity, we take only two vari-
ables. The form of the function is
bi andHij(z)=0,soEquation(5)yields
n i=1
n i=1
bixi =c+bTx
∂F ∂F ∂x = b1 +a11x1 +a12x2, ∂x
= b2 +a22x2 +a12x1 ∂x2 =a11, ∂x ∂x =a12
12 ∂2F ∂2F
112
∂2F ∂2F
∂x ∂x =a12, ∂x2 =a22
212
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
576 Chapter 13
Minimization of Functions
Matrix Form
Letting z = [0, 0]T , we obtain from Equation (5) F(x)=c+[b1,b2]x1+ 1[x1,x2]a11 a12x1
written as
F(x)=c+bTx+1xT Ax (7) 2
where c is a scalar, b a vector, and A a matrix. Equation (7) holds for a general quadratic function of n variables, with b an n-component vector and A an n × n matrix.
Returning to Equation (3), we now write out the complicated double sum in complete detail to assist in understanding it:
x2 2 a12 a22 x2
This is the matrix form of the original quadratic function of two variables. It can also be
xT Hx Term in Detail
T
n n i=1 j=1
nj=1 x1H1jxj + n j = 1 x 2 H 2 j x j
xiHijxj =+··· + ··· + nj = 1 x n H n j x j
x1H11x1 + x1H12x2
+ x2H21x1 + x2H22x2 = +···
x Hx=
Thus, xT Hx can be interpreted as the sum of all n2 terms in a square array of which the (i, j ) element is xi Hi j x j .
Steepest Descent Procedure
A crucial property of the gradient vector G(x) is that it points in the direction of the most
rapid increase in the function F, which is the direction of steepest ascent. Conversely,
−G(x) points in the direction of the steepest descent. This fact isso important that it is
worth a few words of justification. Suppose that h is a unit vector, n h2 = 1. The rate i=1 i
of change of F (at x) in the direction h is definednaturally by
d F ( x + t h ) dt t=0
This rate of change can be evaluated by using Equation (4). From that equation, it follows that
F(x + th) = F(x) + tG(x)T h + 1t2hT H(x)h + ··· (8) 2
Differentiation with respect to t leads to
d F(x+th)=G(x)Th+thTH(x)h+··· (9)
dt
By letting t = 0 here, we see that the rate of change of F in the direction h is nothing else
than
G(x)T h
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Rate of Change of F at x in Direction h
+ ···
+ xn Hn1x1 + xn Hn2x2 + ··· + xn Hnnxn
+ ··· + x1H1nxn + ··· + x2H2nxn +···
+ ···
Key Results
Cauchy-Schwarz Inequality
n i=1
uivi ≦
n i=1
ui2
1/2 n i=1
vi2
(10)
Best Step Steepest Descent
On the basis of the foregoing discussion, a minimization procedure called best-step steepest descent can be described. At any given point x, the gradient vector G(x) is calculated. Then a one-dimensional minimization problem is solved by determining the value t∗ for which the function
φ(t) = F(x + tG(x))
is a minimum. Then we replace x by x + t∗ G(x) and begin anew.
The general method of steepest descent takes a step of any size in the direction of the
negative gradient. It is not usually competitive with other methods, but it has the advantage of simplicity. One way of speeding it up is described in Computer Exercise 13.2.2.
Contour Diagrams
In understanding how these methods work on functions of two variables, it is often helpful to draw contour diagrams. A contour of a function F is a set of the form
{x : F(x) = c}
where c is a given constant. For example, the contours of function
F(x) = 25x12 + x2
are ellipses, as shown in Figure 13.12 (p. 578). Contours are also called level sets by some authors. At any point on a contour, the gradient of F is perpendicular to the curve. So, in general, the path of steepest descent may look like Figure 13.13 (p. 578).
More Advanced Algorithms
To explain more advanced algorithms, we consider a general real-valued function F of n variables. Suppose that we have obtained the first three terms in the Taylor series of F in the vicinity of a point z. How can they be used to guess the minimum point of F? Obviously, we could ignore all terms beyond the quadratic terms and find the minimum of the resulting quadratic function:
F(x + z) = F(z) + G(z)T x + 1 xT H(z)x + · · · (11) 2
Contours (Level Sets) of F
F General Real-Valued Function
Quadratic Function
13.2 Multivariate Case 577 Now we ask: For what unit vector h is the rate of change a maximum? The simplest path
to the answer is to invoke the powerful Cauchy-Schwarz inequality:
1/2
where equality holds only if one of the vectors u or v is a nonnegative multiple of the other.
Applying this to
T n
G(x) h= Gi(x)hi
i=1
and remembering that n h2 = 1, we conclude that the maximum occurs when h is a
i=1 i
positive multiple of G(x), that is, when h points in the direction of G.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
578
Chapter 13
Minimization of Functions
y
6.00
22.00
Ellipse c525x2 1y2
x
2.00
FIGURE 13.12
Contours of F(x) = 25x12 + x2
26.00
x3
F(x) 5 F(x1) F(x) 5 F(x2) F(x) 5 F(x3) F(x) 5 F(x4) F(x) 5 F(x5)
x2
x5
x4
x1
FIGURE 13.13
Path of steepest descent
Here, z is fixed and x is the variable. To find the minimum of this quadratic function of x, we must compute the first partial derivatives and set them equal to zero. Denoting this quadratic function by Q and simplifying the notation slightly, we have
Q(x)=F(z)+ from which it follows that
n 1 n n
Gixi +2 xiHijxj (12)
i=1 i=1 j=1
∂Q n
∂x =Gk+ Hkjxj (1≦k≦n) (13)
k j=1
(See Exercise 13.2.13.) The point x that is sought is thus a solution of the system of n
equations
n
Hkjxj =−Gk (1≦k≦n)
j=1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Iterative Algorithm
Matrix Equation:
G Gradient, H Hessian
The preceding analysis suggests the following iterative algorithm for locating a mini- mum point of a function F: Start with a point z that is a current estimate of the minimum point. Compute the gradient and Hessian of F at the point z. They can be denoted by G and H, respectively. Of course, G is an n-component vector of numbers and H is an n × n matrix of numbers. Then solve the matrix equation
Hx = −G
obtaining an n-component vector x. Replace z by z + x and return to the beginning of the
algorithm.
Minimum, Maximum, and Saddle Points
There are many reasons for expecting trouble from the iterative procedure just outlined. One especially noisome aspect is that we can expect to find a point only where the first partial derivatives of F vanish; it need not be a minimum point. It is what we call a stationary point. Such points can be classified into three types: minimum point, maximum point, and saddle point. They can be illustrated by simple quadratic surfaces familiar from analytic geometry:
Stationary Points: Minimum Point, Maximum Point, Saddle Point
• Minimum of F(x, y) = x2 + y2 at (0, 0)
• MaximumofF(x,y)=1−x2−y2at(0,0) • Saddle point of F(x, y) = x2 − y2 at (0, 0)
(See Figure 13.14(a).) (SeeFigure13.14(b).) (See Figure 13.14(c).)
FIGURE 13.14
Simple quadratic surfaces
or, equivalently,
(a) Minimum point
(c) Saddle point
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
H(z)x = −G(z)
(14)
13.2 Multivariate Case 579
(b) Maximum point
580 Chapter 13
Minimization of Functions
Nelder-Mead Algorithm
Convex Hull n-Simplex
Before beginning the calculations, the user assigns values to three parameters: α, β, and γ . The default values are 1, 1 , and 1, respectively. In each step of the algorithm, a set
Positive Definite Matrix
If z is a stationary point of F, then
Moreover, a criterion ensuring that Q, as defined in Equation (12), has a minimum point is
as follows:
(See Exercise 13.2.15.) A matrix that has this property is said to be positive definite. Notice that this theorem involves only second-degree terms in the quadratic function Q.
As examples of quadratic functions that do not have minima, consider the following: −x12 −x2 +13x1 +6x2 +12, x12 −x2 +3x1 +5x2 +7
x12 −2x1x2 +x1 +2x2 +3, 2x1 +4x2 +6
Inthefirsttwoexamples,letx1 =0andx2 →∞.Inthethird,letx1 =x2 →∞.Inthe last, let x1 = 0 and x2 → −∞. In each case, the function values approach −∞, and no global minimum can exist.
Quasi-Newton Methods
Algorithms that converge faster than steepest descent in general and that are currently recommended for minimization are of a type called quasi-Newton. The principal example is an algorithm introduced in 1959 by Davidon, called the variable metric algorithm. Subsequently, important modifications and improvements were made by others, such as R. Fletcher, M. J. D. Powell, C. G. Broyden, P. E. Gill, and W. Murray. These algorithms proceed iteratively, assuming in each step that a local quadratic approximation is known for the function F whose minimum is sought. The minimum of this quadratic function either provides the new point directly or is used to determine a line along which a one-dimensional search can be carried out. In implementation of the algorithm, the gradient can be either provided in the form of a procedure or computed numerically by finite differences. The Hessian H is not computed, but an estimate of its LU factorization is kept up to date as the process continues.
Nelder-Mead Algorithm
For minimizing a function F: Rn → R, another method, called the Nelder-Mead algorithm, is available. It is a method of direct search and proceeds without involving any derivatives of the function F and without any line searches.
G(z) = 0
Quadratic Function Theorem
If the matrix H has the property that xT Hx > 0 for every nonzero vector x, then the quadratic function Q has a minimum point.
■ Theorem1
Matrix H Positive Definite
Sample Quadratic Functions
Algorithms: Quasi-Newton, Variable Metric
2
ofn+1pointsinRn isgiven:{x0,x1,…,xn}.ThissetisingeneralpositioninRn.This
means that the set of n points xi − x0, with 1 ≦ i ≦ n, is linearly independent. A consequence ofthisassumptionisthattheconvexhulloftheoriginalset{x0,x1,…,xn}isann-simplex.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2-Simplex: Triangle 3-Simplex: Tetrahedron
Centroid Reflected Point
Expended Reflected Point
13.2 Multivariate Case 581 For example, a 2-simplex is a triangle in R2, and a 3-simplex is a tetrahedron in R3. To
make the description of the algorithm as simple as possible, we assume that the points have been relabeled (if necessary) so that F(x0)≧ F(x1)≧ ··· ≧ F(xn). Since we are trying to minimize the function F, the point x0 is the worst of the current set, because it produces the highest value of F.
We compute the point
1 n
u=n xi i=1
This is the centroid of the face of the current simplex opposite the worst vertex, x0. Next, we compute a reflected point v = (1 + α)u − αx0.
If F(v) is less than F(xn), then this is a favorable situation, and one is tempted to replace x0 by v and begin anew. However, we first compute an expanded reflected point w = (1 + γ)v − γu and test to see whether F(w) is less than F(xn). If so, we replace x0 by w and begin anew. Otherwise, we replace x0 by v as originally suggested and begin with the new simplex.
Assume now that F(v) is not less than F(xn). If F(v)≦ F(x1), then replace x0 by v
and begin again. Having disposed of all cases when F(v)≦ F(x1), we now consider two
further cases. First, if F(v)≦ F(x0), then define w = u + β(v − u). If F(v) > F(x0),
compute w = u + β(x0 − u). With w now defined, test whether F(w) < F(x0). If this
is true, replace x0 by w and begin anew. However, if F(w)≧ F(x0), shrink the simplex by
usingxi ← 1(xi +xn)for0≦i≦n−1.Thenbeginanew. 2
The algorithm needs a stopping test in each major step. One such test is whether the relative flatness is small. That is the quantity
F(x0) − F(xn)
|F(x0)| + |F(xn)|
Other tests to make sure progress is being made can be added. In programming the algorithm, one keeps the number of evaluations of f to a minimum. In fact, only three indices are needed: the indices of the greatest F(xi), the next greatest, and the least.
In addition to the original paper of Nelder and Mead [1965], one can consult Dennis and Woods [1987], Dixon [1974], and Torczon [1997]. Different authors give slightly different versions of the algorithm. We have followed the original description by Nelder and Mead.
Method of Simulated Annealing
This method has been proposed and found to be effective for the minimization of difficult functions, especially if they have many purely local minimum points. It involves no deriva- tives or line searches; indeed, it has found great success in minimizing discrete functions, such as arise in the traveling salesman problem.
Suppose we are given a real-valued function of n real variables; F: Rn → R. We must be able to compute the values F (x ) for any x in Rn . It is desired to locate a global minimum point of F, which is a point x∗ such that
F(x∗)≦ F(x) for all x in Rn
In other words, F(x∗) is equal to infx∈Rn F(x). The algorithm generates a sequence of points x1,x2,x3,...,andonehopesthatminj≦k F(xj)convergestoinfF(x)ask →∞.
In describing the computation that leads to xk+1, assuming that xk has been com- puted,webeginbygeneratingamodestnumberofrandompointsu1,u2,...,um inalarge
Flatness Test
Minimizing Difficult Functions
Global Minimum Point of F
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
582 Chapter 13
Minimization of Functions
Simple Way to Make Random Choice
Purpose of Choice for xk+1
Finally, a random choice is made among the points u1, u2, . . . , um , taking account of the probabilities pi that have been assigned to them. This randomly chosen ui becomes xk+1.
The simplest way to make this random choice is to employ a random number generator to get a random point ξ in the interval (0, 1). Select i to be the first integer such that
ξ ≦ p1 + p2 + · · · + pi
Thus,ifξ ≦ p1,leti=1(andxn+1 = u1).If p1 < ξ ≦ p1+p2,thenleti = 2(andxn+1 = u2), and so on.
The formula for the probabilities pi is taken from the theory of thermodynamics. The interested reader can consult the original articles by Metropolis et al. [1953] or Otten and van Ginneken [1989]. Presumably, other functions can serve in this role as well.
What is the purpose of the complicated choice for xk+1? Because of the possibility of encountering local minima, the algorithm must occasionally choose a point that is uphill from the current point. Then there is a chance that subsequent points might begin to move toward a different local minimum. An element of randomness is introduced to make this possible.
With minor modifications, the algorithm can be used for functions f : X → R, where X is any set. For example, in the traveling salesman problem, X is the set of all permutations of a set of integers {1, 2, 3, . . . , N }. All that is required is a procedure for generating random permutations and, of course, a code for evaluating the function f .
Computer programs for a variety of algorithms can be found online at the websites http://www.netlib.org/. A collection of papers on simulated annealing, emphasizing parallel computation, is Azencott [1992].
Summary 13.2
• In a typical minimization problem, we seek a point x∗ such that F(x∗)≦F(x) forallx∈Rn
where F is a real-valued multivariate function.
neighborhood of xk . For each of these points, the value of F must be computed. The next point, xk+1, in our sequence is chosen to be one of the points u1, u2, . . . , um . This choice is made as follows. Select an index j such that
F(uj) = min{F(u1), F(u2),..., F(um)}
If F(uj) < F(xk), then set xk+1 = uj. In the other case, for each i, we assign a probability
m S= pi
pi to ui by the formula
Here, α is a positive parameter chosen by the user of the code. We normalize the probabilities
pi =eα[F(xk)−F(ui)] (1≦i≦m) by dividing each by their sum. That is, we compute
and then carry out a replacement
i=1
pi ← pi/S
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
• A gradient vector G(x) has components
Gi =Gi(x)=∂F(x)
∂xi and a Hessian matrix H(x) has components
Hij =Hij(x)=∂2F(x) ∂xi ∂xj
(1≦i≦n)
(1≦i, j≦n)
It is a symmetric matrix if the second-order derivatives are continuous.
13.2 Multivariate Case 583
• The Taylor series for F is
F(x + h) = F(x) + G(x)T h + 1 hT H(x)h + · · ·
2
Here, x is the fixed point of expansion in Rn and h is the variable in Rn with components
h1,h2,...,hn. The three dots indicate higher-order terms in h that are not needed in this discussion.
• An alternative form of the Taylor series is
F(x)= F(z)+G(z)T(x−z)+ 1(x−z)T H(z)(x−z)+···
2
For example, a linear function F(x) = c + bT x has the Taylor series
A quadratic function is
F(x) = F(z) + bT (x − z) F(x)=c+bTx+1xT Ax
2
• An iterative procedure for locating a minimum point of a function F is to start with a
point z that is a current estimate of the minimum point, compute the gradient G and Hessian H of F at the point z, and solve the matrix equation
Hx = −G for x. Then replace z by z + x and repeat.
• If the matrix H has the property that xT Hx > 0 for every nonzero vector x, then the quadratic function Q has a unique minimum point.
• Algorithms that are discussed are steepest descent, Nelder-Mead, and simulated annealing.
1. Determinewhetherthesefunctionshaveminimumvalues in R2:
aa. x12 −x1x2 +x2 +3×1 +6×2 −4 ab. x12 − 3x1x2 + x2 + 7×1 + 3×2 + 5 c. 2×12 −3x1x2 +x2 +4×1 −x2 +6
d. ax12 −2bx1x2 +cx2 +dx1 +ex2 + f Hint: Use the method of completing the square.
a2. Locatetheminimumpointof3x2−2xy+y2+3x−46+7 by finding the gradient and Hessian and solving the ap- propriate linear equations.
a3. Using (0,0) as the point of expansion, write the first three terms of the Taylor series for F(x, y) = ex cos y − yln(x +1).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 13.2
584
Chapter 13 Minimization of Functions
4.
5.
Using (1, 1) as the point of expansion, write the first three terms of the Taylor series for F(x, y) = 2×2 − 4xy + 7y2−3x+5y.
The Taylor series expansion about zero can be written as
F(x) = F(0) + G(0)T x + 1 xT H(0)x + · · · 2
Show that the Taylor series about z can be written in a similar form by using matrix-vector notation; that is,
T1T F(x)=F(z)+G(z) X+2X H(z)X+···
11.
a12.
13.
14.
15.
16.
17.
18.
Let F be a function of two variables whose gradient at (0, 0) is [−5, 1]T and whose Hessian is
6 −1 −1 2
Make a reasonable guess as to the minimum point of F. Explain.
Write the function F(x1, x2) = 3×12 + 6x1x2 − 2×2 + 5×1 + 3×2 + 7 in the form of Equation (7) with appropri- ate A, b, and c. Show in matrix form the linear equations that must be solved in order to find a point where the first partial derivatives of F vanish. Finally, solve these equations to locate this point numerically.
VerifyEquation(13).Indifferentiatingthedoublesumin Equation (12), first write all terms that contain xk . Then differentiate and use the symmetry of the matrix H.
ConsiderthequadraticfunctionQinEquation(12).Show that if H is positive definite, then the stationary point is a minimum point.
(GeneralQuadraticFunction)GeneralizeEquation(6) to n variables. Show that a general quadratic function Q(x) of n variables can be written in the matrix-vector form of Equation (7), where A is an n × n symmetric matrix, b a vector of length n, and c a scalar. Establish that the gradient and Hessian are
a6.
7.
8.
9.
ShowthatthegradientofF(x,y)isperpendiculartothe contour.
Hint: Interpret the equation F(x, y) = c as defining y as a function of x. Then by the chain rule,
∂F ∂F dy +=0
∂x ∂y dx
From it, obtain the slope of the tangent to the contour.
Considerthefunction
F(x1,x2,x3)=3ex1x2 −x3cosx1+x2lnx3
a. Determine the gradient vector and Hessian matrix. ab. Derive the first three terms of the Taylor series ex-
pansion about (0, 1, 1).
c. What linear system should be solved for a reasonable
guess as to the minimum point for F? What is the value of F at this point?
ItisassertedthattheHessianofanunknownfunctionF
G(x)= Ax+b, respectively.
H(x)= A
where
x G(z) X= ,G(z)=
z −G(z)
H(z) −H(z)
H(z) =
−H(z) H(z)
at a certain point is
3 2 14
What conclusion can be drawn about F?
What are the gradients of the following functions at the
points indicated?
aa. F(x,y)=x2y−2x+yat(1,0)
ab. F(x,y,z)=xy+yz2+x2zat(1,2,1)
LetAbeann×nsymmetricmatrixanddefineanupper triangular matrix U = (ui j ) by putting
Show that xT Ux = xT Ax for all vectors x.
Show that the general quadratic function Q(x) of n vari-
ables can be written
Q(x) = c + bT x + 1 xT U x 2
where U is an upper triangular matrix. Can this simplify the work of finding the stationary point of Q?
ShowthatthegradientandHessiansatisfytheequation
H(z)(x − z) = G(x) − G(z) for a general quadratic function of n variables.
aij, i=j uij= 2aij, i
a10. ConsiderF(x,y,z)=y2z2(1+sin2x)+(y+1)2(z+3)2. We want to find the minimum of the function. The pro- gram to be used requires the gradient of the function. What formulas must we program for the gradient?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
19. Using Taylor series, show that a general quadratic func- tion of n variables can be written in block form
1
Q(x)= 2XTAX +BTX +c
where
xA−Ab X=z,A= ,B=
−AA −b Here z is the point of expansion.
20. (Least-Squares Problem) Consider the function F(x) = (b − Ax)T (b − Ax) + αxT x
where A is a real m × n matrix, b is a real column vector of order m, and α is a positive real number. We want the minimum point of F for given A, b, and α. Show that
F(x + h) − F(x) = (Ah)T (Ah) + αhT h ≧ 0 for h a vector of order n, provided that
(AT A+αI)x= ATb
This means that any solution of this linear system mini-
mizes F(x); hence, this is the normal equation.
21. (Multiple Choice) What is the gradient of the function
f(x)=3×12 −sin(x1x2)atthepoint(3,0)?
a. (6, −3) b. (3, −1) c. (18, 0)
d. (18, −3) e. None of these.
22. (MultipleChoice,continuation)Thedirectionalderiva- tive of the function f at the point x in the direction u is given by the expression
d
dt f(x+tu)|t=0
1. Select a routine from your program library or from a package such as MATLAB, Maple, or Mathematica for minimizing a function of many variables without the need to program derivatives. Test it on one or more of the following well-known functions. The ordering of our variablesis(x,y,z,w).
aa. Rosenbrock: 100(y − x2)2 + (1 − x)2. Start at (−1.2, 1.0).
b. Powell1:(x+10y)2+5(z−w)2+(y−2z)4+ 10(x − w)4. Start at (3, −1, 0, 1).
ac. Powell2:x2+2y2+3z2+4w2+(x+y+z+w)4. Start at (1, −1, 1, 1).
In this description, u should be a unit vector. What is the numerical value of the directional derivative where f (x) is the function defined in the preceding problem,
x=(1,π/2),andu=(1,1)/√2. √
13.2 Multivariate Case 585
23.
24.
a. 6/ 2 b. 6 c. 18 d. 3 e. Noneofthese.
(Multiple Choice, continuation) If f is a real-valued function of n variables, the Hessian H = ( Hi j ) is given by Hij = ∂2 f/∂xi∂xj, all terms being evaluated at a specific point x. What is the entry H22 in this matrix in the case of f as given in the previous problem and x = (1, π/2)?
a. 6 b. 6/√2 c. 1 d. π2/2 e. Noneofthese.
(Multiple Choice) Let f be a real-valued function of n real variables. Let x and u be given as numerical vec- tors, and u ≠ 0. Then the expression f (x + t u) defines a function of t. Suppose that the minimum of f (x + tu) occurs at t = 0. What conclusion can be drawn?
a. The gradient of f at x, denoted by G(x), is 0.
b. u is perpendicular to the gradient of f at x.
c. u = G(x), where G(x) denotes the gradient of f at x.
d. G(x) is perpendicular to x.
e. Noneofthese.
(Multiple Choice) If f is a (real-valued) quadratic func- tion of n real variables, we can write it in the form
f x) = c − bT x + 1 xT Ax. The gradient of f is then: 2
a. Ax b. b−Ax c. Ax−b d. 1Ax−b 2
e. Noneofthese.
d. FletcherandPowell:100(z−10φ)2+x2 + y2− 12+z2 in which φ is an angle determined from (x, y)
25.
by
x2+y2 , x2+y2
cos2πφ = x sin2πφ = y where −π/2 < 2πφ ≦ 3π/2. Start at (1, 1, 1).
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
ae. Woods:100(x2−y)2+(1−x)2+90(z2−w)2+(1− z)2 +10(y−1)2 +(w−1)2 +19.8(y−1)(w−1). Start at (−3, −1, −3, −1).
2. (Accelerated Steepest Descent) This version of steep- est descent is superior to the basic one. A sequence of
Computer Exercises 13.2
586 Chapter 13 Minimization of Functions
points x1, x2, . . . is generated as follows: Point x1 is 9. specified as the starting point. Then x2 is obtained by
one step of steepest descent from x1. In the general step,
if x1,x2,...,xm have been obtained, we find a point z
Use built-in routines in mathematical software systems such as Maple or Mathematica to verify the calculations in Example 1.
Hint: In Maple, use grad and Hessian, and in Mathe- matica, use Series. For example, obtain two terms in the Taylor series in two variables expanded about the point (1, 1), and then carry out a change of variables.
by steepest descent from xm. Then xm+1 is taken as the minimum point on the line xm−1 + t(z − zm−1). Pro- gram and test this algorithm on one of the examples in Computer Exercise 13.2.1.
3. Using a routine in your program library or in MATLAB, Maple, or Mathematica, do the following:
a. Solve the minimization problem that begins this chapter.
b. Plotandsolvefortheminimumpoint,themaximum point, and the saddle point of these functions, respec- tively:x2 +y2,1−x2 −y2,x2 −y2.
c. Plotandnumericallyexperimentwiththesefunctions thatdonothaveminima:−x2−y2+13x+6y+12, x2 − y2 + 3x + 5y + 7, x2 − 2xy + x + 2y + 3, 2x + 4y + 6.
4. WewanttofindtheminimumofF(x,y,z)=z2cosx+ x2y2+x2ez usingacomputerprogramthatrequirespro- cedures for the gradient of F together with F. Write the necessary procedures. Find the minimum using a prepro- grammed code that uses the gradient.
5. Assume that
procedure Xmin( f, (gradi ), n, (xi), (gi j ))
is available to compute the minimum value of a function of two variables. Suppose that this routine requires not only the function but also its gradient. If we are going to use this routine with the function F(x, y) = ex cos2(xy), what procedure will be needed? Write the appropriate code. Find the minimum using a preprogrammed code that uses the gradient.
6. Program and test the Nelder-Mead algorithm.
7. Program and test the simulated annealing algorithm.
8. (Student Research Project) Explore one of the newer methods for minimization such as generic algorithms, methods of simulated annealing, or the Nelder-Mead al- gorithm. Use some of the software that is available for them.
10. (Molecular Conformation: Protein Folding Project) Forces that govern folding of amino acids into proteins are due to bonds between individual atoms and to weaker interactions between unbound atoms such as electrostatic and Van der Waals forces. The Van der Waals forces are modeled by the Lennard-Jones potential
12 U(r) = r12 − r6
where r is the distance between atoms. y
123x
21
In the figure, the energy minimum is −1, and it is achieved at r = 1. Explore this subject and the numerical methods used. One approach is to predict the conformation of the proteins in finding the minimum potential energy of the total configuration of amino acids. For a cluster of atoms with positions (x1, y1, z1) to (xn, yn, zn), the objective function to be minimized is
U =
12 r12 − r6
i
x1,x2,x3,x4 ≧0
Write in matrix-vector form the dual problem and the
second primal problem.
Solve each of the linear programming problems by the
graphical method. Determine x to
6x+5y≦17 c. Constraints: 2x + 11y ≦ 23
x≦0
6. Consider the following linear programming problem:
Maximize: 2×1 + 2×2 − 6×3 − x4
3x + x =25 1 4
x1 + x2 + x3 + x4 = 20 Constraints: 4×1 + 6×3 ≧ 5
13.
2×1 +3×3+2×4≧ 0 Maximize: x1≧0, x2≧0, x3≧0, x4≧0
cT x
Ax ≦ b
aa. Reformulatethisprobleminsecondprimalform. Constraints: ab. Formulatethedualproblem.
a7. Solve the following linear programming problem graphically:
x≧0
Here, nonunique and unbounded “solutions” may be
Maximize:
1 2
aT
b = [−15, 36]
T
b = [6, 36]T
b = [0, 5]T
b=[12, 44]T
b = [6, −20]T
T
3x
x1
a. c = [2, −4] 1T
A=
A=−1 1
+ 5x
x2≦6
x1≧0, x2≧0
a8. (Continuation) Solve the dual problem of the preceding
problem.
9. Showthatthedualproblemmaybewrittenas
A=−3 4 −4 11
A=23 −4 −5
1 1 A=12
A=24 5 3
Maximize: Constraints:
bT y T
T
10. Describehowmax{|x−y−3|,|2x+y+4|,|x+2y− 7|} can be minimized by using a linear programming code.
yA≧c y ≧ 0
g. c = [2, 1] a h. c = [3, 1]T
b = [0, −2] b = [21, 18]T
≦ 4
b.c=2,2 ac. c=[3,2]T
b = [30, 12]
obtained.
d. c = [2, −3]T e. c=[−4, 11]T
af. c=[−3,4]T
T
−3 −5 A=4 9
6 5 A=41
0 1
Constraints: 3×1 +2×2 ≦ 18
−3 2 −4 9
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
596
Chapter 14 Linear Programming Problems
a 14.
Solve the following linear programming problem by hand, using a graph for help:
Maximize: 4x+4y+z
16.
a17.
18.
Consider the following linear programming problem: Maximize: c1x1 +c2x2
15.
pressions. Solve the resulting two-dimensional problem.
Putthislinearprogrammingproblemintosecondprimal form. You may want to make changes of variables. If so, include a dictionary relating new and old variables.
a1.
2.
Constraints:
3x + 2y + z = 12
Constraints: a1 x1 + a2 x2 ≦ b
7x + 7y + 2z ≦ 144
x1≧0, x2≧0
Constraints: 7x + 5y + 2z ≦ 80
In the special case in which all data are positive, show that the dual problem has the same extreme value as the original problem.
Supposethatalinearprogrammingprobleminfirstpri- mal form has the property that cTx is not bounded on the feasible set. What conclusion can be drawn about the dual problem?
(Multiple Choice) Which of these problems is formu- lated in the first primal form for a linear programming problem?
a. Maximize cTx subject to Ax ≦ b
b. MinimizecTxsubjecttoAx≦b,x≧0 c. MaximizecTxsubjecttoAx=b,x≧0 d. MaximizecTxsubjecttoAx≦b,x≧0 e. Noneofthese.
problem in first primal form to determine the number of bottles of the two painkillers that the company should produce each day so as to maximize their profits. Solve by using mathematical software.
Suppose that the university student government wishes to charter planes to transport at least 750 students to the bowl game. Two airlines, α and β, agree to supply aircraft for the trip. Airline α has five aircraft available carrying 75 passengers each, and airline β has three aircraft avail- able carrying 250 passengers each. The cost per aircraft is $900 and $3250 for the trip from airlines α and β, respec- tively. The student government wants to charter at most six aircraft. How many of each type should be chartered to minimize the cost of the airlift? How much should the student government charge to make 50c/ profit per stu- dent? Solve by the graphical method, and verify by using mathematical software.
(Continuation) Rework the preceding computer problem in the following two possibly different ways:
a. The number of students going on the airlift is maxi- mized.
b. Thecostperstudentisminimized.
11x + 7y + 3z ≦ 132
x≧0, y≧0
Hint: Use the equation to eliminate z from all other ex-
Minimize:
ε1 + ε2 + ε3
|3x+4y+6| ≦ε1
|2x − 8y − 4| ≦ ε2 | − x − 3 y + 5 | ≦ ε 3
ε1>0, ε2>0, ε3>0, x>0, y>0 Solve the resulting problem.
A western shop wishes to purchase 300 felt and 200 straw cowboy hats. Bids have been received from three whole- salers. Texas Hatters has agreed to supply not more than 200 hats, Lone Star Hatters not more than 250, and Lariat Ranch Wear not more than 150. The owner of the shop has estimated that his profit per hat sold from Texas Hatters would be $3/felt and $4/straw, from Lone Star Hatters $3.80/felt and $3.50/straw, and from Lariat Ranch Wear $4/felt and $3.60/straw. Set up a linear programming problem to maximize the owner’s profits. Solve by us- ing mathematical software.
The ABC Drug Company makes two types of liquid
painkiller that have brand names Relieve (R) and Ease
(E) and contain different mixtures of three basic drugs,
A, B, and C, produced by the company. Each bottle of R
requires 7 unit of drug A, 1 unit of drug B, and 3 unit of 924
drug C. Each bottle of E requires 4 unit of drug A, 5 unit 92
of drug B, and 1 unit of drug C. The company is able to 4
produce each day only 5 units of drug A, 7 units of drug B, and 9 units of C. Moreover, Food and Drug Admin- istration regulations stipulate that the number of bottles of R manufactured cannot exceed twice the number of bottles of E. The profit margin for each bottle of E and R is $7 and $3, respectively. Set up the linear programming
a3.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
4.
Computer Exercises 14.1
a5. (Diet Problem) A university dining hall wishes to pro- vide at least 5 units of vitamin C and 3 units of vitamin E per serving. Three foods are available containing these vi- tamins. Food f1 contains 2.5 and 1.25 units per ounce of vitamins C and E, respectively, whereas food f2 contains just the opposite amounts. The third food f3 contains an equal amount of each vitamin at 1 unit per ounce. Food
f1 costs 25c/ per ounce, food f2 costs 56c/ per ounce, and food f3 costs 10c/ per ounce. The dietitian wishes to provide the meal at a minimum cost per serving that satis-
14.2 Simplex Method
fies the minimum vitamin requirements. Set up this linear programming problem in second primal form. Solve with the aid of mathematical software.
6. Use built-in routines in mathematical software systems such as MATLAB, Maple, or Mathematica to solve each of these linear programming problems in first primal form, in second primal form, and in dual form:
a. LPP (2) b. LPP (3) c. LPP (4) d. LPP (5) e. LPP (6)
Polyhedral Set
■ Theorem1
14.2 Simplex Method 597
Second Primal Form LPP
The principal algorithm that is used in solving linear programming problems is the simplex method. Here, enough of the background of this method is described that the reader can use available computer programs that incorporate it.
Consider a linear programming problem in second primal form: Maximize: cTx
Constraints: Ax = b x≧0
It is assumed that c and x are n-component vectors, b is an m-component vector, and A is an m × n matrix. Also, it is assumed that b≧ 0 and that A contains an m × m identity matrix in its last m columns. As before, we define the set of feasible points as
K = {x ∈ Rn: Ax = b, x ≧ 0}
The points of K are exactly the points that are competing to maximize cTx.
Vertices in K and Linearly Independent Columns of A
The set K is a polyhedral set in Rn , and the algorithm to be described proceeds from vertex
to vertex in K, always increasing the value of cTx as it goes from one to another. Let us
give a precise definition of vertex. A point x in K is called a vertex if it is impossible to
expressitasx=1(u+v),withbothuandvinKandu≠ v.Inotherwords,xisnotthe 2
midpoint of any line segment whose endpoints lie in K .
We denote by a(1), a(2), . . . , a(n) the column vectors constituting the matrix A. The
following theorem relates the columns of A to the vertices of K :
Theorem on Vertices and Column Vectors
Let x ∈ K and define I(x) = {i: xi > 0}. Then the following are equivalent:
1. x isavertexof K.
2. The set {a(i): i ∈ I(x)} is linearly independent.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
598 Chapter 14 Linear Programming Problems
Proof
IfStatement1isfalse,thenwecanwritex=1(u+v),withu∈K,v∈K,andu≠ v.For 2
everyindexi thatisnotinthesetI(x),wehavexi =0,ui ≧0,vi ≧0,andxi = 1(ui +vi). 2
1. There are no feasible points; that is, the set K is empty.
2. K is not empty, and cTx is not bounded on K.
3. K is not empty, and cTx is bounded on K.
Augmented Matrix
showing the linear dependence of the set {a(i): i ∈ I(x)}. Thus, Statement 2 is false. Consequently, Statement 2 implies Statement 1.
For the converse, assume that Statement 2 is false. From the linear dependence of column vectors a(i) for i ∈ I(x), we have
yi a(i) =0 with |yi|≠ 0 i∈I(x) i∈I(x)
for appropriate coefficients yi. For each i ∈/ I(x), let yi = 0. Form the vector y with components yi for i = 1,2,…,n. Then, for any λ, we see that because x ∈ K,
n n
A(x±λy)= (xi ±λyi)a(i) = xi a(i)±λ yi a(i) =Ax=b
i=1 i=1 i∈I(x)
Now select the real number λ positive but so small that x + λ y ≧ 0 and x − λ y ≧ 0. (To
see that it is possible, consider separately the components for i ∈ I(x) and i ∈/ I(x).) The
resulting vectors, u = x + λy and v = x − λy, belong to K. They differ, and obviously,
x = 1 (u + v). Thus, x is not a vertex of K; that is, Statement 1 is false. So Statement 1 2
implies Statement 2. ■ Given a linear programming problem, there are three possibilities:
It is true (but not obvious) that in the third case, there is a point x in K such that cTx ≧ cT y for all y in K. We have assumed that our problem is in the second primal form so that possibility 1 cannot occur. Indeed, A contains an m × m identity matrix and so has the form
a11 a12 ··· a1k 1 0 ··· 0 a21 a22 · · · a2k 0 1 · · · 0
A = . . . . . . . . . . . . . . . . . . . . . . . . am1 am2 ··· amk 0 0 ··· 1
This forces ui and vi to be zero. Thus, all the nonzero components of u and v correspond to indices i in I(x). Since u and v belong to the set K, we have
and
Hence, we obtain
n
b = Au = ui a(i) = ui a(i)
i=1 i∈I(x) n
b = Av = vi a(i) = vi a(i) i=1 i∈I(x)
(ui −vi)a(i) =0 i∈I(x)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
■ Algorithm
Simplex Algorithm
where k = n − m. Consequently, we can construct a feasible point x easily by setting x1 =x2 =···=xk =0andxk+1 =b1,xk+2 =b2,andsoon.Itisthenclearthat Ax=b. The inequality x ≧ 0 follows from our initial assumption that b ≧ 0.
Simplex Method
Next we present a brief outline of the simplex method for solving linear programming prob- lems. It involves a sequence of exchanges so that the trial solution proceeds systematically from one vertex to another in K. This procedure is stopped when the value of cTx is no longer increased as a result of the exchange.
The following is an outline of the simplex algorithm.
A few remarks on this algorithm are in order. In the beginning, select the indices k1,k2,…,km such that a(k1),a(k2),…,a(km) form an m × m identity matrix. At Step 5, where we say that x is a solution, we mean that the vector v = (vi ) given by vki = xi for 1≦ i ≦ n and vi = 0 for i ∈/ {k1,k2,…,km} is the solution. A convenient choice for the tolerance ε that occurs in Steps 5 and 7 might be 10−6.
In any reasonable implementation of the simplex method, advantage must be taken of the fact that succeeding occurrences of Step 1 are very similar. In fact, only one column of B changes at a time. Similar remarks hold for Steps 3 and 6.
We do not recommend that the reader attempt to program the simplex algorithm. Efficient codes, refined over many years of experience, are usually available in software libraries. Many of them can provide solutions to a given problem and to its dual with very little additional computing. Sometimes this feature can be exploited to decrease the execution time of a problem. To see why, consider a linear programming problem in first
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
14.2 Simplex Method 599
Simplex Method
Select a small positive value for ε. In each step, we have a set of m indices {k1,k2,…,km}.
1. Put columns a(k1), a(k2), . . . , a(km ) into B, and solve Bx = b.
2. Ifxi >0for1≦i≦m,continue.Otherwise,exitbecausethealgorithmhasfailed.
3. Sete=[ck1,ck2,…,ckm]T,andsolveBTy=e.
4. Choose any s in {1,2,…,n} but not in {k1,k2,…,km} for which cs − yTa(s) is greatest.
5. If cs − yTa(s) < ε, exit because x is the solution.
6. Solve Bz = a(s).
7. If zi ≦ ε for 1 ≦ i ≦ m, then exit because the objective function is unbounded on K .
8. Among the ratios xi/zi that have zi > 0 for 1≦ i ≦ m, let xr/zr be the smallest. In case of a tie, let r be the first occurrence.
9. Replace kr by s, and go to Step 1.
600 Chapter 14 Linear Programming Problems primal form:
First Primal Form LPP
Dual Form LPP
Maximize: cTx
(P) Constraints: Ax ≦ b
x≧0
As usual, we assume that x is an n vector and that A is an m × n matrix. When the simplex algorithm is applied to this problem, it performs an iterative process on an m × m matrix denoted by B in the preceding description. If the number of inequality constraints m is very large relative to n, then the dual problem may be easier to solve, since the B matrices for it will be of dimension n × n. Indeed, the dual problem is
Minimize: bT y
(D) Constraints: AT y ≧ c
y≧0
and the number of inequality constraints here is n. An example of this technique appears in
the next section.
Summary 14.2
• Forthesecondprimalform,thesetoffeasiblepointsis
K = {x ∈ Rn: Ax = b, x ≧ 0}
which are the points of K competing to maximize cTx.
• For a linear programming problem, there are these possibilities: There are no feasible points, that is, the set K is empty; K is not empty, and cTx is not bounded on K; K is not empty, and cTx is bounded on K.
• Denote by a(1), a(2), . . . , a(n) the column vectors constituting the matrix A. Let x ∈ K anddefineI(x)={i:xi >0}.ThenxisavertexofKifandonlyiftheset{a(i):i ∈I(x)} is linearly independent.
• Thesimplexmethodinvolvesasequenceofexchangessothatthetrialsolutionproceeds systematically from one vertex to another in the set of feasible points K . This procedure is stopped when the value of cTx is no longer increased as a result of exchanges.
a1. Put this linear programming problem into first primal 4. form by increasing the number of variables by just one:
Maximize: cT x a5. Constraints: Ax ≦ b
Hint:Replacexj byyj −y0. a6.
a2. Show that the set K can have only a finite number of
Using the simplex method as described, solve the numer- ical example in the text.
Using standard manipulations, put the dual problem (D) into first and second primal forms.
Show how a code for solving a linear programming prob- lem in first primal form can be used to solve a system of n linear equations in n variables.
vertices.
3. Supposethatuandvaresolutionpointsforalinearpro-
gramming problem and that x = 1 (u + v). Show that x 2
is also a solution.
7. Usingstandardtechniques,putthedualproblem(D)into first primal form (P); then take the dual of it. What is the result?
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 14.2
a.
ab.
14.3
x1≧0, x2≧0, x3≧0
14.3 Inconsistent Linear Systems 601
1. Select a linear programming code from a mathematical softwarepackageorlibraryanduseittosolvetheselinear programmingproblems: ac.
Minimize: 8×1 + 6×2 + 6×3 + 9×4
x1+2×2 +x4≧2
3 x 1 + x 2 + x 4 ≧ 4 Constraints: x3 + x4 ≧ 1 x 1 + x 3 ≧ 1
x1≧0, x2≧0, x3≧0, x4≧0 Minimize: 10×1 − 5×2 − 4×3 + 7×4 + x5
4×1−3×2− x3+4×4+ x5=1 Constraints: −x1 + 2×2 + 2×3 + x4 + 3×5 = 4
Maximize: 2×1 + 4×2 + 3×3
4×1 +2×2 +3×3 ≦ 15
3×1 +2×2 + x3 ≦ 7 Constraints: x1 + x2 + 2×3 ≦ 6
2. (StudentResearchProject)Investigaterecentdevelop- ments in computational linear programming algorithms, especially by interior-point methods.
x1≧0, x2≧0, x3≧0, x4≧0, x5≧0
Inconsistent Linear Systems
Linear programming can be used for the approximate solution of systems of linear equations that are inconsistent. An m × n system of equations
n
aijxj =bi (1≦i≦m)
j=1
is said to be inconsistent if there is no vector x = [x1,x2,…,xn]T that simultaneously
satisfies all m equations in the system. For instance, the system 2×1 +3×2 =4
x1− x2=2 (1)
x1 +2×2 =7
is inconsistent, as can be seen by attempting to carry out the Gaussian elimination process.
l1 Problem
Since no vector x can solve an inconsistent system of equations, the residuals
n j=1
approximate solutions, might be to minimize m r 2 or max |r |. Chapter 9 discusses
in detail the problem of minimizing m r2. i=1 i
Sample Inconsistent System
(1≦i≦m)
cannot be made to vanish simultaneously. Hence, we have m |ri | > 0. Now it is natural to
ri =
aijxj −bi
i=1
ask for an x vector that renders the expression m |ri | as small as possible. This problem
i=1
is called the l1 problem for this system of equations. Other criteria, leading to different
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
i=1 i 1≦i≦m i
Computer Exercises 14.2
602 Chapter 14
Linear Programming Problems
Direct Restatement LPP
εi
n
( 1 ≦ i ≦ m )
First Primal Form LPP (Nonnegative Variables)
aijyj −εi ≦bi aijyj −εi ≦ −bi
(1≦i≦m) (1≦i≦m)
(3)
The minimization of n |ri | by appropriate choice of the x vector is a problem for i=1
which special algorithms have been designed (see Barrodale and Roberts [1974]). However, if one of these special programs is not available or if the problem is small in scope, linear programming can be used.
A simple, direct restatement of the problem is
Minimize:
m i=1
a i j x j − b i ≦ ε i
− aijxj+bi≦εi (1≦i≦m)
j=1
n
(2)
Constraints:
j =1
If a linear programming code is at hand in which the variables are not required to be nonnegative, then it can be used on Problem (2). If the variables must be nonnegative, the following technique can be applied. Introduce a variable yn+1, and write xj = yj − yn+1. Then define ai,n+1 = − nj =1 ai j . This step creates an additional column in the matrix A. Now consider the linear programming problem
m Maximize: − εi
i=1
n+1
j=1 n+1
Constraints:
−
j=1
y≧0, ε≧0
which is in first primal form with m + n + 1 variables and 2m inequality constraints.
It is not hard to verify that Problem (3) is equivalent to Problem (2). The main point is
that
n+1 n
aijyj =aij(xj +yn+1)+ai,n+1yn+1 j=1 j=1
n
= ai j x j
j=1
Another technique can be used to replace the 2m inequality constraints in Problem (3)
nnn
= aij xj + yn+1 aij + yn+1 − aij j=1 j=1 j=1
by a set of m equality constraints. We write
εi =|ri|=ui +vi
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
LPP Inequality Constraints
i = 1 i = 1 n+1
l1 Sense LPP
y1−y2 −u2+v2=2 y1 +2y2 −3y3 −u3 +v3 =7
(4)
Mathematical Software
We can use mathematical software systems such as MATLAB, Maple, or Mathematica to solve this linear programming problem. For example, we obtain u1 = v1 = u2 = v2 = u3 = y2 = y3 =0,v3 =5,and y1 =2,with5asthevalueoftheobjectivefunction.Foranother system, we need to set the equality constraints. We obtain the solution corresponding to y1 = y2 = y3 =684.2887,u1 =u2 =u3 =v1 =v2 =0,andv3 =5with5asthevalue of the objective function. The x vector is x1 = 2 and x2 = 3.1494 × 10−11. This solution
14.3 Inconsistent Linear Systems 603 whereui =ri andvi =0ifri≧0butvi =−ri andui =0ifri <0.Theresultinglinear
programming problem is
Maximize:
m m
− ui − vi
u≧0, v≧0, y≧0 Using the preceding formulas, we have
aijyj −ui +vi =bi
(1≦i≦m)
Constraints:
j = 1
n ri =
=aijyj −yn+1aij −bi j=1 j=1
n+1
= aijyj −bi =ui −vi
aijxj −bi = nn
j=1
n j=1
aij(yj −yn+1)−bi
j=1
From it, we conclude that ri + vi = ui ≧ 0. Now vi and ui should be as small as possible,
consistent with this restriction, because we are attempting to minimize
m (ui + vi ). So if i=1
ri ≧0,wetakevi ≧0andui =ri,whereasifri <0,wetakevi =−ri andui =0.Ineither
case, |ri | = ui + vi . Thus, minimizing m (ui + vi ) is the same as minimizing m |ri |.
i=1 i=1
The example of the inconsistent linear system given by System (1) could be solved in
the l1 sense by solving the linear programming problem Minimize: u +v +u +v +u +v
1 1 2 2 3 3
2y1 +3y2 −5y3 −u1 +v1 =4
Constraints: The solution is
y1, y2, y3 ≧ 0, u1,u2,u3 ≧ 0,
v1,v2,v3 ≧ 0
0
u1 =0, v1 =0, y1 =2,
u2 =0, v2 =0, y2 =0,
u3 =0 v3 =5 y3 =0
From it, we recover the l1 solution of System (1) in the form
x1 = y1 − y3 = 2, r1 = u1 − v1 = x2 = y2 − y3 = 0, r2 = u2 − v2 =
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0 r3 = u3 − v3 = −5
604 Chapter 14
Linear Programming Problems
l∞ Sense LPP
(1≦i≦m) − aijxj −ε ≦ −bi (1≦i≦m)
l∞ Sense LPP (Nonnegative Variables)
Constraints:
n+1
(1≦i≦m) (1≦i≦m)
(1≦j≦n+1)
(5)
is slightly different from the one previously obtained, owing to roundoff errors, but the minimum value for the objective function is the same and all the constraints are satisfied.
l∞ Problem
Consider again a system of m linear equations in n unknowns:
n
aijxj =bi (1≦i≦m)
j=1
If the system is inconsistent, we know that the residuals ri = nj =1 ai j x j − bi cannot all be zero for any x vector. So the quantity ε = max1 ≦ i ≦ m |ri | is positive. The problem of making ε a minimum is called the l∞ problem for the system of equations. An equivalent linear programming problem is
Minimize: ε
n
j=1
If a linear programming code is available in which the variables need not be greater than or equal to zero, then it can be used to solve the l∞ problem as formulated above. If the variables must be nonnegative, we first introduce a variable yn+1 so large that the quantities y j = x j + yn+1 are positive. Next, we solve the linear programming problem
Constraints:
n
Minimize:
ε
n+1
aijyj−ε≦bi j=1
j=1
ε≧0, yj≧0
aijxj −ε ≦bi j =1
− a y −ε ≦ −b ijj i
Here, we have again defined ai,n+1 = − nj =1 ai j .
For our System (1), the solution that minimizes the quantity
max{|2x1 + 3x2 − 4|,|x1 − x2 − 2|,|x1 + 2x2 − 7|} is obtained from the linear programming problem
Minimize: ε
2y1+3y2−5y3−ε≦ 4 y1 − y2 −ε≦2 y1 +2y2 −3y3 −ε≦7
System (1) LPP
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
−2y1−3y2+5y3 −ε≦−4 −y1 + y2 −ε≦−2 −y1 −2y2 +3y3 −ε ≦ −7
(6)
Constraints:
y1,y2,y3≧0, ε≧0
The solution is
y1 = 8, y2 = 5, y3 = 0, ε = 25 939
14.3 Inconsistent Linear Systems 605
Mathematical Software
From it, the l∞ solution of System (1) is recovered as follows: 85
x1 = y1 − y3 = 9, x2 = y2 − y3 − 3
We can use mathematical software systems such as MATLAB, Maple, or Mathemat-
ica to solve the linear programming Problem (6). For example, we obtain the solution
y1 = 8,y2 = 5,y3 =0,andε= 25 fromtwoofthesesystems.Butforoneof 939
the mathematical systems, we obtain the solution corresponding to y1 = 1.0423 × 103, y = 1.0431 × 103, y = 1.0414 × 103, and ε = 2.778. We do obtain the same results as
System (1) Dual LPP
3u1 −u2 +2u3 −3u4 +u5 −2u6 ≧ 0
23
before (0.8889, 1.6667) ≈ 8 , 5 . 93
In problems like (6), m is often much larger than n. Thus, in accordance with re- marks made in Section 14.2, it may be preferable to solve the dual problem because it would have 2m variables but only n + 2 inequality constraints. To illustrate, the dual of Problem (6) is
Maximize: 4u1 + 2u2 + 7u3 − 4u4 − 2u5 − 7u6
Constraints: −5u1 − 3u3 + 5u4 + 3u6 ≧ 0 −u1−u2− u3− u4−u5− u6 ≧ −1
2u1+u2+ u3−2u4−u5− u6 ≧ 0
ui ≧0 (1≦i≦6)
The three types of approximate solution that have been discussed (for an overdetermined system of linear equations) are useful in different situations. Broadly speaking, an l∞ solution is preferred when the data are known to be accurate. An l2 solution is preferred when the data are contaminated with errors that are believed to conform to the normal probability distribution. The l1 solution is often used when data are suspected of containing wild points—points that result from gross errors, such as the incorrect placement of a decimal point. Additional information can be found in Rice and White [1964]. The l2 problem is discussed in Chapter 9 also.
Summary 14.3
• We consider an inconsistent system of m linear equations in n unknowns
n
aijxj =bi (1≦i≦m)
j=1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
606 Chapter 14
Linear Programming Problems
For the residuals ri = nj =1 ai j x j − bi , the l1 problem for this system is to minimize the expression m |ri |. A direct restatement of the problem is
i=1
Minimize:
Constraints:
m
εi
n
n
i=1
j =1
a i j x j − b i ≦ ε i
a i j x j + b i ≦ ε i
( 1 ≦ i ≦ m )
( 1 ≦ i ≦ m )
−
j=1
where εi = |ri |. If the variables must be nonnegative, we introduce a variable yn+1 and write xj = yj − yn+1. Define ai,n+1 = −nj=1 aij; an equivalent linear programming problem is
m Maximize: − εi
i=1
n+1
aijyj−εi≦bi j=1
(1≦i≦m) (1≦i≦m)
y≧0, ε≧0
which is in first primal form with m + n + 1 variables and 2m inequality constraints.
• Another technique is to replace the 2m inequality constraints by a set of m equality constraints.Wewriteεi =|ri|=ui +vi,whereui =ri andvi =0ifri ≧0butvi =−ri andui =0ifri <0.Theresultinglinearprogrammingproblemis
n+1 Constraints:−a y −ε ≦ −b
ijj i i j=1
Maximize: − u − v
n+1
Constraints: aijyj −ui +vi =bi
ii i=1 i=1
Minimize:
ε
n
aijxj −ε ≦bi
(1≦i≦m) (1≦i≦m)
j=1
− aijxj −ε ≦ −bi
mm
(1≦i≦m)
• For an inconsistent system, the problem of making ε = max1 ≦ i ≦ m |ri | a minimum is the l∞ problem for the system. An equivalent linear programming problem is
j = 1
u≧0, v≧0, y≧0
n
Constraints:
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
j=1
If the variables must be nonnegative, we introduce a large variable yn+1 so that the quantities yj = xj + yn+1 are positive and we have an equivalent linear programming problem:
Minimize:
Constraints:
j=1
(1≦i≦m) (1≦i≦m)
(1≦j≦n+1)
a3. Wewanttofindapolynomialpofdegreenthatapprox- imates a function f as well as possible from below; that is,wewant0≦f−p≦εforminimumε.Showhowp could be obtained with reasonable precision by solving a linear programmingproblem.
− aijyj −ε ≦ −bi j=1
ε
n+1
aijyj −ε ≦bi
n+1
14.3 Inconsistent Linear Systems 607
Exercises 14.3
1. Considertheinconsistentlinearsystem 5 x 1 + 2 x 2 = 6 x 1 + x 2 + x 3 = 2 7x2 − 5x3 = 11 6x1 +9x3= 9
a4. To solve the l1 problem for the system of equations x− y=4
2x − 3y = 7 x+ y=2
we can solve a linear programming problem. What is it?
Write the following with nonnegative variables:
a a. The equivalent linear programming problem for solv-
ing the system in the l1 sense.
ab. Theequivalentlinearprogrammingproblemforsolv-
ing the system in the l∞ sense.
2. (Continuation) Repeat the preceding exercise for the
system
3 x + y = 7
x− y= 11 x+6y= 13 −x + 3y = −12
a1. Obtain numerical answers for Parts a and b of Exer- cise 14.3.1.
2. (Continuation)RepeatforExercise14.3.2.
a3. Find a polynomial of degree 4 that represents the func- tion ex in the following sense: Select 20 equally spaced points xi in interval [0, 1] and require the polynomial to minimize the expression max1 ≦ i ≦ 20 |exi − p(xi )|. Hint: This is the same as solving 20 equations in five variables in the l∞ sense. The i th equation is A + B xi +
Cxi2 + Dxi3 + Exi4 = exi , and the unknowns are A, B, C, D, and E.
4. Use built-in routines in mathematical software systems such as MATLAB, Maple, or Mathematica to solve each of these linear programming problems in first primal form, in second primal form, and in dual form:
a. LPP (4) b. LPP (6)
ε≧0, yj≧0 where we defined ai,n+1 = − nj=1 ai j .
Computer Exercises 14.3
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A
Advice on Good Programming Practices
Because the programming of numerical schemes is essential to under- standing them, we offer here a few words of advice on good programming practices.
A.1 Programming Suggestions
608
The suggestions and techniques given here should be considered in context. They are not intended to be complete, and some good programming suggestions have been omitted to keep the discussion brief. Our purpose is to encourage the reader to be attentive to considerations of efficiency, economy, readability, and roundoff errors. Of course, some of these suggestions and admonitions may vary depending on the particular mathematical software system or programming language that is being used.
Be Careful and Be Correct Strive to write programs carefully and correctly. This is of utmost importance.
Use Pseudocode Before beginning the coding, write out in complete detail the mathemat- ical algorithm to be used in pseudocode such as that used in this text. The pseudocode serves as a bridge between the mathematics and the computer program. It need not be defined in a formal way, as is done for a computer language, but it should contain sufficient detail that the implementation is straightforward. When writing the pseudocode, use a style that is easy to read and understand. For maintainability, it should be easy for a person who is unfamiliar with the code to read it and understand what it does.
Check and Double-Check Check the code thoroughly for errors and omissions before beginning to edit on a computer terminal. Spend time checking the code before running it to avoid executing the program, showing the output, discovering an error, correcting the error, and repeating the process ad nauseam.∗
Modern computing environments may allow the user to accomplish this process in only a few seconds, but this advice is still valid if for no other reason than that it is dangerously easy to write programs that may work on a simple test but not on a more complicated one. No function key or mouse can tell you what is wrong!
∗In 1962, the rocket carrying the Mariner I space probe to Venus went off course after only five minutes of flight and was destroyed. An investigation revealed that a single line of faulty Fortran code caused the disaster. A period was typed in the code DO 5 I=1,3 instead of the comma, resulting in the loop being executed once instead of three times. It has been estimated that this single typographical error cost the U.S. National Aeronautics and Space Administration $18.5 million dollars! For additional details, see material available online: history.nasa.gov/mariner.html
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Use Test Cases After writing the pseudocode, check and trace through it using pencil- and-paper calculations on a typical yet simple example. Checking boundary cases, such as the values of the first and second iterations in a loop and the processing of the first and last elements in a data structure, may reveal embarrassing errors. These same sample cases can be used as the first set of test cases on the computer.
Modularize Code Build a program in steps by writing and testing a series of segments (subprograms, procedures, or functions); that is, write self-contained subtasks as separate routines. Try to keep these program segments reasonably small, less than a page or computer screen whenever possible, to make reading and debugging easier.
Generalize Slightly If the code can be written to handle a slightly more general situation, then in many cases it is worth the extra effort to do so. A program that was written for only a particular set of numbers may have to be completely rewritten for another set. For example, only a few additional statements are required to write a program with an arbitrary step size compared with a program in which the step size is fixed numerically. However, be careful not to introduce too much generality into the code because it can make a simple programming task overly complicated.
Show Intermediate Results Print out or display intermediate results and diagnostic mes- sages to assist in debugging and understanding the program’s operation. Always echo-print or display the input data unless it is impractical to do so, such as with a large amount of data. Using simple read and print commands frees the programmer from errors associated with misalignment of data. Fancy output formats are not necessary, but some simple labeling of the output is recommended.
Include Warning Messages A robust program always warns the user of a situation that it is not designed to handle. In general, write programs so that they are easy to debug when the inevitable bug appears.
Use Meaningful Variable Names It is often helpful to assign meaningful names to the variables because they may have greater mnemonic value. There is perennial confusion between the characters O (letter “oh”) and 0 (number zero) and between l (letter “ell”) and 1 (number one).
Declare All Variables All variables should be listed in the type declarations in each program or program segment. Implicit type assignments can be ignored when declaration statements include all variables used. Historically, in Fortran, variables beginning with I/i, J/j, K/k, L/l, M/m, and N/n are integer variables, and ones beginning with other letters are floating-point real variables. It may be a good idea to adhere to this scheme to immediately recognize the type of a variable without looking it up in the type declarations. Nevertheless, algorithms written in pseudocode do not always need to follow this advice.
Include Comments Comments within a routine are helpful for revealing at some later time what the program does. Extensive comments are not necessary, but we recommend that you include a preface to each program or subprogram explaining the purpose, the input and output variables, and the algorithm used as well as providing a few comments between major segments of the code. Indent each block of code a consistent number of spaces to
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A.1 Programming Suggestions 609
610 Appendix A
Advice on Good Programming Practices
improve readability. Inserting blank comment lines and blank spaces can greatly improve the readability of the code as well. To save space, we have not included any comments in the pseudocode.
Use Clean Loops Never put unnecessary statements within loops. Move expressions and variables outside a loop from inside a loop if they do not depend on the loop or do not change. Also, indenting loops can add to the readability of the code, particularly for nested loops. Use a nonexecutable statement as the terminator of a loop so that the code may be altered easily.
Declare Nonchanging Constants Use a parameter statement for assigning values to key constants. Parameter values correspond to constants that do not change throughout the routine. Such parameter statements are easy to change when one needs to rerun the program with different values. Also, they clarify the role key constants play in the code and make the routines more readable and easier to understand.
Use Appropriate Data Structures Use data structures that are natural to the problem at hand. If the problem adapts more easily to a three-dimensional array than to several one-dimensional arrays, then by all means, use a three-dimensional array.
Use Arrays of All Types The elements of arrays, whether one-, two-, or higher-dimensional, are usually stored in consecutive words of memory. Since the compiler may map the value of an index for two- and higher-subscripted arrays into a single subscript value that is used as a pointer to determine the location of elements in storage, the use of two- and higher- dimensional arrays can be considered a notational convenience for the user. However, any advantage in using only a one-dimensional array and performing complicated subscript calculation is slight. Such matters are best left to the compiler.
Use Built-in Functions In scientific programming languages, many built-in mathematical functions are available for common functions such as sin, log, exp, arcsin, and so on. Also, numeric functions such as integer, real, complex, and imaginary are usually available for type conversion. One should utilize these and others as much as possible. Some of these intrinsic functions accept arguments of more than one type and return a result whose type may vary depending on the type of the argument used. Such functions are called generic functions, for they represent an entire family of related functions. Of course, care should be taken not to use the wrong argument type.
Use Program Libraries In preference to one that you might write yourself for a pro- gramming project, a preprogrammed routine from a mathematical software system or a computer program library should be used when applicable. Such routines can be expected to be state-of-the-art software, well tested, and, of course, completely debugged.
Do Not Overoptimize Students should be primarily concerned with writing readable code that correctly computes the desired results. There are any number of tricks of the trade for making code run faster or more efficiently. Save them for use later on in your programming life. We are primarily concerned with understanding and testing various numerical methods. Do not sacrifice the clarity of a program in an effort to make the code run faster. Clarity of code may be preferable to optimization of code when the two criteria conflict.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Case Studies
We present some case studies that may be helpful.
Computing Sums When a long list of floating-point numbers is added in the computer, there may be less roundoff error if the numbers are added in order of increasing magnitude. (Roundoff errors are discussed in detail in Section 1.4.)
Mathematical Constants Some students are surprised to learn that in many programming languages, the computer does not automatically know the values of common mathematical constants such as π and e and must be explicitly told their values! It is easy to mistype a long sequence of digits in a mathematical constant, such as in the real number π coded as
pi ← 3.14159 26535 89793
We recommend using simple calculations involving mathematical functions. For example, the real numbers π and e can be easily and safely entered with nearly full machine precision by using standard intrinsic functions such as
pi ← 4.0 arctan(1.0) e ← exp(1.0)
Another reason for this advice is to avoid the problem that arises if you use a short ap- proximation such as pi ← 3.14159 on a computer with limited precision but later move the code to another computer that has more precision. If you overlook changing this assignment statement, then all results that depend on this value may be less accurate than they should be.
Exponents In coding for the computer, exercise care in writing statements that involve
exponents.Thegeneralfunctionxy maybecalculatedonmanycomputersasexp(ylnx)
whenever y is not an integer. Sometimes this is unnecessarily complicated and may con-
tribute to roundoff errors. For example, it may be preferable to write code with integer
exponents such as 5 rather than 5.0. Similarly, using exponents such as 1 or 0.5 is not 2
recommended because the built-in function sqrt may be used.
There is rarely any need for a calculation such as j ← (−1)k because there are better
ways of obtaining the same result. For example, in a loop, we can write j ← 1 before the loop and j ← − j inside the loop.
Avoid Mixed Mode In general, one should avoid mixing real and integer expressions in the computer code. Mixed expressions are formulas in which variables and constants of different types appear together. If the floating-point form of an integer variable is needed, use a function such as real. Similarly, a function such as integer is generally available for obtaining the integer part of a real variable. In other words, use the intrinsic type conversion functions whenever converting from complex to real, real to integer, or vice versa. For example, in floating-point calculations, m/n should be coded as real(m)/real(n) when m and n are integer variables so that it computes the correct real value of m/n. Similarly, 1/m should be coded as 1.0/real(m) and 1/2 as 0.5 and so on.
Precision In the usual mode of representing numbers in a computer, one word of storage is used for each number. This mode of representation is called single precision. In calculations that require greater precision (called double precision or extended precision), two or more words of storage are alloted to each number. On a 32-bit computer, one can obtain
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
A.1 Programming Suggestions 611
612 Appendix A
Advice on Good Programming Practices
approximately 6 decimal places of precision in single precision, and approximately 15 decimal places of precision in double precision. If more accuracy is needed than single precision can provide, then double or extended precision should be used.∗ This is particularly true on computers with limited precision, such as a 32-bit computer, on which roundoff errors can quickly accumulate in long computations and reduce the accuracy to only three or four decimal places! (This topic is discussed in Section 1.3.)
Usually, two words of memory are used to store the real and imaginary parts of a complex number. Complex variables and arrays must be explicitly declared as being of complex type. Expressions involving variables and constants of complex type are evaluated according to the normal rules of complex arithmetic. Intrinsic functions such as complex, real, and imaginary should be used to convert between real and complex types.
Memory Fetches When using loops, write the code so that fetches are made from adjacent words in memory. To illustrate, suppose we want to store values in a two-dimensional array (ai j ) in which the elements of each column are stored in consecutive memory locations. Using i and j loops with the i th loop as the innermost one would process elements down the columns. For some mathematical software systems and computer programming languages, this detail may be of only secondary concern. However, some computers have immediate access to only a portion or a few pages of memory at a time. In this case, it is advantageous to process the elements of an array so that they are taken from, or stored in, adjacent or nearby memory locations.
When to Avoid Arrays Although the mathematical description of an algorithm may indi- cate that a sequence of values is computed, thus seeming to imply the need for an array, it is often possible to avoid arrays. (This is especially true if only the final value of a sequence is required.) For example, the theoretical description of Newton’s method (Section 3.2) reads
xn+1=xn− f(xn) f′(xn)
but the pseudocode can be written within a loop simply as
where x is a real variable and function procedures for f and f ′ have been written. Such an assignment statement automatically effects the replacement of the value of the old x with the new numerical value of x − f (x)/f ′(x).
Limit Iterations In a repetitive algorithm, one should always limit the number of permis- sible steps by the use of a loop with a control variable. This prevents endless cycling due to unforeseen problems (e.g., programming errors and roundoff errors). For example, in
∗With the proliferation of 64-bit microprocessor(s), 64-bit computing is becoming more commonplace with the resulting improvement in performance and increase in precision. The meanings of single precision and double precision are changing!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
for n = 1 to 10
x ← x − f(x)/f′(x)
end for
A.1 Programming Suggestions 613 Newton’s pseudocode, one might replace x ← x - f(x)/f′(x) with
If the function involves some erratic behavior, there is a danger here in not limiting the number of repetitions. It is better to use a loop with a control variable:
where n and n max are integer variables and the value of n max is an upper bound on the number of desired repetitions. All others are real variables.
Floating-Point Equality The sequence of steps in a routine should not depend on whether two floating-point numbers are equal. Instead, reasonable tolerances should be permitted to allow for floating-point arithmetic roundoff errors. For example, a suitable branching statement for n decimal digits of accuracy might be
if |x − y| < ε then . . . end if
provided that it is known that x and y have magnitude comparable to 1. Here, x, y, and ε
are real variables with ε = 1 × 10−n . This corresponds to requiring that the absolute error 2
between x and y be less than ε. However, if x and y have very large or small orders of magni- tude, then the relative error between x and y would be needed, as in the branching statement
if |x − y| < εmax{|x|,|y|} then ... end if
Equal Floating-Point Steps In some situations, notably in solving differential equations (see Chapter 7), a variable t assumes a succession of values equally spaced a distance of h apart along the real line. One way of coding this is
d← f(x)/f′(x) while |d| > 1 × 10−6
2
x←x−d output x
d← f(x)/f′(x)
end while
for n = 1 to n max d← f(x)/f′(x)
x←x−d
output n, x
if|d|≦ 1 ×10−6 thenexitloop 2
end for
t ← t0 output 0, t fori =1ton
.
t←t+h
output i, t end for
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
614 Appendix A
Advice on Good Programming Practices
Here, i and n are integer variables, and t0, t, and h are real variables. An alternative way is
In the first pseudocode, n additions occur, each with possible roundoff error. In the second, this situation is avoided, but at the added cost of n multiplications. Which is better depends on the particular situation at hand.
Function Evaluations When values of a function at arbitrary points are needed in a program, several ways of coding this are available. For example, suppose values of the function
f (x) = 2x + ln x − sin x
are needed. A simple approach is to use an assignment statement such as
y ← 2x + ln(x) − sin(x)
at appropriate places within the program. Here, x and y are real variables. Equivalently, by
using an internal function procedure corresponding to the pseudocode f (x) ← 2x + ln(x) − sin(x)
it could be evaluated at 2.5 by
y ← f (2.5)
or whatever value of x is desired. Finally, a function subprogram can be used such as in the
following pseudocode:
Which implementation is best?
It depends on the situation at hand. The assignment statement is simple and safe. An internal or external function procedure can be used to avoid duplicating code. A separate external function subprogram is the best way to avoid difficulties that inadvertently occur when someone must insert code into another’s program. In using program library routines, the user may be required to furnish an external function procedure to communicate function values to the library routine. If the external function procedure f is passed as an argument in another procedure, then a special interface must be used to designate it as an external function.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
fori =0ton .
t ← t0 + real(i)h
output i, t end for
real function f (x) real x
f ←2x+ln(x)−sin(x) end function f
A.1 Programming Suggestions 615 On Developing Mathematical Software
Fred Krogh [2003] has written a paper listing some of the things he has learned from a career at the Jet Propulsion Laboratory involving the development and writing of mathematical software used in application packages. Some of his helpful hints and random thoughts to remember in code development are as follows:
• Includeinternaloutputinordertoseewhatyouralgorithmisdoing
• Supportdebuggingbyincludingoutputattheinterfaces
• Providedetailederrormessages
• Fine-tuneyourcode
• Provideunderstandabletestcases
• Verifyresultswithcare
• Takeadvantageofyourmistakes
• Keepunitsconsistent
• Testtheextremes
• Thealgorithmmatters
• Workonwhatdoeswork
• Tossoutwhatdoesnotwork
• Donotgiveuptoosoononideasforimprovingordebuggingyourcode
• Yoursubconsciousisapowerfultool,solearntouseit
• Testyourassumptions
• In the comments, keep a dictionary of variables in alphabetical order because it is quite helpful when looking at a code years after it was written
• Writetheuserdocumentationfirst
• Knowwhatperformanceyoushouldexpecttoget
• Donotpaytoomuch,butjustenough,attentiontoothers
• Seesetbacksaslearningopportunitiesandasthestaircaseforkeepingone’sspiritsup
• Whencomparingcodes,donotchangetheirfeaturesorcapabilitiesinordertomakethe comparison fair, since you may not fully understand the other person’s code
• Keepactionlists
• Categorizecodefeatures
• Organizethingsintogroups
• The organization of the code may be one of the most important decisions the developer makes
• Isolatethelinearalgebrapartsofthecodeinanapplicationpackagesothattheusermay make modifications to them
• Reversecommunicationisahelpfulfeaturethatallowsuserstoleavethecodeandcarry out matrix-vector operations using their own data structures
• Saveandrestorevariableswhentheuserisallowedtoleavethecodeandreturn
• Portabilityismoreimportantthanefficiency
These are just a random sampling of some of his insights. Use them as you see fit!
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
B
Representation of Numbers in Different Bases
In this appendix, we review some basic concepts on number representation in different bases.
B.1 Representation of Numbers in Different Bases
Base 10 Integer Part
Base 10 Fractional Part
Infinite String of Digits
Base 10 General Expression
616
We begin with a discussion of general number representation, but move quickly to bases 2, 8, and 16, as they are the bases primarily used in computer arithmetic.
The familiar decimal notation for numbers uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. When we write a whole number such as 37294, the individual digits represent coefficients of powers of 10 as follows:
37294 = 4+90+200+7000+30000
= 4×100 +9×101 +2×102 +7×103 +3×104
Thus, in general, a string of digits represents a number according to the formula (anan−1…a2a1a0)10 =a0 ×100 +a1 ×101 +···+an−1 ×10n−1 +an ×10n
This takes care of only the positive whole numbers. A number between 0 and 1 is represented by a string of digits to the right of a decimal point. For example, we see that
0.7215 = 7 + 2 + 1 + 5 10 100 1000 10000
= 7×10−1 +2×10−2 +1×10−3 +5×10−4 In general, we have the formula
(0.b1b2b3…)10 =b1 ×10−1 +b2 ×10−2 +b3 ×10−3 +···
Note that there can be an infinite string of digits to the right of the decimal point; indeed, there must be an infinite string to represent some numbers. For example, we note that
√
2 = 1.41421 35623 73095 04880 16887 24209 69 . . .
e = 2.71828 18284 59045 23536 02874 71352 66 . . .
π = 3.14159 26535 89793 23846 26433 83279 50 . . .
ln 2 = 0.69314 71805 59945 30941 72321 21458 17 . . .
1 =0.33333333333333333333333333333333… 3
For a real number of the form
(anan−1 …a1a0.b1b2b3 …)10 =
n k=0
ak10k +
∞ k=1
bk10−k
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Integer Base 8 Example
Fraction Base 8 Example
Base β General Expression
the integer part is the first summation in the expansion and the fractional part is the second summation. If ambiguity can arise, a number represented in base β is signified by enclosing it in parentheses and adding a subscript β.
Base β Numbers
The foregoing discussion pertains to the usual representation of numbers with base 10. Other bases are also used, especially in computers. For example, the binary system uses 2 as the base, the octal system uses 8, and the hexadecimal system uses 16.
In the octal representation of a number, the digits that are used are 0, 1, 2, 3, 4, 5, 6, and 7. Thus, we see that
(21467)8 = 7+6×8+4×82 +1×83 +2×84 = 7+8(6+8(4+8(1+8(2))))
= 9015
A number between 0 and 1, expressed in octal, is represented with combinations of 8−1,
8−2, and so on. For example, we have
(0.36207)8 = 3×8−1 +6×8−2 +2×8−3 +0×8−4 +7×8−5 = 8−5(3×84 +6×83 +2×82 +7)
= 8−5(7 + 82(2 + 8(6 + 8(3))))
15495
=
32768
= 0.47286 987 . . .
We shall see presently how to convert easily to decimal form without having to find a common denominator.
If we use another base, say β, then numbers represented in the β-system look like this:
Integer Base γ to Base β
n k=0
B.1 Representation of Numbers in Different Bases 617
n (anan−1 …a1a0.b1b2b3 …)β =
k=0
akβk +
∞ k=1
bkβ−k
The digits are 0,1,…,β −2, and β −1 in this representation. If β > 10, it is necessary to introduce symbols for 10, 11, . . . , β − 1. The separator between the integer and fractional part is called the radix point, since decimal point is reserved for base-10 numbers.
Conversion of Integer Parts
We now formalize the process of converting a number from one base to another. It is advisable to consider separately the integer and fractional parts of a number. Consider, then, a positive integer N in the number system with base γ :
lations are to be performed in arithmetic with base β. Write N in its nested form: N = a0 +γ(a1 +γ(a2 +···+γ(an−1 +γ(an))···))
akγk
Suppose that we wish to convert this to the number system with base β and that the calcu-
N=(anan−1…a1a0)γ =
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
618 Appendix B
Representation of Numbers in Different Bases
Integer Base 10 to Base 2
Then replace each of the numbers on the right by its representation in base β. Next, carry out the calculations in β -arithmetic. The replacement of the ak ’s and γ by equivalent base-β numbers requires a table showing how each of the numbers 0, 1, . . . , γ − 1 appears in the β-system. Moreover, a base-β multiplication table may be required.
To illustrate this procedure, consider the conversion of the decimal number 3781 to binary form. Using the decimal binary equivalences and longhand multiplication in base 2, we have
(3781)10 =1+10(8+10(7+10(3)))
= (1)2 + (1 010)2 ((1 000)2 + (1 010)2 ((111)2 + (1 010)2(11)2)) = (111 011 000 101)2
This arithmetic calculation in binary is easy for a computer that operates in binary but tedious for humans!
Another procedure should be used for hand calculations. Write down an equation containingthedigitsc0,c1,…,cm thatweseek:
N=(cmcm−1…c1c0)β =c0+β(c1+β(c2+···+β(cm)···))
Next, observe that if N is divided by β, then the remainder in this division is c0, and the
EXAMPLE 1
Solution
Convert the decimal number 3781 to binary form using the division algorithm.
As was indicated above, we divide repeatedly by 2, saving the remainders along the way. Here is the work:
Quotients Remainders
2 ) 3781
2)1890 1=c0 ↓ ̇
2)945 0=c1 2)472 1=c2 2)236 0=c3 2)118 0=c4
2)59 0=c5 2)29 1=c6 2)14 1=c7
2)7 0=c8 2)3 1=c9 2)1 1=c10
0 1=c11
Here, the symbol ↓ ̇ is used to remind us that the digits ci are obtained beginning with the
digit next to the binary point. Thus, we have
(3781.)10 =(111011000101.)2
and not the other way around: (101 000 110 111.)2 = (2615)10. ■
quotient is
If this number is divided by β, the remainder is c1, and so on. Thus, we divide repeatedly
c1 +β(c2 +···+β(cm)···) byβ,savingremaindersc0,c1,…,cm andquotients.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Fraction Base 10 to Base 2
(0.372)10 = 3×10−1 +7×10−2 +2×10−3 = 13+ 17+ 1 (2)
10 10 10
= 1 (011)2 + 1 (111)2 +
(1 010)2 (1 010)2
1 (010)2 (1 010)2
EXAMPLE 2
Solution
B.1 Representation of Numbers in Different Bases 619 Convert the number N = (111 011 000 101)2 to decimal form by nested multiplication.
N = 1×20 +0×21 +1×22 +0×23 +0×24 +0×25
+1×26 +1×27 +0×28 +1×29 +1×210 +1×211
= 1+2(0+2(1+2(0+2(0+2(0+2(1+2(1+2(0 + 2(1 + 2(1 + 2(1)))))))))))
= 3781
The nested multiplication with repeated multiplication and addition can be carried out on
a hand-held calculator more easily than can the previous form with exponentiation. ■
Another conversion problem exists in going from an integer in base γ to an integer in base β when using calculations in base γ . As before, the unknown coefficients in the equation
N =c0 +c1β+c2β2 +···+cmβm
are determined by a process of successive division, and this arithmetic is carried out in the γ -system. At the end, the numbers ck are in base γ , and a table of γ -β equivalents is used. For example, we can convert a binary integer into decimal form by repeated division by (1 010)2 (which equals (10)10 ), carrying out the operations in binary. A table of binary- decimal equivalents is used at the final step. However, since binary division is easy only for
computers, we shall develop alternative procedures presently.
Conversion of Fractional Parts
We can convert a fractional number such as (0.372)10 to binary by using a direct yet naive approach as follows:
Dividing in binary arithmetic is not straightforward, so we look for easier ways of doing this conversion!
Suppose that x is in the range 0 < x < 1 and that the digits ck in the representation
∞ k=1
are to be determined. Observe that
βx = (c1.c2c3c4 . . .)β
because it is necessary to shift the radix point only when multiplying by base β.
Thus, the unknown digit c1 can be described as the integer part of β x . It is denoted by I(βx). The fractional part, (0.c2c3c4 ...)β, is denoted by F(βx). The process is repeated
x=
ckβ−k =(0.c1c2c3...)β
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
620 Appendix B
Representation of Numbers in Different Bases
EXAMPLE 3
Solution
etc.
In this algorithm, the arithmetic is carried out in the decimal system.
Use the preceding algorithm to convert the decimal number x = (0.372)10 to binary form. The algorithm consists in repeatedly multiplying by 2 and removing the integer parts. Here
in the following pattern:
d0 = x
d1 =F(βd0) d2 = F(βd1)
c1 =I(βd0) ↓ ̇ c2 = I(βd1)
is the work:
0.372 2 ↓ ̇ c1=.744 2 c2 = .488 2 c3 = .976 2 c4 = .952 2 c5 = .904 2 c6 = .808
etc.
Examples Base 8 to Base 10
(26031)8 = 2×84 +6×83 +0×82 +3×8+1 = ((((2)8+6)8+0)8+3)8+1
= 11289
(7152.46)8 = 7×83 +1×82 +5×8+2+4×8−1 +6×8−2 = (((7)8+1)8+5)8+2+8−2[(4)8+6]
= 3690 + 38 64
= 3690.59375
Thus, we have (0.372)10 = (0.010 111 . . .)2 . Base Conversion 10 ↔ 8 ↔ 2
■
Most computers use the binary system (base 2) for their internal representation of numbers. The octal system (base 8) is particularly useful in converting from the decimal system (base 10) to the binary system and vice versa. With base 8, the positional values of the numbers are80 =1,81 =8,82 =64,83 =512,84 =4096,andsoon.Thus,forexample,wehave
and
When numbers are converted between decimal and binary form by hand, it is convenient to use octal representation as an intermediate step. In the octal system, the base is 8, and, of course, the digits 8 and 9 are not used. Conversion between octal and decimal proceeds according to the principles already stated. Conversion between octal and binary is especially
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
0
1
0
1
1
1
Binary-Odd Table
B.1 Representation of Numbers in Different Bases 621 simple. Groups of three binary digits can be translated directly to octal according to the
following table:
Binary 000 001 010 011 100 101 110 111
Octal 0 1 2 3 4 5 6 7
This grouping starts at the binary point and proceeds in both directions. Thus, we have
(101101001.110010100)2 =(551.624)8
To justify this convenient sleight of hand, we consider, for instance, a fraction expressed
in binary form:
x = (0.b1b2b3b4b5b6 . . .)2
= b12−1 +b22−2 +b32−3 +b42−4 +b52−5 +b62−6 +··· =(4b1 +2b2 +b3)8−1 +(4b4 +2b5 +b6)8−2 +···
In the last line of this equation, the parentheses enclose numbers from the set {0, 1, 2, 3, 4, 5, 6, 7} because the bi ’s are either 0 or 1. Hence, this must be the octal representation of x .
Conversion of an octal number to binary can be done in a similar manner but in reverse order. It is easy! Just replace each octal digit with the corresponding three binary digits. Thus, for example, we obtain
(5362.74)8 =(101011110010.111100)2 What is (2576.35546 875)10 in octal and binary forms?
We convert the original decimal number first to octal and then to binary. For the integer part, we repeatedly divide by 8:
EXAMPLE 4
Solution
8 ) 2576
8 ) 322 0 ↓ ̇
8 ) 40 2 8)50 05
Thus, we have
2576.=(5020.)8 =(101000010000.)2
using the rules for grouping binary digits. For the fractional part, we repeatedly multiply
by 8
so that
Finally, we obtain the result
↓ ̇
0.35546875 8 .84375000 8 .75000000 8 .00000000
0.35546 875 = (0.266)8 = (0.010 110 110)2 2576.35546 875 = (101 000 010 000.010 110 110)2
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2
6
6
622 Appendix B
Representation of Numbers in Different Bases
Hexadecimal-Binary Table
Although this approach is longer for this example, we feel that it is easier, in general, and less likely to lead to errors because one is working with single-digit numbers most of the time. ■
Base 16
Some computers whose word lengths are multiples of 4 use the hexadecimal system (base 16) in which A, B, C, D, E, and F represent 10, 11, 12, 13, 14, and 15, respectively, as given in the following table of equivalences:
Hexadecimal 0 1 2 3 4 5 6 7 Binary 0000 0001 0010 0011 0100 0101 0110 0111
89ABCDEF 1000 1001 1010 1011 1100 1101 1110 1111
Conversion between binary numbers and hexadecimal numbers is particularly easy. We need only regroup the binary digits from groups of three to groups of four. For example, we have
(010101110101101)2 =(0010101110101101)2 =(2BAD)16 (111101011110010.110010011110)2 =(101011110010.110010011110)2
= (7AF2.C9E)16
More Examples
Continuing with more examples, let us convert (0.276)8 , (0.C8)16 , and (492)10 into different number systems. We show one way for each number and invite the reader to work out the details for other ways and to verify the answers by converting them back into the original base.
(0.276)8 = 2×8−1 +7×8−2 +6×8−3 = 8−3[((2)8 + 7)8 + 6]
= (0.37109 375)10
(0.C8)16 =(0.110010)2 = (0.62)8
= 6×8−1 +2×8−2 = 8−2[(6)8 + 2]
= (0.78125)10
(492)10 = (754)8
= (111 101 100)2
= (1EC)16
More Example
and
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
because
Summary B.1
8) 492
8) 61 4 ↓ ̇
8) 7 5 07
c.3.14
3. Converttohexadecimal,tooctal,andthentodecimal.
aa. (110111001.101011101)2 ab.(1001100101.01101)2 4. Convertthefollowingnumbers:
a
a
f.57.321
Do you expect your computer to calculate 3 × 1 with
a. 27.1
e. 75.232
b.12.34
B.1 Representation of Numbers in Different Bases 623
It might seem that there are several different procedures for converting between number systems. Actually, there are only two basic techniques. The first procedure for converting the number (N)γ to base β can be outlined as follows:
• Express(N)γ innestedformusingpowersofγ.
• Replace each digit by the corresponding base-β numbers. • Carryouttheindicatedarithmeticinbaseβ.
This outline holds whether N is an integer or a fraction. The second procedure is either the divide-by-β and remainder-quotient-split process for N an integer or the multiply-by-β and integer-fraction-split process for N a fraction. The first procedure is preferred when γ <βandthesecondwhenγ >β.Ofcourse,the10↔8↔2↔16baseconversion procedure should be used whenever possible because it is the easiest way to convert numbers between the decimal, octal, binary, or hexadecimal systems.
1. Findthebinaryrepresentationandcheckbyreconverting to decimal representation.
aa. e≈(2.718)10 b.7 c.(592)10 8
2. Convert the following decimal numbers to octal numbers.
5. a6.
7.
8. a9.
10.
11. a12.
Convert(45653.127664)8tobinaryandtodecimal. Convert(0.4)10firsttooctalandthentobinary.
Check: Convert directly to binary.
Prove that the decimal number 1 cannot be represented
ad.23.58
5
by a finite expansion in the binary system.
a. (100101101)2 = ( )8 = (
)10
)8 =(
Explain the algorithm for converting an integer in base 10 to one in base 2, assuming that the calculations are performed in binary arithmetic. Illustrate by converting (479)10 to binary.
Justify mathematically the conversion between binary and hexadecimal numbers by regrouping.
Justify for integers the rule given for the conversion be- tween octal and binary numbers.
Prove that a real number has a finite representation in the binary number system if and only if it is of the form ±m /2n , where n and m are positive integers.
b. (0.782)10 =( ac. (47)10=(
d. (0.47)10 =( ae. (51)10=(
)8 =( )8=(
)8 =( )8=(
)2 )2
)2 )2
f. (0.694)10 =(
ag. (110011.1110101101101)2 =(
)8 =(
h. (361.4)8 =( )2 =( )10
)2
)10
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
infinite precision? What about 2 × 1 or 10 × 1 ? 2 10
3
Exercises B.1
624 Appendix B Representation of Numbers in Different Bases 13. Provethatanynumberthathasafiniterepresentationin
11131 7520
18651 3611
the binary system must have a finite representation in the decimal system.
14. Some countries measure temperature in Fahrenheit (F), while other countries use Celsius (C). Similarly, for dis- tance, some use miles and others use kilometers. As a frequent traveler, you may be in need of a quick approx- imate conversion scheme that you can do in your head.
a. Fahrenheit and Celsius are related by the equation F = 32+(9/5)C . Verify the following simple conver- sion scheme for going from Celsius to Fahrenheit: A rough approximation is to double the Celsius temper- ature and add 32. To refine your approximation, shift the decimal place to the left in the doubled number (2C) and subtract it from the approximation obtained previously:
F = [(2C) + 32] − (2C)/10
b. Determine a simple scheme to convert from Fahren-
heit to Celsius.
c. Determine a simple scheme to convert from miles to kilometers.
d. Determine a simple scheme to convert from kilome- ters to miles.
15. Convert fractions such as 1 and 1 into their binary rep- 3 11
resention.
16. (Mayan Arithmetic) The Maya civilization of Central America (2600 B.C. to 1200 A.D.) understood the con- cept of zero hundreds of years before many other civ- ilizations. For their calculations, the vigesimal (base 20) system was used, not the decimal (base 10) sys- tem. So instead of 1,10,100,1000,10000, they used 1, 20, 400, 8000, 16000. They used a dot for 1 and a bar for 5, and 0 was represented by the shell symbol. For example, the calculations
8000s
400s
20s 1s
11131 + 7520 = 18651, was as follows:
11131 − 7520 = 3611
In the table above, some decimal numbers are included as an aid: on the left are the powers used, and across the top are the numbers represented by the columns.
Do these calculations using Mayan symbols and arithmetic:
a. 92819 + 56313 = 149132, 92819 − 56313 = 36506
b. 3296 + 853 = 4149, 3296 − 853 = 2443
c. 2273 + 729 = 1544, 2273 − 729 = 1544
d. Investigate how the Mayans might have done multi- plication and division in their number system. Work out some simple examples.
17. (Babylonian Arithmetic) Babylonians of ancient Mesopotania (now Iraq) used a sexagesimal (base 60) positional number system with a decimal (base 10) sys- tem within it. The Babylonians based their number sys- tem on only two symbols! The influence of Babylonian arithmetic is still with us today. An hour consists of 60 minutes and is divided into 60 seconds, and a circle is measured in divisions of 360 degrees. Numbers are fre- quently called digits, from the Latin word for “finger.” The base-10 and base-20 systems most likely arose from the fact that 10 fingers and 10 toes could be used in count- ing. Investigate the early history of numbers and doing aritmetic calculations in different number systems.
Hint: More than 30 decimal digits will be needed to see any difference.
3. Writeandtestaroutineforconvertingintegersintooctal and binary forms.
4. (Continuation) Write and test a routine for converting decimal fractions into octal and binary forms.
or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1. Read into your computer x = 1.1 (base 10), and
print it out using several different formats. Explain the
results.
2. Show that eπ integer
√
163 is incredibly close to being the 18-digit 262 53741 26407 68744
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole deemed that any suppressed content does not materially affect the overall learning experience.
Computer Exercises B.1
5. (Continuation) Using the two routines of the preceding problems, write and test a program that reads in decimal numbers and prints out the decimal, octal, and binary representations of these numbers.
6. See how many binary digits your computer has for (0.1)10 .
7. Somemathematicalsoftwaresystemshavecommandsfor converting numbers between binary, decimal, hex, octal, and vice versa. Explore these commands using various numerical values. Also, see whether there are commands for determining the precision (the number of significant
decimal digits in a number) and the accuracy (the number of significant decimal digits to the right of the decimal point in a number).
8. Write a computer program to verify the conclusions in evaluating
f (x) = x − sin x
for various values of x near 1.9, say, over the interval [0.1, 2.5] with increments of 0.1. For these values, com- putetheapproximatevalueof f,thetruecalculatedvalue, and the absolute error between them. Single-precision and double-precision computations may be necessary.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
B.1 Representation of Numbers in Different Bases 625
C
Additional Details on IEEE Floating-Point Arithmetic
In this appendix, we summarize some additional features in IEEE standard floating-point arithmetic. (See Overton [2001] for additional details.)
C.1 More on IEEE Standard Floating-Point Arithmetic
626
In the early 1980s, a working committee of the Institute for Electrical and Electronics Engineers (IEEE) established a standard floating-point arithmetic system for computers that is now known as the IEEE floating-point standard (IEEE-754). Previously, manufacturers of different computers each developed their own internal floating-point number systems. This led to inconsistencies in numerical results in moving code from machine to machine, for example, in porting source code from an IBM computer to a Cray machine. Some important requirements for all machines adopting the IEEE floating-point standard include the following:
• Correctlyroundedarithmetic
• Consistentrepresentationoffloating-pointnumbersacrossmachines • Consistentandsensibletreatmentofexceptionalsituations
Suppose that we are using a 32-bit computer with IEEE standard floating-point arith- metic. There are exactly 23 bits of precision in the fraction field in a single-precision normalized number. By counting the hidden bit, this means that there are 24 bits in the significand and the unit roundoff error is u = 2−24. In single precision, the machine epsilon is εsingle = 2−23 because 1 + 2−23 is the first such single-precision number larger than 1. Since 2−23 ≈ 1.19 × 10−7, we can expect only approximately 6 accurate decimal digits in the output. This accuracy may be reduced further by errors of various types, such as roundoff errors in the arithmetic, truncation errors in the formulas used, and so on.
For example, when computing the single-precision approximation to π, we obtain 6 accurate digits: 3.14159. Converting and printing the 24-bit binary number results in an actual decimal number with more than six nonzero digits, but only the first six digits are considered accurate approximations to π.
The first double-precision number larger than 1 is 1 + 2−52. So the double-precision machineepsilonisεdouble =2−52.Since2−52 ≈2.22×10−16,thereareonlyapproximately 15 accurate decimal digits in the output in the absence of errors. The fraction field has exactly 52 bits of precision, and this results in 53 bits in the significand when the hidden bit is counted.
For example, when approximating π in double precision, we obtain 15 accurate digits: 3.14159 26535 8979. As in the case with single precision, converting and printing the 54-bit binary significand results in more than 15 digits, but only the first 15 digits are accurate approximations to π.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
TABLE C.2
Results with IEEE Standard Floating-Point on 32-Bit Machine
C.1 More on IEEE Standard Floating-Point Arithmetic 627
There are some useful special numbers in the IEEE standard. Instead of terminating with an overflow when dividing a nonzero number by 0, the machine representation for ∞ is stored, which is the mathematically sensible thing to do. Because of the hidden bit representation, a special technique for storing zero is necessary. Note that all zeros in the fraction field (mantissa) represent the significand 1.0 rather than 0.0. Moreover, there are two different representations for the same number zero, namely, +0 and −0. On the other hand, there are two different representations for infinity that correspond to two quite different numbers, +∞ and −∞. NaN stands for Not a Number and is an error pattern rather than a number.
Is it possible to represent numbers smaller than the smallest normalized floating-point number 2−126 in IEEE standard floating-point format?
Yes! If the exponent field contains a bit string of all zeros and the fraction field contains a nonzero bit string, then this representation is called a subnormal number. Subnormal numbers cannot be normalized because this would result in an exponent that does not fit into the exponent field. These subnormal numbers are less accurate than normal numbers because they have less room in the fraction field for nonzero bits.
By using various system inquiry functions (such as those in Table C.1 from Fortran), we can determine some of the characteristics of the floating-point number system on a typical PC with 32-bit IEEE standard floating-point arithmetic. Table C.2 contains the results. In most cases, simple programs can also be written to determine these values.
TABLE C.1
Some Numeric Inquiry Functions in Fortran
EPSILON(X)
TINY(X)
HUGE(X)
PRECISION(X)
Machine epsilon (number almost negligible compared to 1) Smallest positive number
Largest number
Decimal precision (number of significant decimal digits in output)
EPSILON(X)
TINY(X)
HUGE(X)
PRECISION(X)
X Single Precision 1.192 × 10−7 ≈ 2−23
1.175 × 10−38 ≈ 2−126 3.403 × 1038 ≈ 2128 6
X Double Precision 2.220 × 10−16 ≈ 2−52
2.225 × 10−308 ≈ (2 − 2−23) × 2127 1.798 × 10308 ≈ 21024
15
In Table C.3 (p. 628), we show the relationship between the exponent field and the possible single-precision 32-bit floating-points numbers corresponding to it. In this table, all lines except the first and the last are normalized floating-point numbers. The first line shows that zero is represented by +0 when all bits bi = 0, and by −0 when all bits are zero except b1 = 1. The last line shows that +∞ and −∞ have bit strings of all ones in the exponent field except for possibly the sign bit together with all zeros in the mantissa field.
In the IEEE floating-point standard, the round to nearest or correctly rounded value of the real number x, denoted round(x), is defined as follows. First, let x+ be the closest floating-point number greater than x, and let x− be the closest one less than x. If x is a
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
628 Appendix C
Additional Details on IEEE Floating-Point Arithmetic
TABLE C.3 Single-Precision 32-Bit Word withSignBitb1 =0for+andb1 =1for−.
b1 b2b3b4 ···b9 b10b11 ···b32
(b2 b3 . . . b9)2 Exponent Field (00000000)2 = (0)10
(00000001)2 = (1)10 (00000010)2 = (2)10 (00000011)2 = (3)10 (00000100)2 = (4)10
.
(01111101)2 = (125)10 (01111110)2 = (126)10 (01111111)2 = (127)10 (10000000)2 = (128)10 (10000001)2 = (129)10
.
(11111011)2 = (251)10 (11111100)2 = (252)10 (11111101)2 = (253)10 (11111110)2 = (254)10
(11111111)2 = (255)10
Numerical Representation
±0, ifb10 =b11 =···=b32 =0 subnormal, otherwise
±(1.b10b11b12···b32)2×2−126 ±(1.b10b11b12···b32)2×2−125 ±(1.b10b11b12···b32)2×2−124 ±(1.b10b11b12···b32)2×2−123
.
±(1.b10b11b12 · · · b32)2 × 2−2
±(1.b10b11b12 · · · b32)2 × 2−1 ±(1.b10b11b12 · · · b32)2 × 20 ±(1.b10b11b12 · · · b32)2 × 21 ±(1.b10b11b12 · · · b32)2 × 22
.
±(1.b10b11b12 · · · b32)2 × 2124 ±(1.b10b11b12 · · · b32)2 × 2125 ±(1.b10b11b12 · · · b32)2 × 2126 ±(1.b10b11b12 · · · b32)2 × 2127
±∞, ifb10 =b11 =···=b32 =0 NaN, otherwise
Rounding Modes
floating-point number, then round(x) = x. Otherwise, the value of round(x) depends on the rounding mode selected:
• Round to nearest: round(x) is either x− or x+, whichever is nearer to x. (If there is a tie, choose the one with the least significant bit equal to 0.)
• Round toward 0: round(x) is either x− or x+, whichever is between 0 and x. • Round toward −∞/round down: round(x) = x−.
• Round toward +∞/round up: round(x) = x+.
Round to nearest is almost always used, since it is the most useful and gives the floating-point number closest to x.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
D
Linear Algebra Concepts and Notation
In this appendix, we review some basic concepts and standard notation used in linear algebra.
D.1 Elementary Concepts
Column Vector
Row Vector
The two concepts from linear algebra that we are most concerned with are vectors and matrices because of their usefulness in compressing complicated expressions into a compact notation. The vectors and matrices in this text are most often real, since they consist of real numbers. These concepts easily generalize to complex vectors and matrices.
Vectors
A vector x ∈ Rn can be thought of as a one-dimensional array of numbers and is written as x1
x2 x = . . .
xn
where xi is called the i-th element, entry, or component. An alternative notation that is useful in pseudocodes is x = (xi )n . Sometimes the vector x displayed above is said to be a column vector to distinguish it from a row vector y written as
y = [ y1 , y2 , . . . , yn ] For example, here are some vectors:
1
35 1 −5, [π, e, 5,−4], 21
26 3
7
To save space, a column vector x can be written as a row vector such as
x = [x1,x2,…,xn]T or xT = [x1,x2,…,xn]
by adding a T (for transpose) to indicate that we are interchanging or transposing a row or
column vector. As an example, we have
[1 2 3 4]T =2
Transpose
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
1 34
629
630 Appendix D
Linear Algebra Concepts and Notation
Column Vectors
x1
x2 x = . . . ,
xn
y1 y2
y = . . . yn
y = ax + b
Linear Combination
combination as
m i=1
m
α x(i) i1 i=1
m
α i x ( i )
Many operations involving vectors are component-by-component operations. For vectors x and y
the following definitions apply:
Equality x=yifandonlyifxi =yi foralli(1≦i≦n) Inequality x
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
D.1 Elementary Concepts 633 triangular, and upper triangular matrices, respectively, are as follows:
1000 3000 53000 0 1 0 0 0 5 0 0 2 5 3 0 0
0 0 1 0, 0 0 7 0, 0 2 9 2 0 0001 0009 00372
00037
6000 1−121 3 6 0 0, 0 5 −5 1 4 −2 7 0 0 0 9 −3 5 −3 9 21 0 0 0 2
As with vectors, many operations involving matrices correspond to component opera- tions. For matrices A and B, we write
a11 a12 ··· a1m
a21 a22 ··· a2m A= . . … . ,
an1 an2 ··· anm
b11 b12 ··· b1m
b21 b22 ··· b2m B= . . … .
bn1 bn2 ··· bnm
the following definitions apply:
Equality A=Bifandonlyifaij =bij foralli(1≦i≦n)andallj(1≦j≦m) Inequality A
f (0.0125) ≈ −9.888 × 10−7 +3−1.7×2; f(0)=3.5
14. |x| < 10−15 Exercises 2.1
15. ρ50 = 2.85987
1
x2 + 1 + x 1.
Homogeneous: α = 0, zero solution; α = ±1, infinite number of solutions
11.f(x)= 0, √2 x=0 −ln −x+ x +1 , x<0
2. For α ≈ 1, erroneous answer is produced. 3a. No solution
x4 3b. x +4+2 4.
Infinite number of solutions
13. z=√4
16. f(x)≈1−x+ 3 − 6 ; f(0.008)≈0.992020915
x1 = −720.79976 x2 = 356.28760
x2 x3
20. arctan x − x ≈ x3−1 + x21 + x2−1 5.
x1 = −697.3 x2 = 343.9
−0.001343 −0.001572
−0.0000001 0.0000000 ,
r =
e= −0.001 , e= 0.913
r =
,
357 22. e2x −12x ≈1+x(1+(x/3)(2+x))
24a. Nearπ/2,sinecurveisrelativelyflat. 6a. 26b. lnx−1=ln(x/e)
26d. x−2(sinx−ex+1)≈−1−xwhenx→0
23
6ε, where ε machine precision
which is exact.
7a. x1 = −0.2752 + 0.9174i ,
x2 = 5.4312 − 0.7706i , x3 = −3.3394 − 0.018i
7b. x1 = 1.1927 − 0.6422i , x2 = −0.2018 + 3.3394i , x3 = 1.1376 − 2.4587i
−0.001
x2=1,x1=0 6b. x2=1,x1=1
−0.659 6c. Letb1 =b2 =1.Thenx2 =1, x1 =0,
√
29. x1 ≈ 105, x2 ≈ 10−5
28. |x | <
30. Not much. Expect to compute b2 − 4ac in double precision.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
642 Answers for Selected Exercises 7c. x1 ≈ −7.233, x2 ≈ 1.133,
x3 ≈ 2.433, x4 = 4.5 Computer Exercises 2.1
6. z = [2i, i, i, i ]T , λ = 1 + 5i ;
z = [1, 2, 1, 1]T , λ = 2 + 6i ; z=[−i,−i,0,−i]T,λ= −3−7i; z = [1, 1, 1, 0]T , λ = −4 − 8i
21. Solvethese:UTy=b, LTx=y
23a. x1=5, x2=2, x3=1×10−9
999
Computer Exercises 2.2
2. [3.4606, 1.5610, −2.9342, −0.4301]T 3. [6.7831, 3.5914, −6.4451, −1.5179]T
4. 2≦n≦10,xi ≈1foralli; forlargen,manyxi ≠ 1 5. bi=n2+2(i−1) 6.x2=1,xi=0fori≠2
7a. (3.75, 90◦); (3.27, −65.7◦);
7b. (2.5, −90◦); (2.08, 56.3◦);
Exercises 2.2
(0.775, 172.9◦) (1.55, −60.2◦)
1/2 5/2 −4 −1
1/4 −1/2 −5/19 −62/19
1. 3/4 9/10 38/5 9/10
4104
2. x = [1/3, 3, 1/3]T
103 0 0 1 3−2
3. 013 − 1 ⇒ 0 1 3 − 1
3−30 6 3−306 0 2 4 −6 0 2 4 −6
0 1 3 −2
0001 ⇒3−306
Exercises 2.3
2a. 5n−4 3. n+2nk−k(k+1)
7. D−1AD=Tridiagonal±√a c i−1 i−1
Computer Exercises 2.3
6. Yes,itdoes. , d, ±√ac
0 0 −2 −2
x1 ←b1/a11
n−1
1/4 5/2 7/4 1/2 4212 5. 1/2 0 5/9 17/9
11a.
x i ← b i − a i j x j a i i
( 2 ≦ i ≦ n )
3.
di ←di −1/di−1 xn ← bn
(2≦i≦n) (n−1≧ j≧1)
bi ←bi −bi−1/di−1
xi ←(bi −xi+1)/xi
4. x1=1
x =1−(4x )−1 (2≦i≦100)
i i−1
j=1 c←c/d
i i i
1/4 3/5 27/10 1/5
6. l = (1, 3, 2), the second pivot row is the third row.
12.
i ii bi ←bi/di
8. x3 =−1, x2 =1, x1 =0
10. x4=−1, x3=0, x2=2, x1=1 13b. x3 =1, x2 =1, x1 =1
13d. x1 ≈ 4.267, x2 ≈ −4.133,
(1≦i≦n−1) bi ←bi−cibi+1 (1=n−1,...,1)
17. 18. 19.
n(n+1)
29(n2 − 1) + 7 n(n − 1)(2n − 1)10−6 seconds
10 30
n 10 102 103 104
1. 4.
9.
0.61906; 1.51213
−π −δ,0, π +ε, 3π +ε, 5π +ε,...,
4444
where δ ≈ 0.2 and ε starts at approximately
x3 ≈ −2.467
Exercises 3.1
d←d−ac i+1 i+1 i+1i
bi+1 ←bi+1−ai+1bi bn ← bn/dn
Time 1 ×10−3sec. 1 sec. 33
Cost 0.005 cents 5 cents
5.56min. $46.30
3.86days $46,296.30
0.4 and decreases.
0,±π,±π,±3π,±2π,... 10. x =0 22
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
12. If the original interval is of width h, then after, say, k steps, we have reduced the interval
27. x = (m − 1)xm + R/mxm−1; n+1 n n
containing the root to width h 2−k . From then we add one bit at each step. About three steps are needed for each decimal digit.
17. 20 steps
18b. False,becauseifrisclosetobn,then
r − an ≈ bn − an = 2−n(b0 − a0). 18d. True, because 0 ≦ r − an (obvious) and
r − an ≦ bn − an = 2−n(b0 − a0).
31. xn+1 = xn − f(xn)f′(xn)
[ f ′(xn)]2 − f (xn) f ′′(xn)
32. xn+1=xn− f′(xn) f′′(xn)
′2 ′′
+ [f (xn)] −2f(xn)f (xn)
f′′(xn)
f (m+1)(ηn) f (m+1)(ξn)
on,
xn+1=xn (m+1)R−xnm /(mR) 29. Diverges.
Answers for Selected Exercises
643
19a. False, in some cases. 19e. 21. n≧24−m. 23. No; No.
Computer Exercises 3.1
True.
m! − (m + 1)(m − 1)!
10. 1,2,3,3−2i,3+2i,5+5i,5−5i,16 11. 2.365
Exercises 3.2
35. en+1=en2 f(m)(r)+enf(m+1)(ηn) (m − 1)! m!
36. en+1=1en2f′′ 2g
37. |g′(r)|<1if0<ω<2 41. 4thorder
Computer Exercises 3.2
1
3. xn+1 = 2[xn + 1/(Rxn)]
4. 0.32796 77853 31818 36223 77546
5. 2.09455 14815 42326 59148 23865 40579 8. 1.83928 67552 9. 0.47033 169
10a. 1.8954942670340
10b. 1.9926668631307
10c. 0.51097342938857
10d. 2.5828014730552
14. 3.13108; 3.15145
4. 0.79; 7.y=x+1− 9.π
1.6
√2 √2 π 224
11. xn+1 = 2xn xn2 R + 1; −0.49985
12a. Yes, −√3 R. 13a. xn+1 = 12xn + R/xn2
3 13c. xn+1=xn xn3+2R 2xn3+R
13e. xn+1 = xn 4R − xn3 3R
(two nearby roots)
= R2x6+12Rx3+1 n+1 xn2 n n
13g. x
15.x=1 17.x =−1
Exercises 3.3
1. 2.7385 3. −3 2
4. ln2
1 2 n+1 2 19. |x0| < √3
9. xn+1=xn− xn2−R xn + xn−1
xn−x0 ′ 12. en+1 = 1− f(xn)− f(x0) f (ξn) en
13a. Linearconvergence
13c. Quadraticconvergence
15. Show|ξ −xn+1|≦c|ξ −xn|. 16. 2 17. x = 4.510187
21. Newton’s method cycles if x0 ≠ 0.
22. x←R
for n = 1 to n max
x ← (2x + Rx2)/3 end for
√
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
644 Answers for Selected Exercises Computer Exercises 3.3
1. −0.45896; 3.73308 6a. 1.53209 6b. 1.23618 7. 1.3880 81078 21373
9. 20.80485 4
Exercises 4.1
1
40. Divisions: 2n (n − 1);
Additions/subtractions: n(n − 1) 42. 0
45. False, only unique for polynomial pofdegree ≦n−1.
Computer Exercises 4.1
p(x) = 2 + 46(x − 1) + 89(x − 1)(x − 2) + 6(x −1)(x −2)(x −3)
+ 4(x −1)(x −2)(x −3)(x +4) Exercises 4.2
1. 3.
p3(x) = 7 − 2x + x3
l2(x) = −(x − 4)(x2 − 1)/8
1.
7a. p3(x)=2+(x+1)−3+
(x − 1)2 − (x − 3) 11
24
(x − 2.5)3 + (x − 3)11 46
9a.0 1
8 193
14
223 0
1.25×10−5
16. Yes.
8. p4(x) = −1 + (x − 1)2 + (x − 2)1 + 38
1. f[x0,x1,x2,x3,x4] = 0 6. 7. Errors: 8.1 × 10−6 , 6.1 × 10−6 8. 497 table entries
9. 4.105 × 10−14 (Thm 1);
1.1905 × 10−23 (Thm 2)
10. 2.6×10−6 n−1
13. n≧7 14. |x−xi|≦hn(2n)!
1
7
4 12 83
6 259
9b. f (4.2) = 104.488
1 i=0
10. On [−1, 1], the interpolation error
with the Chebyshev nodes minimize: ||w||∞ = max−1 ≦ x ≦ 1 |w(x)|.
So they make the error formula most favorable.
Exercises 4.3
1. −hf′′(ξ) 2. Errorterm=−hf′′(ξ)forξ ∈(0,2h)
4. No such formula exists.
6. The point ξ for the first Taylor series is
such that ξ ∈ (x, x + h), while the second is ξ∈(x−h,x).Theyarenotthesame.
2 2 ′′′ h2 (5) h2 (6) 8a. −3hf(ξ) 9a.−4f (ξ)9b.−6f (ξ)
35
22nn! Computer Exercises 4.2
93
12. q(x)=x4 −x3+x2−x+1
(x +2)(x +1)(x)(x −1)(x −2)
w(x) the infinity norm of w on [−1, 1]
−
13a. x3−3x2+2x−1
31 120
f(n)(ξ) n!
f (x) − p(x) =
where w(x) = (x − x0)(x − x1)···(x − xn)
16. 19.
14. p(x)=x−2.5
α0 = 1 18. 2+x−1+(x −1)1−(x −3)x
2
p4(x) = − 1 + 2(x + 2) − (x + 2)(x + 1)
+ (x + 2)(x + 1)x; p2(x) = 1 + 2(x + 1)x
22.
25. 1.5727; No advantage
27. 39.
32 p(x)=−5x3−5x2+1
0.85527; 0.87006
28. 0.38099; 0.077848
p(x) = 0.76(x − 1.73)(x − 1.82)(x − 5.22)(x − 8.26)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
11.
12.
13. 16. 20.
h2 ′′′ α=1,errorterm=− f (ξ);
6
α ≠ 1,errorterm=−(α−1)h f′′(ξ) 2
h2 ′′′ 1 (4) Errorterm=− 6 f (ξ1)+ 2 f (ξ2)
with partition {0, 1}
b
a 2
f(a+ih)+E
L ≈2φh−φ(h) where E = 1(b−a)hf′(ξ)forξ ∈(a,b)
Answers for Selected Exercises 645 29. f(x)=xn(n>3)on[0,1],
30.
31. n≧1155
f(x)dx≦T 34. −1(b−a)hf′(ξ)
forsomeξi ∈(x−h,x+h). b2
p′ x0+x1 = f(x1)−f(x0) 35. f(x)dx=h
2 x1 − x0 a i=0
n
22
h2 h L≈ φ −φ(h)φ
Computer Exercises 5.1
2a. 2 2b. 1.71828 2c. 0.43882
9. 0.94598 385; 0.94723 395 11a. 4.84422
Exercises 5.2
h
23
2φh−φ(h)−φh 23
Computer Exercises 4.3
3. 0.20211 58503 Exercises 5.1
1. 13 3. − 136 5. 4.267 15
8. R(1,1)= 31 3
7. Not well. {f(−h)+4f(0)+ f(h)}
8. ≈ 0.70833 9. T ( f ; P ) = 0.775; x 2 + 1 ≈ 0.7854;
is Simpson’s rule. 1+2m−1
2h R(2, 2) = 7 f (a) + 32 f (a + h)
45
+ 12 f (a + 2h) + 32 f (a + 3h) + 7 f (b)
X = (27v − u)/26
4096 h 1344 h
Z=2835f 8 −2835f 4
+ 84 fh− 1 f(h)
2835 2 2835 xn+1 + n3(xn+1 − xn)/(3n2 + 3n + 1)
|I−R(n,m)|=O(h2m)ash→0 R(n + 1, m + 1) = R(n + 1, m) +
[R(n+1,m)−R(n,m)]/(8m −1)
f(x)dx−R(n,0)≈c4−(n+1). Letm=1andletn→∞inFormula(2).
1.
72
3. 0 00010 00025 0006
Error = 0.0104
13. h 2 1 1/2 1/4
T2111 14. n ≧ 16 07439; Too small.
15. T = 1 1(n−1)(2n−1)n+ 1 n3 6 2n
19. 0.000025 20. T ( f ; P) ≈ 4.37132
21. T(f;P)≈0.43056
22. | error term | ≦ 0.3104
23. T( f ; P) = 7.125; No, they cannot be computed from the given data.
1
24a. −2(b−a)hf′(ξ)forsomeξ∈(a,b).
24b. −1(b−a)h2 f′′(ξ)forsomeξ ∈(a,b). 6
13′′ 1n 3′′ 25a. 24h f (ξ) 25b. 24 i=1hi f (ξi)
1
25c. (b − a)h2 f ′′(ξ)
24
23. Show
b a
1 dx 0
10. 13.
14. 15.
17. 18.
22.
24. 27.
E = A2m(2π)
2π2m 2m
4 [±4 cos(4ξ)]
± (2π)2m+142m+1 A2m cos(4ξ)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
646 Answers for Selected Exercises
Computer Exercises 5.2
Computer Exercises 5.4
2a. 1.4183 8a. 2.03480 53185 77
1. R(7, 7) = 0.49996 9819 5. R(5, 0) = 1.81379 9364
6.
2 9
= 0.22222 . . .
8b. 0.89297 95115 69 Exercises 6.1
8c. 0.43398 771
7. 0.62135 732 9. 0.61748 11. R(7, 7) = 0.76519 7687
Exercises 5.3
1. 6.
Yes
In Exercise 6.1.5, the bracketed expression
is f ′(ξ1) − f ′(ξ2) and in magnitude does not exceed 2C.
1st-degree spline functions having knots t0 , t1 , . . . , tn .
π
1. 4 2a. h < 0.03 or n > 33.97.
2b. 3a.
4.
7. 8.
Computer Exercises 5.3
1. 3.1416; 3.1416
2. 0.1996
h<0.15orn>7.5.
9.
2dx −4 10. n f (ti )Si is a linear combination of
7.1667 1x
3b. 7.0833 3c. 7.0777
= 0.6933; Bound is 5.2 × 10 .
Knots ≈ 1.57 × 1010. i=1
b
a iijj
Exercises 5.4
Q0(x) = −(x + 1)2 + 2 Q 1 ( x ) = − 2 x + 1
16 1 Hence, it is also such a function. Its value at t j is
f(x)dx = 15S2(n−1) − 15Sn−1 −3h5f(4)(ξ)
80
n f (t )S (t ) = f (t ). i=1
12.
1
3 17.
A = (b − a), B = 1(b − a)2 2
5h 2h h
f (a) + f (a + h) − f (a + 2h)
123 12
578 α= , a=c= , b=
7 25 75 h h3
w1 =w2 = 2, w3 =w4 =−24 h3
Si(x) = 0 if x < ti−1 or x > ti+1.
On (ti−1, ti ), Si (x) is given by (x − ti−1)/(ti − ti−1). On (ti , ti+1), Si (x) is given by (x − ti+1)/(ti − ti+1). S0 and Sn are slightly different.
If S is piecewise quadratic, then clearly S′ is piecewise linear. If S is a quadratic spline then S ∈ C1. Hence, S′ ∈ C. Hence, S′ is piecewise linear and continuous.
1. ≈ 0.91949 4a.
12 1 Q2(x)=8 x−2 −2 x−2
x = ±
x = ±0.861136, ±0.339981
Q3(x) = −5(x − 1)2 + 6(x − 1) + 1
4b.
5. α=γ=4, β=−2
Q4(x) = 12(x − 2)2 − 4(x − 2) + 2 The answer is given by Equation (8).
6. 7.
9. 10.
Exercises 6.2
1. No 2. No
4. a=−4, b=−6, c=−3, d=−1, e=−3
33
19.
20a. Yes 20b. No 20c. No 21. Yes
5. a=−5, b=−26, c=−27, d=27 11.A=2h, B=0, C=3 2
12.
13. 14.
848 A=3, B=−3, C=3
Yes. Exact for polynomials of degree ≦ 3. hh4
A= , D=0, C= , B= h 333
True for n≦3
6. No
7a. S(x) is not continuous at x = −1. S′′(x) is not continuous at x = −1, 1.
8a. (m+1)n
8c. (m−1)(n−1)
8b. 2n 8d.m−1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x2, [0, 1] 10. S=1+2(x−1)+(x−1)2+(x−1)3, [1,2] 5+7(x −2)+4(x −2)2, [2,3]
Answers for Selected Exercises 647 24. Use Equation (14) with all A’s zero except A j = 1.
Next, take all A’s zero except A j +1 = 1.
28. No 30. Let Ci2 = ti+1ti+2, then Ci1 = xti+1,
and Ci0 = x2.
32. Bik(tj)=0ifftj ≧ti+k+1 ortj ≦ti
33. x = (ti+3ti+2 − ti ti+1)/(ti+3 + ti+2 − ti+1 − ti ) Computer Exercises 6.3
7. 47040
Exercises 7.1
12. a=3, b=3, c=1 13. No 15. a=−1, b=3, c=−2, d=2 17. n+3 19. f isnotacubicspline 22. p3(x) = x − 0.0175×2 + 0.1927×3; 26. S is linear.
32. S0(x)=−5(x−1)3+12(x−1), 77
No
S1(x) = 6 (x − 2)3 − 5 (3 − x)3 − 6 (x − 2) + 12 (3 − x), 7777
S2(x)=−5(x −3)3 + 6(4−x)3 + 12(x −3)− 6(4−x), 7777
S3(x)=−5(5−x)3 + 12(5−x) 77
33. The conditions on S make it an even function.
If S(x) = S0(x) in [−1, 0] and S(x) = S1(x) in [0, 1],
thenS1(0)=1,S′(0)=0,S′′(1)=0,andS1(1)=0. 11
1a.
1 4 7 3 2 3/2 t x=4t +3t −3t +c 1b. x=ce
An easy calculation yields S1(x) = 1 − 3 x2 + 1 x3.
38. 5n,n+4 Exercises 6.3
1e. 2a.
t −t x=c1e +c2e or
x=c1cosht+c2sinht
t2n+1
+c
(2n + 1)(2n + 1)!
n=0
22
∞ n (2n − 1)! 2n 4. x = a0 + a0 (−1) n−1 t
x=1t3+3t4/3+7 22 34
39. Yes
Chebyshev polynomials recurrence relation.
∞
(x − ti )2
3c. x= (−1)n
2.
3.
5.
15. 16. 19. 20.
3d. x=e−t/2 t2et/2dt+c
See Section 10.2.
(ti+2 − ti)(ti+1 − ti) ,
[ti,ti+1]
n=1 2 (2n)! n!2n
(x−ti)(ti+2−x)
∞
+ (−1)n−1 t2n+1
i + 2 i i + 2 i + 1
n=1 (2n + 1)!
6. Letp(t)=a0+a1t+a2t2+···anddetermineai.
9. t = 10, Error = 2.2×104ε; t = 20, Error = 4.8×108ε
10. x(iv) = 18xx′x′′ + 6(x′)3 + 3x2x′′′
11a. x′=x+ex;
x′′ =(1+ex)x′;
+ (t − t )(t − t )
B i2 ( x ) = (ti+3 − x)(x − ti+1) (ti+3 − ti+1)(ti+2 − ti+1)
[ti+1, ti+2]
(t −x)2 i+3
,
, [t ,t ]
(ti+3 − ti+1)(ti+3 − ti+2) 0, elsewhere
i+2
i+3
∞ 0
i=−∞f(ti)Bi(x) 14. n−k≦i≦m−1
Use induction on k and Bk+i (x) = 0 on [ti , ti+1]. i+i
No 17. No ∞ 1
i=−∞ti+1Bi (x)
In Equation (9), take all ci = 1. Then di = 0.
Hence, d n Bik(x) = 0 and n Bik(x) dx i=1 i=1
are constants.
12.
x′′′ = (1 + ex)x′′ + ex(x′)2;
x(iv) =(1+ex)x′′′ +3exx′x′′ +ex(x′)3.
x(0.1) = 1.21633
14. n←20 s ← x(n)
for i = 1 to n − 1
s ← x(n−i) + [h/real(n + 1 − i)]s
end for
s ← x + h[s]
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
648 Answers for Selected Exercises
Computer Exercises 7.1
1. x (2.77) = 385.79118
2b. x (1.75) = 0.63299 9983
2c. x (5) = −0.20873 51554
3. x (10) = 22026.47
4a. Erroratt =1is1.8×10−10. 5. x (0) = 0.03245 34427
7. x (1) = 1.64872 12691
9. x (0) = 1.67984 09205 × 10−3
10. x (0) = −3.75940 73450
Exercises 7.2
2c. f(t,x)=+ x/1−t2 3. x(−0.2)=1.92
5a. real function f (t, x) real t, x
f ←t2/(1−t+2x) end function f
17. Taylor series of f (x, y) = g(x) + h(y) about (a, b) is equal to the Taylor series of g(x) about a plus
that of h(y) about b.
18. f(1+h,k)≈−3h+3h2+k2 2
19. e1−xy≈3−x−y
20. A=1+k+1k2, B=h(1+k)
2
21. A=1, B=h−k, C=(h−k)2
22. f (x + h, y + k) ≈
1 + 2xh + k + (1 + 2×2)h2 + 2hkx + 1 k2 f ;
f (0.001, 0.998) ≈ 2.71285 34
Computer Exercises 7.2
2. x (1) = 1.5708
3b. n=7; x(2) = 0.82356 78972 (RK),
x (2) = 0.82356 78970 (TS)
3c. n=7; x(2) = −0.49999 99998 (RK), x (2) = −0.50000 00012 (TS)
2
8. Solvedf =e−x2, f(0)=0. dx 3
4. x(1) = 0.60653 = x(3) 6. x(0) = 1.0 = x(1.6)
9. x (10) = 1.344 × 1043
Exercises 7.3
5. x(3) = 1.5 8. x(1) = 3.95249
10. h3 1−α D2f+h fxDfwhere 646
∂∂
D= +f
∂t ∂x
Let h = 1/n. Then x(1) = e−1 (True Soln.) and
xn = {[1 − 1/(2n)]/[1 + 1/(2n)]}n (Approx. Soln.)
x(t+h)=x(t−h)+ h[f(t−h,x(t−h)) 3
+ 4f(t,x(t))+ f(t +h,x(t +h))] a=24, b=−11, c=2,
13 13 13
d=10, e=−2h2 13 39
a=1,b=c= h; ErrortermisO(h3). 2
Computer Exercises 7.3
5. x(1) = 2.25 6. x(−1) = −4.5 2 2
2 ∂2 ∂2 2∂2 1. D =∂t2+2f∂x∂t+f ∂x2
11. h = 1/1024 2.
12. Let’s make local truncation error ≦ 10−13.
Thus, 100h5 ≦ 10−13 or h ≦ 10−3.
So take h = 10−3 and hope that the three
extra digits are enough to preserve 10-digit precision.
14b. x(iv)=D3f+fxD2f+3DfxDf+fx2Dfwhere 3∂3 ∂3 2∂3 3∂3
4.
15.
D =∂t3+3f∂x∂t2+3f ∂t∂x2+f ∂x2 5. f(x+th,y+tk)= f(x,y)+t[f1(x,y)h+ f2(x,y)k]
∂ 252 ∂sx(9,s)=e ≈10
109
2 9c. Positive t . 9e. No t . 11. Divergent for all t .
1 8. + t2 f11(x, y)h2
9a. Allt.
+ 2 f12(x, y)hk + f22(x, y)k2
+ · · ·
Now let t = 1 to get the usual form of Taylor’s series
in two variables.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
9. y(e)=−6.3890560989where y(x) = [1 − ln v(x)]v(x)
12. 0.21938 39244 13. 0.99530 87432 15. Si(1) = 0.94608 30703
Answers for Selected Exercises 649 13. Letx0 =t,x1 =x,x2 =y,x3 =x′,x4 =y′.
X(0) = [0, 1, 3, 2, 4]T x(t+h)=x1+1h2+ 1h4+yh+1h3+ 1 h5ComputerExercises7.4
Exercises 7.4
1.
2 24 6 120
2. Since system is not coupled, solve two separate problems.
3. System is not coupled so each differential equation can be solved individually by the program.
1
4. X′=x12+logx2+x02 , ex2 −cosx1+sin(x0x1)−(x1x2)7
X(0) = [0, 1, 3]T
′x2 T
5. X = x3 , X(0)=[1,−3,5] jj
1
′ x3 ThenX =x4 ,
x1 +x2 −2×3 +3×4 +logx0 2×1 −3×2 +5×3 +x0x2 −sinx0
2 24 6 120 y(t+h)=y 1+1h2+ 1h4 +x h+1h3+ 1 h5
1. x(1)=2.4686939399, y(1)=1.2873552872
2. x(0.38) = 1.90723 × 1012, y(0.38) = −8.28807 × 104
4. x(−1) = 3.36788, y(−1) = 2.36788
π π π π
5. x12=x42=0,×22=1,×32=−1
2×2 +logx3 +cosx1
7. Solve each equation separately since they are not
coupled.
x2 −3/2 0.5
8. X′=−x1 x12+x32 , X(0)=0.75
Exercises 8.1
x42 + cos(x2x3) − sin(x0x1) + log(x1/x0)
X(0) = [0,1,3,4,5]T 0 1 0 0 0
3a.M=0−2100, x4 3 0 0−2 1 0
x5 3 −4 0 0 0 1
10. X′=x6 , X(1)=2
25 0 U= 0 0 0
3
4b. A=2
1
0 0 0 1 27 4 3 2 0 50 −6 −4
0000 0 0 0 20
2 1
2 1
11
2x1x3x4 +3x12x2t2 ex2x5+4x1t2x3
−79/12
2 2tx6+2tex1x3 3
11a. Letx1 =x,x2 =x′,x3 =x′′. x2
ThenX′= x3
−x3 sin x1 − t x2 − x3
x
12. X′ = 2 , X(0) = [0, 1]T
x2 − x1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
7. x(6) = 4.39411,
y(6) = 3.10378
Exercises 7.5
1. x (t)=eλjtx (0)
100303 1a. L= 0 1 0, U=0 −1 3
1/3 −3 1 x4−3/20.25
−x3 x12+x32 1.0 1 0 0 0 1 0 0 2 1 2a.M=0 1 0 0, U=0 3 0 0 x2 0 −3 1 0 0 0 4 0
9. X′=x3 , −5 6 −2 1 0 0 0 0 x4 1 0 0 0 0
0 0 8
650 Answers for Selected Exercises
1000 1001
5a.M=0 1 0 0, 8. U = 0 1 0 −2 ,
40 6b. D=0 15/4
0 56/15
0 0 ,
15 −8 l11 l11u12
00 l22u23 0 l32u23 + l33 l33u34
0 −x/b 1 0 0014
−w/a (xy)/(bc) −y/c 1
a00z U = 0 b 0 0
0 0 0 −8
1000
L = 1 1 0 0 −1 1 1 0
1 −1 1 1
100 200 9a. L= 1 1 0 , D= 0 −2 0 ,
00c0
0 0 0 d−(wz)/a
a00 0 ′0b00
5b.L=0xc 0, 0 0 y d−(wz)/a
3−11 1 −1/2 U′ = 0 1 00
003 1
−1/2 1
1 0 0 z/a U′=0100
0010 0001
0 −4/15 −2/7 1
x = [−1, 2, 1]T
1 0 0 −2 0 0
1 0
6a. L=−1/4 1 0 0,
9b.
10a.L= 2 1 0,D= 0 1 0,
0 0 −1/4 −1/15 1 0
−1 3 1 −1/2 U′ = 0 1 00
1 00−1 1
1 1
−5 −7 10 11 51
4 −1 U=0 15/4 00 00
−1 −1/4 56/15 0
0
0 −1
−16/15 24/7
x = [−1, 1, 1]T 11
00 00
0 0 24/7
14a. l21 l21u12 +l22 0 l32
00
1 0 0 −1 16a.X−1=11−11
10b. 12.A−1=1−13
1 −1/4 −1/4 0 U′ =0 1 −1/15 −4/15 0 0 1−2/7
l43
l43u34 + l44
0001
4000 0001
6c. L′=−1 15/4 0 0 −1 −1/4 56/15 0
−1 0 1 −1
0 −1 −1 1
16b.X−1=−10−11 −1 −1 0 1
2√6/7 536 −668 458 −186
0 −1 −16/15 24/7
1 1 1 −1 0 ComputerExercises8.1
2 √ 0
0
−1/2 (1/2) 15 6d. L′′ =−1/2 −1/2√15
0 2√14/15 −(4/7)√14/15
0
0 3. Case4:
0 −2/ √15
p5( A) = −668 994 −854
458 −854 994 −668
6e. 192
458 −186 458 −668 536
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Exercises 8.2
1. Yes.
3. Bases vectors mapping: (1, 0) → (cos θ , sin θ , ). Counterclockwise rotations through the angle θ.
7. Eigenvalues: a. 9.8393, 3.0804 ± 1.3763i
c. 0.4660 − 1.4971i, 2.6112 + 0.3313i, −0.0773 + 2.1658i
Answers for Selected Exercises 651 3. a=(1+2e)/(1+2e2), b=1
5. a=2.1, b=0.9
m m
7. c= yklogxk (logxk)2 k=0 k=0
11. φ involves the sum of m + 1 polynomials of degree two in c which is either concave upward or a constant. Thus, no maxima exists—only a minima.
9. c. 11. d. 13. True. Computer Exercises 8.2
√
1a. Eigenvalues: −2 ± Eigenvectors:[1,9−(3±√23)/7]T or [1,−1.11369]T and[1,0.25655]T.
1d. n = 5: Eigenvalues: 0.26795, 1, 2, 3, 3.73205. 7. Eigenvalues/Eigenvectors:
a. 5.2426,(0.6656,0.7463);−3.2426,(−0.3051,0.9523)
b. 3.9893±5.5601i,(0.7267,−0.0680±0.4533i, −0.3395 ∓ 0.3829i ); 0.0214, (0.7916, 0.5137, 0.3308) 11. Eigenvalues/Eigenvectors: 1, (−1, 1, 0, 0); 2, (0, 0, −1, 1);
5,(−1,1,2,2) Exercises 8.3
1. a.
Computer Exercises 8.3
4. Eigenvalues are −5, 7, and 3. Exercises 8.4
1. Jacobi and Gauss-Seidel converge because A is diagonally dominant, SOR converges because A is symmetric and postive definite.
3. d. 5.e. 9.b. Computer Exercises 8.4
3. Iterations: Jacobi 77, Gauss-Seidel 38, and SOR 12. 7. Both converge to approximately the
true solution x = −4/11 and y = 6/11. Exercises 9.1
f (x) = w0 − (1 + 2x)w1
1. 2.
y(x) = 1
f(x)=
y x13 − y x13 17.α= 12 21 .
2. b.
3. Since cos(n − 2)θ = cos[(n − 1)θ − θ] =
cos(n − 1)θ cosθ + sin(n − 1)θ sinθ,
we have 2 cos θ cos(n − 1)θ − cos(n − 2)θ =
cos(n − 1)θ cos θ − sin(n − 1)θ sin θ = cos(nθ). Note:
If gn(θ) = cos nθ, then gn(θ) = 2 cos θgn−1(θ) − gn−2(θ).
5. By the previous problem, the recursive relation is
the same as (2) so that Tn(x) = fn(x) = cos(n arccos x).
6. Tn (Tm (x )) = cos(n arccos(cos(m arccos x ))) = cos(nm arccos x) = Tnm(x)
7. |Tn(x)|=|cos(narccosx)|≦1forallx ∈[−1,1]since |cosy|≦1andforarccosx toexistx mustbe|x|≦1.
g(x)=1 0
8. g1(x) = (x + 1)/2 gj(x)=(x+1)gj−1(x)−gj−2(x) (j≧2)
10. n + 2 multiplications, 2n + 1 additions/subtractions if 2x is computed as x + x
12. n multiplications, 2n additions/subtractions 13. T6(x)=32×6−48×4+18×2−1
1 m m + 1 k=0
yk =(y0+···+ym)/(m+1),
the average of the y values which does not involve any xi .
15. True.
23 = −6.79583, 2.79583.
k=0
13. y = (6x − 5)/10
16. a ≈ 2.5929, b ≈ −0.32583, c ≈ 0.022738
18. a=1, b=1 19. y(x)=2×2+29 3 735
m 12. c=10αwhereα= (m+1)−1 (yk−logxk).
m k=0
n+2 n+1
2. wk =ck +3xwk+1 +2wk+2
m k=0
e2xk
20. y=x+1
21. c=
exk f(xk)
Exercises 9.2 w=w=0
(k =n,n−1,…,0)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
x12x12(x2 − x1) 1 2
α is very sensitive to perturbations in y1.
652 Answers for Selected Exercises Computer Exercises 9.2
0 (i≠ j)
7. aij=(m+1) (i=j=1) (m+1)/2 (i=j>1)
Exercises 9.3
2. Coefficient matrix for the normal equations haselementsaij =1/(i+ j−1)by(5).
Computer Exercises 10.1
8. 13. 15.
32.5% 11. Sequence not periodic. 0123456789
97 93 97 107 90 115 88 101 113 99 5.6% 16. 200
3. c=0 4. y=bx 20
6. c=ln2 24
Exercises 10.2
1. m > 4 million
Computer Exercises 10.2
2. 1.71828 4. 8 5. 49.9
9. 1.11 10. 2.00034 6869 14.
Computer Exercises 10.3
9a. c=π3 9b. c=3 a + bx y
π02a(e2π−1)/2 16. 0 π/2 0 b=−2(e2π +1)/5
2 0 π/2 c (e2π +1)/5
8. x=−1, y=13
14. No. 15. y≈ 1 .Changeto1≈a+bx.
7. 0.518 0.635
n i=1
n i=1
(sinxi)2
1. 2 2. 0.898 4. 7 6. 1.05 3 16
17b. 8.3
17. c=3 20. c= Computer Exercises 9.3
1. a=2, b=3 Exercises 9.4
5. 7d.
2 1 ∞ n 4
x =3+ n=1(−1)n2π2cosnπx
f(x)= 4 ∞ 1sinnx π n=1,3,5,… n
7. 5.24 9. 0.996 12. 0.6394
14. 11.6 kilometers 15. 0.14758
17. 0.009 21. 24.2 revolutions 23. 0.6617
Exercises 11.1
2. c1=1−2e1−e2, 2 2
c2= 2e−e 1−e
3a. x(t) = eπ+t − eπ−t e2π − 1
3b. x(t) = t4 − 25t + 1212
4a. x(t)=βsint+αcostforall(α,β)
4b. x(t)=c1sint+αcostforallα+β=0
with c1 arbitrary
6. φ(t)=z 7. φ(z)=z 8. φ(z)= 9+6z
9. φ(z)=e5 +e+ze4 −z2e2
10. Two ways: Use x′′(a) = z or x′(b) = z, x′′(b) = w.
11. x(t)=−et+2ln(t+1)+3t
14a. Thisisalinearproblem.Sotwoinitial-value problems can be solved to obtain the solution. The two sets of initial values would be x(0)=0 and x(0)=1 .
x′(0) = 1 x′(0) = 0
11a. √n a=√n rcos2πk/n)+isin(2πk/n) √√√√
yisinxi
11b. 4 2,i4 2,−i4 2,−4 2
17. Can reduce additions from 12 to 6 and
multiplications from 16 to 2.
Computer Exercises 9.4
√
6.f(x)=4A∞ 1sinnω0x π n=1,3,5,… n
∞
8a. f(x)=2− sin nπx
41 n=1 nπ 2
Exercises 10.1
1. l0 = 123456; x1 = .96621 2243;
x2 = .12917 3003; x3 = .01065 6910
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
15. Solution of x′′ = −x, x(0) = 1, x′(0) = z is x(t) = cos t + z sin t. So φ(z) = x(π) = −1.
Answers for Selected Exercises 653 5. a=[1+2kh−2(cosπh−1)]1/k
6. The right-hand side is changed by b1 + c0 in place of b1 and bn−1 + cn replacing bn−1 for both (5) and (7).
7. In (6), b1 is replaced by b1 + g(t), bn−1 by bn−1 + g(t). Atthelevelzero,bi = f(ih)for1≦i≦n−1.
k
8. u(x,t+k)= h2(1−h)u(x+h,t)
k h2 k
+h2 k +h−2 u(x,t)+h2u(x−h,t)
01
−1 0 1
Since φ is constant, we cannot get by any choice of z!
Computer Exercises 11.1
1. t x
0.00000 1.00000 00000 000 0.11111 1.04596 43628 148 0.22222 1.08421 37270 667 0.33333 1.11366 39070 190 0.44444 1.13294 09456 874 0.55556 1.14031 38573 989 0.66667 1.13362 66672 380 0.77778 1.11023 44247 917 0.88889 1.06694 88627 917 1.00000 1.00000 00000 000
Exercises 11.2
φ(z) = 3
1. −1− hxi−1 +2(1+h2)x1 −1− hxi+1 = −h2t 22
9. A = …
… … −101
−2 2
1. −0.21
2. uxx = f′′(x+at)+g′′(x−at),
utt =a2f′′(x+at)+a2g′′(x−at)=a2uxx 3. u(x,t)= 1[F(x+t)−F(−x+t)]
2
+ 1G(x+t)−G(−x+t) 2
where G is the antiderivative of G Computer Exercises 12.2
Exercises 12.2
2. x1 ≈ 0.29427, 4. x′(0)=5
x2 ≈ 0.57016, x3 ≈ 0.81040 8. −xi−1+2+(1+ti)h2xi
3
9. x(t) = [7/u(2)]u(t)
− xi+1 = 0
11. x′′ = −x1, x1(0) = 3, x′ (0) = z1 implies
x′ = −A sin t + B cos t. Let z1 = x′(0) = B. Sox1 =3cost+z1sint,x2 =3cost+z2sint. By Equation (10), x = λx1 + (1 − λ)x2 and
λ = [β − x2(b)]/[x1(b) − x2(b)]
= x [7 − (−3)]/[(−3) − (−3)] = 10/0 Computer Exercises 11.2
2a. x =1/(1+t) 2b. x =−log(1+t)
11
x = A cos t + B sin t, 3 = x(0) = A,
1. real function fbar(x) real x, xbar
xbar ← x + 2 real(integer(−(1 + x )/2)) if xbar < 0 then
fbar ← − f (−xbar) else
fbar ← f (xbar) end if
end function fbar
Exercises 12.1
1a. Elliptic. 1c. Parabolic.
1f. Hyperbolic.
Exercises 12.3
5. 20+ 2.5h ui+1,j +20− 2.5h ui−1,j xi +yj xi +yj
+ −30 + 0.5h ui, j+1 + −30 + 0.5h ui, j−1 yj yj
+ 20ui j = 69h2
1 ∂ ∂u 1 ∂2u
2. r∂r r∂r +r2∂θ2=0
4. Equation (3) shows that u(x, t + k) is a convex
combination of values of u(x,t) in the interval [0,1]. So it remains in the interval.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
654
Answers for Selected Exercises
6.
u0, 1 ≈ −8.932 × 10−3; u1, 1 ≈ 4.643 × 10−1 222
Exercises 13.2
−4 1 1 0 7. A= 1 −4 0 1
1 0 −4 1 0 1 1 −4
Computer Exercises 12.3
1a. 3. 6.
7b.
9a.
10.
12.
1 9 Yes 1b. No 2. 4,4
F(x,y)=1+x−xy+ 1x2 − 1y2 +··· 22
The slope of the tangent is dy = − Fx ≡m1. dx Fy
The gradient has direction numbers Fx and Fy , and its slope is Fy ≡ m 2 . The condition of
Fx
perpendicularity m 1 m 2 = −1 is met.
F(x)=3−1x+3xx+xx+2x2−1x2+··· 2221223123
5. 18.41◦ 13.75◦ 41.47◦ 36.60◦ 69.41◦ 66.77◦
Exercises 13.1
24.41◦ 61.05◦
53.01◦
51.00◦
F(2, 1, −2) = −15; F(2, 0, −2) = −12
9 9
F , = −20.25
88 Casen=2:
x =(3a+b)/4+δ, x =(a+3b)/4−δ,
F(0, 0, −2) = −8;
a≦x∗≦b′ a′≦x∗≦b
−2 5 2 9b. G(1,2,1) = 2
5
1.
2. 4.
5a. 7. 9.
10. 13.
14.
15b. Squarebothsidestoobtain 2√
G(1,0) =
Exact solution F(3) = −7. √√
A=α/ 5, A=−β/ 5
By(6),y+rb=a+r2(b−a)+rb=ar+b sincer2+r=1.Moreover,wehave
r(y + rb) = a + r(b − a) = x. Thus, we obtain yr + r2b = x or y + r2(b − y) = x.
Computer Exercises 13.2
1a. (1,1) 1c. (0,0,0,0) 1e. (1,1,1,1)
Exercises 14.1
G = 2yz2(1 + sin2 x) + 2(y + 1)(z + 3)2
191 −30,−5
2y2z2 sinx cosx
2y2z(1+sin2 x)+2(y+1)2(z+3)
n≧1+(k +logl−log2)/|logr| 11. Minimum point of F is a root of F′.
Newton’s method to find a root of F′: F′(xn)
xn+1 = xn − F′′(xn)
Formula does not involve F itself.
To find minimum of F, look for root of F′.
n≧48
Maximize:
−5x1 − 6x2 + 2x3 − 2 x 1 + 3 x 2
Secant method to find a root of F′ is:
Constraints:
Minimum value 1.5 at (1.5, 0).
x =x −F′(x) xn−xn−1
n+1 n n F′(xn) − F′(xn−1)
Formula does not involve F.
x1 ≧ 0, x2 ≧ 0, x3 ≧ 0
Maximize: −3x + 2y − 5z
−x−y−z≦−4
x−y−z≦ 2 Constraints: −x + y + z ≦ −2
x≧0,y≧0,z≧0
2.
4a.
5b.
≦ − 5 x1+x2 ≦15 2x1 − x2 + x3 ≦ 25 −x1−x2+x3≦−1
r =1+ 1+ 1+···=1+r. 15d. Byseriesexpansion,wehave
1+r−1 +r−2 +···=(1−r−1)−1 Hence,wehaver =(1−r−1)−1 −1= or r2 = r + 1.
1 r−1
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Maximize: 2x1 + 2x2 − 6x3 − x4
3x1 +x4 =25 6a. x1+x2+x3+x4 =20 Constraints: −4x1 − 6x3 + x5 = −5 −2x1 −3x3−2x4+x6= 0
x1, x2, x3, x4, x5, x6 ≧0 Minimize: 25y1 + 20y2 − 5y3
2. At most 2n.
5. First Primal form:
3y1+y2−4y3−2y4≦2 6b. y2≦2 Constraints: y2 − 6y3 − 3y4 ≦ −6 y1+y2 −2y4≦−1
6. GivenAx=b.Letyj =xj +yn+1.
y,y,y,y≧0 1234
n n
a i j y j + − a i j y n + 1
Constraints: j =1 j =1
= b i ( 1 ≦ i ≦ n + 1 )
y≧0
Computer Exercises 14.2
Exercises 14.3
7. Maximum of 36 at (2, 6)
8. Minimum of 36 at (0, 3, 1)
11. Minimum2for(x,x−2)wherex≧3 13a. Maximumof18at(9,0)
13c. Unboundedsolution
13f. Nosolution
52T
1b. x= 0,0,3,3,0 1c. x= 0,3,3
Maximize: Constraints:
−bT y −ATy≦ −c
Answers for Selected Exercises
655
y ≧ 0
n n n j=1aijxj −bi = aijyj −yn+1
aij −bi.
Now
Minimize: yn+1
j=1 j=1
13h. Maximum of 54 at 18,0 55
85T
− 7y4 −u1 +v1 = 6
14. Maximum of 100 at (24, 32, −124) 17. Its feasible set is empty.
Computer Exercises 14.1
Maximize:
1a.
Constraints:
Minimize:
−4 (ui+vi) i=1
u≧0, v≧0, y≧0 ε
1.
3. 5.
Exercises 14.2
n
y1+ y2+ y3− 3y4−u2+v2= 2 7y2 −5y3 − 2y4 −u3 +v3 =11 6y1 +9y3 −15y4 −u4 +v4 = 9
1.
Maximize:
j = 0
1b.
Felt Straw 0 200 150 0 Lariat Ranch Wear 150 0
$13.50
Cost 50¢ for 1.6 ounces of food f ,
1ounceoffood f ,andnoneoffood f . 32
cj yj Constraints: j=0aijyj ≦bi
7y4 − ε ≦ 1 2 3 4
−6
n
yi≧0 (0≦i≦n)
−y−y−y+3y−ε≦ −2
Here c0 = −nj=1cj and ai0 = −nj=1aij.
1
y+y+y−
6y1 +9y3−15y4−ε≦
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
5y1 +2y2
Texas Hatters Lone Star Hatters
5y1+2y2 −
7y4−ε≦ 3y−ε≦
6
2
11
9
Constraints: −5y1 − 2y2 +
−7y2 +5y3 + 2y4 −ε ≦ −11
−6y1 −9y3 +15y4 −ε ≦ −9 ε≧0, yj≧0 (1≦i≦4)
1 2 3 7 y2 − 5 y3 −
4 2y4−ε≦
656
Answers for Selected Exercises
3.
Take m points xi (i = 1,2,...,m). n
Let p(x)= ajxj. j=0 Minimize: ε n
j=0 Constraints: n
3. p(x ) = 1.0001 + 0.9978x + 0.51307x 2 + 0.13592x 3 + 0.071344x 4
Exercises Appendix B.1
1a. e ≈ (2.718)10 = (010.101 101 111 100 111 . . .)2
2d. (27.45075 341 . . .)8
4.
y1 − y2 − u1 + v1 =4 6. (0.31463146...) 9. (479) =(111011111)
axj
ji i
ajxj+ε≧ f(xi) (1≦i≦m) i
2e. (113.16662 13 . . .)8 3b. (613.40625)10
Minimize: u +v +u +v +u +v 112233
≦ f(x) (1≦i≦m)
2f. (71.24426 416 . . .) 8
j = 0 ε≧0
3a. (441.68164 0625)10
4c. (101 111) 4e. (110 011) 4g. (63.72664)
228
Constraints: 2y1−3y2+ y3−u2+v2=7 8
y1 + y2 −2y3 −u3 +v3 =2 y1,y2,y3 ≧0;u1,u2,u3 ≧0;v1,v2,v3 ≧0
12. ArealnumberRhasafiniterepresentation in the binary system.
⇔ R = (amam−1 ...a1a0.b1b2 ...bn)2.
ComputerExercises14.3
1a. x1 = 0.353, x2 = 2.118, x3 = 0.765
1b. x1 = 0.671, x2 = 1.768, x3 = 0.453
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
2
⇔R=(am...a1a0b1b2...bn)2×2−n =m×2−n where m = (amam−1 ...a1a0b1b2 ...bn)2.
10
Bibliography
Abell, M. L., and J. P. Braselton. 1993. The Mathematical Handbook. New York: Academic Press.
Abramowitz, M., and I. A. Stegun (eds.). 1964. Handbook of Mathematical Functions with Formulas, Graphs, and Math- ematical Tables. National Bureau of Standards. New York: Dover, 1965 (reprint).
Acton, F. S. 1959. Analysis of Straight-Line Data. New York: Wiley. New York: Dover, 1966 (reprint).
Acton, F. S. 1990. Numerical Methods That (Usually) Work. Washington, D.C.: Mathematical Association of America.
Acton, F. S. 1996. Real Computing Made Real: Preventing Er- rors in Scientific and Engineering Calculations. Princeton, New Jersey: Princeton University Press.
Ahlberg, J. H., E. N. Nilson, and J. L. Walsh. 1967. The The- ory of Splines and Their Applications. New York: Academic Press.
Aiken, R. C., ed. 1985. Stiff Computation. New York: Oxford University Press.
Ames, W. F. 1992. Numerical Methods for Partial Differential Equations, 3rd Ed. New York: Academic Press.
Ammar, G. S., D. Calvetti, and L. Reichel, 1999. “Computa- tion of Gauss-Kronrod quadrature rules with non-positive weights,” Electronic Transactions on Numerical Analysis 9, 26–38. http://etna.mcs.kent.edu
Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK User’s Guide, 3rd Ed.. Philadelphia: SIAM.
Armstrong, R. D., and J. Godfrey. 1979. “Two linear program- ming algorithms for the linear discrete l1 norm problem.” Mathematics of Computation 33, 289–300.
Ascher, U. M., R. M. M. Mattheij, and R. D. Russell. 1995. Nu- merical Solution of Boundary Value Problems for Ordinary Differential Equations. Philadelphia: SIAM.
Ascher, U. M., and L. R. Petzold. 1998. Computer Methods for Ordinary Differential Equations and Differential Algebraic Equations. Philadelphia: SIAM.
Atkinson, K. 1993. Elementary Numerical Analysis. New York: Wiley.
Atkinson, K. A. 1988. An Introduction to Numerical Analysis, 2nd Ed. New York: Wiley.
Axelsson, O. 1994. Iterative Solution Methods. New York: Cambridge University Press.
Axelsson, O., and V. A. Barker. 2001. Finite Element Solution of Boundary Value Problems: Theory and Computations. Philadelphia: SIAM.
Azencott, R., ed. 1992. Simulated Annealing: Parallelization Techniques. New York: Wiley.
Bai, Z., J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst. 2000. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Philadelphia: SIAM.
Baldick, R. 2006. Applied Optimization. New York, Cambridge University Press.
Barnsley, M. F. 2006. SuperFractals. New York, Cambridge University Press.
Barrett, R., M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. 1994. Templates for the Solution of Linear Sys- tems: Building Blocks for Iterative Methods Philadelphia: SIAM.
Barrodale, I., and C. Phillips. 1975. “Solution of an overde- termined system of linear equations in the Chebyshev norm.” Association for Computing Machinery Transactions on Mathematical Software 1, 264–270.
Barrodale, I., and F. D. K. Roberts. 1974. “Solution of an overdetermined system of equations in the l1 norm.” Com- munications of the Association for Computing Machinery 17, 319–320.
Barrodale, I., F. D. K. Roberts, and B. L. Ehle. 1971. Elemen- tary Computer Applications. New York: Wiley.
Bartels, R. H. 1971. “A stabilization of the simplex method.” Numerische Mathematik 16, 414–434.
Bartels, R., J. Beatty, and B. Barskey. 1987. An Introduction to Splines for Use in Computer Graphics and Geometric Modelling. San Francisco: Morgan Kaufmann.
Bassien, S. 1998. “The dynamics of a family of one- dimensional maps.” American Mathematical Monthly 105, 118–130.
Bayer, D., and P. Diaconis. 1992. “Trailing the dovetail shuffle to its lair.” Annals of Applied Probability, 2, 294–313.
Beale, E. M. L. 1988. Introduction to Optimization. New York: Wiley.
Berry, M. W., and M. Browne 2005. Understanding Search Engineerings in Mathematical Modeling and Text Retrieval, 2nd Ed. Philadelphia: SIAM.
Bjo ̈rck, Å. 1996. Numerical Methods for Least Squares Prob- lems. Philadelphia: SIAM.
Bloomfield, P., and W. Steiger. 1983. Least Absolute De- viations, Theory, Applications, and Algorithms. Boston: Birkha ̈user.
657
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
658 Bibliography
Bornemann, F., D. Laurie, S. Wagon, and J. Waldvogel. 2004.
The SIAM 100-Digit Challenge: A Study in High-Accuracy
Numerical Computing. Philadelphia: SIAM.
Borwein, J. M., and P. B. Borwein. 1984. “The arithmetic- geometric mean and fast computation of elementary func- tions.” Society for Industrial and Applied Mathematics Re-
view 26, 351–366.
Borwein, J. M., and P. B. Borwein. 1987. Pi and the AGM: A
Study in Analytic Number Theory and Computational Com-
plexity. New York: Wiley.
Boyce, W. E., and R. C. DiPrima. 2008. Elementary Differen-
tial Equations and Boundary Value Problems, 8th Ed. New
York: Wiley.
Branham, R. 1990. Scientific Data Analysis: An Introduc-
tion to Overdetermined Systems. New York: Springer-
Verlag.
Brenner, S., and R. Scott. 2002. The Mathematical Theory of
Finite Element Methods. New York: Springer-Verlag. Brent, R. P. 1976. “Fast multiple precision evaluation of ele- mentary functions.” Journal of the Association for Comput-
ing Machinery 23, 242–251.
Briggs, W. 2004. Ants, Bikes, and Clocks: Problems Solving
for Undergraduates. Philadelphia: SIAM.
Buchanan, J. L., and P. R. Turner. 1992. Numerical Methods
and Analysis. New York: McGraw-Hill.
Burden, R. L., and J. D. Faires. 2011. Numerical Analysis,
9th Ed. Boston: Brooks/Cole Cengage Learning.
Bus, J. C. P., and T. J. Dekker. 1975. “Two efficient algorithms with guaranteed convergence for finding a zero of a func- tion.” Association for Computing Machinery Transactions
on Mathematical Software 1, 330–345.
Butcher, J. C. 1987. The Numerical Analysis of Ordinary Dif-
ferential Equations: Runge-Kutta and General Linear Meth- ods. New York: Wiley.
Calvetti, D., G. H. Golub, W. B. Gragg, and L. Reichel. 2000. “Computation of Gauss-Kronrod quadrature rules.” Mathe- matics of Computation 69, 1035–1052.
Carrier, G., and C. Pearson. 1991. Ordinary Differential Equa- tions. Philadelphia: SIAM.
Ca ̈rtner, B. 2006. Understanding and Using Linear Program- ming. New York: Springer.
Cash, J. “Mesh selection for nonlinear two-point boundary- value problems.” Journal of Computational Methods in Sci- ence and Engineering, 2003.
Chaitlin, G. J. 1975. “Randomness and mathematical proof.” Scientific American May, 47–52.
Chapman, S. J. 2000. MATLAB Programming for Engineering, Pacific Grove, California: Brooks/Cole.
Chapra, S. L. 2012. Applied Numerical Methods for Engineers and Scientist 3rd Ed. New York: McGraw-Hill.
Cheney, E. W. 1982. Introduction to Approximation Theory, 2nd Ed. Washington, D.C.: AMS.
Cheney, E. W. 2001. Analysis for Applied Mathematics, New York: Springer.
Cheney, W., and D. Kincaid. 2012. Linear Algebra: Theory and Applications, 2nd Ed., Sudbury, Massachusetts, Jones & Bartlett Learning.
Chicone, C. 2006. Ordinary Differential Equations with Appli- cations. 2nd Ed. New York: Springer.
Clenshaw, C. W., and A. R. Curtis. 1960. “A method for nu- merical integration on an automatic computer.” Numerische Mathematik 2, 197–205.
Colerman, T. F., and C. Van Loan. 1988. Handbook for Matrix Computations. Philadelphia: SIAM.
Collatz, L. 1966. The Numerical Treatment of Differential Equations, 3rd Ed. Berlin: Springer-Verlag.
Conte, S. D., and C. de Boor. 1980. Elementary Numerical Analysis, 3rd Ed. New York: McGraw-Hill.
Cooper, L., and D. Steinberg. 1974. Methods and Applications of Linear Programming. Philadelphia: Saunders.
Crilly, A. J., R. A. Earnshaw, H. Jones, eds. 1991. Fractals and Chaos. New York: Springer-Verlag.
Cvijovic, D., and J. Klinowski. 1995. “Taboo search: An approach to the multiple minima problem.” Science 267, 664–666.
Dahlquist, G., and A. Bjo ̈rck. 1974. Numerical Methods. En- glewood Cliffs, New Jersey: Prentice-Hall.
Dantzi, G. B., A. Orden, and P. Wolfe. 1963. “Generalized simplex method for minimizing a linear from under linear inequality constraints.” Pacific Journal of Mathematics 5, 183–195.
Davis, P. J., and P. Rabinowitz. 1984. Methods of Numerical Integration, 2nd Ed. New York: Academic Press.
Davis, T. 2006. Direct Methods for Sparse Linear Systems. Philadelphia: SIAM.
de Boor, C. 1971. “CADRE: An algorithm for numerical quadrature.” In Mathematical Software, edited by J. R. Rice, 417–449. New York: Academic Press.
de Boor, C. 1984. A Practical Guide to Splines. 2nd Ed. New York: Springer-Verlag.
Dekker, T. J. 1969. “Finding a zero by means of successive lin- ear interpolation.” In Constructive Aspects of the Fundamen- tal Theorem of Algebra, edited by B. Dejon and P. Henrici. New York: Wiley-Interscience.
Dekker, T. J., and W. Hoffmann. 1989. “Rehabilitation of the Gauss-Jordan algorithm.” Numerische Mathematik 54, 591–599.
Dekker, T. J., W. Hoffmann, and K. Potma. 1997. “Stability of the Gauss-Huard algorithm with partial pivoting.” Comput- ing 58, 225–244.
Dekker, K., and J. G. Verwer. 1984. “Stability of Runge-Kutta methods for stiff nonlinear differential equations.” CWI Monographs 2. Amsterdam: Elsevier Science.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Demmel, J. W., 1997. Applied Numerical Linear Algebra. Philadelphia: SIAM.
Dennis, J. E., and R. Schnabel. 1983. Quasi-Newton Meth- ods for Nonlinear Problems. Englewood Cliffs, New Jersey: Prentice-Hall.
Dennis, J. E., and R. B. Schnabel. 1996. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Philadelphia: SIAM.
Dennis, J. E., and D. J. Woods. 1987. “Optimization on micro- computers: The Nelder-Mead simplex algorithm.” In New Computing Environments, edited by A. Wouk. Philadelphia: SIAM.
de Temple, D. W. 1993. “A quicker convergence to Euler’s Constant.” American Mathematical Monthly 100, 468–470. Devitt, J. S. 1993. Calculus with Maple V. Pacific Grove, Cal-
ifornia: Brooks/Cole.
Dixon, V. A. 1974. “Numerical quadrature: a survey of the
available algorithms.” In Software for Numerical Math- ematics, edited by D. J. Evans. New York: Academic Press.
Dongarra, J. J., I. S. Duff, D. C. Sorenson, and H. van der Vorst. 1990. Solving Linear Systems on Vector and Shared Memory Computers. Philadelphia: SIAM.
Dorn, W. S., and D. D. McCracken. 1972. Numerical Methods with FORTRAN IV Case Studies. New York: Wiley.
Edwards, C., and D. Penny. 2004. Differential Equations and Boundary Value Problems, 5th Ed. Upper Saddle River, New Jersey: Prentice-Hall.
Ellis, W., Jr., E. W. Johnson, E. Lodi, and D. Schwalbe. 1997.
Maple V Flight Manual: Tutorials for Calculus, Linear Al- gebra, and Differential Equations. Pacific Grove, California: Brooks/Cole.
Ellis, W., Jr., and E. Lodi. 1991. A Tutorial Introduction to Mathematica. Pacific Grove, California: Brooks/Cole.
Elman, H., D. J. Silvester, and A. Wathen. 2004. Finite Element and Fast Iterative Solvers. New York: Oxford University Press.
England, R. 1969. “Error estimates for Runge-Kutta type solu- tions of ordinary differential equations.” Computer Journal 12, 166–170.
Enright, W. H. 2006. “Verifying approximate solutions to dif- ferential equations.” Journal of Computational and Applied Mathematics 185, 203–311.
Epureanu, B. I., and H. S. Greenside. 1998. “Fractal basins of attraction associated with a damped Newton’s method.” SIAM Review 40, 102–109.
Evans, G., J. Blackledge, and P. Yardlay. 2000. Numeri- cal Methods for Partial Differential Equations. New York: Springer-Verlag.
Evans, G. W., G. F. Wallace, and G. L. Sutherland. 1967. Sim- ulation Using Digital Computers. Englewood Cliffs, New Jersey: Prentice-Hall.
Farin, G. 1990. Curves and Surfaces for Computer Aided Geo- metric Design: A Practical Guide, 2nd Ed. New York: Aca- demic Press.
Fauvel, J., R. Flood, M. Shortland, and R. Wilson (eds.). 1988. Let Newton Be! London: Oxford University Press.
Feder, J. 1988. Fractals. New York: Plenum Press.
Fehlberg, E. 1969. “Klassische Runge-Kutta formeln fu ̈nfter und siebenter ordnung mit schrittweitenkontrolle.” Comput-
ing 4, 93–106.
Ferris, M. C., O. L. Mangasarian, and S. J. Wright 2007. Linear
Programming with MATLAB. Philadelphia: SIAM. Flehinger, B. J. 1966. “On the probability that a random inte- ger has initial digit A.” American Mathematical Monthly 73,
1056–1061.
Fletcher, R. 1976. Practical Methods of Optimization. New
York: Wiley.
Floudas, C. A., and P. M. Pardalos (eds.). 1992. Recent Ad-
vances in Global Optimization. Princeton, New Jersey:
Princeton University Press.
Flowers, B. H. 1995. An Introduction to Numerical Methods
in C++. New York: Oxford University Press.
Ford, J. A. 1995. “Improved Algorithms of Ilinois-Type for the Numerical Solution of Nonlinear Equations.” Technical Re- port, Department of Computer Science, University of Essex,
Colchester, Essex, UK.
Forsythe, G. E. 1957. “Generation and use of orthogonal poly-
nomials for data-fitting with a digital computer.” Society for
Industrial and Applied Mathematics Journal 5, 74–88. Forsythe, G. E. 1970. “Pitfalls in computation, or why a math book isn’t enough,” American Mathematical Monthly 77,
931–956.
Forsythe, G. E., M. A. Malcolm, and C. B. Moler. 1977. Com-
puter Methods for Mathematical Computations. Englewood
Cliffs, New Jersey: Prentice- Hall.
Forsythe, G. E., and C. B. Moler. 1967. Computer Solution of
Linear Algebraic Systems. Englewood Cliffs, New Jersey:
Prentice-Hall.
Forsythe, G. E., and W. R. Wasow. 1960. Finite Difference
Methods for Partial Differential Equations. New York:
Wiley.
Fox, L. 1957. The Numerical Solution of Two-Point Bound-
ary Problems in Ordinary Differential Equations. Oxford:
Clarendon Press.
Fox, L. 1964. An Introduction to Numerical Linear Alge-
bra, Monograph on Numerical Analysis. Oxford: Clarendon
Press. Reprinted 1974. New York: Oxford University Press. Fox, L., D. Juskey, and J. H. Wilkinson, 1948. “Notes on the solution of algebraic linear simultaneous equations,” Quarterly Journal of Mechanics and Applied Mathematics.
1, 149–173.
Frank, W. 1958. “Computing eigenvalues of complex matrices
by determinant evaluation and by methods of Danilewski and Wielandt.” Journal of SIAM 6, 37–49.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Bibliography 659
660 Bibliography
Fraser, W., and M. W. Wilson. 1966. “Remarks on the Clenshaw-Curtis quadrature scheme.” SIAM Review 8, 322–327.
Friedman, A., and N. Littman. 1994. Industrial Mathematics: A Course in Solving Real-World Problems. Philadelphia: SIAM.
Fro ̈berg, C.-E. 1969. Introduction to Numerical Analysis. Reading, Massachusetts: Addison-Wesley.
Gallivan, K. A., M. Heath, E. Ng, B. Peyton, R. Plemmons, J. Ortega, C. Romine, A. Sameh, and R. Voigt. 1990. Parallel Algorithms for Matrix Computations. Philadelphia: SIAM.
Gander, W., and W. Gautschi. 2000. “Adaptive quadrature— revisited.” BIT 40, 84–101.
Garvan, F. 2002. The Maple Book. Boca Raton, Florida: Chap- man & Hall/CRC.
Gautschi, W. 1990. “How (un)stable are Vandermonde sys- tems?” in Asymptotic and Computational Analysis, 193– 210, Lecture Notes in Pure and Applied Mathematics, 124. New York: Dekker.
Gautschi, W. 1997. Numerical Analysis: An Introduction. Boston: Birkha ̈user.
Gear, C. W. 1971. Numerical Initial Value Problems in Ordi- nary Differential Equations. Englewood Cliffs, New Jersey: Prentice-Hall.
Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods, 2nd Ed. New York: Springer-Verlag.
Gentleman, W. M. 1972. “Implementing Clenshaw-Curtis quadrature.” Communications of the ACM 15, 337–346, 353. Gerald, C. F., and P. O. Wheatley 2003. Applied Numerical Analysis, 7th Ed. Reading, Massachusetts: Addison-Wesley
Longman.
Ghizetti, A., and A. Ossiccini. 1970. Quadrature Formulae.
New York: Academic Press.
Gill, P. E., W. Murray, and M. H. Wright. 1981. Practical Op-
timization. New York: Academic Press.
Gleick, J. 1992. Genius: The Life and Science of Richard Feyn-
man. New York: Pantheon.
Gockenbach, M. S. 2002. Partial Differential Equations: An-
alytical and Numerical Methods. Philadelphia: SIAM. Goldberg, D. 1991. “What every computer scientist should know about floating-point arithmetic.” ACM Computing Sur-
veys 23, 5–48.
Goldstine, H. H. 1977. A History of Numerical Analysis from
the 16th to the 19th Century. New York: Springer-Verlag. Golub, G. H., and J. M. Ortega. 1992. Scientific Comput- ing and Differential Equations. New York: Harcourt Brace
Jovanovich.
Golub, G. H., and J. M. Ortega. 1993. An Introduction
with Parallel Scientific Computing. New York: Academic
Press.
Golub, G. H., and C. F. Van Loan. 1996. Matrix Computations,
3rd Ed. Baltimore: Johns Hopkins University Press.
Good, I. J. 1972. “What is the most amazing approximate in- teger in the universe?” Pi Mu Epsilon Journal 5, 314–315.
Greenbaum, A. 1997. Iterative Methods for Solving Linear Systems. Philadelphia: SIAM.
Greenbaum, A. 2002. “Card Shuffling and the Polynomial Nu- merical Hull of Degree k,” Mathematics Department, Uni- versity of Washington, Seattle, Washington.
Gregory, R. T., and D. Karney, 1969. A Collection of Matrices for Testing Computational Algorithms. New York: Wiley. Griewark, A. 2000. Evaluating Derivatives: Principles and
Techniques of Algorithmic Differentiation. Philadelphia:
SIAM.
Groetsch, C. W. 1998. “Lanczos’ generalized derivative.”
American Mathematical Monthly 105, 320–326.
Haberman, R. 2004. Applied Partial Differential Equations with Fourier Series and Boundary Value Problems. Upper Saddle River, New Jersey: Prentice-Hall.
Hageman, L. A., and D. M. Young. 1981. Applied Iterative Methods. New York: Academic Press; Dover 2004 (reprint). Ha ̈mmerlin, G., and K.-H. Hoffmann. 1991. Numerical Math-
ematics. New York: Springer-Verlag.
Hammersley, J. M., and D. C. Handscomb. 1964. Monte Carlo
Methods. London: Methuen.
Hansen, T., G. L. Mullen, and H. Niederreiter. 1993. “Good
parameters for a class of node sets in quasi-Monte Carlo
integration.” Mathematics of Computation 61, 225–234. Haruki, H., and S. Haruki. 1983. “Euler’s Integrals.” American
Mathematical Monthly 7, 465.
Hastings, H. M., and G. Sugihara. 1993. Fractals: A User’s
Guide for the Natural Sciences. New York: Oxford Univer-
sity Press.
Havie, T. 1969. “On a modification of the Clenshaw-Curtis
quadrature formula.” BIT 9, 338–350.
Heath, J. M. 2002. Scientific Computing: An Introductory Sur-
vey, 2nd Ed. New York: McGraw-Hill.
Henrici, P. 1962. Discrete Variable Methods in Ordinary Dif-
ferential Equations. New York: Wiley.
Heroux, M., P. Raghavan, and H. Simon. 2006. Parallel Proc-
essing for Scientific Computing. Philadelphia: SIAM. Herz-Fischler, 1998. R. A Mathematical History of the Golden
Number. New York: Dover
Hestenes, M. R., and E. Stiefel. 1952. “Methods of conju-
gate gradient for solving linear systems.” Journal Research
National Bureau of Standards 49, 409–436.
Higham, N. J. 2002. Accuracy and Stability of Numerical
Algorithms, 2nd Ed. Philadelphia: SIAM.
Hildebrand, F. B. 1974. Introduction to Numerical Analysis.
New York: McGraw-Hill.
Hodges, A. 1983. Alan Turing: The Enigma. New York: Simon
& Schuster.
Hoffmann, W. 1989. “A fast variant of the Gauss-Jordan algo-
rithm with partial pivoting. Basic transformations in linear
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
algebra for vector computing.” Doctoral dissertation, Uni-
versity of Amsterdam, The Netherlands. Hofmann-Wellenhof, B., H. Lichtenegger, and J. Collins. 2001.
Global Positioning System: Theory and Practice, 5th Ed.
New York: Springer-Verlag.
Horst, R., P. M. Pardalos, and N. V. Thoai. 2000. Introduction
to Global Optimization, 2nd Ed. Boston: Kluwer. Householder, A. S. 1970. The Numerical Treatment of a Single
Nonlinear Equation. New York: McGraw-Hill.
Huard, P. 1979. “La me ́thode du simplexe sans inverse ex-
plicite.” Bull. E.D.F. Se ́rie C 2.
Huddleston, J. V. 2000. Extensibility and Compressibility in
One-Dimensional Structures. 2nd Ed. Buffalo, NY: ECS
Publ.
Hull, T. E., and A. R. Dobell. 1962. “Random number gen-
erators.” Society for Industrial and Applied Mathematics
Review 4, 230–254.
Hull, T. E., W. H. Enright, B. M. Fellen, and A. E. Sedg-
wick. 1972. “Comparing numerical methods for ordi- nary differential equations.” Society for Industrial and Applied Mathematics Journal on Numerical Analysis 9, 603–637.
Hundsdorfer, W. H. 1985. “The numerical solution of nonlin- ear stiff initial value problems: an analysis of one step meth- ods.” CWI Tract, 12. Amsterdam: Stichting Mathematisch Centrum, Centrum voor Wiskunde en Informatica.
Isaacson, E., and H. B. Keller. 1966. Analysis of Numerical Methods. New York: Wiley.
Jeffrey, A. 2000. Handbook of Mathematical Formulas and Integrals. Boston: Academic Press.
Jennings, A. 1977. Matrix Computation for Engineers and Sci- entists. New York: Wiley.
Johnson, L. W., R. D. Riess, and J. T. Arnold. 1997. Introduc- tion to Linear Algebra. New York: Addison-Wesley.
Kahaner, D. K. 1971. “Comparison of numerical quadrature formulas.” In Mathematical Software, edited by J. R. Rice. New York: Academic Press.
Kahaner, D., C. Moler, and S. Nash. 1989. Numerical Methods and Software. Englewood Cliffs, New Jersey: Prentice-Hall. Keller, H. B. 1968. Numerical Methods for Two-Point
Boundary-Value Problems. Toronto: Blaisdell.
Keller, H. B. 1976. Numerical Solution of Two-Point Boundary
Value Problems. Philadelphia: SIAM.
Kelley, C. T. 1995. Iterative Methods for Linear and Nonlinear
Equations. Philadelphia: SIAM.
Kelley, C. T. 2003. Solving Nonlinear Equations with Newton’ s
Method. Philadelphia: SIAM.
Kincaid, D., and W. Cheney. 2002. Numerical Analysis: Math-
ematics of Scientific Computing, 3rd Ed. Providence, Rhode Island: American Mathematical Society.
Kincaid, D. R., and D. M. Young. 1979. “Survey of iterative methods.” In Encyclopedia of Computer Science and Tech- nology, edited by J. Belzer, A. G. Holzman, and A. Kent. New York: Dekker.
Kincaid, D. R., and D. M. Young. 2000. “Partial differential equations.” In Encyclopedia of Computer Science, 4th Ed., edited by A. Ralston, E. D. Reilly, D. Hemmendinger. New York: Grove’s Dictionaries.
Kinderman, A. J., and J. F. Monahan. 1977. “Computer gen- eration of random variables using the ratio of uniform de- viates.” Association of Computing Machinery Transactions on Mathematical Software 3, 257–260.
Kirkpatrick, S., C. D. Gelatt, Jr., and M. P. Vecchi. 1983. “Optimization by simulated annealing.” Science 220, 671–680.
Knuth, D. E. 1997. The Art of Computer Programming, 3rd Ed. Vol. 2, Seminumerical Algorithms. New York: Addison- Wesley.
Krogh, F. T. 2003. “On developing mathematical software.” Journal of Computational and Applied Mathematics 185, 196–202.
Kronrod, A. S. 1964. “Nodes and Weights of Quadrature Rules.” Doklady Akad. Nauk SSSR, 154, 283–286. [Russian] (1965. New York: Consultants Bureau.)
Krylov, V. I. 1962. Approximate Calculation of Integrals, trans- lated by A. Stroud. New York: Macmillan.
Lambert, J. D. 1973. Computational Methods in Ordinary Dif- ferential Equations. New York: Wiley.
Lambert, J. D. 1991. Numerical Methods for Ordinary Differ- ential Equations. New York: Wiley.
Lapidus, L., and J. H. Seinfeld. 1971. Numerical Solution of Ordinary Differential Equations. New York: Academic Press.
Laurie, D. P. 1997. “Calculation of Gauss-Kronrod quadrature formulae.” Mathematics of Computation, 1133–1145.
Lawson, C. L., and R. J. Hanson. 1995. Solving Least-Squares Problems. Philadelphia: SIAM.
Leva, J. L. 1992. “A fast normal random number generator.”
Association of Computing Machinery Transactions on Math-
ematical Software 18, 449–455.
Lindfield, G., and J. Penny. 2000. Numerical Methods Us-
ing MATLAB, 2nd Ed. Upper Saddle River: New Jersey:
Prentice-Hall.
Lootsam, F. A., ed. 1972. Numerical Methods for Nonlinear
Optimization. New York: Academic Press.
Lozier, D. W., and F. W. J. Olver. 1994. “Numeri-
cal evaluation of special functions.” In Mathematics of Computation 1943–1993: A Half-Century of Computa- tional Mathematics 48, 79–125. Providence, Rhode Island: AMS.
Lynch, S. 2004. Dynamical Systems with Applications. Boston: Birkha ̈user.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Bibliography 661
662 Bibliography
MacLeod, M. A. 1973. “Improved computation of cubic natural splines with equi-spaced knots.” Mathematics of Computa- tion 27, 107–109.
Maron, M. J. 1991. Numerical Analysis: A Practical Approach. Boston: PWS Publishers.
Marsaglia, G. 1968. “Random numbers fall mainly in the planes.” Proceedings of the National Academy of Sciences 61, 25–28.
Marsaglia, G., and W. W. Tsang. 2000. “The Ziggurat Method for generating random variables.” Journal of Statistical Soft- ware 5, 1–7.
Mattheij, R. M. M., S. W. Rienstra, and J. H. M. ten Thije Boonkkamp. 2005. Partial Differential Equations: Model- ing, Analysis, Computation. Philadelphia: SIAM.
McCartin, B. J. 1998. “Seven deadly sins of numerical com- putations,” American Mathematical Monthly 105, No. 10, 929–941.
McKenna, P. J., and C. Tuama. 2001. “Large torsional oscil- lations in suspension bridges visited again: Vertical forc- ing creates torsional response.” American Mathematical Monthly 108, 738–745.
Mehrotra, S. 1992. “On the implementation of a primal-dual interior point method.” SIAM Journal on Optimization 2, 575–601.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. “Equation of state calculations by fast computing machines.” Journal of Physical Chemistry 21, 1087–1092.
Meurant, G. 2006. The Lanczos and Conjugate Gradient Al- gorithms: From Theory to Finite Precision Computations. Philadelphia: SIAM.
Meyer, C. D. 2000. Matrix Analysis and Applied Linear Alge- bra. Philadelphia: SIAM.
Miranker, W. L. 1981. “Numerical methods for stiff equations and singular perturbation problems.” In Mathematics and its Applications, Vol. 5. Dordrecht-Boston, Massachusetts: D. Reidel.
Moler, C. B., 2008. Numerical Computing with MATLAB, Revised Reprint. Philadelphia: SIAM.
Moler, C. 2011. Clever’ s Corner – Computing π , Mathworks News & Notes, www.mathworks.com
More ́, J. J., and S. J. Wright. 1993. Optimization Software Guide. Philadelphia: SIAM.
Moulton, F. R. 1930. Differential Equations. New York: Macmillan.
Nelder, J. A., and R. Mead. 1965. “A simplex method for func- tion minimization.” Computer Journal 7, 308–313.
Nerinckx, D., and A. Haegemans. 1976. “A comparison of nonlinear equation solvers.” Journal of Computational and Applied Mathematics 2, 145–148.
Nering, E. D., and A. W. Tucker. 1992. Linear Programs and Related Problems. New York: Academic Press.
Niederreiter, H. 1978. “Quasi-Monte Carlo methods.” Bulletin of the American Mathematical Society 84, 957–1041.
Niederreiter, H. 1992. Random Number Generation and Quasi- Monte Carlo Methods. Philadelphia: SIAM.
Nievergelt, J., J. G. Farrar, and E. M. Reingold. 1974. Com- puter Approaches to Mathematical Problems. Englewood Cliffs, New Jersey: Prentice-Hall.
Noble, B., and J. W. Daniel. 1988. Applied Linear Algebra, 3rd Ed. Englewood Cliffs, New Jersey: Prentice-Hall.
Nocedal, J., and S. Wright. 2006. Numerical Optimization. 2nd Ed. New York: Springer.
Novak, E., K. Ritter, and H. Woz ́niakowski. 1995. “Average- case optimality of a hybrid secant-bisection method.” Math- ematics of Computation 64, 1517–1540.
Novak, M., ed. 1998. Fractals and Beyond: Complexities in the Sciences. River Edge, New Jersey: World Scientific.
O’Hara, H., and F. J. Smith. 1968. “Error estimation in Clenshaw-Curtis quadrature formula.” Computer Journal 11, 213–219.
O’Leary, D. P. 2009. Scientific Computing with Case Studies. Philadelphia: SIAM.
Oliveira, S., and D. E. Stewart. 2006. Writing Scientific Soft- ware: A Guide to Good Style. New York: Cambridge Uni- versity Press.
Orchard-Hays, W. 1968. Advanced Linear Programming Com- puting Techniques. New York: McGraw-Hill.
Ortega, J., and R. G. Voigt. 1985. Solution of Partial Differen- tial Equations on Vector and Parallel Computers. Philadel- phia: SIAM.
Ortega, J. M. 1990a. Numerical Analysis: A Second Course. Philadelphia: SIAM.
Ortega, J. M. 1990b. Introduction to Parallel and Vector Solu- tion of Linear Systems. New York: Plenum.
Ortega, J. M., and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. New York: Aca- demic Press. (2000. Reprint. Philadelphia: SIAM.)
Ostrowski, A. M. 1966. Solution of Equations and Systems of Equations, 2nd Ed. New York: Academic Press.
Overton, M. L. 2001. Numerical Computing with IEEE Float- ing Point Arithmetic. Philadelphia: SIAM.
Otten, R. H. J. M., and L. P. P. van Ginneken. 1989. The An- nealing Algorithm. Dordrecht, Germany: Kluwer.
Pacheco, P. 1997. Parallel Programming with MPI. San Fran- cisco: Morgan Kaufmann.
Patterson, T. N. L. 1968. “The optimum addition of points to quadrature formulae.” Mathematics of Computations 22, 847–856, and in 1969 Mathematics of Computations 23, 892.
Parlett, B. N. 1997. The Symmetric Eigenvalue Problem. Philadelphia: SIAM.
Parlett, B. 2000. “The QR Algorithm,” Computing in Science and Engineering 2, 38–42.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Pessens, R., E. de Doncker, C. W. Uberhuber, and D. K. Kahaner, 1983. QUADPACK: A Subroutine Package for Automatic Integration. New York: Springer-Verlag.
Peterson, I. 1997. The Jungles of Randomness: A Mathematical Safari. New York: Wiley.
Phillips, G. M., and P. J. Taylor. 1973. Theory and Applications of Numerical Analysis. New York: Academic Press.
Press, W. H., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 2007. Numerical Recipes in C++, 3rd Ed. New York: Cambridge University Press.
Quinn, M. J. 1994. Parallel Computing: Theory and Practice. New York: McGraw-Hill.
Rabinowitz, P. 1968. “Applications of linear programming to numerical analysis.” Society for Industrial and Applied Mathematics Review 10, 121–159.
Rabinowitz, P. 1970. Numerical Methods for Nonlinear Alge- braic Equations. London: Gordon & Breach.
Raimi, R. A. 1969. “On the distribution of first significant figures.” American Mathematical Monthly 76, 342–347. Ralston, A. 1965. A First Course in Numerical Analysis. New
York: McGraw-Hill.
Ralston, A., and C. L. Meek (eds.) 1976. Encyclopedia of Com-
puter Science. New York: Petrocelli/Charter.
Ralston, A., and P. Rabinowitz 2001. A First Course in Numer-
ical Analysis, 2nd Ed. New York: Dover.
Reid, J. 1971. “On the method of conjugate gradient for the
solution of large sparse systems of linear equations.” In Large Sparse Sets of Linear Equations, J. Reid (ed.), London: Academic Press.
Rheinboldt, 1998. Methods for Solving Systems of Nonlinear Equations, 2nd Ed. Philadelphia: SIAM.
Rice, J. R. 1971. “SQUARS: An algorithm for least squares approximation.” In Mathematical Software, edited by J. R. Rice. New York: Academic Press.
Rice, J. R. 1983. Numerical Methods, Software, and Analysis. New York: McGraw-Hill.
Rice, J. R., and R. F. Boisvert. 1984. Solving Elliptic Problems Using ELLPACK. New York: Springer-Verlag.
Rice, J. R., and J. S. White. 1964. “Norms for smoothing and estimation.” Society for Industrial and Applied Mathematics Review 6, 243–256.
Rivlin, T. J. 1990. The Chebyshev Polynomials, 2nd Ed. New York: Wiley.
Roger, H.-F. 1998. A Mathematical History of the Golden Num- ber. New York: Dover.
Roos, C., T. Terlaky, and J.-Ph. Vial. 1997. Theory and Algo- rithms for Linear Optimization: An Interior Point Approach. New York: Wiley.
Saad, Y. 2011. Numerical Methods for Large Eigenvalues Problems, 2nd Ed. Philadelphia: SIAM.
Saad, Y., 2003. Iterative Methods for Sparse Linear Systems, 3rd Ed. Philadelphia: SIAM.
Salamin, E. 1976. “Computation of π using arithmetic- geometric mean.” Mathematics of Computation 30, 565– 570.
Sauer, T. 2012. Numerical Analysis. New York: Pearson, 2nd Ed., Addison-Wesley.
Scheid, F. 1968. Theory and Problems of Numerical Analysis. New York: McGraw-Hill.
Scheid, F. 1990. 2000 Solved Problems in Numerical Analysis. Schaum’s Solved Problem Series. New York: McGraw-Hill. Schilling, R. J., and S. L. Harris. 2000. Applied Numerical Methods for Engineering Using MATLAB and C. Pacific
Grove, California: Brooks/Cole.
Schmidt 1908. Title unknown. Rendiconti del Circolo Matem-
atico di Palermo 25, 53–77.
Schoenberg, I. J. 1946. “Contributions to the problem
of approximation of equidistant data by analytic func- tions.” Quarterly of Applied Mathematics 4, 45–99, 112–141.
Schoenberg, I. J. 1967. “On spline functions.” In Inequalities, edited by O. Shisha, 255–291. New York: Academic Press. Schrage, L. 1979. “A more portable Fortran random number generator.” Association for Computing Machinery Transac-
tions on Mathematical Software 5, 132–138.
Schrijver, A. 1986. Theory of Linear and Integer Programming.
Somerset, New Jersey: Wiley.
Schultz, M. H. 1973. Spline Analysis. Englewood Cliffs, New
Jersey: Prentice-Hall.
Schumaker, L. L. 1981. Spine Function: Basic Theory. New
York: Wiley.
Shampine, J. D. 1994. Numerical Solutions of Ordinary Dif-
ferential Equations. London: Chapman and Hall. Shampine, L. F., R. C. Allen, and S. Pruess. 1997. Fundamen-
tals of Numerical Computing. New York: Wiley. Shampine, L. F., and M. K. Gordon. 1975. Computer Solution of Ordinary Differential Equations. San Francisco: W. H.
Freeman.
Shewchuk, J. R. 1994. “An introduction to the con-
jugate gradient method without the agonizing pain,”
www.wikipedia.com.
Skeel, R. D., and J. B. Keiper. 1992. Elementary Numerical Computing with Mathematica. New York: McGraw-Hill. Smith, G. D. 1965. Solution of Partial Differential Equations.
New York: Oxford University Press.
Sobol, I. M. 1994. A Primer for the Monte Carlo Method. Boca
Raton, Florida: CRC Press.
Southwell, R. V. 1946. Relaxation Methods in Theoretical
Physics. Oxford: Clarendon Press.
Spa ̈th, H. 1992. Mathematical Algorithms for Linear Regres-
sion. New York: Academic Press.
Stakgold, I., 2000. Boundary Value Problems of Mathematical
Physics. Philadelphia: SIAM.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Bibliography 663
664 Bibliography
Steele, J. M., 1997. Random Number Generation and Quasi- Monte Carlo Methods. Philadelphia: SIAM.
Stetter, H. J. 1973. Analysis of Discretization Methods for Or- dinary Differential Equations. Berlin: Springer-Verlag.
Stewart, G. W. 1973. Introduction to Matrix Computations. New York: Academic Press.
Stewart, G. W. 1996. Afternotes on Numerical Analysis. Philadelphia: SIAM.
Stewart, G. W. 1998a. Afternotes Goes to Graduate School. Philadelphia: SIAM.
Stewart, G. W. 1998b. Matrix Algorithms: Basic Decomposi- tions, Vol. 1. Philadelphia: SIAM.
Stewart, G. W. 2001. Matrix Algorithms: Eigensystems, Vol. 2. Philadelphia: SIAM.
Stoer, J., and R. Bulirsch. 1993. Introduction to Numerical Analysis, 2nd Ed. New York: Springer-Verlag.
Strang, G. 2006. Linear Algebra and Its Applications, 4th Ed. Belmont, California: Thomson Brooks/Cole.
Strang, G., and K. Borre. 1997. Linear Algebra, Geodesy, and GPS. Cambridge, Massachusetts: Wellesley Cambridge Press.
Street, R. L. 1973. The Analysis and Solution of Partial Differ- ential Equations. Pacific Grove, California: Brooks/Cole. Stroud, A. H. 1974. Numerical Quadrature and Solution of Or-
dinary Differential Equations. New York: Springer-Verlag. Stroud, A. H., and D. Secrest. 1966. Gaussian Quadrature For-
mulas. Englewood Cliffs, New Jersey: Prentice-Hall. Subbotin, Y. N. 1967. “On piecewise-polynomial approxi- mation.” Matematicheskie Zametcki 1, 63–70. (Translation:
1967. Math. Notes 1, 41–46.)
Szabo, F. 2002. Linear Algebra: An Introduction Using
MAPLE. San Diego, California: Harcourt/Academic Press.
Torczon, V. 1997. “On the convergence of pattern search meth- ods.” Society for Industrial and Applied Mathematics Jour- nal on Optimization 7, 1–25.
To ̈rn, A., and A. Zilinskas. 1989. Global Optimization. Lecture Notes in Computer Science 350. Berlin: Springer-Verlag. Traub, J. F. 1964. Iterative Methods for the Solution of Equa-
tions. Englewood Cliffs, New Jersey: Prentice-Hall. Trefethen, L. N., and D. Bau. 1997. Numerical Linear Algebra.
Philadelphia: SIAM.
Turner, P. R. 1982. “The distribution of leading significant dig-
its.” Journal of the Institute of Mathematics and Its Appli- cations 2, 407–412.
van Huffel, S., and J. Vandewalle. 1991. The Total Least Squares Problem: Computational Aspects and Analsyis. Philadelphia: SIAM.
Van Loan, C. F. 1997. Introduction to Computational Sci- ence and Mathematics. Sudbury, Massachusetts: Jones and Bartlett.
Van Loan, C. F. 2000. Introduction to Scientific Computing, 2nd Ed. Upper Saddle River, New Jersey: Prentice-Hall. Van der Vorst, H. A. 2003. Iterative Krylov Methods for Large
Linear Systems. New York: Cambridge University Press. Varga, R. S. 2004. Gers ̆gorin and His Circles, New York:
Springer.
Varga, R. S. 1962. Matrix Iterative Analysis. Englewood Cliffs:
New Jersey: Prentice-Hall. (2000. Matrix Iterative Analysis: Second Revised and Expanded Edition. New York: Springer- Verlag.)
Wachspress, E. L. 1966. Iterative Solutions to Elliptic Systems. Englewood Cliffs, New Jersey: Prentice-Hall.
Watkins, D. S. 1991. Fundamentals of Matrix Computation. New York: Wiley.
Westfall, R. 1995. Never at Rest: A Biography of Isaac Newton, 2nd Ed. London: Cambridge University Press.
Whittaker, E., and G. Robinson. 1944. The Calculus of Obser- vation, 4th Ed. London: Blackie. New York: Dover, 1967 (reprint).
Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. Ox- ford: Clarendon Press. Reprinted 1988. New York: Oxford University Press.
Wilkinson, J. H. 1963. Rounding Errors in Algebraic Proc- esses. Englewood Cliffs, New Jersey: Prentice-Hall. New York: Dover 1994 (reprint).
Wood, A. 1999. Introduction to Numerical Analysis. New York: Addison-Wesley.
Wright, S. J. 1997. Primal-Dual Interior-Point Methods. Philadelphia: SIAM.
Yamaguchi, F. 1988. Curves and Surfaces in Computer Aided Geometric Design. New York: Springer-Verlag.
Ye, Yinyu. 1997. Interior Point Algorithms. New York: Wiley. Young, D. M. 1950. Iterative methods for solving partial dif- ference equations of elliptic type. Ph.D. thesis. Cambridge, MA: Harvard University. See www.sccm.stanford.edu/
pub/sccm/david young thesis.ps.gz
Young, D. M., 1971. Iterative Solution of Large Linear Sys- tems. New York: Academic Press: Dover 2003 (reprint). Young, D. M., and R. T. Gregory. 1972. A Survey of Numerical
Mathematics, Vols. 1–2. Reading, Massachusetts: Addison-
Wesley. New York: Dover 1988 (reprint).
Ypma, T. J. 1995, “Historical development of the Newton-
Raphson method.” Society for Industrial and Applied Math- ematics Review 37, 531–551.
Zhang, Y. 1995. “Solving large-scale linear programs by interior-point methods under the MATLAB environment.” TechnicalReportTR96–01,DepartmentofMathematicsand Statistics, University of Maryland, Baltimore County, Bal- timore, MD.
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index
Absolute error(s), 5–6, 14
Accelerated steepest decent procedure,
585 (CEx 13.2.2) Accurate to n decimal/significant
digits, 6 Accuracy
first-degree spline, 256
ODEs solutions, 307 A−1 computing, 371
A-conjugate vectors, 418 Adams-Bashforth formula, 325, 347 Adams-Bashforth-Moulton method(s),
324, 328, 347
adaptive scheme, 352 first-order ODEs, 324 integration, 249 (Ex 5.4.15) predictor-corrector scheme, 347 problems, 249 (Ex 5.4.15),
329 (CEx 7.3.2–4) stiff equations, 353
Adams-Moulton formula, 325, 348 Adaptive integration, 234 (Fig 5.8) Adaptive Runge-Kutta methods, 323
industrial example, 2323 Adaptive scheme, 352
Adaptive Simpson’s scheme, 231,
232 (Fig 5.7), 233 Adaptive two-point Gaussian
integration, 250 (CEx 5.4.7) Advection equation, 541
Aitken acceleration formula, 399 A-inner product, vectors, 418 Airfoil cross section, 252 (Fig 6.1) Airy differential equation,
347 (CEx 7.4.16) Algorithms:
Berman, 571 (CEx 13.1.5) Complete Horner’s, 24, 31 Conjugate Gradient, 419 Converting Number Bases, 620 Divided Differences, 162 Fibonacci Search, 563 Gauss-Huard/Jordan,
102 (CEx 2.2.24) Gaussian Elimination, 73
Golden Section Search, 566, 567 (Fig 13.8–9)
Gram-Schmidt Process, 444
Linear Least Squares, 428 (Fig 9.2) Mother of All Pseudo-Random-
Number-Generator, 483 Moler-Morrison, 151 (CEx 3.3.14) Multivariate, Minimization
Functions, 573
Naive Gaussian Elimination, 72 Natural Cubic Spline Functions,
266, 268 (Fig 6.7), 276 Neider-Mead, 580
Neville’s, 170–172 Newton Form, 127 Normalized Tridiagonal,
111 (CEx 2.3.12)
Overview of Adaptive Process, 322 Polynomial Fitting, 438 Polynomial Interpolation, 165 Power Method, 398
Quadratic Interpolation
Algorithm, 568
Quadratic Spline Interpolation at the
Knots, 258
Quasi-Newton Method, 580 rand() in Unix, 483
Random Numbers, 481, 483 Richardson Extrapolation, 193,
222–224 Romberg, 217
Euler-Maclaurin Formula, 221 Richardson Extrapolation, 193,
222–224
Secant Method, Roots of
Equations, 142
Shooting Method, IVPs, 508–510
Modifications, Refinements, 510 Simplex Method, 599
Solving Natural Cubic Spline
Tridiagonal System
Directly, 269 Variable Metric, 580
Alternating Series Principle/Theorem, 29, 31, 32 (Ex 1.2.13)
Anti-derivative, 201
Ariane rocket, 51
Arcsine routine, 67 (CEx 1.4.20) Arctan routine/series, 34 (CEx 1.1.2),
67 (CEx 1.4.21–22) Area/volume estimations,
491, 495
Area base triangle, 552
Arithmetic:
Babylonian, 623
IEEE standard floating-point,
41, 615 Mayan, 624
Arithmetic mean, 17 (CEx 1.1.7)
Army of Flanders, 297 (CEx 6.3.7) Attraction, fractal basins, 135 (Fig 3.8),
140 (CEx 3.2.27) Autonomous ODEs, 337, 341
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Augmented matrix, 598 B0 spline, 281 (Fig 6.12)
i
Bi1 spline, 282 (Fig 6.13)
Bik spline recurrence, 282, 283 (Fig 6.14)
linear combination, 283 Babylonian arithmetic, 624
(Ex B.1.17) Backslash, MATLAB, 97
Back substitution,
Gaussian algorithm, 72, 76, 80, 89,
91, 98
Triangular system, 104, 110 Pentadiagonal system, 108
Backward error analysis, 47, 49 Backward stable, 96
Bad algorithm, 65 (Ex 1.4.37) Banded linear systems, 103–110
block pentadiagonal, 108 pentadiagonal, 107
strictly diagonal dominance, 106,
110, 416 tridiagonal, 103
Banded storage mode, 112 (CEx 2.3.19)
Banker’s rounding, 7
665
666 Index
Bases, numbers, 616–623 conversions:
γ to β, 616 Fahrenheit-Celsius,
624 (Ex B.1.14)
fractional part, 616, 617, 619 integer part, 617
10 ↔ 2, 618
10 ↔ 8 ↔ 2, 620
16 ↔ 2, 622
Basic Simpson’s rule, 227, 228 (Fig 5.5), 238 (Ex 5.3.8)
error term, 230
uniform spacing, 229
Basic trapezoid rule 203, 207, 213,
228 (Fig 5.4)
typical trapezoid, 203 (Fig 5.1),
228 (Fig 5.4) Basins of attraction, 134, 140 (CEx 3.2.27)
Basis functions, 167, 431, 432, 435, 555
Bell splines, 281 Berman algorithm, 571
(CEx 13.1.5) Bernoulli numbers, 221
Bernstein polynomials, 292, 295 first few, 292 (Fig 6.18)
Bessel functions, 37 (CEx 1.2.23), 226 (CEx 5.2.11)
Best-step steepest descent, 577 Be ́zier curves, 291, 295
Big O notation, 27 Biharmonic equation, 525 Binary-odd table, 621
Binary search, intervals, 263 (CEx 6.1.2)
Binary system, 617
Binomial series, 20, 31 (Ex 1.2.1) Birthday problem, 499,
500 (Table 10.1)
Bisection method, nonlinear equations,
114, 116, 122 convergence analysis, 119–120 error bound, 119–120
false position method, 121, 122 illustrating error upper bound,
119 (Fig 3.1)
secant vs. Newton’s method, 147
Bit reversal, 474
Bivariate functions, 172 Block pentadiagonal linear
systems, 108
Boundary points, irregular mesh spacing,
546 (Fig 12.12)
Boundary value problems (BVPs), 507 Bratu’s problem, 522 (CEx 11.2.7b)
B spline, 281–298
approximation/interpolation, 286 Be ́zier curves, 295
degree 0, 281, 293, 286
degree 1, 282, 286
degree k, 282,
higher degree, 287 interpolation, 286 Schoenberg’s process, 289
Bucking circular ring, 522–523 (CEx 11.2.8)
Buffon’s needle problem, 501 BVPs, 507
Calculus, Fundamental Theorem, 202 Cancellation error, 65 (Ex 1.4.37) Cantilever beam, 424 (CEx 8.4.10b) Cardinal polynomials, 155,
156 (Fig 4.1), 173 Cardioid, 504 (CEx 10.3.6)
Catenery, 114 Cauchy-Riemann equation,
137 (Ex 3.2.40) Cauchy-Schwartz inequality,
433 (Ex 9.1.9), 577 Caution, 30, 90, 231, 372, 484, 563 Cayley-Hamilton Theorem,
395 (CEx 8.2.5) Central difference formula,
16 (CEx 1.1.3), 194,
196–198 (CEx. 4.3.5), 513 four nodes, 196 (Fig 4.13)
two nodes, 194 (Fig 4.12)
Centroids, 581
Change of intervals, 240
Change of variables, 437
Chapeau functions of B splines, 282 Characteristic equation, 381 Characteristic polynomials, 381, 449 Chebyshev nodes, 180, 185,
187 (CEx 4.2.10)
Chebyshev nodes of T9, 181 (Fig 4.9) Chebyshev polynomials, 168,
169 (Fig 4.4), 180,
181 (Fig 4.9), 435,
436 (Fig 9.4), 437–439
orthogonal systems, 435 algorithm, 438
orthonormal basis functions, 435
polynomial regression, 440 properties, 441
recurrence relation, 436, 445
Checkerboard ordering, 558 (Ex 12.3.3)
Cholesky factorization, 378 (Ex 8.1.24) Chopping numbers, 7, 47
Chopped to n digits, 7
Clamped cubic splines, 265 Coefficients ai , 160
Collocation method, 556
Companion matrix, 395 (CEx 8.2.3) Complete Horner’s algorithm, 24, 31 Complete pivoting, 84–85, 96 Complex Fourier series, 470 Components, in vectors, 630 Component-by-component, 629 Composite Gaussian three-point rule,
251 (CEx 5.4.11c) Composite midpoint rule with
unequal/equal subintervals, 214 (Ex 5.1.6-7),
496 (Ex 10.2.3)
Composite rectangle rule, nonuniform/uniform spacing,
216 (Ex 5.1.33–24, 5.1.36–37) Composite Gaussian rule,
251 (CEx 5.4.11c) Composite Simpson’s rule, 231,
237 (Ex 5.3.6), 231,
251 (CEx 5.4.11b) error term, 231
Composite trapezoid rule, 207, 213, 216 (Ex 5.1.36–37),
250 (CEx 5.4.11a)
Composite trapezoid rule with uniform
spacing, 213, 216 (Ex 5.1.36) Computation, noise, 197
Computer-aided geometric design, 298 (CEx 6.3.19)
Computer-caused, loss of precision, 57
Computer error, representing numbers, 46
Concrete, reinforced, 125 (CEx 3.1.16) Condition number, bound, 407 Condition number, linear systems, 97,
388, 406
Conjugate gradient method, 419
features, 420 Constrained minimization
problems, 560
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Continuity of functions, 253
Contour diagrams, 577
Control points, in drawing curves, 252,
291, 293 (Fig 6.19) Convergence analysis
bisection method, 119 Newton’s method, 125–130 secant method, 144
Convergence theorems, 414 Convergent solution curves, 326 (Fig 7.5) Convex hull, of vectors, 293
Corollary:
Divided Differences, 184
Similar to Triangular Matrix, 384 Strictly Diagonally Dominant
Matrix, 384
Correctly rounded value, 46, 627 Correct rounding, 46
Corrected value, 328
Cosine series, 20, 464, 475 Cramer’s Rule, 428, 638 Crank-Nicolson method, 529, 531
alternative stencil, 531, 532 (Fig 12.5)
implicit stencil, 529 (Fig 12.4) Crout factorization, 369,
379 (CEx 8.1.2) Cryptology, 11
Cubic B spline, 297 (Ex 6.3.38) adjacent pieces Si −1 , Si ,
267 (Fig 6.6)
Cubic interpolating spline, 371
Cubic interpolation, 186 (Ex. 4.2.5c) Cubic splines, example, 268 (Fig 6.7) Curve fitting, B splines, 288
Curve using control points,
293 (Fig 6.19) Curvilinear regression, 440
D(n, 0) formula, 191 D(n, m) formula, 191
error term, 192 Danielson-Lanczo Lemma, 474 Dawson integral, 310 (CEx 7.1.12) Decimal places, accuracy, 6 Decimal point, 617 Decompose/factor matrix, 361 Definite/indefinite integrals, 201 Definitions:
Matrices, 631
Elementary consequences, 637 Equality, 633
Inequality, 633 Addition/Subtraction, 633 Scalar Product, 633
Spline, Degree 1, 253, 260 Spline, Degree 2, 256, 260 Spline, Degree k, 264, 276 Symmetric Positive Definite,
369, 383, 416, 580 369,
416, 580 Vector properties:
Addition/Subtraction, 630 Equality, 630
Inequality, 630
Scalar Product, 630
Deflation process, 10, 385 Deflation of polynomials, 9 Delay ODEs, 319 (CEx 7.2.17) de Moivre’s equation, 470, 477 de Moivre numbers, 471, 477 Derivative(s):
approximations, 188–189 B splines, 284, 294, central difference rule, 189 computation, 12–13 divided differences, 185 error analysis, 188 estimating, 187
f ′ central difference rule, 189, 197
f ′ forward difference rule, 188–189,
194, 197
f ′((x1 + x2)/2) central difference
rule, 195
f ′′ central difference rule, 196, 198 f ′ via Taylor series, 188
f ′ via interpolation polynomials,
194
Lanczos’ generalized,
199 (Ex 4.3.21)
noise in computation, 197 polynomial interpolation estimating,
194–196
Richardson extrapolation, 190–191 second, via Taylor series, 196 symbolically, 194
Taylor series estimating, 188–189
Determinants, 101 (CEx 2.2.14) Developing mathematical
software, 615
Diagonal dominance, 529, 415 Diagonal matrices, 106, 632 Diet problem, 597 (CEx 14.1.5) Differential equations, 299 (See
ODEs; PDEs)
Differentiation, 153
Diffusion equation, 525
Direct error analysis, 49
Direct method, eigenvalues, 381 Dirichlet condition, 463, 526 Dirichlet function, 179 Dirichlet problem, 556
Discrete Fourier Transform (DFT), 472, 477
Discretization method, 513–520 Divergent solution curves,
326 (Fig 7.4)
Divided differences, 162, 164, 173
calculating coefficients ai , 160 divided difference table, 164,
174, 185
Divided Differences Corollary, 184 Doolittle factorization, 369,
379 (CEx 8.1.2)
Dot product, vectors, 631, 634 Double-length accumulator, 5 Double-precision floating-point
representation, 43, 51 precision/range 45
computer word, 43, 45 Double root, 149 (CEx 3.2.19) Dual problem, LPPs, 590, 591,
593, 600
ex approximation/routine/series, 20, 23, 66 (CEx 1.4.10–12)
ex partial sum approximations,
23 (Fig 1.4) Easy/Difficult Problem Paris, 11,
19 (CEx 1.1.24) Economical version, singular value decomposition,
394 (Ex 8.2.5) Eigenvalues/eigenvectors, 12,
81 (CEx 2.1.6), 380-393 Gershgorin’s Theorem, 386 linear differential equations, 391 properties, 383
singular value decomposition,
387–391
Electrical circuit/network, 69 (Fig 2.1),
81 (CEx 2.1.7) Electrical field theory, 201
Electrical power cable, 114 Elementary matrix M pq , 362, 374 Elliptic integrals, 35–36 (CEx 1.2.14),
67 (CEx 1.4.22), 201 Elliptic paraboloid, 561 (Fig 13.1)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index 667
668 Index
Elliptic problems, PDEs, 535 (Ex 12.1.1), 544
finite-difference method, 544 finite-element methods, 551 Gauss-Seidel method, 411,
422, 548
Helmholtz equation, 526, 544
Engineering design problem, 560 Engineering example/problem, 115,
125 (CEx 3.1.14), 352 Engineering polynomials,
125 (CEx 3.1.15) Engineering ODE example, 352 Epsilon, machine, 43, 48, 626 Equal oscillation property, 169 Error(s):
absolute/relative, 5–6
ODEs, 307
polynomial interpolation, 178 roundoff, 46
single-step, 322
trapezoid rule analysis, 205 truncation, 188, 197
unit roundoff, 46
vector, 79, 102 (CEx 2.2.19)
Error function, 52 (Ex 1.2.52), 205 Error term, 27
Error vector, 79, 102 (CEX 2.2.19),
414, 421 Euclidean/l2-vector norm, 405 Euler-Bernoulli beam,
424 (CEx 8.4.10) Euler-Maclaurin formula, 221,
225 (Ex 5.2.26)
Euler’s constant, 55 (CEx 1.3.7–8) Euler’s equation/formula, 354, 469,
477, 507
Euler’s method, 305, 309
example solution curve, 306 (Fig 7.3)
improved, 309 (Ex 7.1.15) Even function, 464, 475 Expanded reflected points, 581 Expansion, finite, 39 Experimental data/values, 440,
426 (Fig 9.1) Explicit method, PDEs, 528,
535 (Ex 12.1.12) Exponents, 39
Exponential distribution(s), 491 (CEx 10.1.20)
Extrapolation, general, 222–224 Extrapolation, Richardson, 191, 193
fl notation, 47, 49, 58
Factoring matrices, 358, 361–374 Failure, Gaussian elimination, 78 False position method,
121 (Fig 3.2) modified, 122
Fast Fourier Transform (FFT), 473, 477
Feasible set, vectors, 588
Fibonacci numbers/sequence/series,
36 (CEx 1.2.16), 146, 563 Fibonacci search algorithm,
563–565 (Fig 13.4-7) Fill-in, 95
Finite-difference approximations, 513
Finite-difference method, 527, 544 Finite-element methods, 551, 555
base triangle, 552 (Fig 12.15)
triangularization, 553 (Fig 12.16) Finite expansion, 39
First bad case, quadratic interpolation
algorithm, 570 First-degree polynomial accuracy
theorem, 255 First-degree spline, 252
First-degree spline accuracy theorem, 256
First-derivative formulas, 194–196 First-order ODE systems, 331
Constrained/Unconstrained, 332
Taylor series method, 332
First primal form, LPPs, 587, 590
Nonnegative variables, 602
Transforming into, 590
First programming experiment, 12 Five-point formula, Laplace’s equation,
545, 546, 557
Fixed point iteration, 148 (Fig 3.10) Floating-point numbers,
137 (Ex 3.2.24) computer errors, 46
double-precision, 43
equality, 613
floating-point machine number
[fl(x)], 48, 52
IEEE standard arithmetic, 626 normalized, 38 single-precision, 42
standard, 41, 626
Floating-point representation, 38, 41 Forward difference, Newton’s form,
177 (Ex 4.1.38)
Forward difference, two notes, 194 (Fig 4.11)
Forward elimination, in Gaussian algorithm, 71–72, 79, 88, 98 Forward elimination, pentadiagonal
system, 108
Forward elimination, tridiagonal
system, 104, 109–110 Fourier coefficients, 462, 463, 475 Fourier matrix, 475
Fourier series, 66 (CEx 1.4.15), 459,
463, 468 examples, 468–469
music, 467
N -th degree approximations,
462, 475
periodic on intervals, 466, 476
Fractal basins of attraction, 134, 140 (CEx 3.2.27)
Fractional numbers, converting bases, 619
Fractional parts, 615, 616, 618 French railroad system problem,
504 (CEx 10.3.3) Fresnel sine integral,
216 (CEx 5.1.5c) Frobenius norm, 423 (Ex 8.4.10) Full pivoting, 84, 96
Fully implicit method, PDEs,
535 (Ex 12.1.13) Function and Inverse Function
170 (Fig 4.5) Function iteration,
151, (CEx 3.3.14–17) Function, widely oscillating,
275 (Fig 6.11) Functions, minimization,
multivariate case, 573 advanced algorithms, 577 alternative form, Taylor
series, 575
contour diagrams, 577 contours F(x) = 25x12 + x2,
577 (Fig 13.12) Fibonacci search algorithm,
563 (Fig 13.4–7) F(x) = x2 + sin(53x),
561 (Fig 13.2)
Golden section search algorithm,
565, 566 (Fig 13.8–9) minimum/maximum/saddle
points, 579
Neider-Mead algorithm, 580
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
path, steepest descent, 577 (Fig 13.13)
positive definite matrix, 369, 383, 416, 580
quasi-Newton methods, 580 simple quadratic surfaces,
578 (Fig 13.14)
simulated annealing method, 581 steepest descent procedure, 576 Taylor series algorithm: 1st/2nd
bad case, 569 (Fig 13.10–11) Taylor series, F 572, 573
one-variable case, 561
Fibonacci search algorithm, 563 golden section search
algorithm, 566 quadratic interpolation
algorithm, 568 unconstrained/constrained
problems, 561
unimodal functions F, 563 unimodal/non-unimodal
functions, 562 (Fig 13.3) test functions, 585 (CEx 13.2.1)
Fletcher-Powell, Powell, Rosenbrock, Woods, 585
Fundamental Theorem of Calculus, 33–34 (Ex 1.2.52)
Galerkin equation, 555 Gauss-Huard algorithm,
102 (CEx 2.2.24) Gaussian continued functions,
66 (CEx 1.4.12),
67 (CEx 1.4.18–19) Gaussian elimination
algorithm, 72
classical (no pivot) A = LU,
95 (Fig 2.2)
complete pivoting, 96 (Fig 2.4) naive, 69, 638
no pivoting, 86, 95
operation count, 92
partial pivoting, 95, 96 (Fig 2.3) scaled partial pivoting, 85 stability, 93
updating rhs, 98
variants, 94
Gaussian method, elliptic integrals, 35–36 (CEx 1.2.14)
change of intervals, 240 composite three-point, 251 (CEx 5.4.11)
integrals with singularities, 246 Legendre polynomials, 243–244 nodes/weights 239
Gaussian Probability Integral, 216 (CEx 5.1.5a)
Gaussian quadrature, 239, 244 Gaussian rules, 239, 241 Gauss-Jordan algorithm,
102 (CEx 2.2.24) Gauss-Legendre quadrature
formulas, 241
Gauss-Seidel iteration, 411, 548 Gauss-Seidel method, 409, 548 General extrapolation, 222 Generalized Neumann equation, 525 Generalized Newton’s method,
137 (Ex 3.2.36) General quadratic functions,
583 (Ex 13.2.15) Geometric series, 78
Gershgorin discs, 386, 387 (Fig 8.1) Gershgorin’s Theorem, 386
Gibbs phenomenon, 463
Global minimum point of F, 563 Global positioning systems,
142 (CEx 3.2.41) Golden ratio, 145,
570 (CEx 13.1.5), 565
Golden section search algorithm, 565,
566 (Fig 13.8–9) vs. Fibonacci search, 468
Goodness of fit, 266
Gradient vector matrix, 573 Gram-Schmidt process, 445
Great Internet Mersenne Prime Search
(GIMPS), 489 Growth factors, 84
Halley’s method, 151 (CEx 3.3.13) Hallways, board across,
125 (CEx 3.1.17) Harmonic function, 545, 556
Harmonic series, 55 (CEx 1.3.7)
Hat functions, B splines, 282
Heat balance in a rod, 523 (CEx 11.2.9) Heat equation model problem, 524,
525, 526
contour plot, 533 (Fig 12.6) explicit stencil, 528 (Fig 12.3) step-size condition, 533 solution surface, 533 (Fig 12.6) xt-plane, 527 (Fig 12.2)
Heated rod, 527 (Fig 12.1)
Helmholtz equation, 544 five-point star, 546 (Fig 12.13) uniform grid spacing,
547 (Fig 12.14) Hermitian matrices,
Hessenberg matrix, 112 (CEx 2.3.18) Hessian matrix, 574, 583
Heun’s method, 309 (Ex 7.1.15) Hexadecimal-Binary Table, 622 Hexadecimal system, 617
Hidden bits, 42
High degree polynomial, 154, 179 Higher-order ODE systems, 339
autonomous, 341
systems, 341
Hilbert matrix, 101 (CEx 2.2.4),
376 (Ex 8.1.7), 407, 435,
457 (Ex 9.3.2) Histograms, 505 (CEx 10.3.13) History, 51
computing π, 8, 611 Gaussian elimination, 93–94 ODE numerical methods, 355
Hole at zero, 41
Horner’s algorithm, 9–10 Horner’s algorithm, complete, 24 Horner’s algorithm, p, p′, 11 Hyperbolic problems, PDEs, 526,
535 (Ex 12.1.1), 536 Hyperbolic sine, 64 (Ex 1.4.5) Hyperbolic tangent, 64 (Ex 1.4.18)
advection equation, 541
analytical solution, 537
Lax method/scheme, 541 Lax-Wendroff method/scheme, 542 upwind method, 541
wave equation, 524, 536
Ice cream cone region, 495 (Fig 10.7) Identity matrix, 632
IEEE floating-point standard
arithmetic (IEEE-754), 626 32-bit machine inquiry functions
results, 627 (Table C.2) Ill-conditioned linear system, 5, 79,
406, 407 Ill-conditioned matrix, 78, 97,
101, 406
Ill-conditioned ODE, 318 (CEx 7.2.5) Ill-conditioned problem, 65 (Ex 1.4.37) Ill-conditioning, 406
Improved Euler’s method,
309 (Ex 7.1.15)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index 669
670 Index
Incompatible systems, 448
Inconsistent linear systems, 447, 601 Industrial example/problem, 14, 323 Initial value problems (IVPs), 299–300,
330 (CEx 7.3.17), 508, 511 Inner product, 441, 630
Inner product, integral, 460 Integer computer arithmetic, 44 Integer-fraction-split, 623 Integer parts, 616, 617 Integrals, 201
Dawson, 310 (CEx 7.1.12) elliptic, 35–36 (CEx 1.2.14) Shi(x), 330 (CEx 7.3.6) Si(x), 217 (CEx 5.2.9) singularities, 246
standard, 461 Integration by parts, 461 Integration, numerical, 201
area/volume estimation, 491, 494 B splines, 285
definite/indefinite, 201
Gaussian quadrature formulas/rules
Bool’s, 235
change of intervals, 240 integrals with singularities, 246 Legendre polynomials, 243 Newton-Cotes Closed, 235 Newton-Cotes Open, 236 nodes/weights, 239
Romberg algorithm, 218–219 Romberg array, 218–219,
225 (Ex 5.2.11) Euler-Maclaurin formula, 221,
225 (Ex 5.2.26) Richardson extrapolation,
190, 198 Simpson’s rule, 228
adaptive, 234 basic, 235 composite, 231
Newton-Cotes rules, 235–236 Closed, 235
Open, 236
trapezoid rule,
composite, 207
error analysis, 205–207 multidimensional integration, 211 uniform spacing, 229
Integration formulas, 461 Intel Pentium processor, 51 Intermediate-Value Theorem,
116, 208
Interpolate, 154, 568 Interpolation, polynomials, 153
splines, 358–258, 286, 371 Interpolation polynomial
nine equal Chebyshev nodes, 180 (Fig 4.8)
nine equal spaced nodes, 180 (Fig 4.7)
over data points, 179 (Fig 4.6) Invariance theorem, 163
Inverse Discrete Fourier Transform
(DFT), 472, 477 Inverse function, 170 (Fig 4.5) Inverse Hyperbolic sine,
64 (Ex 1.4.11)
Inverse polynomial interpolation, 169 Inverse power method, 400, 403 Irregular five-point formula, Laplace’s
Lagrange form, polynomial interpolation, 155, 173
Lagrange quadrature polynomial, 229 Lanczos’ generalized derivative,
199 (Ex 4.3.21)
Landen transformation, ascending,
67 (CEx 1.4.22) LAPACK, 383
Laplace’s equations, 109, 526, 544 five-point stencil, 545 (Fig 12.11)
Laplace operator, 557 Laplacian operator, 525 Largest positive floating-point
single-precision number,
52 (Ex 1.3.10) Lax method/scheme, 541
Lax-Wendroff method/scheme, 542 LDLT factorizations,367,
378 (Ex 8.1.24) Least squares method, 426,
584 (Ex 13.2.20) basis function, 431
linear example, 429 (Fig 9.2) nonlinear example, 450 nonpolynomial example, 430,
431 (Fig 9.3)
singular value decomposition
(SVD), 388, 451–452 weight function, 448
Least squares principle, 447
Least squares problem, 443
Least squares series, 459
Lebesgue constants, 66 (CEx 1.4.15) Legendre polynomials, 243 Legendre’s elliptic integral relation,
35–36 (CEx 1.2.14) Lemma: Upper Bound, 182
Lemniscate, 504 (CEx 10.3.7) Length of vectors, 405 L’Hoˆpital’s rule, 33 (Ex 1.2.49) Linear algebra, concepts/notation,
629–638
Inequality constraints, 603
Linear B spline, 296 (Ex 6.3.36) Linear combinations, 630 Linear convergence, 120
Linear functional φ, 397,
508 (Fig 11.3) Linear interpolation, 154,
186 (Ex 4.2.5–8) Linear least squares, 426, 429,
434 (Ex 9.1.22)
Linear polynomial interpolation, 154
equation, 546 Iteration matrix, 408, Gauss-Seidel, 411,
Jacobi, 410, 422
SOR, 412, 422 Iterative method, 405
basic, 408 Gauss-Seidel, 411, Jacobi, 416, 421 Richardson, 408 SOR, 412, 421
414
422, 548
421, 548
Iterative solutions, linear systems, 405
fixed point, 148 Newton-Raphson, 125 Richardson, 408
IVPs, 300, 330 (CEx 7.3.17), 508, 511
Jacobian matrix, 132–135 Jacobi iteration, 410 Jacobi method, 409, 421 Jacobi overrelaxation
(JOR) method, 417
Kepler’s equation, 129 (CEx 3.2.6) Kirchhoff’s law, 69
Knots, spline theory, 258 Kronecker delta, 155, 173, 461
kth residual, 447
l1/l2 approximation, 427, 432
l1/l∞ problem, 601, 604 Lagrange cardinal polynomial,
156 (Fig 4.1), 173
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Linear programming problems (LPPs), 587
Linear systems: banded, 103–110
block pentadiagonal, 108 diagonal matrix, 632 pentadiagonal, 107
strictly diagonal dominance, 106,
110, 416 tridiagonal, 632
eigenvalues/eigenvectors, 381 Gershgorin’s Theorem, 386 operation count, 82
singular value decomposition,
388 inconsistent, 601
iterative solutions, 405
condition number, 406–407 conjugate gradient method, 419 convergence theorems, 416 ill-conditioning, 388, 406 matrix formulation, 416, 422 overrelaxation, 409, 417
matrix factorizations: A−1, 371
LDLT factorization,367
LU factorization, 358 multiple right-hand sides, 371
naive Gaussian elimination, 71 algorithm, 73
residual/error vectors,
102 (CEx 2.2.19) power method, 398, 402
Aiken acceleration formula, 399
algorithms, 398
inverse, 401, 403 shifted inverse, 402, 403
Linear transformation, 240, 247 Linearize-solve approach to solving
nonlinear systems, 132 Linearly independent sets, 432
Loaded die problem, 498
Local minimum points, functions, 562 Local truncation error, 308 Logarithmic integral, 217 (CEx 5.1.10) Logarithmic series, 20–21
Lorenz problem, 356 (CEx 7.5.7b) Loss of precision, computer-caused, 57 Loss of precision, avoiding in
subtraction, 60
Loss of Precision Theorem, 59 Lower triangular matrix, 358
LPPs, 587
approximate solution, inconsistent
linear systems, 601 l1 problem, 599, 606 l∞ problem, 604, 606
dual problem, 591, 593
first primal form, 586, 590, 593 graphical method dual problem,
592 (Fig 14.2) graphical solution method,
589 (Fig 14.1)
second primal form, 592, 594 simplex method, 599
LU factorization, 362
solving linear systems, 365
Machine epsilon, 49, 626
Machine numbers, 39
Maclaurin series, 22, 31 (Ex 1.2.1),
37 (CEx 1.2.21) Magnitude of vectors, 405
Mantissa, normalized, 39
Maple, mathematical software, 13 Marching problem/method, 528 March of B splines, 297 (CEx 6.3.6) Mariner space probe, 608
Markov chains, 506
Mathematica, mathematical
software, 13
Mathematical Preliminaries, 20 MATLAB, mathematical software, 13 Matrix/Matrices, 631
companion, 395 (CEx 8.2.3) conjugate transpose, 383 diagonal, 632
Gershgorin’s Theorem, 386 gradient vector, 573 Hermitian, 383
Hessian, 573
Hilbert, 101 (CEx 2.2.4),
457 (Ex 9.3.2) identity, 632
inverse, 637
Jacobian, 132–135
near-deficiency in rank, 455 permutation, 372
positive definite, 369, 383.,416, 580 row-equilibrated, 100 (Ex 2.2.23) similar, 384
singular values, 388
symmetric, 383
symmetric positive definite (SPD),
369, 383, 416, 580
transpose, 383 triangular, 630 unitarily similar, 384 unitary, 384
unit lower triangular, 358 upper triangular, 359 Vandermonde, 167 Vandermonde determinant,
177 (Ex 4.1.47) Matrix factorization(s), 358
A, 362, 372 A−1, 371–372
LDLT , 367–369 LLT , 369–371 LU, 70, 359–375 PA, 372,375
QR, 448
Cholesky, 369, 370
Crout, 369
Doolittle, 369
multiple right-hand sides, 371 SVD, 451–454
Matrix inverse, 70, 636 Matrix-matrix product, 634
post-multiplication, 635
pre-multiplication, 635 Matrix multiplication
non-commutative, 637 Matrix norms:
l1, l2, l∞, 406
properties, 406
Matrix transpose, 628, 636
columns/rows, 636
Matrix triangular inequality, 405 Matrix-Vector form, 73 Matrix-Vector product by
columns/rows, 633 Maximum points, functions, 578 Mayan arithmetic, 623 (Ex B.1.16) Mean, arithmetic, 17 (CEx 1.1.7) Mean square error, 462 Mean-Value Theorem, 26–27 Measured quantity, 56
Mersenne prime number, 489 Mesh points, natural ordering,
109 (Fig 2.5), 547
red-black ordering, 558 (Ex 12.3.3)
Metal shaft, circular, 124 (CEx 3.1.11)
Midpoint composite rule, 214 (Ex 5.1.6–7),
330 (CEx 7.3.8)
Minimal solution, linear equations, 453
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index 671
672 Index
Minimization problems/functions, 560, 581
Constrained/Unconstrained, 561
One variable case, 562 Minimization test functions,
584 (CEx 13.2.1) Minimum points, functions, 578 Mixed Dirichlet/Neumann
equation, 526
Modified false position method, 122,
122 (Fig 3.2)
Modified Gram-Schmidt Process, 448 Modified Newton’s method, 131,
137 (Ex 3.2.35) Modulus of continuity, spline
functions, 255 Molecular conformation,
585 (CEx 13.2.10) Moler-Morrison algorithm,
151 (CEx 3.3.14) Monomial, 167, 168 (Fig 4.3),
436 (Fig 9.4) Monte Carlo methods, 481 area/volume estimation,
491, 494
approximation of integrals, 491
multi-integral, 492, 494 computing, 494
ice cream cone example, 495
random numbers, 481 algorithms/generators, 482
Muller’s method, 151 (CEx 3.3.17) Multidimensional integration, 211 Multiple zero, 96, 131, 137 (Ex 3.2.35) Multiplication, nested, 14,
15 (Ex 1.1.6), 619 Multiplicity of a zero, 131,
132 (Fig 3.7) Multiplier, 4, 71–73
Multipliers, in Gaussian algorithm, 73 Multi-step methods, 320, 347 Multivariate, minimization functions:
advanced algorithms, 577 contour diagrams, 577
general quadratic function, 585
(Ex 13.2.19) minimum/maximum/saddle
points, 579
Neider-Mead algorithm, 580 positive definite matrix, 369, 383,
416, 580
quasi-Newton methods, 580 simulated annealing method, 581
steepest descent procedure, 576 Taylor series F, 572, 584 (Ex 13.2.5)
Naive Gaussian elimination, 69, 72, 75
algorithm, 72
failure, 82
NaN (Not a Number), 42, 627 Natural cubic spline functions,
263, 265 curve, 273 (Fig 6.8)
examples, 268 (Fig 6.7), 273 (Fig 6.8)
smoothness property, 274
space curves, 273
Natural logarithm (ln), 20, 26
routine 67 (CEx 1.4.18) Natural ordering mesh
points, (Fig 2.5), 109, 547 Navler-Stokes equation, 525 Neider-Mead algorithm, 580
illustration, 128
interpretation, 126
modified, 137 (Ex 3.2.35) nonlinear equation systems, 132 sample problem, 128
Taylor series approach, 126
three steps, 128 (Fig 3.5) Newton’s method, nonlinear
systems, 133 Nine-point formula, Laplace’s
equation, 545, 559 (Ex 12.3.10) Nodes
Chebyshev, 187 (CEx 4.2.10) Gaussian, 239, 245, 246
plane triangular elements, 551 polynomial interpolation, 154 spline theory, 257, 258
x in equally spaced nodes, 182 (Fig 4.10)
Noise in computation, 197 Nonlinear equation systems, 132,
138 (Ex 3.2.39) Nonlinear least squares
problems, 450 Nonperiodic spline filter, 113 (CEx 2.3.22)
Nonnegative machine numbers, 40 Nonnegative normalized machine
numbers, 40 Nonuniformly distributed random
numbers in circle,
488 (Fig 10.5) Nonuniformly distributed random
numbers in ellipse,
487 (Fig 10.3–4)
Normal equations, 428, 430, 431,
432, 435, 438, 439, 448, 449,
457, 555 Normalized floating-point
representation, 38 Normalized mantissa, 39
Normalized scientific notation, 38 Normalized tridiagonal algorithm,
111 (CEx 2.3.12) Norm(s):
matrix l1, l2, l∞, 406
vector l1, l2, l∞, 405, 427 n-simplex sets, 579
Numeric Inquiry Functions (Fortran),
627 (Table C.1) Numerical analysis, 2
Numerical computer failures, 51 Numerical differentiation, 187
Nested
Nested
Nested
form of polynomial interpolation, 158 multiplication, 8, 14, 159, 15 (Ex 1.1.6)
polynomial evaluation, 8, 10, 14
Neumann boundary condition, 526
Neutron shielding problem, 502, 503 (Fig 10.9)
Neville’s algorithm, 170–172 Newton-Cotes rules Closed/Open,
235–236, 238 (CEx 5.3.7) Newton polynomials, 159 (Fig 4.2) Newton-Raphson iteration, 125 Newton’s form of polynomial
interpolation, 157, 161, 173, 177 (Ex 4.1.38–39), 185, 187 (CEx 4.2.14)
Newton’s method, nonlinear equations, 127, 126 (Fig 3.4)
bisection vs. secant method, 147 convergence analysis, 129
error bound, 129
failure due to bad starting points,
131 (Fig 3.6) formula, 127
forward-difference form, 177 (Ex 4.1.38–39)
fractal basins of attraction, 134, 137 (Ex 3.2.37)
geometric approach, 126 generalized, 137 (Ex 3.2.37)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Numerical integrals, 201 error analysis, 205 multidimensional, 212
Numerical stability, 93, 532
Objective functions, 588 Octal system, 617
Odd function, 464, 475 Odd periodic functions, 538 Ohm’s law, 69
Olver’s method, 151 (CEx 3.3.12) One-variable case, minimization
functions, 562
Fibonacci search algorithm, 563 Golden section search algorithm,
565, 566 (Fig 13.8–9) quadratic interpolation
algorithm, 568 unconstrained/constrained
problems, 561 unimodal functions F, 563
Operation count, Gauss/Solve, 93 Optimization example, LPPs, 588 Ordering, natural, 547
Ordering, red-black (checkerboard),
558 (Ex 12.3.3) Ordinary differential equations
(ODEs), 299 Adams-Bashforth-Moulton
formulas, 328, 347 error types, 307
integration, 302
IVPs, 347–348 Runge-Kutta methods, 311
adaptive, 320
order 2, 313, 316 order 3, 317 (Ex 7.2.7) order 4, 314, 316,
319 (CEx 7.2.14a–b), 343 order 5, 319 (CEx 7.2.15) Taylor series, two variables, 311
stability analysis, 325
Taylor series methods, higher
order, 306
vector fields, 303 (Fig 7.1),
304 (Fig 7.2) ODEs systems:
Adams-Bashforth-Moulton methods, 347–348
adaptive scheme, 352 autonomous ODE, 337, 343 predictor-corrector scheme, 347 stiff equations, 353
autonomous, 341 first order methods:
Runge-Kutta, 334
Taylor series, 335 uncoupled/coupled systems, 332
higher order, 339 system(s), 334, 341
ODEs/BVPs:
discretization method, 513 shooting method, 508
IVPs, 510, 511, 520
linear case, 514, 516, 520 linear function, 509 modifications/refinements, 510 nonlinear two-point, 513
(CEx 11.1.1)
two-point boundary-value
problem, 510, 513, 520 Orthogonal functions, 441
Orthogonal matrices, 456 Orthogonal, mutually, 460 Orthogonal projection, 475 Orthogonal systems, 435
algorithm, 438
orthonormal basis functions, 435 polynomial regression, 439
Orthogonality properties, 435, 449, 461
Orthonormal basis, 435, 460 Overflow, range, 40 Overrelaxation, 409, 412 Overwrite, 9
Pade ́ interpolation, 178 (CEx 4.1.17) Pade ́ rational approximation,
37 (CEx 1.2.22),
66–67 (CEx 1.4.17) Parabolic problems, PDEs, 524, 526,
534, 535 (Ex 12.1.1) Crank-Nicolson alternative
method, 531 Crank-Nicolson method, 529, 534 heat equation, 526
stability, 530, 532
stencil (4-point implicit), 529 stencil (6-point implicit), 531
Parametric representation, curves, 273
Partial differential equations (PDEs), 524
elliptic problems, 526
finite-difference method, 544 finite-element methods, 551, 555
Gauss-Seidel iterative method, 548
Helmholtz equation, 525 hyperbolic problems, 536, 542
advection equation, 541 analytical solution, 537 Lax method/scheme, 541 Lax-Wendroff method/
scheme, 542
upwind method, 541 wave equation, 524, 536
parabolic problems, 534 Crank-Nicolson alternative
method, 531 Crank-Nicolson method, 529 heat equation model, 526 stability, 532
Partial double-precision arithmetic, 356 (CEx 7.5.2)
Partial pivoting strategy, 84–87 Partial pivoting for size/sparsity, 95 Partition of unity on interval, 292 Pascal’s triangle, 35 (CEx 1.2.10c) Patriot missile, 51
Penrose properties, 455 Pentadiagonal linear systems, 107
block, 108 Pentadiagonal matrix, 107 Periodic cubic splines, 265,
279 (Ex 6.2.23) Periodicity, 538
Periodic sequences, random numbers, 484
Periodic spline filter, 113 (CEx 2.2.23)
Permutation matrices, 372
π, computing value, 14 (Ex 1.1.1),
15 (Ex 1.1.4), 17 (CEx 1.1.17),
35–36 (CEx 1.2.11–15), 611 π4/90 series approximation via
calculus, 30 (Fig 1.5) Piecewise bilinear polynomial,
263 (CEx 6.1.3)
Piecewise linear functions, 253 Pierce decomposition, 394 (Ex 8.2.6)
history, computing π, 8
Pivot cross section, 506 (Fig 11.1) Pivoting, 94
pivot element, 73–74, 94 pivot equation, 71, 73 scaled partial, 83
complete pivoting, 96 Gaussian elimination, 92
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index 673
674 Index
Pivoting (continued) operational count, 93 partial pivoting, 95
Plane triangular element, 551 Poisson’s equation, 526, 544, 551, 553 Polygonal functions, 252
Polyhedral set, 597
Polynomial(s):
N -th degree approximation, 462 piecewise bilinear, 262 (CEx 6.1.3) trigonometric, 462
Polynomial interpolation, 153, 154, 165
bivariate functions, 172 derivative estimating, 188 divided differences, calculating
coefficients ai , 160 errors, 178
Dirichlet function, 179 polynomial interpolation, 185 Runge function, 179, 185 theorems, 181
inverse, 169
Lagrange form, 155
linear, 186 (Ex 4.2.8)
nested form, 158
Neville’s algorithm, 170–172 Newton form, 157
nested form, 158
upper bound, 185 Vandermonde matrix, 168
Polynomial regression, 440
Poorly fitting polynomials, 179 Population problem, 175 (Ex 4.1.11) Positive definite matrices, 369, 383,
416, 580
Power method, 396–403
Aiken acceleration formula, 399 inverse, 400-401
shifted inverse, 401–402
power series, 478
Precision, 6, 41, 47 Preconditioning, 420 Predator-prey model(s), 299 Predictor-corrector scheme,
329 (CEx 7.3.4), 347 Predicted value, 328
Primal forms, LPPs: First, 587, 593 Second, 592, 594, 597
Prime numbers, 19 (CEx 1.1.25), 489 Primitive n-th roots of unity, 471 Principle, 29
Probability integral, 216 (CEx 5.1.5a) Programming suggestions, 608 Projection, 394 (Ex 8.2.6)
Projection operator, 444
Prony’s method, 458 (CEx 9.1.3.2) Properties:
Bernstein Polynomials, 292 Characteristics of Random Number
Algorithms, 484 Elementary Consequences of
Definitions, 637 Inner Product, 441
Schoenber’s Process, 290 Protein folding, 585 (CEx 13.2.10) Pseudocode:
AMRK, 349
AMRK Adaptive, 356 (CEx 7.5.3) AM System, 349–350 area/volume estimation, 493–494 back substitution, 76, 366 Backward Tri, 112 (CEx 2.3.17) Birthday, 500
Bisection, 116
Bi Diagonal, 112 (CEx 2.3.16)
Bi Linear, 263
BSpline2 Coef, 288, 263 BSpline2 Eval, 288, 263
BVP1, 515–516
BVP2, 518–519
Cholesky factorization, 370 Circle Erroneous, 488
Coarse Check, 484-485
Coef, 166
complete Horner’s algorithm, 24 compute A, b, 439
compute T matrix, 438
Cone, 495
conjugate gradient algorithm, 419 Crank-Nicolson method, 530–531 CubeRoot, 141 (CEx 3.2.31) Derivative, 193
differentiation, 200 (CEx 4.3.3) divided difference/improved, 165 Doolittle Factorization, 365 Double Integral, 493–494 Ellipse, 486
Ellipse Erroneous, 487
Euler, 305
Eval, 166
evaluate Pn(t), 173
evaluation, interpolation
polynomial, 160 explicit model PDEs, 528
Extrapolate, 226 (CEx 5.2.12) First, 12–13
first four digits in a random
number, 486
Forward Eliminating, 75 Forward Substitution, 365 Forward Sub, 111 (CEx 2.3.11) Gauss, 89–90
Gauss Sym, 101 (CEx 2.2.101) Gaussian elimination, key
loops, 77
Gauss-Seidel, 409, 413
general iterative, 408
Horner’s algorithm, 11, 14 Hyperbolic, 539–540
improved divided difference, 165 improved forward elimination, 75 Jacobi, 413
LDLT factorization,368
linear systems, 89–90, 408–409,
413
Loaded Die, 499
machine epsilon, 47
matrix factorizations, 365–370 modified forward elimination, 91 modified Gram-Schmidt process,
445
Moler-Morrison algorithm,
151 (CEx 3.3.14) Naive Gauss, 77
natural cubic spline functions, 271–272
nested multiplication, 8
naive forward elimination, 91 Needle, 501
Newton, 127–128
Newton’s method, 511, 512,
612–613
normalized power method, 399 numerical integration, 204, 219,
234–235
Parabolic1, 528–529 Parabolic2, 530–531
Penta, 107
Poly, 379 (CEx 8.1.3) polynomial interpolation, 160,
165–167 power method, 398
Probably, 500
pseudo-random number, 489 Random, 483
random integers, 485, 489 random-numbers recursive, 482
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
random points in (a, b), 485, 489 recursive Chebyshev polynomials,
168
recursive random number, 482 RK4, 315
RK45, 321
RK45 Adaptive, 323
RK System, 350 –351
RK System1, 336
Romberg, 219
Schoenberg’s process, 289, 290,
294
Schoenberg Coef, 291 Schoenberg Eval, 291
Secant, 143–144, 150 (CEx 3.3.2) secant method, 143–144,
151 (CEx 3.3.2) Shielding, 503
Simpson, 234–235
Solve, 92
Solve Sym, 101 (CEx 2.2.101) SOR, 413
Spline1, 255
Spline3 Coef, 271
Spline3 Eval, 271
Sqrt, 140 (CEx 3.2.29) successive overrelaxation (SOR)
method, 413
synthetic division, 10 Taylor, 307
Taylor System1, 333
Taylor System2, 335
Test AMRK, 351
Test Bisection, 118
Test Derivative, 193
Test Coef Eval, 166–167 Test NGE, 78
Test Random, 484
Test RK4, 315
Test RK System, 350
Test RK4 System1, 336–337 Test RK4 System2, 338–339 Test RK45, 321–322
Test RK45 Adaptive, 324 Test Spline3, 272 Trapezoid, 204
Trapezoid Uniform,
216 (CEx 5.1.1) Tri, 105
Tri Normal, 112 (CEx 2.3.15) Tri 2n, 112 (CEx 2.3.15) tridiagonal factorization,
377 (Ex 8.1.14)
Two Dice, 502
Volume Region, 494
X Gauss 111 (CEx 2.3.10) X Shielding, 503
X Solve 111 (CEx 2.3.10) x − sin x function, 61 update rhs b, 367
Pseudo-inverse, matrices, 454, 455 Pseudo-random numbers, 482 Pythagorean Theorem, use, 3
Quadratic B spline, 296 (Ex 6.3.37) Quadratic convergence, 129, 130 Quadratic form, 418, 425 (CEx 8.4.12) Quadratic formula, 34 (CEx 1.2.1),
65 (Ex 1.4.29)
Quadratic function(s), 419, 576, 582,
583 (Ex 13.2.15) Quadratic interpolation, 186 (Ex. 4.2.5b)
Quadratic splines, 356–258 Quadrature, 213
Quadrature rules, 213, 240 Quasi-Newton methods, minimization
functions, 580 Quasi-random number sequences,
489, 498 Quotient, 618
Radix point, 617 rand() in Unix, 483 Random numbers, 481
algorithms/generators, 483, 489 diamond, 490 (CEx 10.1.7b) equilateral triangle,
490 (CEx 10.1.7a) uniformly distributed, 485
Random walk problem,
505 (CEx 10.3.17–18)
Randomness, 482 Range, computer, 45 Range reduction, 62 Rational approximation,
37 (CEx 1.2.22) Rationalizing, 60
Rayleigh quotient, 403 (Ex 8.3.7) Reciprocals, numbers, 136 (Ex 3.2.23) Recursive definition, in Newton’s
method, 130
Recursive property, divided differences
theorem, 163
Recursive trapezoid formula, equal
subintervals, 210
Red-black ordering, 558 (Ex 12.3.3) Reflected points, 580
Regression, polynomial, 440 Regula falsi method, 121
Relative error(s), 5–6, 14 Relaxation factor, 412
Remainder, 618 Remainder-quotient-split, 623 Replacement, 73
Representation, numbers, different
bases, 616–623
Residual vectors, 97, 102 (CEx 2.2.19),
557
Richardson extrapolation, 190, 198,
199 (Ex 4.3.19) D(n, 0), 191
D(n, m), 192 estimating derivatives,
199 (Ex 4.3.19) Euler-Maclaurin formula, 221 Romberg algorithm, 222–223
Richardson iteration, 408
Riffle shuffles, 506 (CEx 10.3.27) Right triangle with hypotenuse,
503 (Fig 10.10)
Rising sequences in card shuffle,
506 (CEx 10.3.27) Robust software, 92
Rocket model problem, 301 Rolle’s Theorem, 181 Romberg algorithm, 217
Euler-Maclaurin formula, 221 ratios, 221–222
Richardson extrapolation, 220
Roots, nonlinear equations, locating, 114 bisection method, 114
convergence analysis, 119 false position method, 121 multiplicity k, 381
Newton’s method, 127 convergence analysis, 129–130 fractal basins of attraction, 134, interpretation, 126
nonlinear equation systems, 132 polynomials, multiplicity 2, 3,
132 (Fig 3.7) secant method, 142
comparison of methods, 147 convergence analysis, 144–147 fixed point iteration, 148
Roots of polynomials, 11 Roots of unity, 471 (Fig 9.10)
Index 675
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
676 Index
Rounding modes, 628
round(ing) down/up, 46, 628 rounding numbers, 7
round to n digits, 7
round to even, 7
round to nearest value, 7, 627, 628 round toward 0, 628
round toward ±∞, 628
Roundoff error, 308 Row-equilibrated matrix,
100 (Ex 2.2.23) Rules of Thumb, 97, 406
Runge function, 179 Runge-Kutta-Adams-Moulton method,
348 (Fig 7.6) Runge-Kutta-England method, 320,
328, 331 (CEx 7.3.19) Runge-Kutta-Fehlberg method, 320
pseudocode, 321
overview of adaptive process, 322 Runge-Kutta methods, 311
adaptive, 322
order 2, 313
order 3, 314, 317 (Ex 7.2.7) order 4, 314, 335, 342
order 5, 319 (CEx 7.2.15), 320 pseudocode, 316
systems ODEs, 335
Taylor series, two variables, 311
sin x partial sum approximations, 22 (Fig 1.3)
polynomial interpolation, 166, 183, 189
Saddle points, functions, 579
Sample problem pairs, 12
Sawtooth wave approximations/series,
465 (Fig 9.6), 468 (Fig 9.7) Sawtooth wave function on [−π, π ],
465, 467
Sawtooth wave function on [0, 2L],
467, 468, 476
Sawtooth wave polynomial, 465, 476 Scale factor, 85
Scale vector, 85, 97
Scaled partial pivoting, 85, 87 Scaling, 94
Schoenberg’s process, 289,
290, 294 Schur’s Theorem, 384
Secant method, nonlinear equations, 142, 143 (Fig 3.9), 149
algorithm/formula, 142
bisection, secant, Newton’s methods, pros-cons, 147
convergence analysis, 144 error bound (superlinear
convergence), 145–146, error vector approximation, 144,
146, 149
fixed point iteration, 148 hybrid schemes, 148
Second bad case, quadratic interpolation algorithm, 570
Second-degree spline, 256 Second-derivative formulas, 196 Second primal form, LPPs, 592, 594,
597
Seed, random number sequence, 483 Series, geometric, 78
Serpentine curves, 274 (Fig 6.9) Shifted inverse power method, 401 Shooting method, BVPs, 508
illustrated, 508 (Fig 11.2) linear case, 514, 516 refinements, 517
typical equation, 514
Significance digits, 56 Significance, loss of, 56
x − sin x, 57
avoiding in subtraction, 60 computer-caused, 57 range reduction, 62
Significant decimal digits, 3–5, 43–44
Similar matrices, 384
Simplex algorithm/method, 599
2-simplex/3-simplex, 580 Simplex method, 597 Simple zero, 129
Simpson’s rule, 228
adaptive, 231
basic, 227, 235, 236–237,
238 (Ex 5.3.8) composite, 237, 237–238
(Ex 5.3.6–8), 251 (CEx 5.4.11) one/two basic rules 232 (Fig 5.7) uniform spacing, 229
Simulated annealing method, 581 Simulation, 481, 498
birthday problem, 499 Buffon’s needle problem,
501 (Fig 10.8) loaded die problem, 498 neutron shielding, 503 two dice problem, 502
Simultaneous nonlinear equations, 138 (Ex 3.2.39)
Sine integral Si(x), 205, 216 (CEx 5.1.5b),
217 (CEx 5.1.9),
330 (CEx 7.3.15) Sine routine, 66 (CEx 1.4.17)
Sine series, 20, 464, 476 partial summations, 21
Single-precision floating-point form, 42
Single-precision floating-point representation, 42, 51
computer word, 42, 44 precision, 41, 47, 611
range, 45
32-bit word, 628 (Table C.3)
Single-step error control, 322 Single-step methods, 347 Singular value
decomposition/factorization
(SVD), 388, 393, 448, 451, 454 economical version, 394 (Ex 8.2.5) eigenvalues/eigenvectors, 387
least squares method, 522
Singular values, 388, 406 Singularity, 246
sin x , periodicity, 62
Smallest positive floating-point
single-precision number,
52 (Ex 1.3.10)
Smoothing data, 439, 440 (Fig 9.5) Smoothness properties, splines, 274 Solution case, quadratic interpolation
algorithm, 570 Solution curves, 303 (Fig 7.1),
304 (Fig 7.2), 326 (Fig 7.4),
326(Fig 7.5) Solution curves, stiff ODE, 353 (Fig 7.7)
Solutions, ODEs, 299
SOR iteration, 409, 416
SOR method, 412, 421
Sparse factorization, 378 (Ex 8.1.25) Sparse linear system, 424
(CEx 8.4.11), 547 Sparse sample system, 109
Spectral radius, 406, 414 Spectral theorem, 456
Spherical surface, 55 (Ex 1.3.20) Spherical volume, 55 (Ex 1.3.20) Spline functions, 252
B, 281
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
cubic interpolating, 252 example, first/second degree,
259 (Fig 6.4)
first degree, 251, 253 (Fig 6.2)
approximation/interpolation, 280 Be ́zier curves, 295
Schoenberg’s process, 289,
290, 294 first-degree, linear Si (x ),
254 (Fig 6.3)
interpolating quadratic, Q(x), 257 modulus of continuity, 255 natural cubic, 265
smoothness property, 274
space curves, 273 quadratic interpolating spline,
289 (Fig 6.16) second-degree, 256
Subbotin quadratic, 258, 259 (Fig 6.5), 261
Spurious zeros, 57 Square root, computing,
136 (Ex 3.2.1–2),
137 (Ex 3.2.25–26)
Square wave approximations/series,
469 (Fig 9.8), 477, 480 Stability, 93, 325, 532
ODEs, 325
PDEs, 532
Standard deviation, 17 (CEx 1.1.7) Standard floating-point representation
(IEEE-754), 41, 626 Standard forms, LPP, 587
Stationary points, functions, 578 Statistician’s rounding, 7
Steady state, systems, 352
Steel, rectangular sheet, 4 (Fig 1.1) Steepest descent procedure, 575,
584 (CEx 13.2.2)
Steffensen’s method, 137 (Ex 3.2.36) Stencil (4-point explicit), 527
Stiff ODEs, 353 (Fig 7.7)
Stirling’s formula, 33 (Ex 1.2.47) Strategies, 3
Strictly diagonal dominant matrix, 106,
110, 416
Student Research Project, 54, 55, 65,
68, 102, 151, 187, 226, 251, 262, 331, 395, 459, 491, 498, 506, 585, 601
Sturm-Liouville problem, 557 Subbotin quadratic spline
functions, 258
Subdiagonal matrix, 103 Subnormal numbers, 627 Subordinate norms, 406 Subtraction, significance, 60 Successive overrelaxation (SOR)
method, 409 Superdiagonal matrix, 103
Superlinear convergence, 145, 149 Support of function, 281 Supremum (least upper bound), 255 Surface f (x, y) about disk,
493 (Fig 10.6) Symbolic computations, 308
Symbolic verification, 19 (CEx 1.1.26) Symmetric banded storage mode,
112 (CEx 2.3.20) Symmetric matrices, 637
Symmetric pentadiagonal system, 108 Symmetric positive definite (SPD)
matrices, 369, 383, 416, 580 Symmetric storage mode,
101 (CEx 2.2.13) Symmetric Tri call, 105
Symmetric tridiagonal system, 105 Synthetic division, 9
Systems ODEs, 334, 342
Autonomous ODEs, 337 nth-order, 343, 344
Runge-Kutta method, 335
Taylor series, vector notation, 334 vector notation, 343
Systems of linear equations, 69, 358
2n Equal Spaced Nodes, 168 (Fig 4.3), 210 (Fig 5.3)
Tacoma Narrows Bridge,
356 (CEx 7.5.9), 357 (Fig 7.8)
Tangent line, 126
Tangent routine, 67 (CEx 1.4.19) Taylor series, 20, 22, 25–27
alternating series, 28–31 complete Horner’s algorithm,
24, 31
derivative estimating, 187
f at point c, 22, 31
F, minimization, functions, 573,
574, 582
machine precision, 65 (Ex 1.4.28) Mean Value Theorem, 26–27 method, ODEs, 304, 306
method, symbolic computations, 308 method, vector notation, 334
method of order m, 308, 342
natural logarithm (ln), 20–21 ODEs, 332
Runge-Kutta methods, 311
Taylor’s Theorem, terms h, 27, 31 Taylor’s Theorem, terms (x − c), 25, 31 Telescoped rational functions,
67 (CEx 1.4.18–19) Tensor-product interpolation, 172 Tent function, 151 (CEx 3.3.15a) Theorem/Corollary/Lemma:
Alternating Series, 32 (Ex 1.2.13) Bisection Method, 119 Cayley-Hamilton, 395 (CEx 8.2.5) Cholesky Factorizations, 370 Cubic Spline Smoothness, 275 Divided Differences and
Derivatives, 184
Divided Differences Corollary, 184 Duality, 591
Eigenvalues of Similar Matrices, 384 Euler-Maclaurin Formula, 221
On Existence of Polynomial
Interpolation, 157 First-Degree Polynomial Accuracy,
255
First-Degree Spline Accuracy, 256 First Interpolation, Error, 181
First Primal Form, 587, 600 Fourier Convergence, 463 Fundamental Theorem of Calculus,
202, 230, 302
Gaussian Quadrature, 239, 241 Gershgorin’s, 386
Hermitian Matrix Unitarily Similar
to a Diagonal Matrix, 385 IVP Uniqueness, 304 Intermediate-Value, 116, 208 Interpolation Errors, 181–183 Interpolation Properties, 171 Invariance, 163 Jacobi/Gauss-Seidel
Convergence, 416
Linear Differential Equations, 392 Localization, 386
Long Operations, 93
Loss of Precision, 59
LU Factorization, 363
Matrix Eigenvalue Properties, 383 Matrix Similar to a Triangular
Matrix, 384 Matrix Spectral, 387 Mean-Value, 27
Integrals, 496 (Ex 10.2.2)
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.
Index 677
678 Index
Theorem/Corollary/Lemma (continued)
Minimal Solution, 454
More Gershgorin Discs, 386 Newton’s Method Locating Roots of
Equations, 129 Orthogonal Basis, 388 Penrose Properties of
Pseudo-Inverse, 455 Polynomial Interpolation Errors,
181, 183
Precision of Trapezoid Rule, 206 Primal and Dual Problems, 591 Quadratic Function, 579 Recursive Property of Divided
Differences, 193
Recursive Trapezoid Formula, 211 Richardson Extrapolation, 191 Rolle’s, 181
Schur’s, 384
Second Interpolation Error, 183 Second Primal Form, 592, 594, 597 Similar to Triangular Matrix
Corollary, 384
SOR Convergence, 416 Spectral Radius, 414
Spectral Theorem, Symmetric
Matrices, 456
Strictly Diagonally Dominant
Matrix Corollary, 384 Successive Overrelaxation (SOR),
409, 413
SVD Least Squares, 453 Taylor’s Theorem, Terms h, 27 Taylor’s Theorem, Terms
(x − c), 25
Taylor Series, f about c, 22 Third Interpolation Error, 183 Trapezoid Rule Precision, 205 Uniqueness of IVP, 304
Upper Bound Lemma, 182 Vertices and Column Vectors, 597 Weierstrass Approximation, 292 Weighted Gaussian Quadrature,
244, 247
Traffic flow, 481 (Fig 10.1)
Transpose, matrices, 629
Transpose by columns/rows, 637 Trapezoid method/rule, 201, 235
basic, 203 (Fig 5.2), 227 composite, 203 (Fig 5.2), 203–204,
250 (CEx 5.4.11) composite with unequal spacing,
216 (Ex 5.1.34) error analysis, 205
multidimensional integration, 211 nonuniform partition, 202
ODE, 309 (Ex 7.1.15
recursive formula, equal
subintervals, 210
vs. Simpson’s rule 229 (Fig 5.6) uniform partition, 203
uniform spacing, 207
Triangle inequality, 405
Triangle wave approximations/series,
469 (Fig 9.9), 477 Triangular inequality, 405
Triangular matrix, 103
Tridiagonal matrix, 103, 515 Tridiagonal linear system, 103, 260,
529
Tridiagonal normalized algorithm,
111 (CEx 2.3.12) Tridiagonal symmetric matrix, 269 Troesch’s problem,
522 (CEx 11.2.7c) Truncation error, 188, 197, 308 Two dice problem, 502 Two-dimensional integration over
unit square,, 212
Uncertainty, 7 Unconstrained minimization
problems, 560 Underflow, range, 40
Undetermined coefficients method, 227, 242
Uniformly distributed numbers, 482 Uniformly distributed random numbers
in ellipse, 485 (Fig 10.2) Unimodal functions F, 563
Unitarily similar matrices, 384 Unit roundoff error, 47, 626 Unit vectors, 630
Unstable functions, roots, 124 (CEx 3.1.12)
Upper Bound Lemma, 182 Upper Bi-diagonal system, 104 Upper triangular matrix, 359 Upper triangular system, 72, 76 Upwind method, 541
Vandermonde determinant, 177 (Ex 4.1.47)
Vandermonde matrix, 78, 168, 472, 477 Variable metric algorithm, 580 Variance, 17 (CEx 1.1.7–8), 441 Vector fields, 303 (Fig 7.1),
204 (Fig 7.2)
Vector norm(s), 405, l1, l2, l∞, 405 Vectors, 629
Verification, symbolic, 19 (CEx 1.1.26) Vertices in K , 597
Vibrating string, 536 (Fig 12.7) Vibration problem, two mass,
391 (Fig 8.2)
Volume estimations, 491, 495
Wave equation PDE, 524, 525, 536 explicit stencil, 538 (Fig 12.10)
f stencil, 538 (Fig 12.9) xt-plane, 537 (Fig 12.8)
Weierstrass Approximation Theorem, 292
Weight function, 448
Weights, Gaussian, 245, 246 Well/ill-conditioned 2D linear system,
5 (Fig 1.2), 406, 407 Wilkinson’s polynomial,
19 (CEx 1.1.24),
124 (CEx 3.1.10),
151 (CEx 3.3.9)
World population, 459 (CEx 9.3.4)
Zeros
complicated functions, 115 function f , 114
multiplicity, 131, 137 (Ex 3.2.35) polynomial, 10, 132 (Fig 3.7) simple, 129
spurious, 57
Copyright 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.