Jorge Nocedal Stephen J. Wright
Numerical Optimization Second Edition
This is page iii Printer: Opaque this
Jorge Nocedal
EECS Department Northwestern University Evanston, IL 602083118
USA nocedaleecs.northwestern.edu
Series Editors:
Thomas V. Mikosch
University of Copenhagen Laboratory of Actuarial Mathematics DK1017 Copenhagen
Denmark
mikoschact.ku.dk
Sidney I. Resnick
Cornell University
School of Operations Research and
Industrial Engineering Ithaca, NY 14853
USA
sirlcornell.edu
Stephen J. Wright
Computer Sciences Department University of Wisconsin
1210 West Dayton Street Madison, WI 537061613
USA
swrightcs.wisc.edu
Stephen M. Robinson
Department of Industrial and Systems
Engineering
University of Wisconsin 1513 University Avenue Madison, WI 537061539 USA smrobinsfacstaff.wise.edu
Mathematics Subject Classification 2000: 90B30, 90C11, 9001, 9002 Library of Congress Control Number: 2006923897
ISBN10: 0387303030 ISBN13: 9780387303031 Printed on acidfree paper.
C 2006SpringerScienceBusinessMedia,LLC.
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher Springer ScienceBusiness Media, LLC, 233 Spring Street, New York, NY 10013, USA, except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed in the United States of America. TBHAM 987654321
springer.com
This is pa Printer: O
g
To Sue, Isabel and Martin and
To Mum and Dad
This is page v Printer: Opaque this
Contents
Preface xvii
Preface to the Second Edition xxi
1 Introduction 1
MathematicalFormulation …………………… 2 Example:ATransportationProblem ………………. 4 ContinuousversusDiscreteOptimization. . . . . . . . . . . . . . . . . 5 Constrained and Unconstrained Optimization . . . . . . . . . . . . . . 6 GlobalandLocalOptimization …………………. 6 StochasticandDeterministicOptimization . . . . . . . . . . . . . . . . 7 Convexity …………………………… 7 OptimizationAlgorithms ……………………. 8
NotesandReferences …………………………. 9
2 Fundamentals of Unconstrained Optimization 10
2.1 WhatIsaSolution? ………………………. 12
This is page vii Printer: Opaque this
viii CONTENTS
RecognizingaLocalMinimum …………………. 14
NonsmoothProblems……………………… 17 2.2 OverviewofAlgorithms…………………….. 18 TwoStrategies:LineSearchandTrustRegion . . . . . . . . . . . . . . . 19 SearchDirectionsforLineSearchMethods . . . . . . . . . . . . . . . . 20 ModelsforTrustRegionMethods………………… 25
Scaling…. Exercises…….
3 Line Search Methods
…………………………. 26 …………………………. 27
30
3.1 StepLength ………………………….. 31 TheWolfeConditions……………………… 33 TheGoldsteinConditions……………………. 36 SufficientDecreaseandBacktracking ………………. 37
3.2 ConvergenceofLineSearchMethods………………. 37
3.3 RateofConvergence………………………. 41 ConvergenceRateofSteepestDescent………………. 42 NewtonsMethod ……………………….. 44 QuasiNewtonMethods…………………….. 46
3.4 NewtonsMethodwithHessianModification . . . . . . . . . . . . . . . 48 EigenvalueModification…………………….. 49 AddingaMultipleoftheIdentity ………………… 51 ModifiedCholeskyFactorization ………………… 52 ModifiedSymmetricIndefiniteFactorization . . . . . . . . . . . . . . . 54
3.5 StepLengthSelectionAlgorithms………………… 56 Interpolation………………………….. 57 InitialStepLength……………………….. 59 ALineSearchAlgorithmfortheWolfeConditions . . . . . . . . . . . . 60
NotesandReferences …………………………. 62 Exercises……………………………….. 63
4 TrustRegion Methods 66
OutlineoftheTrustRegionApproach ……………… 68
4.1 AlgorithmsBasedontheCauchyPoint ……………… 71 TheCauchyPoint ……………………….. 71 ImprovingontheCauchyPoint…………………. 73 TheDoglegMethod………………………. 73 TwoDimensionalSubspaceMinimization . . . . . . . . . . . . . . . . 76
4.2 GlobalConvergence………………………. 77 ReductionObtainedbytheCauchyPoint…………….. 77 ConvergencetoStationaryPoints………………… 79
4.3 IterativeSolutionoftheSubproblem ………………. 83
TheHardCase…………………………. 87 ProofofTheorem4.1 ……………………… 89 Convergence of Algorithms Based on Nearly Exact Solutions . . . . . . . 91
4.4 Local Convergence of TrustRegion Newton Methods . . . . . . . . . . 92
4.5 OtherEnhancements ……………………… 95 Scaling…………………………….. 95 TrustRegionsinOtherNorms………………….. 97
NotesandReferences …………………………. 98 Exercises……………………………….. 98
5 Conjugate Gradient Methods 101
5.1 TheLinearConjugateGradientMethod……………… 102 ConjugateDirectionMethods………………….. 102 Basic Properties of the Conjugate Gradient Method . . . . . . . . . . . 107 APracticalFormoftheConjugateGradientMethod . . . . . . . . . . . 111 RateofConvergence………………………. 112 Preconditioning ………………………… 118 PracticalPreconditioners ……………………. 120
5.2 NonlinearConjugateGradientMethods …………….. 121 TheFletcherReevesMethod ………………….. 121 ThePolakRibiereMethodandVariants …………….. 122 QuadraticTerminationandRestarts……………….. 124 BehavioroftheFletcherReevesMethod …………….. 125 GlobalConvergence………………………. 127 NumericalPerformance…………………….. 131
NotesandReferences …………………………. 132 Exercises……………………………….. 133
6 QuasiNewton Methods 135
6.1 TheBFGSMethod……………………….. 136 PropertiesoftheBFGSMethod …………………. 141 Implementation ………………………… 142
6.2 TheSR1Method………………………… 144 PropertiesofSR1Updating …………………… 147
6.3 TheBroydenClass……………………….. 149
6.4 ConvergenceAnalysis ……………………… 153 GlobalConvergenceoftheBFGSMethod…………….. 153 SuperlinearConvergenceoftheBFGSMethod . . . . . . . . . . . . . . 156 ConvergenceAnalysisoftheSR1Method…………….. 160
NotesandReferences …………………………. 161 Exercises……………………………….. 162
CONTENTS ix
x CONTENTS
7 LargeScale Unconstrained Optimization 164
7.1 InexactNewtonMethods ……………………. 165 LocalConvergenceofInexactNewtonMethods. . . . . . . . . . . . . . 166 LineSearchNewtonCGMethod………………… 168 TrustRegionNewtonCGMethod ……………….. 170 Preconditioning the TrustRegion NewtonCG Method . . . . . . . . . 174 TrustRegionNewtonLanczosMethod……………… 175
7.2 LimitedMemoryQuasiNewtonMethods . . . . . . . . . . . . . . . . 176 LimitedMemoryBFGS …………………….. 177 Relationship with Conjugate Gradient Methods . . . . . . . . . . . . . 180 GeneralLimitedMemoryUpdating……………….. 181 CompactRepresentationofBFGSUpdating . . . . . . . . . . . . . . . 181 UnrollingtheUpdate ……………………… 184
7.3 SparseQuasiNewtonUpdates …………………. 185
7.4 AlgorithmsforPartiallySeparableFunctions . . . . . . . . . . . . . . . 186
7.5 PerspectivesandSoftware ……………………. 189
NotesandReferences …………………………. 190 Exercises……………………………….. 191
8 Calculating Derivatives 193
8.1 FiniteDifference Derivative Approximations . . . . . . . . . . . . . . . 194 ApproximatingtheGradient…………………… 195 ApproximatingaSparseJacobian ………………… 197 ApproximatingtheHessian …………………… 201 ApproximatingaSparseHessian…………………. 202
8.2 AutomaticDifferentiation……………………. 204 AnExample ………………………….. 205 TheForwardMode ………………………. 206 TheReverseMode……………………….. 207 VectorFunctionsandPartialSeparability . . . . . . . . . . . . . . . . . 210 CalculatingJacobiansofVectorFunctions. . . . . . . . . . . . . . . . . 212 CalculatingHessians:ForwardMode ………………. 213 CalculatingHessians:ReverseMode……………….. 215 CurrentLimitations ………………………. 216
NotesandReferences …………………………. 217 Exercises……………………………….. 217
9 DerivativeFree Optimization 220
9.1 FiniteDifferencesandNoise…………………… 221
9.2 ModelBasedMethods……………………… 223 InterpolationandPolynomialBases……………….. 226 UpdatingtheInterpolationSet …………………. 227
AMethodBasedonMinimumChangeUpdating. . . . . . . . . . . . . 228
9.3 CoordinateandPatternSearchMethods …………….. 229 CoordinateSearchMethod …………………… 230 PatternSearchMethods…………………….. 231
9.4 AConjugateDirectionMethod…………………. 234
9.5 NelderMeadMethod……………………… 238
9.6 ImplicitFiltering………………………… 240
NotesandReferences …………………………. 242 Exercises……………………………….. 242
10 LeastSquaresProblems 245
10.1 Background ………………………….. 247
10.2 LinearLeastSquaresProblems …………………. 250
10.3 Algorithms for Nonlinear LeastSquares Problems . . . . . . . . . . . . 254
TheGaussNewtonMethod…………………… 254 ConvergenceoftheGaussNewtonMethod. . . . . . . . . . . . . . . . 255 TheLevenbergMarquardtMethod……………….. 258 Implementation of the LevenbergMarquardt Method . . . . . . . . . . 259 Convergence of the LevenbergMarquardt Method . . . . . . . . . . . . 261 MethodsforLargeResidualProblems………………. 262
10.4 OrthogonalDistanceRegression…………………. 265
NotesandReferences …………………………. 267 Exercises……………………………….. 269
11 NonlinearEquations 270
11.1 LocalAlgorithms………………………… 274 NewtonsMethodforNonlinearEquations . . . . . . . . . . . . . . . . 274 InexactNewtonMethods ……………………. 277 BroydensMethod ……………………….. 279 TensorMethods ………………………… 283
11.2 PracticalMethods ……………………….. 285 MeritFunctions ………………………… 285 LineSearchMethods………………………. 287 TrustRegionMethods……………………… 290
11.3 ContinuationHomotopyMethods ……………….. 296 Motivation…………………………… 296 PracticalContinuationMethods…………………. 297
NotesandReferences …………………………. 302 Exercises……………………………….. 302
12 TheoryofConstrainedOptimization 304
LocalandGlobalSolutions …………………… 305
CONTENTS xi
xii CONTENTS
Smoothness ………………………….. 306
12.1 Examples……………………………. 307 ASingleEqualityConstraint…………………… 308 ASingleInequalityConstraint………………….. 310 TwoInequalityConstraints …………………… 313
12.2 TangentConeandConstraintQualifications . . . . . . . . . . . . . . . 315
12.3 FirstOrderOptimalityConditions ……………….. 320
12.4 FirstOrderOptimalityConditions:Proof. . . . . . . . . . . . . . . . . 323 Relating the Tangent Cone and the FirstOrder Feasible Direction Set . . 323 AFundamentalNecessaryCondition ………………. 325 FarkasLemma…………………………. 326 ProofofTheorem12.1……………………… 329
12.5 SecondOrderConditions……………………. 330 SecondOrder Conditions and Projected Hessians . . . . . . . . . . . . 337
12.6 OtherConstraintQualifications…………………. 338
12.7 AGeometricViewpoint…………………….. 340
12.8 LagrangeMultipliersandSensitivity……………….. 341
12.9 Duality…………………………….. 343
NotesandReferences …………………………. 349 Exercises……………………………….. 351
13 LinearProgramming:TheSimplexMethod 355
LinearProgramming………………………. 356
13.1 OptimalityandDuality …………………….. 358 OptimalityConditions……………………… 358 TheDualProblem……………………….. 359
13.2 GeometryoftheFeasibleSet…………………… 362 BasesandBasicFeasiblePoints …………………. 362 VerticesoftheFeasiblePolytope…………………. 365
13.3 TheSimplexMethod………………………. 366 Outline…………………………….. 366 ASingleStepoftheMethod…………………… 370
13.4 LinearAlgebraintheSimplexMethod ……………… 372
13.5 OtherImportantDetails…………………….. 375 PricingandSelectionoftheEnteringIndex . . . . . . . . . . . . . . . . 375 StartingtheSimplexMethod ………………….. 378 DegenerateStepsandCycling………………….. 381
13.6 TheDualSimplexMethod……………………. 382
13.7 Presolving …………………………… 385
13.8 WhereDoestheSimplexMethodFit?………………. 388
NotesandReferences …………………………. 389 Exercises……………………………….. 389
14 LinearProgramming:InteriorPointMethods 392
14.1 PrimalDualMethods ……………………… 393 Outline…………………………….. 393 TheCentralPath………………………… 397 Central Path Neighborhoods and PathFollowing Methods . . . . . . . . 399
14.2 PracticalPrimalDualAlgorithms………………… 407 CorrectorandCenteringSteps………………….. 407 StepLengths………………………….. 409 StartingPoint …………………………. 410 APracticalAlgorithm ……………………… 411 SolvingtheLinearSystems……………………. 411
14.3 OtherPrimalDualAlgorithmsandExtensions . . . . . . . . . . . . . . 413 OtherPathFollowingMethods …………………. 413 PotentialReductionMethods………………….. 414 Extensions…………………………… 415
14.4 PerspectivesandSoftware ……………………. 416
NotesandReferences …………………………. 417 Exercises……………………………….. 418
15 FundamentalsofAlgorithmsforNonlinearConstrainedOptimization 421
15.1 CategorizingOptimizationAlgorithms ……………… 422
15.2 The Combinatorial Difficulty of InequalityConstrained Problems . . . . 424
15.3 EliminationofVariables…………………….. 426
SimpleEliminationusingLinearConstraints . . . . . . . . . . . . . . . 428 General Reduction Strategies for Linear Constraints . . . . . . . . . . . 431 EffectofInequalityConstraints…………………. 434
15.4 MeritFunctionsandFilters …………………… 435 MeritFunctions ………………………… 435 Filters …………………………….. 437
15.5 TheMaratosEffect……………………….. 440
15.6 SecondOrder Correction and Nonmonotone Techniques . . . . . . . . 443 NonmonotoneWatchdogStrategy ………………. 444
NotesandReferences …………………………. 446 Exercises……………………………….. 446
16 QuadraticProgramming 448
16.1 EqualityConstrainedQuadraticPrograms . . . . . . . . . . . . . . . . 451 PropertiesofEqualityConstrainedQPs. . . . . . . . . . . . . . . . . . 451
16.2 DirectSolutionoftheKKTSystem ……………….. 454 FactoringtheFullKKTSystem …………………. 454 SchurComplementMethod…………………… 455 NullSpaceMethod ………………………. 457
CONTENTS xiii
xiv CONTENTS
16.3 IterativeSolutionoftheKKTSystem ………………. 459 CGAppliedtotheReducedSystem ……………….. 459 TheProjectedCGMethod……………………. 461
16.4 InequalityConstrainedProblems………………… 463 Optimality Conditions for InequalityConstrained Problems . . . . . . . 464 Degeneracy…………………………… 465
16.5 ActiveSetMethodsforConvexQPs……………….. 467
Specification of the ActiveSet Method for Convex QP . . . . . . . FurtherRemarksontheActiveSetMethod . . . . . . . . . . . . .
Finite Termination of ActiveSet Algorithm on Strictly Convex QPs UpdatingFactorizations…………………….. 478
16.6 InteriorPointMethods …………………….. 480 SolvingthePrimalDualSystem…………………. 482 StepLengthSelection ……………………… 483 APracticalPrimalDualMethod ………………… 484
16.7 TheGradientProjectionMethod ………………… 485 CauchyPointComputation…………………… 486 SubspaceMinimization …………………….. 488
16.8 PerspectivesandSoftware ……………………. 490
NotesandReferences …………………………. 492 Exercises……………………………….. 492
17 PenaltyandAugmentedLagrangianMethods 497
17.1 TheQuadraticPenaltyMethod …………………. 498 Motivation…………………………… 498 AlgorithmicFramework…………………….. 501 ConvergenceoftheQuadraticPenaltyMethod . . . . . . . . . . . . . . 502 IllConditioningandReformulations ………………. 505
17.2 NonsmoothPenaltyFunctions …………………. 507 APractical 1PenaltyMethod………………….. 511 AGeneralClassofNonsmoothPenaltyMethods . . . . . . . . . . . . . 513
17.3 Augmented Lagrangian Method: Equality Constraints . . . . . . . . . . 514 MotivationandAlgorithmicFramework …………….. 514 PropertiesoftheAugmentedLagrangian …………….. 517
17.4 PracticalAugmentedLagrangianMethods . . . . . . . . . . . . . . . . 519 BoundConstrainedFormulation………………… 519 LinearlyConstrainedFormulation ……………….. 522 UnconstrainedFormulation…………………… 523
17.5 PerspectivesandSoftware……………………. 525
NotesandReferences …………………………. 526 Exercises……………………………….. 527
. . . . . .
. 472 . 476 . 477
18 SequentialQuadraticProgramming 529
18.1 LocalSQPMethod……………………….. 530 SQPFramework………………………… 531 InequalityConstraints……………………… 532
18.2 PreviewofPracticalSQPMethods………………… 533 IQPandEQP …………………………. 533 EnforcingConvergence …………………….. 534
18.3 AlgorithmicDevelopment……………………. 535 HandlingInconsistentLinearizations………………. 535 FullQuasiNewtonApproximations……………….. 536 ReducedHessian QuasiNewton Approximations . . . . . . . . . . . . 538 MeritFunctions ………………………… 540 SecondOrderCorrection ……………………. 543
18.4 APracticalLineSearchSQPMethod ………………. 545
18.5 TrustRegionSQPMethods …………………… 546
A Relaxation Method for EqualityConstrained Optimization
. . . . . . . .
. . . . . . . . . . . .
. 547 . 549 . 551 . 553
S 1QPSequential 1 QuadraticProgramming . . Sequential LinearQuadratic Programming SLQP A Technique for Updating the Penalty Parameter . .
. . . . . .
. . . . . . . . .
18.6 NonlinearGradientProjection …………………. 554
18.7 ConvergenceAnalysis ……………………… 556 RateofConvergence………………………. 557
18.8 PerspectivesandSoftware ……………………. 560
NotesandReferences …………………………. 561 Exercises……………………………….. 561
19 InteriorPointMethodsforNonlinearProgramming 563
19.1 TwoInterpretations ………………………. 564
19.2 ABasicInteriorPointAlgorithm ………………… 566
19.3 AlgorithmicDevelopment……………………. 569
Primalvs.PrimalDualSystem …………………. 570 SolvingthePrimalDualSystem…………………. 570 UpdatingtheBarrierParameter…………………. 572 HandlingNonconvexityandSingularity. . . . . . . . . . . . . . . . . . 573 StepAcceptance:MeritFunctionsandFilters . . . . . . . . . . . . . . . 575 QuasiNewtonApproximations…………………. 575 FeasibleInteriorPointMethods…………………. 576
19.4 ALineSearchInteriorPointMethod ………………. 577
19.5 ATrustRegionInteriorPointMethod ……………… 578 AnAlgorithmforSolvingtheBarrierProblem . . . . . . . . . . . . . . 578 StepComputation……………………….. 580 Lagrange Multipliers Estimates and Step Acceptance . . . . . . . . . . . 581
CONTENTS xv
xvi CONTENTS
Description of a TrustRegion InteriorPoint Method . . . . . . . . . . . 582
19.6 ThePrimalLogBarrierMethod…………………. 583
19.7 GlobalConvergenceProperties …………………. 587
FailureoftheLineSearchApproach……………….. 587 ModifiedLineSearchMethods …………………. 589 Global Convergence of the TrustRegion Approach . . . . . . . . . . . . 589
19.8 SuperlinearConvergence ……………………. 591
19.9 PerspectivesandSoftware ……………………. 592
NotesandReferences …………………………. 593 Exercises……………………………….. 594
A Background Material 598
A.1 ElementsofLinearAlgebra …………………… 598 VectorsandMatrices………………………. 598 Norms…………………………….. 600 Subspaces …………………………… 602 Eigenvalues, Eigenvectors, and the SingularValue Decomposition . . . . 603 DeterminantandTrace …………………….. 605 MatrixFactorizations:Cholesky,LU,QR . . . . . . . . . . . . . . . . . 606 SymmetricIndefiniteFactorization ……………….. 610 ShermanMorrisonWoodburyFormula…………….. 612 InterlacingEigenvalueTheorem…………………. 613 ErrorAnalysisandFloatingPointArithmetic . . . . . . . . . . . . . . . 613 ConditioningandStability……………………. 616
A.2 ElementsofAnalysis,Geometry,Topology . . . . . . . . . . . . . . . . 617 Sequences …………………………… 617 RatesofConvergence ……………………… 619 TopologyoftheEuclideanSpaceIRn……………….. 620 ConvexSetsinIRn ……………………….. 621 ContinuityandLimits……………………… 623 Derivatives…………………………… 625 DirectionalDerivatives …………………….. 628 MeanValueTheorem ……………………… 629 ImplicitFunctionTheorem …………………… 630 OrderNotation ………………………… 631 RootFindingforScalarEquations ……………….. 633
B A Regularization Procedure 635
References 637 Index 653
Preface
This is a book for people interested in solving optimization problems. Because of the wide and growing use of optimization in science, engineering, economics, and industry, it is essential for students and practitioners alike to develop an understanding of optimization algorithms. Knowledge of the capabilities and limitations of these algorithms leads to a better understanding of their impact on various applications, and points the way to future research on improving and extending optimization algorithms and software. Our goal in this book is to give a comprehensive description of the most powerful, stateoftheart, techniques for solving continuous optimization problems. By presenting the motivating ideas for each algorithm, we try to stimulate the readers intuition and make the technical details easier to follow. Formal mathematical requirements are kept to a minimum.
Because of our focus on continuous problems, we have omitted discussion of impor tant optimization topics such as discrete and stochastic optimization. However, there are a great many applications that can be formulated as continuous optimization problems; for instance,
finding the optimal trajectory for an aircraft or a robot arm;
identifying the seismic properties of a piece of the earths crust by fitting a model of the region under study to a set of readings from a network of recording stations;
This is page xvii Printer: Opaque this
xviii PREFACE
designing a portfolio of investments to maximize expected return while maintaining an acceptable level of risk;
controlling a chemical process or a mechanical device to optimize performance or meet standards of robustness;
computing the optimal shape of an automobile or aircraft component.
Every year optimization algorithms are being called on to handle problems that are much larger and complex than in the past. Accordingly, the book emphasizes large scale optimization techniques, such as interiorpoint methods, inexact Newton methods, limitedmemory methods, and the role of partially separable functions and automatic differentiation. It treats important topics such as trustregion methods and sequential quadratic programming more thoroughly than existing texts, and includes comprehensive discussion of such core curriculum topics as constrained optimization theory, Newton and quasiNewton methods, nonlinear least squares and nonlinear equations, the simplex method, and penalty and barrier methods for nonlinear programming.
The Audience
We intend that this book will be used in graduatelevel courses in optimization, as of fered in engineering, operations research, computer science, and mathematics departments. There is enough material here for a twosemester or threequarter sequence of courses. We hope, too, that this book will be used by practitioners in engineering, basic science, and industry, and our presentation style is intended to facilitate selfstudy. Since the book treats a number of new algorithms and ideas that have not been described in earlier textbooks, we hope that this book will also be a useful reference for optimization researchers.
Prerequisites for this book include some knowledge of linear algebra including nu merical linear algebra and the standard sequence of calculus courses. To make the book as selfcontained as possible, we have summarized much of the relevant material from these ar eas in the Appendix. Our experience in teaching engineering students has shown us that the material is best assimilated when combined with computer programming projects in which the student gains a good feeling for the algorithmstheir complexity, memory demands, and eleganceand for the applications. In most chapters we provide simple computer exercises that require only minimal programming proficiency.
Emphasis and Writing Style
We have used a conversational style to motivate the ideas and present the numerical algorithms. Rather than being as concise as possible, our aim is to make the discussion flow in a natural way. As a result, the book is comparatively long, but we believe that it can be read relatively rapidly. The instructor can assign substantial reading assignments from the text and focus in class only on the main ideas.
A typical chapter begins with a nonrigorous discussion of the topic at hand, including figures and diagrams and excluding technical details as far as possible. In subsequent sections,
the algorithms are motivated and discussed, and then stated explicitly. The major theoretical results are stated, and in many cases proved, in a rigorous fashion. These proofs can be skipped by readers who wish to avoid technical details.
The practice of optimization depends not only on efficient and robust algorithms, but also on good modeling techniques, careful interpretation of results, and userfriendly software. In this book we discuss the various aspects of the optimization processmodeling, optimality conditions, algorithms, implementation, and interpretation of resultsbut not with equal weight. Examples throughout the book show how practical problems are formu lated as optimization problems, but our treatment of modeling is light and serves mainly to set the stage for algorithmic developments. We refer the reader to Dantzig 86 and Fourer, Gay, and Kernighan 112 for more comprehensive discussion of this issue. Our treatment of optimality conditions is thorough but not exhaustive; some concepts are dis cussed more extensively in Mangasarian 198 and Clarke 62. As mentioned above, we are quite comprehensive in discussing optimization algorithms.
Topics Not Covered
We omit some important topics, such as network optimization, integer programming, stochastic programming, nonsmooth optimization, and global optimization. Network and integer optimization are described in some excellent texts: for instance, Ahuja, Magnanti, and Orlin 1 in the case of network optimization and Nemhauser and Wolsey 224, Papadim itriou and Steiglitz 235, and Wolsey 312 in the case of integer programming. Books on stochastic optimization are only now appearing; we mention those of Kall and Wallace 174, Birge and Louveaux 22. Nonsmooth optimization comes in many flavors. The relatively simple structures that arise in robust data fitting which is sometimes based on the 1 norm are treated by Osborne 232 and Fletcher 101. The latter book also discusses algorithms for nonsmooth penalty functions that arise in constrained optimization; we discuss these briefly, too, in Chapter 18. A more analytical treatment of nonsmooth optimization is given by HiriartUrruty and Lemare chal 170. We omit detailed treatment of some important topics that are the focus of intense current research, including interiorpoint methods for nonlinear programming and algorithms for complementarity problems.
Additional Resource
The material in the book is complemented by an online resource called the NEOS Guide, which can be found on the WorldWide Web at
http:www.mcs.anl.govotcGuide
The Guide contains information about most areas of optimization, and presents a number of case studies that describe applications of various optimization algorithms to realworld problems such as portfolio optimization and optimal dieting. Some of this material is interactive in nature and has been used extensively for class exercises.
PREFACE xix
xx PREFACE
For the most part, we have omitted detailed discussions of specific software packages, and refer the reader to More and Wright 217 or to the Software Guide section of the NEOS Guide, which can be found at
http:www.mcs.anl.govotcGuideSoftwareGuide
Users of optimization software refer in great numbers to this web site, which is being constantly updated to reflect new packages and changes to existing software.
Acknowledgments
We are most grateful to the following colleagues for their input and feedback on various sections of this work: Chris Bischof, Richard Byrd, George Corliss, Bob Fourer, David Gay, JeanCharles Gilbert, Phillip Gill, JeanPierre Goux, Don Goldfarb, Nick Gould, Andreas Griewank, Matthias Heinkenschloss, Marcelo Marazzi, Hans Mittelmann, Jorge More , Will Naylor, Michael Overton, Bob Plemmons, Hugo Scolnik, David Stewart, Philippe Toint, Luis Vicente, Andreas Wa chter, and Yaxiang Yuan. We thank Guanghui Liu, who provided help with many of the exercises, and Jill Lavelle who assisted us in preparing the figures. We also express our gratitude to our sponsors at the Department of Energy and the National Science Foundation, who have strongly supported our research efforts in optimization over the years.
One of us JN would like to express his deep gratitude to Richard Byrd, who has taught him so much about optimization and who has helped him in very many ways throughout the course of his career.
Final Remark
In the preface to his 1987 book 101, Roger Fletcher described the field of optimization as a fascinating blend of theory and computation, heuristics and rigor. The evergrowing realm of applications and the explosion in computing power is driving optimization research in new and exciting directions, and the ingredients identified by Fletcher will continue to play important roles for many years to come.
Jorge Nocedal Stephen J. Wright
Evanston, IL Argonne, IL
Preface to the Second Edition
During the six years since the first edition of this book appeared, the field of continuous optimization has continued to grow and evolve. This new edition reflects a better under standing of constrained optimization at both the algorithmic and theoretical levels, and of the demands imposed by practical applications. Perhaps most notably, new chapters have been added on two important topics: derivativefree optimization Chapter 9 and interior point methods for nonlinear programming Chapter 19. The former topic has proved to be of great interest in applications, while the latter topic has come into its own in recent years and now forms the basis of successful codes for nonlinear programming.
Apart from the new chapters, we have revised and updated throughout the book, deemphasizing or omitting less important topics, enhancing the treatment of subjects of evident interest, and adding new material in many places. The first part unconstrained opti mization has been comprehensively reorganized to improve clarity. Discussion of Newtons methodthe touchstone method for unconstrained problemsis distributed more nat urally throughout this part rather than being isolated in a single chapter. An expanded discussion of largescale problems appears in Chapter 7.
Some reorganization has taken place also in the second part constrained optimiza tion, with material common to sequential quadratic programming and interiorpoint methods now appearing in the chapter on fundamentals of nonlinear programming
This is page xxi Printer: Opaque this
xxii PREFACE TO THE SECOND EDITION
algorithms Chapter 15 and the discussion of primal barrier methods moved to the new interiorpoint chapter. There is much new material in this part, including a treatment of nonlinear programming duality, an expanded discussion of algorithms for inequality con strained quadratic programming, a discussion of dual simplex and presolving in linear programming, a summary of practical issues in the implementation of interiorpoint linear programming algorithms, a description of conjugategradient methods for quadratic pro gramming, and a discussion of filter methods and nonsmooth penalty methods in nonlinear programming algorithms.
In many chapters we have added a Perspectives and Software section near the end, to place the preceding discussion in context and discuss the state of the art in software. The appendix has been rearranged with some additional topics added, so that it can be used in a more standalone fashion to cover some of the mathematical background required for the rest of the book. The exercises have been revised in most chapters. After these many additions, deletions, and changes, the second edition is only slightly longer than the first, reflecting our belief that careful selection of the material to include and exclude is an important responsibility for authors of books of this type.
A manual containing solutions for selected problems will be available to bona fide instructors through the publisher. A list of typos will be maintained on the books web site, which is accessible from the web pages of both authors.
We acknowledge with gratitude the comments and suggestions of many readers of the first edition, who sent corrections to many errors and provided valuable perspectives on the material, which led often to substantial changes. We mention in particular Frank Curtis, Michael Ferris, Andreas Griewank, Jacek Gondzio, Sven Leyffer, Philip Loewen, Rembert Reemtsen, and David Stewart.
Our special thanks goes to Michael Overton, who taught from a draft of the second edition and sent many detailed and excellent suggestions. We also thank colleagues who read various chapters of the new edition carefully during development, including Richard Byrd, Nick Gould, Paul Hovland, Gabo Lope zCalva, Long Hei, Katya Scheinberg, Andreas Wa chter, and Richard Waltz. We thank Jill Wright for improving some of the figures and for the new cover graphic.
We mentioned in the original preface several areas of optimization that are not covered in this book. During the past six years, this list has only grown longer, as the field has continued to expand in new directions. In this regard, the following areas are particularly noteworthy: optimization problems with complementarity constraints, secondorder cone and semidefinite programming, simulationbased optimization, robust optimization, and mixedinteger nonlinear programming. All these areas have seen theoretical and algorithmic advances in recent years, and in many cases developments are being driven by new classes of applications. Although this book does not cover any of these areas directly, it provides a foundation from which they can be studied.
Jorge Nocedal Stephen J. Wright
Evanston, IL Madison, WI
CHAPTER1 Introduction
People optimize. Investors seek to create portfolios that avoid excessive risk while achieving a high rate of return. Manufacturers aim for maximum efficiency in the design and operation of their production processes. Engineers adjust parameters to optimize the performance of their designs.
Nature optimizes. Physical systems tend to a state of minimum energy. The molecules in an isolated chemical system react with each other until the total potential energy of their electrons is minimized. Rays of light follow paths that minimize their travel time.
This is page 1 Printer: Opaque this
2 CHAPTER 1. INTRODUCTION
Optimization is an important tool in decision science and in the analysis of physical systems. To make use of this tool, we must first identify some objective, a quantitative measure of the performance of the system under study. This objective could be profit, time, potential energy, or any quantity or combination of quantities that can be represented by a single number. The objective depends on certain characteristics of the system, called variables or unknowns. Our goal is to find values of the variables that optimize the objective. Often the variables are restricted, or constrained, in some way. For instance, quantities such as electron density in a molecule and the interest rate on a loan cannot be negative.
The process of identifying objective, variables, and constraints for a given problem is known as modeling. Construction of an appropriate model is the first stepsometimes the most important stepin the optimization process. If the model is too simplistic, it will not give useful insights into the practical problem. If it is too complex, it may be too difficult to solve.
Once the model has been formulated, an optimization algorithm can be used to find its solution, usually with the help of a computer. There is no universal optimization algorithm but rather a collection of algorithms, each of which is tailored to a particular type of optimization problem. The responsibility of choosing the algorithm that is appropriate for a specific application often falls on the user. This choice is an important one, as it may determine whether the problem is solved rapidly or slowly and, indeed, whether the solution is found at all.
After an optimization algorithm has been applied to the model, we must be able to recognize whether it has succeeded in its task of finding a solution. In many cases, there are elegant mathematical expressions known as optimality conditions for checking that the current set of variables is indeed the solution of the problem. If the optimality conditions are not satisfied, they may give useful information on how the current estimate of the solution can be improved. The model may be improved by applying techniques such as sensitivity analysis, which reveals the sensitivity of the solution to changes in the model and data. Interpretation of the solution in terms of the application may also suggest ways in which the model can be refined or improved or corrected. If any changes are made to the model, the optimization problem is solved anew, and the process repeats.
MATHEMATICAL FORMULATION
Mathematically speaking, optimization is the minimization or maximization of a function subject to constraints on its variables. We use the following notation:
xisthevectorofvariables,alsocalledunknownsorparameters;
f is the objective function, a scalar function of x that we want to maximize or
minimize;
ci areconstraintfunctions,whicharescalarfunctionsofxthatdefinecertainequations and inequalities that the unknown vector x must satisfy.
cix0, iI. As a simple example, consider the problem
minx1 22 x2 12 subjectto x12 x2 0, x1 x2 2.
We can write this problem in the form 1.1 by defining
fxx1 22 x2 12, x x1 ,
x2
c x c 1 x x 12 x 2 , I 1 , 2 ,
c2x x1 x2 2
x IR n
Here I and E are sets of indices for equality and inequality constraints, respectively.
Figure 1.1 shows the contours of the objective function, that is, the set of points for which f x has a constant value. It also illustrates the feasible region, which is the set of points satisfying all the constraints the area between the two constraint boundaries, and the point
CHAPTER 1. INTRODUCTION 3
c1 c2
feasible region
x2
x
contours of f
x1
Figure 1.1 Geometrical representation of the problem 1.2.
Using this notation, the optimization problem can be written as follows: min fx subjectto cix0, i E,
1.1
1.2
E .
4 CHAPTER 1. INTRODUCTION
x, which is the solution of the problem. Note that the infeasible side of the inequality constraints is shaded.
The example above illustrates, too, that transformations are often necessary to express an optimization problem in the particular form 1.1. Often it is more natural or convenient to label the unknowns with two or three subscripts, or to refer to different variables by completely different names, so that relabeling is necessary to pose the problem in the form 1.1. Another common difference is that we are required to maximize rather than minimize
f , but we can accommodate this change easily by minimizing f in the formulation 1.1. Good modeling systems perform the conversion to standardized formulations such as 1.1 transparently to the user.
EXAMPLE: A TRANSPORTATION PROBLEM
We begin with a much simplified example of a problem that might arise in manufac turing and transportation. A chemical company has 2 factories F1 and F2 and a dozen retail outletsR1,R2,…,R12.EachfactoryFi canproduceai tonsofacertainchemicalproduct each week; ai is called the capacity of the plant. Each retail outlet Rj has a known weekly demandofbj tonsoftheproduct.Thecostofshippingonetonoftheproductfromfactory Fi to retail outlet Rj is cij.
The problem is to determine how much of the product to ship from each factory to each outlet so as to satisfy all the requirements and minimize cost. The variables of the problem are xij, i 1,2, j 1,…,12, where xij is the number of tons of the product shipped from factory Fi to retail outlet R j ; see Figure 1.2. We can write the problem as
ij
1.3a
1.3b
1.3c 1.3d
This type of problem is known as a linear programming problem, since the objective function
and the constraints are all linear functions. In a more practical model, we would also include
costs associated with manufacturing and storing the product. There may be volume discounts
min subjectto
12
ci j xi j
xij ai,
i 1,2, j1,…,12,
j1 2
xij bj,
xij 0, i1,2, j1,…,12.
i1
in practice for shipping the product; for example the cost 1.3a could be represented by
c x , where 0 is a small subscription fee. In this case, the problem is a ijij ij
nonlinear program because the objective function is nonlinear.
CHAPTER 1. INTRODUCTION 5
F
X21
R1 R2 R3
R12
1
F
2
Figure 1.2 A transportation problem. CONTINUOUS VERSUS DISCRETE OPTIMIZATION
In some optimization problems the variables make sense only if they take on integer values. For example, a variable xi could represent the number of power plants of type i that should be constructed by an electicity provider during the next 5 years, or it could indicate whether or not a particular factory should be located in a particular city. The mathematical formulation of such problems includes integrality constraints, which have theformxi Z,whereZisthesetofintegers,orbinaryconstraints,whichhavetheform xi 0, 1, in addition to algebraic constraints like those appearing in 1.1. Problems of this type are called integer programming problems. If some of the variables in the problem are not restricted to be integer or binary variables, they are sometimes called mixed integer programming problems, or MIPs for short.
Integer programming problems are a type of discrete optimization problem. Generally, discrete optimization problems may contain not only integers and binary variables, but also more abstract variable objects such as permutations of an ordered set. The defining feature of a discrete optimization problem is that the unknown x is drawn from a a finite but often very large set. By contrast, the feasible set for continuous optimization problemsthe class of problems studied in this bookis usually uncountably infinite, as when the components of x are allowed to be real numbers. Continuous optimization problems are normally easier to solve because the smoothness of the functions makes it possible to use objective and constraint information at a particular point x to deduce information about the functions behavior at all points close to x. In discrete problems, by constrast, the behavior of the objective and constraints may change significantly as we move from one feasible point to another, even if the two points are close by some measure. The feasible sets for discrete optimization problems can be thought of as exhibiting an extreme form of nonconvexity, as a convex combination of two feasible points is in general not feasible.
6 CHAPTER 1. INTRODUCTION
Discrete optimization problems are not addressed directly in this book; we refer the reader to the texts by Papadimitriou and Steiglitz 235, Nemhauser and Wolsey 224, Cook et al. 77, and Wolsey 312 for comprehensive treatments of this subject. We note, however, that continuous optimization techniques often play an important role in solving discrete optimization problems. For instance, the branchandbound method for integer linear programming problems requires the repeated solution of linear programming relaxations, in which some of the integer variables are fixed at integer values, while for other integer variables the integrality constraints are temporarily ignored. These subproblems are usually solved by the simplex method, which is discussed in Chapter 13 of this book.
CONSTRAINED AND UNCONSTRAINED OPTIMIZATION
Problems with the general form 1.1 can be classified according to the nature of the objective function and constraints linear, nonlinear, convex, the number of variables large or small, the smoothness of the functions differentiable or nondifferentiable, and so on. An important distinction is between problems that have constraints on the variables and those that do not. This book is divided into two parts according to this classification.
Unconstrained optimization problems, for which we have E I in 1.1, arise directly in many practical applications. Even for some problems with natural constraints on the variables, it may be safe to disregard them as they do not affect on the solution and do not interfere with algorithms. Unconstrained problems arise also as reformulations of constrained optimization problems, in which the constraints are replaced by penalization terms added to objective function that have the effect of discouraging constraint violations.
Constrained optimization problems arise from models in which constraints play an
essential role, for example in imposing budgetary constraints in an economic problem or
shape constraints in a design problem. These constraints may be simple bounds such as
0 x1 100, more general linear constraints such as xi 1, or nonlinear inequalities i
that represent complex relationships among the variables.
When the objective function and all the constraints are linear functions of x, the
problem is a linear programming problem. Problems of this type are probably the most widely formulated and solved of all optimization problems, particularly in management, financial, and economic applications. Nonlinear programming problems, in which at least some of the constraints or the objective are nonlinear functions, tend to arise naturally in the physical sciences and engineering, and are becoming more widely used in management and economic sciences as well.
GLOBAL AND LOCAL OPTIMIZATION
Many algorithms for nonlinear optimization problems seek only a local solution, a point at which the objective function is smaller than at all other feasible nearby points. They do not always find the global solution, which is the point with lowest function value among all feasible points. Global solutions are needed in some applications, but for many problems they
are difficult to recognize and even more difficult to locate. For convex programming problems, and more particularly for linear programs, local solutions are also global solutions. General nonlinear problems, both constrained and unconstrained, may possess local solutions that are not global solutions.
In this book we treat global optimization only in passing and focus instead on the computation and characterization of local solutions. We note, however, that many successful global optimization algorithms require the solution of many local optimization problems, to which the algorithms described in this book can be applied.
Research papers on global optimization can be found in Floudas and Pardalos 109 and in the Journal of Global Optimization.
STOCHASTIC AND DETERMINISTIC OPTIMIZATION
In some optimization problems, the model cannot be fully specified because it depends on quantities that are unknown at the time of formulation. This characteristic is shared by many economic and financial planning models, which may depend for example on future interest rates, future demands for a product, or future commodity prices, but uncertainty can arise naturally in almost any type of application.
Rather than just use a best guess for the uncertain quantities, modelers may obtain more useful solutions by incorporating additional knowledge about these quantities into the model. For example, they may know a number of possible scenarios for the uncertain demand, along with estimates of the probabilities of each scenario. Stochastic optimization algorithms use these quantifications of the uncertainty to produce solutions that optimize the expected performance of the model.
Related paradigms for dealing with uncertain data in the model include chance constrained optimization, in which we ensure that the variables x satisfy the given constraints to some specified probability, and robust optimization, in which certain constraints are required to hold for all possible values of the uncertain data.
We do not consider stochastic optimization problems further in this book, focusing instead on deterministic optimization problems, in which the model is completely known. Many algorithms for stochastic optimization do, however, proceed by formulating one or more deterministic subproblems, each of which can be solved by the techniques outlined here.
Stochastic and robust optimization have seen a great deal of recent research activity. For further information on stochastic optimization, consult the books of Birge and Louveaux 22 and Kall and Wallace 174. Robust optimization is discussed in BenTal and Nemirovski 15.
CONVEXITY
The concept of convexity is fundamental in optimization. Many practical problems possess this property, which generally makes them easier to solve both in theory and practice.
CHAPTER 1. INTRODUCTION 7
8 CHAPTER 1. INTRODUCTION
The term convex can be applied both to sets and to functions. A set S IRn is a convex set if the straight line segment connecting any two points in S lies entirely inside S. Formally, for any two points x S and y S, we have x 1 y S for all 0, 1. The function f is a convex function if its domain S is a convex set and if for any two points x and y in S, the following property is satisfied:
f x 1 y f x 1 f y, for all 0, 1. 1.4 Simple instances of convex sets include the unit ball y IRn y 2 1; and any
polyhedron, which is a set defined by linear equalities and inequalities, that is, x IR n A x b , C x d ,
where A and C are matrices of appropriate dimension, and b and d are vectors. Simple instances of convex functions include the linear function f x cT x , for any constant vector c IRn and scalar ; and the convex quadratic function f x xT Hx, where H is a symmetric positive semidefinite matrix.
We say that f is strictly convex if the inequality in 1.4 is strict whenever x a y and is in the open interval 0, 1. A function f is said to be concave if f is convex.
If the objective function in the optimization problem 1.1 and the feasible region are both convex, then any local solution of the problem is in fact a global solution.
The term convex programming is used to describe a special case of the general constrained optimization problem 1.1 in which
the objective function is convex,
the equality constraint functions ci , i E , are linear, and
the inequality constraint functions ci , i I, are concave.
OPTIMIZATION ALGORITHMS
Optimization algorithms are iterative. They begin with an initial guess of the variable x and generate a sequence of improved estimates called iterates until they terminate, hopefully at a solution. The strategy used to move from one iterate to the next distinguishes one algorithm from another. Most strategies make use of the values of the objective function
f , the constraint functions ci , and possibly the first and second derivatives of these functions. Some algorithms accumulate information gathered at previous iterations, while others use only local information obtained at the current point. Regardless of these specifics which will receive plenty of attention in the rest of the book, good algorithms should possess the following properties:
Robustness. They should perform well on a wide variety of problems in their class, for all reasonable values of the starting point.
Efficiency. They should not require excessive computer time or storage.
Accuracy. They should be able to identify a solution with precision, without being overly sensitive to errors in the data or to the arithmetic rounding errors that occur when the algorithm is implemented on a computer.
These goals may conflict. For example, a rapidly convergent method for a large uncon strained nonlinear problem may require too much computer storage. On the other hand, a robust method may also be the slowest. Tradeoffs between convergence rate and storage requirements, and between robustness and speed, and so on, are central issues in numerical optimization. They receive careful consideration in this book.
The mathematical theory of optimization is used both to characterize optimal points and to provide the basis for most algorithms. It is not possible to have a good understanding of numerical optimization without a firm grasp of the supporting theory. Accordingly, this book gives a solid though not comprehensive treatment of optimality conditions, as well as convergence analysis that reveals the strengths and weaknesses of some of the most important algorithms.
NOTES AND REFERENCES
Optimization traces its roots to the calculus of variations and the work of Euler and Lagrange. The development of linear programming n the 1940s broadened the field and stimulated much of the progress in modern optimization theory and practice during the past 60 years.
Optimization is often called mathematical programming, a somewhat confusing term coined in the 1940s, before the word programming became inextricably linked with computer software. The original meaning of this word and the intended one in this context was more inclusive, with connotations of algorithm design and analysis.
Modeling will not be treated extensively in the book. It is an essential subject in its own right, as it makes the connection between optimization algorithms and software on the one hand, and applications on the other hand. Information about modeling techniques for various application areas can be found in Dantzig 86, Ahuja, Magnanti, and Orlin 1, Fourer, Gay, and Kernighan 112, Winston 308, and Rardin 262.
CHAPTER 1. INTRODUCTION 9
CHAPTER2
Fundamentals of Unconstrained Optimization
In unconstrained optimization, we minimize an objective function that depends on real variables, with no restrictions at all on the values of these variables. The mathematical formulation is
min fx, 2.1 x
where x IRn is a real vector with n 1 components and f : IRn IR is a smooth function.
This is pa Printer: O
g
CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION 11
y3
y
.
. y1
y2
.
. .
t
t1 t2 t3
tm
Figure 2.1 Least squares data fitting problem.
Usually, we lack a global perspective on the function f . All we know are the values of f and maybe some of its derivatives at a set of points x0, x1, x2, . . .. Fortunately, our algorithms get to choose these points, and they try to do so in a way that identifies a solution reliably and without using too much computer time or storage. Often, the information about f does not come cheaply, so we usually prefer algorithms that do not call for this information unnecessarily.
EXAMPLE 2.1
Suppose that we are trying to find a curve that fits some experimental data. Figure 2.1 plotsmeasurements y1, y2,…, ym ofasignaltakenattimest1,t2,…,tm.Fromthedataand our knowledge of the application, we deduce that the signal has exponential and oscillatory behavior of certain types, and we choose to model it by the function
t;xx1x2ex3t2x4 x5cosx6t.
The real numbers xi, i 1,2,…,6, are the parameters of the model; we would like to choosethemtomakethemodelvaluestj;xfittheobserveddata yj ascloselyaspossible. Tostateourobjectiveasanoptimizationproblem,wegrouptheparametersxi intoavector of unknowns x x1, x2, . . . , x6T , and define the residuals
rjxyj tj;x, j1,2,…,m, 2.2 which measure the discrepancy between the model and the observed data. Our estimate of
12 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
x will be obtained by solving the problem
min fxr12xr2xrm2x. 2.3
x IR 6
This is a nonlinear leastsquares problem, a special case of unconstrained optimization. It illustrates that some objective functions can be expensive to evaluate even when the number of variables is small. Here we have n 6, but if the number of measurements m is large 105, say, evaluation of f x for a given parameter vector x is a significant computation.
Suppose that for the data given in Figure 2.1 the optimal solution of 2.3 is ap proximately x 1.1, 0.01, 1.2, 1.5, 2.0, 1.5 and the corresponding function value is f x 0.34. Because the optimal objective is nonzero, there must be discrepancies be tweentheobservedmeasurements yj andthemodelpredictionstj,xforsomeusually most values of jthe model has not reproduced all the data points exactly. How, then, can we verify that x is indeed a minimizer of f ? To answer this question, we need to define the term solution and explain how to recognize solutions. Only then can we discuss
algorithms for unconstrained optimization problems.
2.1 WHAT IS A SOLUTION?
Generally, we would be happiest if we found a global minimizer of f , a point where the function attains its least value. A formal definition is
A point x is a global minimizer if f x f x for all x,
where x ranges over all of IRn or at least over the domain of interest to the modeler. The global minimizer can be difficult to find, because our knowledge of f is usually only local. Since our algorithm does not visit many points we hope!, we usually do not have a good picture of the overall shape of f , and we can never be sure that the function does not take a sharp dip in some region that has not been sampled by the algorithm. Most algorithms are able to find only a local minimizer, which is a point that achieves the smallest value of f in its neighborhood. Formally, we say:
A point x is a local minimizer if there is a neighborhood N of x such that f x fxforallx N.
Recall that a neighborhood of x is simply an open set that contains x . A point that satisfies this definition is sometimes called a weak local minimizer. This terminology distinguishes
it from a strict local minimizer, which is the outright winner in its neighborhood. Formally,
A point x is a strict local minimizer also called a strong local minimizer if there is a neighborhoodN ofx suchthat fx fxforallx N withx ax.
For the constant function f x 2, every point x is a weak local minimizer, while the function f x x 24 has a strict local minimizer at x 2.
A slightly more exotic type of local minimizer is defined as follows.
A point x is an isolated local minimizer if there is a neighborhood N of x such that
x is the only local minimizer in N .
Some strict local minimizers are not isolated, as illustrated by the function
f x x4 cos1x 2×4, f 0 0,
which is twice continuously differentiable and has a strict local minimizer at x 0. However, there are strict local minimizers at many nearby points x j , and we can label these pointssothatxj 0as j .
While strict local minimizers are not always isolated, it is true that all isolated local minimizers are strict.
Figure 2.2 illustrates a function with many local minimizers. It is usually difficult to find the global minimizer for such functions, because algorithms tend to be trapped at local minimizers. This example is by no means pathological. In optimization problems associated with the determination of molecular conformation, the potential function to be minimized may have millions of local minima.
2.1. WHAT IS A SOLUTION? 13
f
x
Figure 2.2 A difficult case for global minimization.
14 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
Sometimes we have additional global knowledge about f that may help in identi fying global minima. An important special case is that of convex functions, for which every local minimizer is also a global minimizer.
RECOGNIZING A LOCAL MINIMUM
From the definitions given above, it might seem that the only way to find out whether a point x is a local minimum is to examine all the points in its immediate vicinity, to make sure that none of them has a smaller function value. When the function f is smooth, however, there are more efficient and practical ways to identify local minima. In particular, if
f is twice continuously differentiable, we may be able to tell that x is a local minimizer and possibly a strict local minimizer by examining just the gradient f x and the Hessian 2 fx.
The mathematical tool used to study minimizers of smooth functions is Taylors theorem. Because this theorem is central to our analysis throughout the book, we state it now. Its proof can be found in any calculus textbook.
Theorem 2.1 Taylors Theorem.
Suppose that f : IRn IR is continuously differentiable and that p IRn . Then we have
that
and that
fx p fx fx tpT p,
2.4
2.5
2.6
for some t 0, 1. Moreover, if f is twice continuously differentiable, we have that 1
f x p f x
2 f x tpp dt,
f x p f x f xT p 1 pT 2 f x tpp, 2
0
for some t 0, 1.
Necessary conditions for optimality are derived by assuming that x is a local minimizer
and then proving facts about f x and 2 f x.
Theorem 2.2 FirstOrder Necessary Conditions.
If x is a local minimizer and f is continuously differentiable in an open neighborhood
of x, then f x 0.
PROOF. Suppose for contradiction that f x a 0. Define the vector p f x and note that pT fx fx 2 0. Because f is continuous near x, there is a scalar T 0 such that
pTfx tp0, forallt 0,T. For any t 0, T , we have by Taylors theorem that
fx t p fxt pTfx tp, forsomet 0,t .
Therefore, f x t p f x for all t 0, T . We have found a direction leading away from x along which f decreases, so x is not a local minimizer, and we have a contradiction.
We call x a stationary point if f x 0. According to Theorem 2.2, any local minimizer must be a stationary point.
For the next result we recall that a matrix B is positive definite if pT Bp 0 for all p a 0, and positive semidefinite if pT Bp 0 for all p see the Appendix.
Theorem 2.3 SecondOrder Necessary Conditions.
If x is a local minimizer of f and 2 f exists and is continuous in an open neighborhood
of x, then f x 0 and 2 f x is positive semidefinite.
PROOF. We know from Theorem 2.2 that f x 0. For contradiction, assume that 2 f x is not positive semidefinite. Then we can choose a vector p such that pT2 fxp 0, and because 2 f is continuous near x, there is a scalar T 0
suchthat pT2 fx tpp0forallt 0,T.
By doing a Taylor series expansion around x, we have for all t 0,T and some
t 0, t that
fxt p fxt pTfx1t 2pT2fxtpp fx.
2
As in Theorem 2.2, we have found a direction from x along which f is decreasing, and so again, x is not a local minimizer.
We now describe sufficient conditions, which are conditions on the derivatives of f at the point z that guarantee that x is a local minimizer.
2.1. WHAT IS A SOLUTION? 15
16 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
Theorem 2.4 SecondOrder Sufficient Conditions.
Suppose that 2 f is continuous in an open neighborhood of x and that f x 0
and 2 f x is positive definite. Then x is a strict local minimizer of f .
PROOF. Because the Hessian is continuous and positive definite at x , we can choose a radius r 0sothat2 fxremainspositivedefiniteforallxintheopenballDz zx r.Takinganynonzerovectorpwith p r,wehavexpDandso
fxp fxpTfx1pT2fzp 2
fx 1 pT2 fzp, 2
where z x tp for some t 0, 1. Since z D, we have pT 2 f zp 0, and therefore f x p f x, giving the result.
Note that the secondorder sufficient conditions of Theorem 2.4 guarantee something stronger than the necessary conditions discussed earlier; namely, that the minimizer is a strict local minimizer. Note too that the secondorder sufficient conditions are not necessary: A point x may be a strict local minimizer, and yet may fail to satisfy the sufficient conditions. A simple example is given by the function f x x4, for which the point x 0 is a strict local minimizer at which the Hessian matrix vanishes and is therefore not positive definite.
When the objective function is convex, local and global minimizers are simple to characterize.
Theorem 2.5.
When f is convex, any local minimizer x is a global minimizer of f . If in addition f is differentiable, then any stationary point x is a global minimizer of f .
PROOF. Suppose that x is a local but not a global minimizer. Then we can find a point z IRn with f z f x. Consider the line segment that joins x to z, that is,
x z1x, forsome0,1. 2.7 By the convexity property for f , we have
fxfz1fx fx. 2.8
Any neighborhood N of x contains a piece of the line segment 2.7, so there will always be points x N at which 2.8 is satisfied. Hence, x is not a local minimizer.
For the second part of the theorem, suppose that x is not a global minimizer and choose z as above. Then, from convexity, we have
fxTzx d fx zx0 seetheAppendix d
lim f x z x f x 0
lim fz1fx fx 0
fz fx0.
Therefore, f x a 0, and so x is not a stationary point.
These results, which are based on elementary calculus, provide the foundations for unconstrained optimization algorithms. In one way or another, all algorithms seek a point where f vanishes.
NONSMOOTH PROBLEMS
This book focuses on smooth functions, by which we generally mean functions whose second derivatives exist and are continuous. We note, however, that there are interesting problems in which the functions involved may be nonsmooth and even discontinuous. It is not possible in general to identify a minimizer of a general discontinuous function. If, how ever, the function consists of a few smooth pieces, with discontinuities between the pieces, it may be possible to find the minimizer by minimizing each smooth piece individually.
If the function is continuous everywhere but nondifferentiable at certain points, as in Figure 2.3, we can identify a solution by examing the subgradient or generalized
2.1. WHAT IS A SOLUTION? 17
f
x x
Figure 2.3 Nonsmooth function with minimum at a kink.
18 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
gradient, which are generalizations of the concept of gradient to the nonsmooth case. Nonsmooth optimization is beyond the scope of this book; we refer instead to Hiriart Urruty and Lemare chal 170 for an extensive discussion of theory. Here, we mention only that the minimization of a function such as the one illustrated in Figure 2.3 which contains a jump discontinuity in the first derivative fx at the minimum is difficult because the behavior of f is not predictable near the point of nonsmoothness. That is, we cannot be sure that information about f obtained at one point can be used to infer anything about f at neighboring points, because points of nondifferentiabil ity may intervene. However, minimization of certain special nondifferentiable functions, such as
fx rx 1, fx rx 2.9
where rx is a vector function, can be reformulated as smooth constrained optimiza tion problems; see Exercise 12.5 in Chapter 12 and 17.31. The functions 2.9 are useful in data fitting, where rx is the residual vector whose components are defined in 2.2.
2.2 OVERVIEW OF ALGORITHMS
The last forty years have seen the development of a powerful collection of algorithms for unconstrained optimization of smooth functions. We now give a broad description of their main properties, and we describe them in more detail in Chapters 3, 4, 5, 6, and 7. All algorithms for unconstrained minimization require the user to supply a starting point, which we usually denote by x0. The user with knowledge about the application and the data set may be in a good position to choose x0 to be a reasonable estimate of the solution. Otherwise, the starting point must be chosen by the algorithm, either by a systematic approach or in some arbitrary manner.
Beginning at x0, optimization algorithms generate a sequence of iterates xkk0 that terminate when either no more progress can be made or when it seems that a so lution point has been approximated with sufficient accuracy. In deciding how to move from one iterate xk to the next, the algorithms use information about the function f at xk , and possibly also information from earlier iterates x0, x1, . . . , xk1. They use this in formation to find a new iterate xk1 with a lower function value than xk. There exist nonmonotone algorithms that do not insist on a decrease in f at every step, but even these algorithms require f to be decreased after some prescribed number m of iterations, that is,
fxk fxkm.
There are two fundamental strategies for moving from the current point xk to a new
iterate xk1. Most of the algorithms described in this book follow one of these approaches.
TWO STRATEGIES: LINE SEARCH AND TRUST REGION
In the line search strategy, the algorithm chooses a direction pk and searches along this direction from the current iterate xk for a new iterate with a lower function value. The distance to move along pk can be found by approximately solving the following one dimensional minimization problem to find a step length :
min fxk pk. 2.10 0
By solving 2.10 exactly, we would derive the maximum benefit from the direction pk , but an exact minimization may be expensive and is usually unnecessary. Instead, the line search algorithm generates a limited number of trial step lengths until it finds one that loosely approximates the minimum of 2.10. At the new point, a new search direction and step length are computed, and the process is repeated.
In the second algorithmic strategy, known as trust region, the information gathered about f is used to construct a model function mk whose behavior near the current point xk is similar to that of the actual objective function f . Because the model mk may not be a good approximation of f when x is far from xk , we restrict the search for a minimizer of mk to some region around xk . In other words, we find the candidate step p by approximately solving the following subproblem:
min mk xk p, where xk p lies inside the trust region. 2.11 p
If the candidate solution does not produce a sufficient decrease in f , we conclude that the trust region is too large, and we shrink it and resolve 2.11. Usually, the trust region is a ball defined by p 2 a, where the scalar a 0 is called the trustregion radius. Elliptical and boxshaped trust regions may also be used.
The model mk in 2.11 is usually defined to be a quadratic function of the form mkxk p fk pT fk 1 pT Bk p, 2.12
2
where fk , fk , and Bk are a scalar, vector, and matrix, respectively. As the notation indicates, fk and fk are chosen to be the function and gradient values at the point xk, so that mk and f are in agreement to first order at the current iterate xk . The matrix Bk is either the
Hessian 2 fk or some approximation to it.
Suppose that the objective function is given by f x 10×2 x122 1 x12. At
the point xk 0, 1 its gradient and Hessian are
fk 2 , 2fk 38 0 . 20 0 20
2.2. OVERVIEW OF ALGORITHMS 19
20 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
m 12
m1
contours of f
unconstrained minimizer
xk
p k
p k
contours of model
Figure 2.4 Two possible trust regions circles and their corresponding steps pk . The solid lines are contours of the model function m k .
The contour lines of the quadratic model 2.12 with Bk 2 fk are depicted in Figure 2.4, which also illustrates the contours of the objective function f and the trust region. We have indicated contour lines where the model mk has values 1 and 12. Note from Figure 2.4 that each time we decrease the size of the trust region after failure of a candidate iterate, the step from xk to the new candidate will be shorter, and it usually points in a different direction from the previous candidate. The trustregion strategy differs in this respect from line search, which stays with a single search direction.
In a sense, the line search and trustregion approaches differ in the order in which they choose the direction and distance of the move to the next iterate. Line search starts by fixing the direction pk and then identifying an appropriate distance, namely the step length k . In trust region, we first choose a maximum distancethe trustregion radius ak and then seek a direction and step that attain the best improvement possible subject to this distance constraint. If this step proves to be unsatisfactory, we reduce the distance measure ak and try again.
The line search approach is discussed in more detail in Chapter 3. Chapter 4 discusses the trustregion strategy, including techniques for choosing and adjusting the size of the re gion and for computing approximate solutions to the trustregion problems 2.11. We now preview two major issues: choice of the search direction pk in line search methods, and choice of the Hessian Bk in trustregion methods. These issues are closely related, as we now observe.
SEARCH DIRECTIONS FOR LINE SEARCH METHODS
The steepest descent direction fk is the most obvious choice for search direction for a line search method. It is intuitive; among all the directions we could move from xk,
it is the one along which f decreases most rapidly. To verify this claim, we appeal again to Taylors theorem Theorem 2.1, which tells us that for any search direction p and steplength parameter , we have
fxk p fxkpTfk 12pT2 fxk tpp, forsomet 0, 2
see 2.6. The rate of change in f along the direction p at xk is simply the coefficient of , namely, pT fk . Hence, the unit direction p of most rapid decrease is the solution to the problem
min pTfk, subjectto p 1. 2.13 p
SincepTfk p fk cos fk cos,whereistheanglebetweenpandfk, it is easy to see that the minimizer is attained when cos 1 and
pfkfk ,
as claimed. As we illustrate in Figure 2.5, this direction is orthogonal to the contours of the function.
The steepest descent method is a line search method that moves along pk fk at everystep.Itcanchoosethesteplengthk inavarietyofways,aswediscussinChapter3.One advantage of the steepest descent direction is that it requires calculation of the gradient fk but not of second derivatives. However, it can be excruciatingly slow on difficult problems.
Line search methods may use search directions other than the steepest descent direc tion. In general, any descent directionone that makes an angle of strictly less than 2 radians with fk is guaranteed to produce a decrease in f , provided that the step length
2.2. OVERVIEW OF ALGORITHMS 21
xk
.x pk
Figure 2.5 Steepest descent direction for a function of two variables.
22 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
p k
fk
Figure 2.6
A downhill direction pk .
is sufficiently small see Figure 2.6. We can verify this claim by using Taylors theorem. From 2.6, we have that
fxk pk fxk pkTfk O 2.
When pk is a downhill direction, the angle k between pk and fk has cos k 0, so that
pkTfkpk fkcosk0.
It follows that f xk pk f xk for all positive but sufficiently small values of . Another important search directionperhaps the most important one of all is the Newton direction. This direction is derived from the secondorder Taylor series
approximation to f xk p, which is
fxkpfkp fk2p fkpmkp. 2.14
Assuming for the moment that 2 fk is positive definite, we obtain the Newton direction by finding the vector p that minimizes mk p. By simply setting the derivative of mk p to zero, we obtain the following explicit formula:
pN 2 f 1 f . 2.15 kkk
The Newton direction is reliable when the difference between the true function f xk p and its quadratic model mk p is not too large. By comparing 2.14 with 2.6, we see that the only difference between these functions is that the matrix 2 f xk tp in the third term of the expansion has been replaced by 2 fk. If 2 f is sufficiently smooth, thisdifferenceintroducesaperturbationofonlyO p 3intotheexpansion,sothatwhen
T 1T2def
p is small, the approximation f xk p mk p is quite accurate.
The Newton direction can be used in a line search method when 2 fk is positive definite, for in this case we have
fkTpkNpkNT2fkpkNk pkN 2
for some k 0. Unless the gradient fk and therefore the step pkN is zero, we have that fkTpkN 0,sotheNewtondirectionisadescentdirection.
Unlike the steepest descent direction, there is a natural step length of 1 associated with the Newton direction. Most line search implementations of Newtons method use the unit step 1 where possible and adjust only when it does not produce a satisfactory reduction in the value of f .
When 2 fk is not positive definite, the Newton direction may not even be defined, since 2 f 1 may not exist. Even when it is defined, it may not satisfy the descent property
k
fkT pkN 0, in which case it is unsuitable as a search direction. In these situations, line search methods modify the definition of pk to make it satisfy the descent condition while retaining the benefit of the secondorder information contained in 2 fk . We describe these modifications in Chapter 3.
Methods that use the Newton direction have a fast rate of local convergence, typically quadratic. After a neighborhood of the solution is reached, convergence to high accuracy often occurs in just a few iterations. The main drawback of the Newton direction is the need for the Hessian 2 f x. Explicit computation of this matrix of second derivatives can sometimes be a cumbersome, errorprone, and expensive process. Finitedifference and automatic differentiation techniques described in Chapter 8 may be useful in avoiding the need to calculate second derivatives by hand.
QuasiNewton search directions provide an attractive alternative to Newtons method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence. In place of the true Hessian 2 fk, they use an approximation Bk, which is updated after each step to take account of the additional knowledge gained during the step. The updates make use of the fact that changes in the gradient g provide information about the second derivative of f along the search direction. By using the expression 2.5 from our statement of Taylors theorem, we have by adding and subtracting the term 2 f xp that
1 0
and p xk1 xk, we obtain
fk1 fk 2 fkxk1 xko xk1 xk .
When xk and xk1 lie in a region near the solution x, within which 2 f is positive definite, the final term in this expansion is eventually dominated by the 2 fkxk1 xk term, and
2 fxtp2 fx pdt. Becausefiscontinuous,thesizeofthefinalintegraltermiso p .Bysettingxxk
fxpfx2 fxp
2.2. OVERVIEW OF ALGORITHMS 23
24 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
we can write
where
Bk1sk yk, 2.17
sk xk1xk, yk fk1fk.
2 fkxk1 xk fk1 fk. 2.16
We choose the new Hessian approximation Bk1 so that it mimics the property 2.16 of the true Hessian, that is, we require it to satisfy the following condition, known as the secant equation:
Typically, we impose additional conditions on Bk1, such as symmetry motivated by symmetry of the exact Hessian, and a requirement that the difference between successive approximations Bk and Bk1 have low rank.
Two of the most popular formulae for updating the Hessian approximation Bk are the symmetricrankone SR1 formula, defined by
yk Bkskyk BkskT
Bk1 Bk y B s T s , 2.18
kkkk
and the BFGS formula, named after its inventors, Broyden, Fletcher, Goldfarb, and Shanno, which is defined by
B k s k s kT B k y k y kT
Bk1Bk sTBs yTs. 2.19
kkkkk
Note that the difference between the matrices Bk and Bk1 is a rankone matrix in the case of 2.18 and a ranktwo matrix in the case of 2.19. Both updates satisfy the secant equation and both maintain symmetry. One can show that BFGS update 2.19 generates positive definite approximations whenever the initial approximation B0 is positive definite and skT yk 0. We discuss these issues further in Chapter 6.
The quasiNewton search direction is obtained by using Bk in place of the exact Hessian in the formula 2.15, that is,
pk B1 fk. 2.20 k
Some practical implementations of quasiNewton methods avoid the need to factorize Bk at each iteration by updating the inverse of Bk , instead of Bk itself. In fact, the equivalent
formula for 2.18 and 2.19, applied to the inverse approximation Hk Bk , is Hk1 I kskykT Hk I kykskT kskskT, k 1 .
y kT s k
2.21
2.2. OVERVIEW OF ALGORITHMS 25 def 1
Calculation of pk can then be performed by using the formula pk Hk fk . This matrix vector multiplication is simpler than the factorizationbacksubstitution procedure that is needed to implement the formula 2.20.
Two variants of quasiNewton methods designed to solve large problemspartially separable and limitedmemory updatingare described in Chapter 7.
The last class of search directions we preview here is that generated by nonlinear conjugate gradient methods. They have the form
pk fxkkpk1,
where k is a scalar that ensures that pk and pk1 are conjugatean important concept in the minimization of quadratic functions that will be defined in Chapter 5. Conjugate gradient methods were originally designed to solve systems of linear equations Ax b, where the coefficient matrix A is symmetric and positive definite. The problem of solving this linear system is equivalent to the problem of minimizing the convex quadratic function defined by
x 1 xT Ax bT x, 2
so it was natural to investigate extensions of these algorithms to more general types of unconstrained minimization problems. In general, nonlinear conjugate gradient directions are much more effective than the steepest descent direction and are almost as simple to compute. These methods do not attain the fast convergence rates of Newton or quasi Newton methods, but they have the advantage of not requiring storage of matrices. An extensive discussion of nonlinear conjugate gradient methods is given in Chapter 5.
All of the search directions discussed so far can be used directly in a line search framework. They give rise to the steepest descent, Newton, quasiNewton, and conjugate gradient line search methods. All except conjugate gradients have an analogue in the trust region framework, as we now discuss.
MODELS FOR TRUSTREGION METHODS
If we set Bk 0 in 2.12 and define the trust region using the Euclidean norm, the trustregion subproblem 2.11 becomes
minfkpTfk subjectto p2ak. p
26 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
We can write the solution to this problem in closed form as
pk akfk. fk
This is simply a steepest descent step in which the step length is determined by the trust region radius; the trustregion and line search approaches are essentially the same in this case. A more interesting trustregion algorithm is obtained by choosing Bk to be the exact Hessian 2 fk in the quadratic model 2.12. Because of the trustregion restriction p 2 ak , the subproblem 2.11 is guaranteed to have a solution even when 2 fk is not positive definite pk , as we see in Figure 2.4. The trustregion Newton method has proved to
be highly effective in practice, as we discuss in Chapter 7.
If the matrix Bk in the quadratic model function mk of 2.12 is defined by means of
a quasiNewton approximation, we obtain a trustregion quasiNewton method.
SCALING
The performance of an algorithm may depend crucially on how the problem is formu lated. One important issue in problem formulation is scaling. In unconstrained optimization, a problem is said to be poorly scaled if changes to x in a certain direction produce much larger variations in the value of f than do changes to x in another direction. A simple example is provided by the function f x 109×12 x2, which is very sensitive to small changes in x1 but not so sensitive to perturbations in x2.
Poorly scaled functions arise, for example, in simulations of physical and chemical systems where different processes are taking place at very different rates. To be more specific, consider a chemical system in which four reactions occur. Associated with each reaction is a rate constant that describes the speed at which the reaction takes place. The optimization problem is to find values for these rate constants by observing the concentrations of each chemical in the system at different times. The four constants differ greatly in magnitude, since the reactions take place at vastly different speeds. Suppose we have the following rough esti mates for the final values of the constants, each correct to within, say, an order of magnitude:
x1 1010, x2 x3 1, x4 105.
Before solving this problem we could introduce a new variable z defined by
x1010 00 0z 11
x0 100z 2 2,
x 3 0 0 1 0 z 3 x4 000105z4
and then define and solve the optimization problem in terms of the new variable z. The
Figure 2.7 Poorly scaled and well scaled problems, and performance of the steepest descent direction.
optimal values of z will be within about an order of magnitude of 1, making the solution more balanced. This kind of scaling of the variables is known as diagonal scaling.
Scaling is performed sometimes unintentionally when the units used to represent variables are changed. During the modeling process, we may decide to change the units of some variables, say from meters to millimeters. If we do, the range of those variables and their size relative to the other variables will both change.
Some optimization algorithms, such as steepest descent, are sensitive to poor scaling, while others, such as Newtons method, are unaffected by it. Figure 2.7 shows the contours of two convex nearly quadratic functions, the first of which is poorly scaled, while the second is well scaled. For the poorly scaled problem, the one with highly elongated contours, the steepest descent direction does not yield much reduction in the function, while for the wellscaled problem it performs much better. In both cases, Newtons method will produce a much better step, since the secondorder quadratic model mk in 2.14 happens to be a good approximation of f .
Algorithms that are not sensitive to scaling are preferable, because they can handle poor problem formulations in a more robust fashion. In designing complete algorithms, we try to incorporate scale invariance into all aspects of the algorithm, including the line search or trustregion strategies and convergence tests. Generally speaking, it is easier to preserve scale invariance for line search algorithms than for trustregion algorithms.
EXERCISES
2.1 Compute the gradient f x and Hessian 2 f x of the Rosenbrock function
2.2. OVERVIEW OF ALGORITHMS 27
fk
fk
f x 100×2 x122 1 x12. 2.22
28 CHAPTER 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
Show that x 1,1T is the only local minimizer of this function, and that the Hessian matrix at that point is positive definite.
2.2 Show that the function f x 8×1 12×2 x12 2×2 has only one stationary point, and that it is neither a maximum or minimum, but a saddle point. Sketch the contour linesof f.
2.3 Let a be a given nvector, and A be a given n n symmetric matrix. Compute the gradient and Hessian of f1x aT x and f2x xT Ax.
2.4 Write the secondorder Taylor expansion 2.6 for the function cos1x around a nonzero point x, and the thirdorder Taylor expansion of cosx around any point x. Evaluate the second expansion for the specific case of x 1.
2.5 Consider the function f : IR2 IR defined by f x x 2. Show that the sequence of iterates xk defined by
xk 11 cosk 2k sink
satisfies f xk1 f xk for k 0,1,2,…. Show that every point on the unit circle x x 2 1isalimitpointforxk.Hint:Everyvalue 0,2isalimitpointofthe subsequence k defined by
k k kmod2k2 2 ,
where the operator denotes rounding down to the next integer.
2.6 Prove that all isolated local minimizers are strict. Hint: Take an isolated local minimizer x and a neighborhood N. Show that for any x N, x a x we must have
fx fx.
2.7 SupposethatfxxTQx,whereQisannnsymmetricpositivesemidefinite matrix. Show using the definition 1.4 that f x is convex on the domain IRn. Hint: It may be convenient to prove the following equivalent inequality:
f y x y f x 1 f y 0, for all 0, 1 and all x , y IRn .
2.8 Suppose that f is a convex function. Show that the set of global minimizers of f is a convex set.
2.9 Consider the function fx1,x2 x1 x2 2. At the point xT 1,0 we considerthesearchdirectionpT 1,1.Showthatpisadescentdirectionandfindall minimizers of the problem 2.10.
2.10Supposethatf zfx,wherexSzsforsomeSIRnnandsIRn. Show that
f zSTfx, 2 f zST2 fxS.
Hint: Use the chain rule to express d f dz j in terms of d f dxi and dxi dz j for all
i, j 1,2,…,n.
2.11 Show that the symmetric rankone update 2.18 and the BFGS update 2.19 are scaleinvariant if the initial Hessian approximations B0 are chosen appropriately. That is, using the notation of the previous exercise, show that if these methods are applied to
f x starting from x0 Sz0 s with initial Hessian B0, and to f z starting from z0 with initial Hessian ST B0 S, then all iterates are related by xk Szk s. Assume for simplicity that the methods take unit step lengths.
2.12 Suppose that a function f of two variables is poorly scaled at the solution x. Write two Taylor expansions of f around xone along each coordinate directionand use them to show that the Hessian 2 f x is illconditioned.
2.13 For this and the following three questions, refer to the material on Rates of Convergence in Section A.2 of the Appendix. Show that the sequence xk 1k is not Qlinearly convergent, though it does converge to zero. This is called sublinear convergence.
Show that the sequence xk 1 0.52k is Qquadratically convergent to 1. Does the sequence xk 1k! converge Qsuperlinearly? Qquadratically? Consider the sequence xk defined by
Is this sequence Qsuperlinearly convergent? Qquadratically convergent? Rquadratically convergent?
2.14
2.15
2.16
2.2. OVERVIEW OF ALGORITHMS 29
1 2k, keven, 4
xk
xk1k, k odd.
CHAPTER3 Line Search
Methods
Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The iteration is given by
xk1 xk kpk, 3.1
where the positive scalar k is called the step length. The success of a line search method depends on effective choices of both the direction pk and the step length k .
Most line search algorithms require pk to be a descent directionone for which pkT fk 0because this property guarantees that the function f can be reduced along
This is pa Printer: O
g
this direction, as discussed in the previous chapter. Moreover, the search direction often has the form
pk B1 fk, 3.2 k
where Bk is a symmetric and nonsingular matrix. In the steepest descent method, Bk is simply the identity matrix I , while in Newtons method, Bk is the exact Hessian 2 f xk . In quasiNewton methods, Bk is an approximation to the Hessian that is updated at every iteration by means of a lowrank formula. When pk is defined by 3.2 and Bk is positive definite, we have
pTf fTB1f 0, kkkkk
and therefore pk is a descent direction.
In this chapter, we discuss how to choose k and pk to promote convergence from
remote starting points. We also study the rate of convergence of steepest descent, quasi Newton, and Newton methods. Since the pure Newton iteration is not guaranteed to produce descent directions when the current iterate is not close to a solution, we discuss modifications in Section 3.4 that allow it to start from any initial point.
We now give careful consideration to the choice of the steplength parameter k .
3.1 STEP LENGTH
In computing the step length k, we face a tradeoff. We would like to choose k to give a substantial reduction of f , but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function defined by
fxkpk, 0, 3.3
but in general, it is too expensive to identify this value see Figure 3.1. To find even a local minimizer of to moderate precision generally requires too many evaluations of the objec tive function f and possibly the gradient f . More practical strategies perform an inexact line search to identify a step length that achieves adequate reductions in f at minimal cost.
Typical line search algorithms try out a sequence of candidate values for , stopping to accept one of these values when certain conditions are satisfied. The line search is done in two stages: A bracketing phase finds an interval containing desirable step lengths, and a bisection or interpolation phase computes a good step length within this interval. Sophisticated line search algorithms can be quite complicated, so we defer a full description until Section 3.5.
3.1. STEP LENGTH 31
32 CHAPTER 3. LINE SEARCH METHODS
first stationary point
first local minimizer
global minimizer
Figure 3.1 The ideal step length is the global minimizer.
We now discuss various termination conditions for line search algorithms and show that effective step lengths need not lie near minimizers of the univariate function defined in 3.3.
A simple condition we could impose on k is to require a reduction in f , that is, f xk k pk f xk . That this requirement is not enough to produce convergence to x is illustrated in Figure 3.2, for which the minimum function value is f 1, but a sequence of iterates xk for which fxk 5k, k 0,1,… yields a decrease at each iteration but has a limiting function value of zero. The insufficient reduction in f at each step causes it to fail to converge to the minimizer of this convex function. To avoid this
behavior we need to enforce a sufficient decrease condition, a concept we discuss next.
fx
x
x1 x2 x0
Figure 3.2 Insufficient reduction in f .
THE WOLFE CONDITIONS
A popular inexact line search condition stipulates that k should first of all give sufficient decrease in the objective function f , as measured by the following inequality:
fxk pk fxkc1fkT pk, 3.4
for some constant c1 0, 1. In other words, the reduction in f should be proportional to both the step length k and the directional derivative fkT pk . Inequality 3.4 is sometimes called the Armijo condition.
The sufficient decrease condition is illustrated in Figure 3.3. The righthandside of 3.4, which is a linear function, can be denoted by l. The function l has negative slope c1 fkT pk, but because c1 0,1, it lies above the graph of for small positive values of . The sufficient decrease condition states that is acceptable only if l. The intervals on which this condition is satisfied are shown in Figure 3.3. In practice, c1 is chosen to be quite small, say c1 104.
The sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress because, as we see from Figure 3.3, it is satisfied for all sufficiently small values of . To rule out unacceptably short steps we introduce a second requirement, called the curvature condition, which requires k to satisfy
fxk kpkT pk c2fkT pk, 3.5
for some constant c2 c1, 1, where c1 is the constant from 3.4. Note that the lefthand side is simply the derivative k , so the curvature condition ensures that the slope of at k is greater than c2 times the initial slope 0. This makes sense because if the slope
3.1. STEP LENGTH 33
fxk pk l
acceptable acceptable
Figure 3.3 Sufficient decrease condition.
34 CHAPTER 3. LINE SEARCH METHODS
tangent
fxkpk
desired slope
acceptable
acceptable
Figure 3.4 The curvature condition.
is strongly negative, we have an indication that we can reduce f significantly by moving further along the chosen direction.
On the other hand, if k is only slightly negative or even positive, it is a sign that we cannot expect much more decrease in f in this direction, so it makes sense to terminate the line search. The curvature condition is illustrated in Figure 3.4. Typical values of c2 are 0.9 when the search direction pk is chosen by a Newton or quasiNewton method, and 0.1 when pk is obtained from a nonlinear conjugate gradient method.
The sufficient decrease and curvature conditions are known collectively as the Wolfe conditions. We illustrate them in Figure 3.5 and restate them here for future reference:
fxk kpk fxkc1kfkT pk, 3.6a fxk kpkT pk c2fkT pk, 3.6b
with0c1 c2 1.
A step length may satisfy the Wolfe conditions without being particularly close to a
minimizer of , as we show in Figure 3.5. We can, however, modify the curvature condition to force k to lie in at least a broad neighborhood of a local minimizer or stationary point of . The strong Wolfe conditions require k to satisfy
fxk kpk fxkc1kfkT pk, 3.7a fxk kpkT pkc2fkT pk, 3.7b
with 0 c1 c2 1. The only difference with the Wolfe conditions is that we no longer allow the derivative k to be too positive. Hence, we exclude points that are far from stationary points of .
Figure 3.5 Step lengths satisfying the Wolfe conditions.
It is not difficult to prove that there exist step lengths that satisfy the Wolfe conditions
for every function f that is smooth and bounded below.
Lemma 3.1.
Suppose that f : IRn IR is continuously differentiable. Let pk be a descent direction at xk, and assume that f is bounded below along the ray xk pk 0. Then if 0 c1 c2 1, there exist intervals of step lengths satisfying the Wolfe conditions 3.6 and the strong Wolfe conditions 3.7.
PROOF. Notethat fxk pkisboundedbelowforall0.Since0c1 1, the line l f xk c1 fkT pk is unbounded below and must therefore intersect the graph of at least once. Let 0 be the smallest intersecting value of , that is,
fxk pk fxkc1fkT pk.
The sufficient decrease condition 3.6a clearly holds for all step lengths less than .
By the mean value theorem see A.55, there exists 0, such that fxk pk fxkfxk pkT pk.
By combining 3.8 and 3.9, we obtain
fxk pkT pk c1fkT pk c2fkT pk,
3.8
3.9
3.10
since c1 c2 and fkT pk 0. Therefore, satisfies the Wolfe conditions 3.6, and the inequalities hold strictly in both 3.6a and 3.6b. Hence, by our smoothness assumption on f , there is an interval around for which the Wolfe conditions hold. Moreover, since
3.1. STEP LENGTH 35
fxk pk
desired slope
acceptable
line of sufficient decrease
l
acceptable
36 CHAPTER 3. LINE SEARCH METHODS
the term in the lefthand side of 3.10 is negative, the strong Wolfe conditions 3.7 hold in the same interval.
The Wolfe conditions are scaleinvariant in a broad sense: Multiplying the objective function by a constant or making an affine change of variables does not alter them. They can be used in most line search methods, and are particularly important in the implementation of quasiNewton methods, as we see in Chapter 6.
THE GOLDSTEIN CONDITIONS
Like the Wolfe conditions, the Goldstein conditions ensure that the step length achieves sufficient decrease but is not too short. The Goldstein conditions can also be stated as a pair of inequalities, in the following way:
fxk1ckfkT pk fxk kpk fxkckfkT pk, 3.11
with 0 c 12. The second inequality is the sufficient decrease condition 3.4, whereas the first inequality is introduced to control the step length from below; see Figure 3.6
A disadvantage of the Goldstein conditions visavis the Wolfe conditions is that the first inequality in 3.11 may exclude all minimizers of . However, the Goldstein and Wolfe conditions have much in common, and their convergence theories are quite similar. The Goldstein conditions are often used in Newtontype methods but are not well suited for quasiNewton methods that maintain a positive definite Hessian approximation.
fxkpk
1c fTp kk
c f Tp kk
acceptable steplengths
Figure 3.6 The Goldstein conditions.
3.2. CONVERGENCE OF LINE SEARCH METHODS 37
SUFFICIENT DECREASE AND BACKTRACKING
We have mentioned that the sufficient decrease condition 3.6a alone is not sufficient to ensure that the algorithm makes reasonable progress along the given search direction. However, if the line search algorithm chooses its candidate step lengths appropriately, by using a socalled backtracking approach, we can dispense with the extra condition 3.6b and use just the sufficient decrease condition to terminate the line search procedure. In its most basic form, backtracking proceeds as follows.
Algorithm 3.1 Backtracking Line Search. Choose 0,0,1,c0,1;Set ; repeatuntil fxk pk fxkcfkT pk
; end repeat
Terminate with k .
In this procedure, the initial step length is chosen to be 1 in Newton and quasi Newton methods, but can have different values in other algorithms such as steepest descent or conjugate gradient. An acceptable step length will be found after a finite number of trials, because k will eventually become small enough that the sufficient decrease condition holds see Figure 3.3. In practice, the contraction factor is often allowed to vary at each iteration of the line search. For example, it can be chosen by safeguarded interpolation, as we describe later. We need ensure only that at each iteration we have lo , hi , for some fixed constants 0 lo hi 1.
Thebacktrackingapproachensureseitherthattheselectedsteplengthk issomefixed value the initial choice , or else that it is short enough to satisfy the sufficient decrease condition but not too short. The latter claim holds because the accepted value k is within a factor of the previous trial value, k, which was rejected for violating the sufficient decrease condition, that is, for being too long.
This simple and popular strategy for terminating a line search is well suited for Newton methods but is less appropriate for quasiNewton and conjugate gradient methods.
3.2 CONVERGENCE OF LINE SEARCH METHODS
To obtain global convergence, we must not only have well chosen step lengths but also well chosen search directions pk . We discuss requirements on the search direction in this section, focusing on one key property: the angle k between pk and the steepest descent direction fk , defined by
cosk fkTpk . 3.12 fk pk
38 CHAPTER 3. LINE SEARCH METHODS
The following theorem, due to Zoutendijk, has farreaching consequences. It quantifies the effect of properly chosen step lengths k , and shows, for example, that the steepest descent method is globally convergent. For other algorithms, it describes how far pk can deviate from the steepest descent direction and still produce a globally convergent iteration. Various line search termination conditions can be used to establish this result, but for concreteness we will consider only the Wolfe conditions 3.6. Though Zoutendijks result appears at first to be technical and obscure, its power will soon become evident.
Theorem 3.2.
Consider any iteration of the form 3.1, where pk is a descent direction and k satisfies the Wolfe conditions 3.6. Suppose that f is bounded below in IRn and that f is continuously
def
differentiable in an open set N containing the level set L x : f x f x0, where x0 is
the starting point of the iteration. Assume also that the gradient f is Lipschitz continuous on N , that is, there exists a constant L 0 such that
Then
fxfx L xx , forallx,x N.
3.13
3.14
cos2k fk 2.
k0
PROOF. From 3.6b and 3.1 we have that
fk1 fkT pk c2 1 fkT pk, while the Lipschitz condition 3.13 implies that
fk1fkTpkkL pk 2. By combining these two relations, we obtain
k c 2 1 f kT p k . L pk2
By substituting this inequality into the first Wolfe condition 3.6a, we obtain 1 c 2 f kT p k 2
From the definition 3.12, we can write this relation as fk1fkccos2k fk 2,
fk1fkc1 L pk2 .
k. Hence, by taking limits in 3.15, we obtain
which concludes the proof.
3.2. CONVERGENCE OF LINE SEARCH METHODS 39
where c c11 c2L. By summing this expression over all indices less than or equal to k, we obtain
k
fk1f0c cos2j fj 2. 3.15
j0
Since f is bounded below, we have that f0 fk1 is less than some positive constant, for all
cos2k fk 2 ,
k0
Similar results to this theorem hold when the Goldstein conditions 3.11 or strong Wolfe conditions 3.7 are used in place of the Wolfe conditions. For all these strategies, the step length selection implies inequality 3.14, which we call the Zoutendijk condition.
NotethattheassumptionsofTheorem3.2arenottoorestrictive.Ifthefunction f were not bounded below, the optimization problem would not be well defined. The smoothness assumptionLipschitz continuity of the gradientis implied by many of the smoothness conditions that are used in local convergence theorems see Chapters 6 and 7 and are often satisfied in practice.
The Zoutendijk condition 3.14 implies that
cos2k fk 2 0. 3.16
This limit can be used in turn to derive global convergence results for line search algorithms. If our method for choosing the search direction pk in the iteration 3.1 ensures that the angle k defined by 3.12 is bounded away from 90, there is a positive constant such
that
cosk 0, for all k. It follows immediately from 3.16 that
lim fk 0. k
In other words, we can be sure that the gradient norms fk
that the search directions are never too close to orthogonality with the gradient. In particular, the method of steepest descent for which the search direction pk is parallel to the negative
converge to zero, provided
3.17
3.18
40 CHAPTER 3. LINE SEARCH METHODS
gradient produces a gradient sequence that converges to zero, provided that it uses a line search satisfying the Wolfe or Goldstein conditions.
We use the term globally convergent to refer to algorithms for which the property 3.18 is satisfied, but note that this term is sometimes used in other contexts to mean different things. For line search methods of the general form 3.1, the limit 3.18 is the strongest global convergence result that can be obtained: We cannot guarantee that the method converges to a minimizer, but only that it is attracted by stationary points. Only by making additional requirements on the search direction pkby introducing negative curvature information from the Hessian 2 f xk , for examplecan we strengthen these results to include convergence to a local minimum. See the Notes and References at the end of this chapter for further discussion of this point.
Consider now the Newtonlike method 3.1, 3.2 and assume that the matrices Bk are positive definite with a uniformly bounded condition number. That is, there is a constant M such that
Bk B1 M, for all k. k
It is easy to show from the definition 3.12 that cosk 1M
see Exercise 3.5. By combining this bound with 3.16 we find that lim fk 0.
k
3.19
3.20
Therefore, we have shown that Newton and quasiNewton methods are globally convergent if the matrices Bk have a bounded condition number and are positive definite which is needed to ensure that pk is a descent direction, and if the step lengths satisfy the Wolfe conditions.
For some algorithms, such as conjugate gradient methods, we will be able to prove the limit 3.18, but only the weaker result
lim inf fk 0. 3.21 k
In other words, just a subsequence of the gradient norms fk j converges to zero, rather than the whole sequence see Appendix A. This result, too, can be proved by using Zou tendijks condition 3.14, but instead of a constructive proof, we outline a proof by contradiction. Suppose that 3.21 does not hold, so that the gradients remain bounded away from zero, that is, there exists 0 such that
fk , for all k sufficiently large. 3.22
Then from 3.16 we conclude that
3.3. RATE OF CONVERGENCE 41
cosk 0, 3.23
that is, the entire sequence cos k converges to 0. To establish 3.21, therefore, it is enough to show that a subsequence cos k j is bounded away from zero. We will use this strategy in Chapter 5 to study the convergence of nonlinear conjugate gradient methods.
By applying this proof technique, we can prove global convergence in the sense of 3.20 or 3.21 for a general class of algorithms. Consider any algorithm for which i every iteration produces a decrease in the objective function, and ii every mth iteration is a steepest descent step, with step length chosen to satisfy the Wolfe or Goldstein conditions. Then, since cos k 1 for the steepest descent steps, the result 3.21 holds. Of course, we would design the algorithm so that it does something better than steepest descent at the other m 1 iterates. The occasional steepest descent steps may not make much progress, but they at least guarantee overall global convergence.
Note that throughout this section we have used only the fact that Zoutendijks condi tion implies the limit 3.16. In later chapters we will make use of the bounded sum condition 3.14, which forces the sequence cos2 k fk 2 to converge to zero at a sufficiently rapid rate.
3.3 RATE OF CONVERGENCE
It would seem that designing optimization algorithms with good convergence properties is easy, since all we need to ensure is that the search direction pk does not tend to become orthogonal to the gradient fk , or that steepest descent steps are taken regularly. We could simply compute cos k at every iteration and turn pk toward the steepest descent direction if cos k is smaller than some preselected constant 0. Angle tests of this type ensure global convergence, but they are undesirable for two reasons. First, they may impede a fast rate of convergence, because for problems with an illconditioned Hessian, it may be necessary to produce search directions that are almost orthogonal to the gradient, and an inappropriate choice of the parameter may cause such steps to be rejected. Second, angle tests destroy the invariance properties of quasiNewton methods.
Algorithmic strategies that achieve rapid convergence can sometimes conflict with the requirements of global convergence, and vice versa. For example, the steepest descent method is the quintessential globally convergent algorithm, but it is quite slow in practice, as we shall see below. On the other hand, the pure Newton iteration converges rapidly when started close enough to a solution, but its steps may not even be descent directions away from the solution. The challenge is to design algorithms that incorporate both properties: good global convergence guarantees and a rapid rate of convergence.
We begin our study of convergence rates of line search methods by considering the most basic approach of all: the steepest descent method.
42 CHAPTER 3. LINE SEARCH METHODS
Figure 3.7 Steepest descent steps. CONVERGENCE RATE OF STEEPEST DESCENT
We can learn much about the steepest descent method by considering the ideal case, in which the objective function is quadratic and the line searches are exact. Let us suppose that
f x 1 xT Qx bT x, 3.24 2
where Q is symmetric and positive definite. The gradient is given by f x Qx b and the minimizer x is the unique solution of the linear system Qx b.
It is easy to compute the step length k that minimizes f xk fk . By differentiating the function
fxk fk 1xk fkT Qxk fkbTxk fk 2
with respect to , and setting the derivative to zero, we obtain k fkTfk .
fkT Q fk
If we use this exact minimizer k , the steepest descent iteration for 3.24 is given by
xk1 xk fkTfk fk. fkT Q fk
3.25
3.26
Sincefk Qxkb,thisequationyieldsaclosedformexpressionforxk1intermsofxk. In Figure 3.7 we plot a typical sequence of iterates generated by the steepest descent method on a twodimensional quadratic objective function. The contours of f are ellipsoids whose
axes lie along the orthogonal eigenvectors of Q. Note that the iterates zigzag toward the solution.
To quantify the rate of convergence we introduce the weighted norm By using the relation Qx b, we can show that
1 xx 2 fxfx, 2Q
x 2Q x T Q x . 3.27
so this norm measures the difference between the current objective value and the optimal value. By using the equality 3.26 and noting that fk Qxk x, we can derive the equality
fTf 2
xk1x 2Q 1 k k xkx 2Q 3.28
see Exercise 3.7. This expression describes the exact decrease in f at each iteration, but since the term inside the brackets is difficult to interpret, it is more useful to bound it in terms of the condition number of the problem.
Theorem 3.3.
When the steepest descent method with exact line searches 3.26 is applied to the strongly convex quadratic function 3.24, the error norm 3.27 satisfies
n1 2
xk1x2Q xkx2Q, 3.29
n1 where01 2 n aretheeigenvaluesofQ.
The proof of this result is given by Luenberger 195. The inequalities 3.29 and 3.27 show that the function values fk converge to the minimum f at a linear rate. As a special case of this result, we see that convergence is achieved in one iteration if all the eigenvalues are equal. In this case, Q is a multiple of the identity matrix, so the contours in Figure 3.7 are circles and the steepest descent direction always points at the solution. In general, as the condition number Q n1 increases, the contours of the quadratic become more elongated, the zigzagging in Figure 3.7 becomes more pronounced, and 3.29 implies that the convergence degrades. Even though 3.29 is a worstcase bound, it gives an accurate indication of the behavior of the algorithm when n 2.
The rateofconvergence behavior of the steepest descent method is essentially the same on general nonlinear objective functions. In the following result we assume that the step length is the global minimizer along the search direction.
Theorem 3.4.
Suppose that f : IRn IR is twice continuously differentiable, and that the iterates generated by the steepestdescent method with exact line searches converge to a point x at
3.3. RATE OF CONVERGENCE 43
fkT Qfk fkT Q1fk
44 CHAPTER 3. LINE SEARCH METHODS
which the Hessian matrix 2 f x is positive definite. Let r be any scalar satisfying r n1,1 ,
n 1
where 1 2 n are the eigenvalues of 2 f x. Then for all k sufficiently large, we
have
fxk1 fxr2fxk fx.
In general, we cannot expect the rate of convergence to improve if an inexact line search is used. Therefore, Theorem 3.4 shows that the steepest descent method can have an unacceptably slow rate of convergence, even when the Hessian is reasonably well conditioned. For example, if Q 800, f x1 1, and f x 0, Theorem 3.4 suggests that the function value will still be about 0.08 after one thousand iterations of the steepest descent method with exact line search.
NEWTONS METHOD
We now consider the Newton iteration, for which the search is given by
pN 2f1fk. 3.30
kk
Since the Hessian matrix 2 fk may not always be positive definite, pkN may not always be a descent direction, and many of the ideas discussed so far in this chapter no longer apply. In Section 3.4 and Chapter 4 we will describe two approaches for obtaining a globally convergent iteration based on the Newton step: a line search approach, in which the Hessian 2 fk is modified, if necessary, to make it positive definite and thereby yield descent, and a trust region approach, in which 2 fk is used to form a quadratic model that is minimized in a ball around the current iterate xk .
Here we discuss just the local rateofconvergence properties of Newtons method. We know that for all x in the vicinity of a solution point x such that 2 f x is positive definite, the Hessian 2 f x will also be positive definite. Newtons method will be well defined in this region and will converge quadratically, provided that the step lengths k are eventually always 1.
Theorem 3.5.
Supposethat f istwicedifferentiableandthattheHessian2 fxisLipschitzcontinuous see A.42 in a neighborhood of a solution x at which the sufficient conditions Theorem 2.4 are satisfied. Consider the iteration xk1 xk pk , where pk is given by 3.30. Then
i if the starting point x0 is sufficiently close to x, the sequence of iterates converges to x; ii therateofconvergenceofxkisquadratic;and
iii the sequence of gradient norms fk converges quadratically to zero.
PROOF. From the definition of the Newton step and the optimality condition f 0 we have that
xk pN x xk x 2 f 1 fk kk
2 f1 2 fkxk xfk f . k
Since Taylors theorem Theorem 2.1 tells us that
3.31
we have
1 0
a2 fxkxk xfk fxa
a 1 a
a 2 fxk2 fxk tx xk xk xdta 0
fk f
2 fxk tx xkxk xdt,
1a2 2 a
fxk fxk tx xk xk x dt
1
0
xkx 2
Ltdt1L xkx 2, 2
3.32 where L is the Lipschitz constant for 2 f x for x near x. Since 2 f x is nonsingular,
thereisaradiusr 0suchthat 2 f1 2 2 fx1 forallxk with xk x r. k
By substituting in 3.31 and 3.32, we obtain
xk pkN x L 2 f x1 xk x 2 L xk x 2, 3.33
whereL L 2fx1 .Choosingx0sothat x0x minr,12L ,wecanuse this inequality inductively to deduce that the sequence converges to x, and the rate of convergence is quadratic.
Byusingtherelationsxk1 xk pkN andfk 2 fkpkN 0,weobtainthat
3.3. RATE OF CONVERGENCE 45
fxk1
fxk1fk 2 fxkpkN
a 1 a
a a 2 f x k t p kN x k 1 x k d t 2 f x k p kN a a 0
0 2k
0
1a2 N 2 aN
fxk tpk fxk pk dt
1LpN 2
1L2fxk1 2 fk 2 2
2L2fx1 2 fk 2,
proving that the gradient norms converge to zero quadratically.
46 CHAPTER 3. LINE SEARCH METHODS
As the iterates generated by Newtons method approach the solution, the Wolfe or Goldstein conditions will accept the step length k 1 for all large k. This observation follows from Theorem 3.6 below. Indeed, when the search direction is given by Newtons method, the limit 3.35 is satisfiedthe ratio is zero for all k! Implementations of Newtons method using these line search conditions, and in which the line search always tries the unit steplengthfirst,willsetk 1foralllargekandattainalocalquadraticrateofconvergence.
QUASINEWTON METHODS
Suppose now that the search direction has the form
pk B1 fk, 3.34
where the symmetric and positive definite matrix Bk is updated at every iteration by a quasiNewton updating formula. We already encountered one quasiNewton formula, the BFGS formula, in Chapter 2; others will be discussed in Chapter 6. We assume here that the step length k is computed by an inexact line search that satisfies the Wolfe or strong Wolfe conditions, with the same proviso mentioned above for Newtons method: The line search algorithm will always try the step length 1 first, and will accept this value if it satisfies the Wolfe conditions. We could enforce this condition by setting 1 in Algorithm 3.1, for example. This implementation detail turns out to be crucial in obtaining a fast rate of convergence.
The following result shows that if the search direction of a quasiNewton method approximates the Newton direction well enough, then the unit step length will satisfy the Wolfe conditions as the iterates converge to the solution. It also specifies a condition that the search direction must satisfy in order to give rise to a superlinearly convergent iteration. To bring out the full generality of this result, we state it first in terms of a general descent iteration, and then examine its consequences for quasiNewton and Newton methods.
Theorem 3.6.
Suppose that f : IRn IR is twice continuously differentiable. Consider the iteration xk1 xk k pk, where pk is a descent direction and k satisfies the Wolfe conditions 3.6 withc1 12.Ifthesequencexkconvergestoapointx suchthat fx 0and2 fx is positive definite, and if the search direction satisfies
then
lim fk2fkpk 0, k pk
3.35
k
i the step length k 1 is admissible for all k greater than a certain index k0; and ii if k 1 for all k k0, xk converges to x superlinearly.
It is easy to see that if c1 12, then the line search would exclude the minimizer of a quadratic, and unit step lengths may not be admissible.
If pk is a quasiNewton search direction of the form 3.34, then 3.35 is equivalent to lim Bk2fxpk 0. 3.36
k pk
Hence, we have the surprising and delightful result that a superlinear convergence rate can be attained even if the sequence of quasiNewton matrices Bk does not converge to 2 f x; it suffices that the Bk become increasingly accurate approximations to 2 f x along the search directions pk . Importantly, condition 3.36 is both necessary and sufficient for the superlinear convergence of quasiNewton methods.
Theorem 3.7.
Suppose that f : IRn IR is twice continuously differentiable. Consider the iteration xk1 xk pk that is, the step length k is uniformly 1 and that pk is given by 3.34. Let us assume also that xk converges to a point x such that f x 0 and 2 f x is positive definite. Then xk converges superlinearly if and only if 3.36 holds.
PROOF. We first show that 3.36 is equivalent to
pk pkN o pk , 3.37
where pN 2 f 1 fk is the Newton step. Assuming that 3.36 holds, we have that kk
pkpN 2f12fkpkfk kk
2 f12 fk Bkpk k
O2fkBkpk o pk ,
where we have used the fact that 2 f 1 is bounded above for xk sufficiently close to x, k
since the limiting Hessian 2 f x is positive definite. The converse follows readily if we multiply both sides of 3.37 by 2 fk and recall 3.34.
By combining 3.33 and 3.37, we obtain that
x k p k x x k p kN x p k p kN O x k x 2 o p k .
A simple manipulation of this inequality reveals that pk O xk x , so we obtain xkpkx oxkx,
3.3. RATE OF CONVERGENCE 47
giving the superlinear convergence result.
48 CHAPTER 3. LINE SEARCH METHODS
We will see in Chapter 6 that quasiNewton methods normally satisfy condition 3.36 and are therefore superlinearly convergent.
3.4 NEWTONS METHOD WITH HESSIAN MODIFICATION
Away from the solution, the Hessian matrix 2 f x may not be positive definite, so the Newton direction pkN defined by
2fxkpkN fxk 3.38
see 3.30 may not be a descent direction. We now describe an approach to overcome this difficulty when a direct linear algebra technique, such as Gaussian elimination, is used to solve the Newton equations 3.38. This approach obtains the step pk from a linear system identical to 3.38, except that the coefficient matrix is replaced with a positive definite approximation, formed before or during the solution process. The modified Hessian is obtained by adding either a positive diagonal matrix or a full matrix to the true Hessian 2 f xk . A general description of this method follows.
Algorithm 3.2 Line Search Newton with Modification. Given initial point x0;
k 0,1,2,…
FactorizethematrixBk 2 fxkEk,whereEk 0if2 fxk
is sufficiently positive definite; otherwise, Ek is chosen to
ensure that Bk is sufficiently positive definite; SolveBkpk fxk;
Set xk1 xk k pk , where k satisfies the Wolfe, Goldstein, or Armijo backtracking conditions;
end
Some approaches do not compute Ek explicitly, but rather introduce extra steps and tests into standard factorization procedures, modifying these procedures on the fly so that the computed factors are the factors of a positive definite matrix. Strategies based on modifying a Cholesky factorization and on modifying a symmetric indefinite factorization of the Hessian are described in this section.
Algorithm 3.2 is a practical Newton method that can be applied from any starting point. We can establish fairly satisfactory global convergence results for it, provided that the strategy for choosing Ek and hence Bk satisfies the bounded modified factorization property. This property is that the matrices in the sequence Bk have bounded condition number whenever the sequence of Hessians 2 f xk is bounded; that is,
Bk Bk B1 C, someC 0andallk 0,1,2,…. 3.39 k
for
3.4. NEWTONS METHOD WITH HESSIAN MODIFICATION 49
If this property holds, global convergence of the modified line search Newton method follows from the results of Section 3.2.
Theorem 3.8.
Let f be twice continuously differentiable on an open set D, and assume that the starting point x0 of Algorithm 3.2 is such that the level set L x D : f x f x0 is compact. Then if the bounded modified factorization property holds, we have that
lim fxk0. k
For a proof this result see 215.
We now consider the convergence rate of Algorithm 3.2. Suppose that the sequence
of iterates xk converges to a point x where 2 f x is sufficiently positive definite in the sense that the modification strategies described in the next section return the modification Ek 0 for all sufficiently large k. By Theorem 3.6, we have that k 1 for all sufficiently large k, so that Algorithm 3.2 reduces to a pure Newton method, and the rate of convergence is quadratic.
For problems in which f is close to singular, there is no guarantee that the mod ification Ek will eventually vanish, and the convergence rate may be only linear. Besides requiring the modified matrix Bk to be well conditioned so that Theorem 3.8 holds, we would like the modification to be as small as possible, so that the secondorder information in the Hessian is preserved as far as possible. Naturally, we would also like the modified factorization to be computable at moderate cost.
To set the stage for the matrix factorization techniques that will be used in Al gorithm 3.2, we will begin by assuming that the eigenvalue decomposition of 2 f xk is available. This is not realistic for largescale problems because this decomposition is generally too expensive to compute, but it will motivate several practical modification strategies.
EIGENVALUE MODIFICATION
Consider a problem in which, at the current iterate xk , f xk 1, 3, 2T and 2 fxk diag10,3,1, which is clearly indefinite. By the spectral decomposition theorem see Appendix A we can define Q I and diag1, 2, 3, and write
n
2fxkQ QT iqiqiT. 3.40
i1
The pure Newton stepthe solution of 3.38is pkN 0.1, 1, 2T , which is not a de scent direction, since f xk T pkN 0. One might suggest a modified strategy in which we replace 2 f xk by a positive definite approximation Bk , in which all negative eigenvalues in 2 fxk are replaced by a small positive number that is somewhat larger than ma chine precision u; say u. For a machine precision of 1016, the resulting matrix in
50 CHAPTER 3. LINE SEARCH METHODS
our example is
Bk
2 i1
iqiqiT q3q3T diag 10,3,108 , 3.41
which is numerically positive definite and whose curvature along the eigenvectors q1 and q2 has been preserved. Note, however, that the search direction based on this modified Hessian is
pk B1fk 2 1qi qiTfk 1q3 q3Tfxk k i
i1
2108 q3.
3.42
For small , this step is nearly parallel to q3 with relatively small contributions from q1 and q2 and quite long. Although f decreases along the direction pk, its extreme length violates the spirit of Newtons method, which relies on a quadratic approximation of the objective function that is valid in a neighborhood of the current iterate xk. It is therefore not clear that this search direction is effective.
Various other modification strategies are possible. We could flip the signs of the negative eigenvalues in 3.40, which amounts to setting 1 in our example. We could set the last term in 3.42 to zero, so that the search direction has no components along the negative curvature directions. We could adapt the choice of to ensure that the length of the step is not excessive, a strategy that has the flavor of trustregion methods. As this discussion shows, there is a great deal of freedom in devising modification strategies, and there is currently no agreement on which strategy is best.
Setting the issue of the choice of aside for the moment, let us look more closely at the process of modifying a matrix so that it becomes positive definite. The modification 3.41 to the example matrix 3.40 can be shown to be optimal in the following sense. If A is a symmetric matrix with spectral decomposition A Q QT , then the correction matrix aA of minimum Frobenius norm that ensures that minA aA is given by
aAQdiagiQT, with i 0, i , 3.43 i, i.
Here, minA denotes the smallest eigenvalue of A, and the Frobenius norm of a matrix is
defined as A 2 n a2 F i,j1 ij
see A.9. Note that aA is not diagonal in general, and that the modified matrix is given by
AaAQ diagiQT.
By using a different norm we can obtain a diagonal modification. Suppose again that A is a symmetric matrix with spectral decomposition A Q QT . A correction matrix
3.4. NEWTONS METHOD WITH HESSIAN MODIFICATION 51
aA with minimum Euclidean norm that satisfies minA aA is given by aAI, with max0,minA. 3.44
The modified matrix now has the form
A I, 3.45
which happens to have the same form as the matrix occurring in unscaled trustregion methods see Chapter 4. All the eigenvalues of 3.45 have thus been shifted, and all are greater than .
These results suggest that both diagonal and nondiagonal modifications can be con sidered. Even though we have not answered the question of what constitutes a good modification, various practical diagonal and nondiagonal modifications have been pro posed and implemented in software. They do not make use of the spectral decomposition of the Hessian, since it is generally too expensive to compute. Instead, they use Gaussian elim ination, choosing the modifications indirectly and hoping that somehow they will produce good steps. Numerical experience indicates that the strategies described next often but not always produce good search directions.
ADDING A MULTIPLE OF THE IDENTITY
Perhaps the simplest idea is to find a scalar 0 such that 2 f xk I is sufficiently positive definite. From the previous discussion we know that must satisfy 3.44, but a good estimate of the smallest eigenvalue of the Hessian is normally not available. The following algorithm describes a method that tries successively larger values of . Here, aii denotes a diagonal element of A.
Algorithm 3.3 Cholesky with Added Multiple of the Identity. Choose 0;
if mini aii 0
set 0 0; else
0 minaii; end if
for k 0,1,2,…
AttempttoapplytheCholeskyalgorithmtoobtainLLT AkI; if the factorization is completed successfully
stop and return L; else
k1 max2k,; end if
end for
52 CHAPTER 3. LINE SEARCH METHODS
The choice of is heuristic; a typical value is 103. We could choose the first nonzero shift 0 to be proportional to be the final value of used in the latest Hessian modification; see also Algorithm B.1. The strategy implemented in Algorithm 3.3 is quite simple and may be preferable to the modified factorization techniques described next, but it suffers from one drawback. Every value of k requires a new factorization of A k I , and the algorithm can be quite expensive if several trial values are generated. Therefore it may be advantageous to increase more rapidly, say by a factor of 10 instead of 2 in the last else clause.
MODIFIED CHOLESKY FACTORIZATION
Another approach for modifying a Hessian matrix that is not positive definite is to perform a Cholesky factorization of 2 fxk, but to increase the diagonal elements encountered during the factorization where necessary to ensure that they are sufficiently positive. This modified Cholesky approach is designed to accomplish two goals: It guarantees that the modified Cholesky factors exist and are bounded relative to the norm of the actual Hessian, and it does not modify the Hessian if it is sufficiently positive definite.
We begin our description of this approach by briefly reviewing the Cholesky factorization. Every symmetric positive definite matrix A can be written as
ALDLT, 3.46
where L is a lower triangular matrix with unit diagonal elements and D is a diagonal matrix with positive elements on the diagonal. By equating the elements in 3.46, column by column, it is easy to derive formulas for computing L and D.
EXAMPLE 3.1 Considerthecasen3.TheequationALDLT isgivenby
aaa100d001ll 11 21 31 l 1 0 1 21 31 a21 a22 a32 21 0 d2 0 0 1 l32 .
a31 a32 a33 l31 l32 1 0 0 d3 0 0 1
The notation indicates that A is symmetric. By equating the elements of the first column, we have
a11 d1,
a21 d1l21 l21 a21d1, a31 d1l31 l31 a31d1.
3.4. NEWTONS METHOD WITH HESSIAN MODIFICATION 53
Proceeding with the next two columns, we obtain
adl2d 22 121 2
dadl2, 2 22 121
l32 a32 d1l31l21 d2, adl2dl2d dadl2dl2.
a32 d1l31l21 d2l32
33 131 232 3 3 33 131 232
This procedure is generalized in the following algorithm.
Algorithm3.4 CholeskyFactorization,LDLT Form.
for
end end
j 1, 2, . . . , n
c a j1dl2;
jj jj s1 s js dj cjj;
for i j 1, . . . , n
c a j1dl l ;
ij ij s1 sisjs lij cijdj;
One can show see, for example, Golub and Van Loan 136, Section 4.2.3 that the diagonal elements d j j are all positive whenever A is positive definite. The scalars ci j have been introduced only to facilitate the description of the modified factorization discussed below. We should note that Algorithm 3.4 differs a little from the standard form of the Cholesky factorization, which produces a lower triangular matrix M such that
AMMT. 3.47
In fact, we can make the identification M L D12 to relate M to the factors L and D computed in Algorithm 3.4. The technique for computing M appears as Algorithm A.2 in Appendix A.
If A is indefinite, the factorization A LDLT may not exist. Even if it does exist, Algorithm 3.4 is numerically unstable when applied to such matrices, in the sense that the elements of L and D can become arbitrarily large. It follows that a strategy of computing theLDLT factorizationandthenmodifyingthediagonalafterthefacttoforceitselements to be positive may break down, or may result in a matrix that is drastically different from A.
Instead, we can modify the matrix A during the course of the factorization in such a way that all elements in D are sufficiently positive, and so that the elements of D and L are not too large. To control the quality of the modification, we choose two positive parameters and , and require that during the computation of the jth columns of L and D in Algorithm 3.4 that is, for each j in the outer loop of the algorithm the following
54 CHAPTER 3. LINE SEARCH METHODS
bounds be satisfied:
dj , mij, ij1,j2,…,n, 3.48
where m l d . To satisfy these bounds we only need to change one step in Algo ijijj
rithm3.4:Theformulaforcomputingthediagonalelementdj inAlgorithm3.4isreplaced by
j 2
dj max cjj, , , with j max cij. 3.49
jin Toverifythat3.48holds,wenotefromAlgorithm3.4thatcij lijdj,andtherefore
cij cij
mijlij dj , foralli j.
Wenotethatj canbecomputedpriortodj becausetheelementscij inthesecond for loop of Algorithm 3.4 do not involve d j . In fact, this is the reason for introducing the quantitiescij intothealgorithm.
These observations are the basis of the modified Cholesky algorithm described in detail in Gill, Murray, and Wright 130, which introduces symmetric interchanges of rows and columns to try to reduce the size of the modification. If P denotes the permutation matrix associated with the row and column interchanges, the algorithm produces the Cholesky factorization of the permuted, modified matrix P APT E, that is,
PAPT ELDLT MMT, 3.50
where E is a nonnegative diagonal matrix that is zero if A is sufficiently positive definite. One can show More and Sorensen 215 that the matrices Bk obtained by this modified Cholesky algorithm to the exact Hessians 2 f xk have bounded condition numbers, that is, the bound 3.39 holds for some value of C.
MODIFIED SYMMETRIC INDEFINITE FACTORIZATION
Another strategy for modifying an indefinite Hessian is to use a procedure based on a symmetric indefinite factorization. Any symmetric matrix A, whether positive definite or not, can be written as
PAPT LBLT, 3.51
where L is unit lower triangular, B is a block diagonal matrix with blocks of dimension 1 or 2, and P is a permutation matrix see our discussion in Appendix A and also Golub and
dj j
3.4. NEWTONS METHOD WITH HESSIAN MODIFICATION 55
Van Loan 136, Section 4.4. We mentioned earlier that attempting to compute the L D L T factorization of an indefinite matrix where D is a diagonal matrix is inadvisable because even if the factors L and D are well defined, they may contain entries that are larger than the original elements of A, thus amplifying rounding errors that arise during the computation. However, by using the block diagonal matrix B, which allows 2 2 blocks as well as 1 1 blocks on the diagonal, we can guarantee that the factorization 3.51 always exists and can be computed by a numerically stable process.
EXAMPLE 3.2 The matrix
3234 can be written in the form 3.51 with P e1, e4, e3, e2,
1000
0300
0 1 0 0 L 1 2 ,
3 4 0 0
B 7 5 . 3.52
9310 21
0099 510
0123
1 2 2 2 A2 2 3 3
9301 0099
Note that both diagonal blocks in B are 2 2. Several algorithms for computing symmetric
indefinite factorizations are discussed in Section A.1 of Appendix A.
The symmetric indefinite factorization allows us to determine the inertia of a matrix, that is, the number of positive, zero, and negative eigenvalues. One can show that the inertia of B equals the inertia of A. Moreover, the 2 2 blocks in B are always constructed to have one positive and one negative eigenvalue. Thus the number of positive eigenvalues in A equals the number of positive 1 1 blocks plus the number of 2 2 blocks.
As for the Cholesky factorization, an indefinite symmetric factorization algorithm can be modified to ensure that the modified factors are the factors of a positive definite matrix. The strategy is first to compute the factorization 3.51, as well as the spectral decomposition B Q QT , which is inexpensive to compute because B is block diagonal
56 CHAPTER 3. LINE SEARCH METHODS
see Exercise 3.12. We then construct a modification matrix F such that LB FLT
is sufficiently positive definite. Motivated by the modified spectral decomposition 3.43, we choose a parameter 0 and define F to be
F QdiagiQT, i 0, i , i 1,2,…,n, 3.53 i, i,
where i are the eigenvalues of B. The matrix F is thus the modification of minimum Frobenius norm that ensures that all eigenvalues of the modified matrix B F are no less than . This strategy therefore modifies the factorization 3.51 as follows:
PAEPT LBFLT, whereEPTLFLTP.
Note that E will not be diagonal, in general. Hence, in contrast to the modified Cholesky approach, this modification strategy changes the entire matrix A, not just its diagonal. The aim of strategy 3.53 is that the modified matrix satisfies minA E whenever the original matrix A has minA . It is not clear, however, whether it always comes close to attaining this goal.
3.5 STEPLENGTH SELECTION ALGORITHMS
We now consider techniques for finding a minimum of the onedimensional function
fxk pk, 3.54
or for simply finding a step length k satisfying one of the termination conditions described in Section 3.1. We assume that pk is a descent directionthat is, 0 0so that our search can be confined to positive values of .
If f is a convex quadratic, f x 1 xT Qx bT x, its onedimensional minimizer 2
along the ray xk pk can be computed analytically and is given by
k fkT pk . 3.55
p kT Q p k
For general nonlinear functions, it is necessary to use an iterative procedure. The line search procedure deserves particular attention because it has a major impact on the robustness and efficiency of all nonlinear optimization methods.
3.5. STEPLENGTH SELECTION ALGORITHMS 57
Line search procedures can be classified according to the type of derivative information they use. Algorithms that use only function values can be inefficient since, to be theoretically sound, they need to continue iterating until the search for the minimizer is narrowed down to a small interval. In contrast, knowledge of gradient information allows us to determine whether a suitable step length has been located, as stipulated, for example, by the Wolfe conditions 3.6 or Goldstein conditions 3.11. Often, particularly when xk is close to the solution, the very first choice of satisfies these conditions, so the line search need not be invoked at all. In the rest of this section, we discuss only algorithms that make use of derivative information. More information on derivativefree procedures is given in the notes at the end of this chapter.
All line search procedures require an initial estimate 0 and generate a sequence i that either terminates with a step length satisfying the conditions specified by the user for example, the Wolfe conditions or determines that such a step length does not exist. Typical procedures consist of two phases: a bracketing phase that finds an interval a , b containing acceptable step lengths, and a selection phase that zooms in to locate the final step length. The selection phase usually reduces the bracketing interval during its search for the desired step length and interpolates some of the function and derivative information gathered on earlier steps to guess the location of the minimizer. We first discuss how to perform this interpolation.
In the following discussion we let k and k1 denote the step lengths used at iterations k and k 1 of the optimization algorithm, respectively. On the other hand, we denote the trial step lengths generated during the line search by i and i 1 and also j . We use 0 to denote the initial guess.
INTERPOLATION
We begin by describing a line search procedure based on interpolation of known function and derivative values of the function . This procedure can be viewed as an enhancement of Algorithm 3.1. The aim is to find a value of that satisfies the sufficient decrease condition 3.6a, without being too small. Accordingly, the procedures here generate a decreasing sequence of values i such that each value i is not too much smaller than its predecessor i1.
Note that we can write the sufficient decrease condition in the notation of 3.54 as
k 0 c1k 0, 3.56
and that since the constant c1 is usually chosen to be small in practice c1 104, say, this condition asks for little more than descent in f . We design the procedure to be efficient in the sense that it computes the derivative f x as few times as possible.
Suppose that the initial guess 0 is given. If we have 0 0 c100,
58 CHAPTER 3. LINE SEARCH METHODS
this step length satisfies the condition, and we terminate the search. Otherwise, we know that the interval 0,0 contains acceptable step lengths see Figure 3.3. We form a quadratic approximation q to by interpolating the three pieces of information available0, 0, and 0to obtain
q 0 0 00 2 0 0. 3.57 02
Note that this function is constructed so that it satisfies the interpolation conditions q 0 0, q 0 0, and q 0 0. The new trial value 1 is defined as the minimizer of this quadratic, that is, we obtain
002
1 20000. 3.58
If the sufficient decrease condition 3.56 is satisfied at 1, we terminate the search. Oth erwise, we construct a cubic function that interpolates the four pieces of information 0, 0, 0, and 1, obtaining
c a3 b2 0 0,
where
a 1 02 12 1001 .
b 02121 0 03 13 0 0 00 Bydifferentiatingcx,weseethattheminimizer2 ofc liesintheinterval0,1andis
given by
2b b23a0. 3a
If necessary, this process is repeated, using a cubic interpolant of 0, 0 and the two most recent values of , until an that satisfies 3.56 is located. If any i is either too close to its predecessor i1 or else too much smaller than i1, we reset i i12. This safeguard procedure ensures that we make reasonable progress on each iteration and that the final is not too small.
The strategy just described assumes that derivative values are significantly more ex pensive to compute than function values. It is often possible, however, to compute the directional derivative simultaneously with the function, at little additional cost; see Chap ter 8. Accordingly, we can design an alternative strategy based on cubic interpolation of the values of and at the two most recent values of .
3.5. STEPLENGTH SELECTION ALGORITHMS 59
Cubic interpolation provides a good model for functions with significant changes of curvature. Suppose we have an interval a , b known to contain desirable step lengths, and two previous step length estimates i1 and i in this interval. We use a cubic function to interpolate i1, i1, i , and i . This cubic function always exists and is unique; see, for example, Bulirsch and Stoer 41, p. 52. The minimizer of this cubic in a , b is either at one of the endpoints or else in the interior, in which case it is given by
with
i1 i i i1 id2 d1 , 3.59 i i1 2d2
d1 i1i3i1i, i1 i
d 2 s i g n i i 1 d 12 i 1 i 1 2 .
The interpolation process can be repeated by discarding the data at one of the step lengths i1 or i and replacing it by i1 and i1. The decision on which of i1 and i should be kept and which discarded depends on the specific conditions used to terminate the line search; we discuss this issue further below in the context of the Wolfe conditions. Cubic interpolation is a powerful strategy, since it usually produces a quadratic rate of convergence of the iteration 3.59 to the minimizing value of .
INITIAL STEP LENGTH
For Newton and quasiNewton methods, the step 0 1 should always be used as the initial trial step length. This choice ensures that unit step lengths are taken whenever they satisfy the termination conditions and allows the rapid rateofconvergence properties of these methods to take effect.
For methods that do not produce well scaled search directions, such as the steepest de
scent and conjugate gradient methods, it is important to use current information about the
problem and the algorithm to make the initial guess. A popular strategy is to assume that the
firstorder change in the function at iterate xk will be the same as that obtained at the previ
ous step. In other words, we choose the initial guess 0 so that 0 f T pk k1 f T
that is,
pk1,
fT pk1 0 k1 k1 .
k k1
fkT pk
Another useful strategy is to interpolate a quadratic to the data f xk1, f xk , and
f T pk1 and to define 0 to be its minimizer. This strategy yields k1
0 2 fk fk1. 3.60 0
60 CHAPTER 3. LINE SEARCH METHODS
It can be shown that if xk x superlinearly, then the ratio in this expression converges to 1. If we adjust the choice 3.60 by setting
0 min1,1.010, wefindthattheunitsteplength0 1willeventuallyalwaysbetriedandaccepted,andthe
superlinear convergence properties of Newton and quasiNewton methods will be observed.
A LINE SEARCH ALGORITHM FOR THE WOLFE CONDITIONS
The Wolfe or strong Wolfe conditions are among the most widely applicable and useful termination conditions. We now describe in some detail a onedimensional search procedure that is guaranteed to find a step length satisfying the strong Wolfe conditions 3.7 for any parameters c1 and c2 satisfying 0 c1 c2 1. As before, we assume that p is a descent direction and that f is bounded below along the direction p.
The algorithm has two stages. This first stage begins with a trial estimate 1, and keeps increasing it until it finds either an acceptable step length or an interval that brackets the desired step lengths. In the latter case, the second stage is invoked by calling a function called zoom Algorithm 3.6, below, which successively decreases the size of the interval until an acceptable step length is identified.
A formal specification of the line search algorithm follows. We refer to 3.7a as the sufficient decrease condition and to 3.7b as the curvature condition. The parameter max is a usersupplied bound on the maximum step length allowed. The line search algorithm terminates with set to a step length that satisfies the strong Wolfe conditions.
Algorithm 3.5 Line Search Algorithm.
Set 0 0, choose max 0 and 1 0,max; i 1;
repeat
Evaluate i ; ifi0c1i0orii1andi 1
zoomi1,iandstop; Evaluate i ;
if i c20
set i and stop;
if i 0
set zoomi,i1andstop;
Choosei1 i,max;
i i 1; end repeat
3.5. STEPLENGTH SELECTION ALGORITHMS 61
Note that the sequence of trial step lengths i is monotonically increasing, but that the order of the arguments supplied to the zoom function may vary. The procedure uses the knowledge that the interval i1,i contains step lengths satisfying the strong Wolfe conditions if one of the following three conditions is satisfied:
i i violates the sufficient decrease condition; ii i i1;
iii i 0.
The last step of the algorithm performs extrapolation to find the next trial value i1. To implement this step we can use approaches like the interpolation procedures above, or we can simply set i1 to some constant multiple of i. Whichever strategy we use, it is important that the successive steps increase quickly enough to reach the upper limit max in a finite number of iterations.
We now specify the function zoom, which requires a little explanation. The order of its input arguments is such that each call has the form zoomlo,hi, where
a the interval bounded by lo and hi contains step lengths that satisfy the strong Wolfe conditions;
b lo is, among all step lengths generated so far and satisfying the sufficient decrease condition, the one giving the smallest function value; and
c hi is chosen so that lohi lo 0.
Each iteration of zoom generates an iterate j between lo and hi, and then replaces one
oftheseendpointsbyj insuchawaythatthepropertiesa,b,andccontinuetohold.
Algorithm 3.6 zoom. repeat
Interpolate using quadratic, cubic, or bisection to find a trial step length j between lo and hi;
Evaluate j ;
if j 0 c1j 0 or j lo
hi j; else
Evaluate j ;
if j c20
Set j and stop; ifjhi lo0
hi lo; lo j;
end repeat
62 CHAPTER 3. LINE SEARCH METHODS
Ifthenewestimatej happenstosatisfythestrongWolfeconditions,thenzoomhasserved its purpose of identifying such a point, so it terminates with j. Otherwise, if j satisfies the sufficient decrease condition and has a lower function value than xlo, then we set lo j to maintain condition b. If this setting results in a violation of condition c, we remedy the situation by setting hi to the old value of lo. Readers should sketch some graphs to see for themselves how zoom works!
Asmentionedearlier,theinterpolationstepthatdeterminesj shouldbesafeguarded to ensure that the new step length is not too close to the endpoints of the interval. Practical line search algorithms also make use of the properties of the interpolating polynomials to make educated guesses of where the next step length should lie; see 39, 216. A problem that can arise is that as the optimization algorithm approaches the solution, two consecutive function values f xk and f xk1 may be indistinguishable in finiteprecision arithmetic. Therefore, the line search must include a stopping test if it cannot attain a lower function value after a certain number typically, ten of trial step lengths. Some procedures also stop if the relative change in x is close to machine precision, or to some userspecified threshold.
A line search algorithm that incorporates all these features is difficult to code. We advocate the use of one of the several good software implementations available in the public domain. See Dennis and Schnabel 92, Lemare chal 189, Fletcher 101, More and Thuente 216 in particular, and Hager and Zhang 161.
One may ask how much more expensive it is to require the strong Wolfe conditions instead of the regular Wolfe conditions. Our experience suggests that for a loose line search with parameters such as c1 104 and c2 0.9, both strategies require a similar amount of work. The strong Wolfe conditions have the advantage that by decreasing c2 we can directly control the quality of the search, by forcing the accepted value of to lie closer to a local minimum. This feature is important in steepest descent or nonlinear conjugate gradient methods, and therefore a step selection routine that enforces the strong Wolfe conditions has wide applicability.
NOTES AND REFERENCES
For an extensive discussion of line search termination conditions see Ortega and Rheinboldt 230. Akaike 2 presents a probabilistic analysis of the steepest descent method with exact line searches on quadratic functions. He shows that when n 2, the worstcase bound 3.29 can be expected to hold for most starting points. The case n 2 can be studied in closed form; see Bazaraa, Sherali, and Shetty 14. Theorem 3.6 is due to Dennis and More .
Some line search methods see Goldfarb 132 and More and Sorensen 213 compute a direction of negative curvature, whenever it exists, to prevent the iteration from converging to nonminimizing stationary points. A direction of negative curvature p is one that satisfies
pT 2 f xk p 0. These algorithms generate a search direction by combining p with the steepest descent direction fk , often performing a curvilinear backtracking line search.
3.5. STEPLENGTH SELECTION ALGORITHMS 63
It is difficult to determine the relative contributions of the steepest descent and negative curvature directions. Because of this fact, the approach fell out of favor after the introduction of trustregion methods.
For a more thorough treatment of the modified Cholesky factorization see Gill, Murray, and Wright 130 or Dennis and Schnabel 92. A modified Cholesky factorization based on Gershgorin disk estimates is described in Schnabel and Eskow 276. The modified indefinite factorization is from Cheng and Higham 58.
Another strategy for implementing a line search Newton method when the Hessian contains negative eigenvalues is to compute a direction of negative curvature and use it to define the search direction see More and Sorensen 213 and Goldfarb 132.
Derivativefree line search algorithms include golden section and Fibonacci search. They share some of the features with the line search method given in this chapter. They typically store three trial points that determine an interval containing a onedimensional minimizer. Golden section and Fibonacci differ in the way in which the trial step lengths are generated; see, for example, 79, 39.
Our discussion of interpolation follows Dennis and Schnabel 92, and the algorithm for finding a step length satisfying the strong Wolfe conditions can be found in Fletcher 101.
EXERCISES
3.1 Program the steepest descent and Newton algorithms using the backtracking line search, Algorithm 3.1. Use them to minimize the Rosenbrock function 2.22. Set the initial step length 0 1 and print the step length used by each method at each iteration. First try theinitialpointx0 1.2,1.2T andthenthemoredifficultstartingpointx0 1.2,1T.
3.2 Show that if 0 c2 c1 1, there may be no step lengths that satisfy the Wolfe conditions.
3.3 Show that the onedimensional minimizer of a strongly convex quadratic function is given by 3.55.
3.4 Show that the onedimensional minimizer of a strongly convex quadratic function always satisfies the Goldstein conditions 3.11.
3.5 Prove that Bx x B1 for any nonsingular matrix B. Use this fact to establish 3.19.
3.6 Consider the steepest descent method with exact line searches applied to the convex quadratic function 3.24. Using the properties given in this chapter, show that if the initial point is such that x0 x is parallel to an eigenvector of Q, then the steepest descent method will find the solution in one step.
64 CHAPTER 3. LINE SEARCH METHODS
3.7 Prove the result 3.28 by working through the following steps. First, use 3.26 to show that
xkx 2Q xk1x 2Q2kfkTQxkxk2fkTQfk, where Q is defined by 3.27. Second, use the fact that fk Qxk x to obtain
2 2 2fkTfk2 fkTfk2 xkx Q xk1x QfkTQfkfkTQfk
and
xkx 2QfkTQ1fk.
3.8 Let Q be a positive definite symmetric matrix. Prove that for any vector x, we
have
xT x2 4n1 xT QxxT Q1x n 12 ,
where n and 1 are, respectively, the largest and smallest eigenvalues of Q. This relation, which is known as the Kantorovich inequality, can be used to deduce 3.29 from 3.28.
3.9 Program the BFGS algorithm using the line search algorithm described in this chapter that implements the strong Wolfe conditions. Have the code verify that ykT sk is always positive. Use it to minimize the Rosenbrock function using the starting points given in Exercise 3.1.
3.10 Compute the eigenvalues of the 2 diagonal blocks of 3.52 and verify that each block has a positive and a negative eigenvalue. Then compute the eigenvalues of A and verify that its inertia is the same as that of B.
3.11 Describe the effect that the modified Cholesky factorization 3.50 would have on the Hessian 2 f xk diag2, 12, 4.
3.12 Consider a block diagonal matrix B with 1 1 and 2 2 blocks. Show that the eigenvalues and eigenvectors of B can be obtained by computing the spectral decomposition of each diagonal block separately.
3.13 Show that the quadratic function that interpolates 0, 0, and 0 is given by 3.57. Then, make use of the fact that the sufficient decrease condition 3.6a is not satisfied at 0 to show that this quadratic has positive curvature and that the minimizer satisfies
1 0 . 21 c1
3.5. STEPLENGTH SELECTION ALGORITHMS 65
Since c1 is chosen to be quite small in practice, this inequality indicates that 1 cannot be much greater than 1 and may be smaller, which gives us an idea of the new step length.
2
3.14 If 0 is large, 3.58 shows that 1 can be quite small. Give an example of a function and a step length 0 for which this situation arises. Drastic changes to the estimate of the step length are not desirable, since they indicate that the current interpolant does not provide a good approximation to the function and that it should be modified before being trusted to produce a good step length estimate. In practice, one imposes a lower bound typically,0.1anddefinesthenewsteplengthasi maxi1,i,wherei isthe minimizer of the interpolant.
3.15 Suppose that the sufficient decrease condition 3.6a is not satisfied at the step lengths 0, and 1, and consider the cubic interpolating 0, 0, 0 and 1. By drawing graphs illustrating the two situations that can arise, show that the mini mizer of the cubic lies in 0,1. Then show that if 0 1, the minimizer is
lessthan 21. 3
CHAPTER4 TrustRegion
Methods
Line search methods and trustregion methods both generate steps with the help of a quadratic model of the objective function, but they use this model in different ways. Line search methods use it to generate a search direction, and then focus their efforts on finding a suitable step length along this direction. Trustregion methods define a region around the current iterate within which they trust the model to be an adequate representation of the objective function, and then choose the step to be the approximate minimizer of the model in this region. In effect, they choose the direction and length of the step simul taneously. If a step is not acceptable, they reduce the size of the region and find a new
This is pa Printer: O
g
CHAPTER 4. TRUSTREGION METHODS 67
minimizer. In general, the direction of the step changes whenever the size of the trust region is altered.
The size of the trust region is critical to the effectiveness of each step. If the region is too small, the algorithm misses an opportunity to take a substantial step that will move it much closer to the minimizer of the objective function. If too large, the minimizer of the model may be far from the minimizer of the objective function in the region, so we may have to reduce the size of the region and try again. In practical algorithms, we choose the size of the region according to the performance of the algorithm during previous iterations. If the model is consistently reliable, producing good steps and accurately predicting the behavior of the objective function along these steps, the size of the trust region may be increased to allow longer, more ambitious, steps to be taken. A failed step is an indication that our model is an inadequate representation of the objective function over the current trust region. After such a step, we reduce the size of the region and try again.
Figure 4.1 illustrates the trustregion approach on a function f of two variables in which the current point xk and the minimizer x lie at opposite ends of a curved valley. The quadratic model function mk, whose elliptical contours are shown as dashed lines, is constructedfromfunctionandderivativeinformationatxk andpossiblyalsooninformation accumulated from previous iterations and steps. A line search method based on this model searches along the step to the minimizer of mk shown, but this direction will yield at most a small reduction in f , even if the optimal steplength is used. The trustregion method steps to the minimizer of mk within the dotted circle shown, yielding a more significant reduction in f and better progress toward the solution.
In this chapter, we will assume that the model function mk that is used at each iterate xk is quadratic. Moreover, mk is based on the Taylorseries expansion of f around
Trust region
Line search direction
contours of mk
Trust region step
contours of f
Figure 4.1 Trustregion and line search steps.
68 CHAPTER 4. TRUSTREGION METHODS
xk , which is
f xk p fk gkT p 1 pT 2 f xk tpp, 4.1 2
where fk f xk and gk f xk, and t is some scalar in the interval 0,1. By using an approximation Bk to the Hessian in the secondorder term, mk is defined as follows:
mkp fk gkT p 1 pT Bk p, 4.2 2
where Bk is some symmetric matrix. The difference between mk p and f xk p is O p 2 ,whichissmallwhen pissmall.
When Bk is equal to the true Hessian 2 f xk , the approximation error in the model function mk is O p 3 , so this model is especially accurate when p is small. This choice Bk 2 f xk leads to the trustregion Newton method, and will be discussed further in Section 4.4. In other sections of this chapter, we emphasize the generality of the trustregion approach by assuming little about Bk except symmetry and uniform boundedness.
To obtain each step, we seek a solution of the subproblem minmkpfkgkTp1pTBkp s.t. p ak, 4.3
where ak 0 is the trustregion radius. In most of our discussions, we define to be
the Euclidean norm, so that the solution pk of 4.3 is the minimizer of mk in the ball of
radius ak . Thus, the trustregion approach requires us to solve a sequence of subproblems
4.3 in which the objective function and constraint which can be written as pT p a2k
are both quadratic. When Bk is positive definite and B1gk ak, the solution of 4.3 is k
easy to identifyit is simply the unconstrained minimum pB B1gk of the quadratic kk
mkp. In this case, we call pkB the full step. The solution of 4.3 is not so obvious in other cases, but it can usually be found without too much computational expense. In any case, as described below, we need only an approximate solution to obtain convergence and good practical behavior.
OUTLINE OF THE TRUSTREGION APPROACH
One of the key ingredients in a trustregion algorithm is the strategy for choosing the trustregion radius ak at each iteration. We base this choice on the agreement between the model function mk and the objective function f at previous iterations. Given a step pk we define the ratio
k fxk fxk pk; 4.4 mk0 mkpk
the numerator is called the actual reduction, and the denominator is the predicted reduction that is, the reduction in f predicted by the model function. Note that since the step pk
pIRn 2
4
ak1 min2ak,a
CHAPTER 4. TRUSTREGION METHODS 69
is obtained by minimizing the model mk over a region that includes p 0, the predicted reduction will always be nonnegative. Hence, if k is negative, the new objective value f xk pk is greater than the current value f xk , so the step must be rejected. On the other hand, if k is close to 1, there is good agreement between the model mk and the function f over this step, so it is safe to expand the trust region for the next iteration. If k is positive but significantly smaller than 1, we do not alter the trust region, but if it is close
to zero or negative, we shrink the trust region by reducing ak at the next iteration. The following algorithm describes the process.
Algorithm 4.1 Trust Region.
Givena 0,a0 0,a,and 0,1 :
4
for k 0,1,2,…
Obtain pk by approximately solving 4.3;
Evaluate k from 4.4;
if k 1 4
ak1 1ak 4
else
ifk3and pk ak
else
ak1 ak; ifk
xk1 xk pk else
xk1 xk; end for.
Here a is an overall bound on the step lengths. Note that the radius is increased only if pk actually reaches the boundary of the trust region. If the step stays strictly inside the region, we infer that the current value of ak is not interfering with the progress of the algorithm, so we leave its value unchanged for the next iteration.
To turn Algorithm 4.1 into a practical algorithm, we need to focus on solving the trustregion subproblem 4.3. In discussing this matter, we sometimes drop the iteration subscript k and restate the problem 4.3 as follows:
def T 1T
minmpfgppBp s.t. pa. 4.5
pIRn 2
A first step to characterizing exact solutions of 4.5 is given by the following theorem
due to More and Sorensen 214, which shows that the solution p of 4.5 satisfies BIp g 4.6
for some 0.
70 CHAPTER 4. TRUSTREGION METHODS
Theorem 4.1.
The vector p is a global solution of the trustregion problem minmpfgTp1pTBp, s.t. p a, 4.7
pIRn 2
if and only if p is feasible and there is a scalar 0 such that the following conditions are
satisfied:
BIp g, a p 0,
B I is positive semidefinite.
4.8a 4.8b 4.8c
We delay the proof of this result until Section 4.3, and instead discuss just its key features here with the help of Figure 4.2. The condition 4.8b is a complementarity condition that states that at least one of the nonnegative quantities and a p must be zero. Hence, when the solution lies strictly inside the trust region as it does when a a1 in Figure 4.2, we must have 0 and so Bp g with B positive semidefinite, from 4.8a and 4.8c, respectively. In the other cases a a2 and a a3, we have p a, and so is allowed to take a positive value. Note from 4.8a that
p Bp g mp.
1
2 3
contours of m
p1 2 p3 p
Figure 4.2 Solution of trustregion subproblem for different radii a1, a2, a3.
4.1. ALGORITHMS BASED ON THE CAUCHY POINT 71
Thus, when 0, the solution p is collinear with the negative gradient of m and normal to its contours. These properties can be seen in Figure 4.2.
In Section 4.1, we describe two strategies for finding approximate solutions of the subproblem 4.3, which achieve at least as much reduction in mk as the reduction achieved by the socalled Cauchy point. This point is simply the minimizer of mk along the steepest descent direction gk . subject to the trustregion bound. The first approximate strategy is the dogleg method, which is appropriate when the model Hessian Bk is positive definite. The second strategy, known as twodimensional subspace minimization, can be applied when Bk is indefinite, though it requires an estimate of the most negative eigenvalue of this matrix. A third strategy, described in Section 7.1, uses an approach based on the conjugate gradient method to minimize m k , and can therefore be applied when B is large and sparse.
Section 4.3 is devoted to a strategy in which an iterative method is used to identify the value of for which 4.6 is satisfied by the solution of the subproblem. We prove global convergence results in Section 4.2. Section 4.4 discusses the trustregion Newton method, in which the Hessian Bk of the model function is equal to the Hessian 2 f xk of the objective function. The key result of this section is that, when the trustregion Newton algorithm con verges to a point x satisfying secondorder sufficient conditions, it converges superlinearly.
4.1 ALGORITHMS BASED ON THE CAUCHY POINT
THE CAUCHY POINT
As we saw in Chapter 3, line search methods can be globally convergent even when the optimal step length is not used at each iteration. In fact, the step length k need only satisfy fairly loose criteria. A similar situation applies in trustregion methods. Although in principle we seek the optimal solution of the subproblem 4.3, it is enough for purposes of global convergence to find an approximate solution pk that lies within the trust region and gives a sufficient reduction in the model. The sufficient reduction can be quantified in terms of the Cauchypoint,whichwedenotebypkC anddefineintermsofthefollowingsimpleprocedure.
Algorithm 4.2 Cauchy Point Calculation.
Find the vector pkS that solves a linear version of 4.3, that is,
pS argmin fk gkT p s.t. p ak; k pIRn
Calculatethescalark 0thatminimizesmkpkSsubjectto satisfying the trustregion bound, that is,
4.9
4.10
SetpkC kpkS.
k argminmkpkS s.t. pkS ak; 0
72 CHAPTER 4. TRUSTREGION METHODS
It is easy to write down a closedform definition of the Cauchy point. For a start, the solution of 4.9 is simply
p kS a k g k . gk
To obtain k explicitly, we consider the cases of gkT Bk gk 0 and gkT Bk gk 0 separately. For the former case, the function mk pkS decreases monotonically with whenever gk a 0, so k is simply the largest value that satisfies the trustregion bound, namely, k 1. For thecasegkTBkgk 0,mkpkSisaconvexquadraticin,sok iseithertheunconstrained
minimizer of this quadratic, first. In summary, we have
gk
3ak gkT Bk gk , or the boundary value 1, whichever comes
where
pC ak g, kkgk
k
3akgkT Bkgk,1
4.11
4.12
k
1 min
gk
if gkT Bk gk 0; otherwise.
Figure 4.3 illustrates the Cauchy point for a subproblem in which Bk is positive definite. In this example, pkC lies strictly inside the trust region.
The Cauchy step pkC is inexpensive to calculateno matrix factorizations are requiredand is of crucial importance in deciding if an approximate solution of the trustregion subproblem is acceptable. Specifically, a trustregion method will be globally
Trust region
pkC gk
contours of mk
Figure 4.3 The Cauchy point.
4.1. ALGORITHMS BASED ON THE CAUCHY POINT 73
convergent if its steps pk give a reduction in the model mk that is at least some fixed positive multiple of the decrease attained by the Cauchy step.
IMPROVING ON THE CAUCHY POINT
Since the Cauchy point pkC provides sufficient reduction in the model function mk to yield global convergence, and since the cost of calculating it is so small, why should we look any further for a better approximate solution of 4.3? The reason is that by always taking the Cauchy point as our step, we are simply implementing the steepest descent method with a particular choice of step length. As we have seen in Chap ter 3, steepest descent performs poorly even if an optimal step length is used at each iteration.
The Cauchy point does not depend very strongly on the matrix Bk , which is used only in the calculation of the step length. Rapid convergence can be expected only if Bk plays a role in determining the direction of the step as well as its length, and if Bk contains valid curvature information about the function.
A number of trustregion algorithms compute the Cauchy point and then try to improve on it. The improvement strategy is often designed so that the full step pB B1gk
kk
is chosen whenever Bk is positive definite and pkB ak . When Bk is the exact Hessian
2 fxkoraquasiNewtonapproximation,thisstrategycanbeexpectedtoyieldsuperlinear convergence.
We now consider three methods for finding approximate solutions to 4.3 that have the features just described. Throughout this section we will be focusing on the internal workings of a single iteration, so we simplify the notation by dropping the subscript k from the quantities ak , pk , mk , and gk and refer to the formulation 4.5 of the subproblem. In this section, we denote the solution of 4.5 by pa, to emphasize the dependence on a.
THE DOGLEG METHOD
The first approach we discuss goes by the descriptive title of the dogleg method. It can be used when B is positive definite.
To motivate this method, we start by examining the effect of the trustregion radius a on the solution pa of the subproblem 4.5. When B is positive definite, we have already noted that the unconstrained minimizer of m is pB B1g. When this point is feasible for 4.5, it is obviously a solution, so we have
pa pB, when a pB . 4.13
When a is small relative to pB, the restriction p a ensures that the quadratic term in m has little effect on the solution of 4.5. For such a, we can get an approximation to pa
74 CHAPTER 4. TRUSTREGION METHODS
Trust region
Optimal trajectory p
pU unconstrained min along g
pB full step g
dogleg path
Figure 4.4 Exact trajectory and dogleg approximation.
by simply omitting the quadratic term from 4.5 and writing
paa g , whenaissmall. 4.14
g
For intermediate values of a, the solution pa typically follows a curved trajectory like the one in Figure 4.4.
The dogleg method finds an approximate solution by replacing the curved trajectory for pa with a path consisting of two line segments. The first line segment runs from the origin to the minimizer of m along the steepest descent direction, which is
pU gTg g, 4.15 gT Bg
whilethesecondlinesegmentrunsfrompU topB seeFigure4.4.Formally,wedenotethis trajectoryby p for 0,2,where
p
pU, 0 1, pU1pBpU, 12.
4.16
The dogleg method chooses p to minimize the model m along this path, subject to the trustregion bound. The following lemma shows that the minimum along the dogleg path can be found easily.
4.1. ALGORITHMS BASED ON THE CAUCHY POINT 75
Lemma 4.2.
Let B be positive definite. Then
i p is an increasing function of , and
ii mp isadecreasingfunctionof.
PROOF. It is easy to show that i and ii both hold for 0, 1, so we restrict our
attention to the case of 1, 2. For i, define h by h1 p 12
2
1 pUpBpU2 2
1 pU 2pUTpBpU12 pBpU 2. 22
Our result is proved if we can show that h 0 for 0, 1. Now,
hpUTpU pB pU pB 2 pUTpU pB
gTggT gTggB1g gT Bg gT Bg
gTggB1g 1 gTg2
gT Bg gT BggT B1g
0,
where the final inequality is a consequence of the CauchySchwarz inequality. We leave the details as an exercise.
For ii, we define h mp 1 and show that h 0 for 0,1. Substitution of 4.16 into 4.5 and differentiation with respect to the argument leads to
hpB pUTgBpUpB pUTBpB pU pB pUTgBpU BpB pU
pB pUTgBpB0,
giving the result.
It follows from this lemma that the path p intersects the trustregion boundary p a at exactly one point if pB a, and nowhere otherwise. Since m is decreasing along the path, the chosen value of p will be at pB if pB a, otherwise at the point of intersection of the dogleg and the trustregion boundary. In the latter case, we compute the
appropriate value of by solving the following scalar quadratic equation: pU 1pB pU 2 a2.
76 CHAPTER 4. TRUSTREGION METHODS
Consider now the case in which the exact Hessian 2 f xk is available for use in the model problem 4.5. When 2 f xk is positive definite, we can simply set B 2 f xk that is, pB 2 f xk1gk and apply the procedure above to find the Newtondogleg step.Otherwise,wecandefinepB bychoosingBtobeoneofthepositivedefinitemodified Hessians described in Section 3.4, then proceed as above to find the dogleg step. Near a solution satisfying secondorder sufficient conditions see Theorem 2.4, pB will be set to the usual Newton step, allowing the possibility of rapid local convergence of Newtons method see Section 4.4.
The use of a modified Hessian in the Newtondogleg method is not completely satisfying from an intuitive viewpoint, however. A modified factorization perturbs the diagonals of 2 f xk in a somewhat arbitrary manner, and the benefits of the trustregion approach may not be realized. In fact, the modification introduced during the factorization of the Hessian is redundant in some sense because the trustregion strategy introduces its own modification. As we show in Section 4.3, the exact solution of the trustregion problem 4.3 with Bk 2 f xk is 2 f xk I1gk, where is chosen large enough to make 2 f xk I positive definite, and its value depends on the trustregion radius ak . We conclude that the Newtondogleg method is most appropriate when the objective function is convex that is, 2 f xk is always positive semidefinite. The techniques described below may be more suitable for the general case.
The dogleg strategy can be adapted to handle indefinite matrices B, but there is not much point in doing so because the full step pB is not the unconstrained minimizer of m in this case. Instead, we now describe another strategy, which aims to include directions of negative curvature that is, directions d for which dT Bd 0 in the space of candidate trustregion steps.
TWODIMENSIONAL SUBSPACE MINIMIZATION
When B is positive definite, the dogleg method strategy can be made slightly more sophisticated by widening the search for p to the entire twodimensional subspace spanned by pU and pB equivalently, g and B1g. The subproblem 4.5 is replaced by
minmpfgTp1pTBp s.t. p a, pspang,B1g. 4.17 p2
This is a problem in two variables that is computationally inexpensive to solve. After some algebraic manipulation it can be reduced to finding the roots of a fourth degree polynomial. Clearly,theCauchypointpC isfeasiblefor4.17,sotheoptimalsolutionofthissubproblem yields at least as much reduction in m as the Cauchy point, resulting in global convergence of the algorithm. The twodimensional subspace minimization strategy is obviously an extension of the dogleg method as well, since the entire dogleg path lies in spang, B1g.
This strategy can be modified to handle the case of indefinite B in a way that is intuitive, practical, and theoretically sound. We mention just the salient points of the handling of the
indefiniteness here, and refer the reader to papers by Byrd, Schnabel, and Schultz see 54 and 279 for details. When B has negative eigenvalues, the twodimensional subspace in 4.17 is changed to
spang,B I1g, for some 1,21, 4.18
where 1 denotes the most negative eigenvalue of B. This choice of ensures that B I is positive definite, and the flexibility in the choice of allows us to use a numerical procedure such as the Lanczos method to compute it. When B I1g a, we discard the subspace search of 4.17, 4.18 and instead define the step to be
p B I1g v, 4.19
where v is a vector that satisfies vT B I1g 0. This condition ensures that p BI1g .WhenBhaszeroeigenvaluesbutnonegativeeigenvalues,wedefinethe
step to be the Cauchy point p pC.
When the exact Hessian is available, we can set B 2 f xk, and note that B1g is
the Newton step. Hence, when the Hessian is positive definite at the solution x and when xk is close to x and a is sufficiently large, the subspace minimization problem 4.17 will be solved by the Newton step.
The reduction in model function m achieved by the twodimensional subspace min imization strategy often is close to the reduction achieved by the exact solution of 4.5. Most of the computational effort lies in a single factorization of B or B I estimation of and solution of 4.17 are less significant, while strategies that find nearly exact solutions of 4.5 typically require two or three such factorizations see Section 4.3.
4.2 GLOBAL CONVERGENCE
REDUCTION OBTAINED BY THE CAUCHY POINT
In the preceding discussion of algorithms for approximately solving the trustregion subproblem, we have repeatedly emphasized that global convergence depends on the ap proximate solution obtaining at least as much decrease in the model function m as the Cauchy point. In fact, a fixed positive fraction of the Cauchy decrease suffices. We start the global convergence analysis by obtaining an estimate of the decrease in m achieved by the Cauchy point. We then use this estimate to prove that the sequence of gradients gk generated by Algorithm 4.1 has an accumulation point at zero, and in fact converges to zero when is strictly positive.
Our first main result is that the dogleg and twodimensional subspace minimization algorithms and Steihaugs algorithm Algorithm 7.2 produce approximate solutions pk of the subproblem 4.3 that satisfy the following estimate of decrease in the model function:
mk0 mkpk c1 gk min ak, gk , 4.20 Bk
4.2. GLOBAL CONVERGENCE 77
78 CHAPTER 4. TRUSTREGION METHODS
for some constant c1 0,1. The usefulness of this estimate will become clear in the following two sections. For now, we note that when ak is the minimum value in 4.20, the condition is slightly reminiscent of the first Wolfe condition: The desired reduction in the model is proportional to the gradient and the size of the step.
We show now that the Cauchy point pC satisfies 4.20, with c1 1 . k2
Lemma 4.3.
The Cauchy point pC satisfies 4.20 with c1 1 , that is, k2
mk0mkpC1 gk min ak, gk . k2 Bk
PROOF. For simplicity, we drop the iteration index k in the proof. We consider first the case gT Bg 0. Here, we have
4.21
mpCm0mag g f
a 2 1 a2 T
a g
gmina,g , B
and so 4.21 certainly holds.
For the next case, consider gT Bg 0 and
g3
agT Bg 1.
4.22
g g 2 g2gBg
From 4.12, we have g 3 agT Bg , and so from 4.11 it follows that C g41Tg4
mpm0gTBg2g BggTBg2 1 g4
2 gT Bg 1 g4
2Bg2 1 g2
2B 1gmina,g ,
Tg3 g Bg a .
2B
In the remaining case, 4.22 does not hold, and therefore
so 4.21 holds here too.
4.23
From 4.12, we have 1, and using this fact together with 4.23, we obtain C a 2 1a2 T
mpm0g g 2g2gBg a g 1 a2 g 3
4.2. GLOBAL CONVERGENCE 79
2g2a 1a g
2 1gmina,g ,
2B
yielding the desired result 4.21 once again.
To satisfy 4.20, our approximate solution pk has only to achieve a reduction that is at least some fixed fraction c2 of the reduction achieved by the Cauchy point. We state the observation formally as a theorem.
Theorem 4.4.
Letpk beanyvectorsuchthat pk ak andmk0mkpkc2 mk0mkpkC .
Then pk satisfies 4.20 with c1 c22. In particular, if pk is the exact solution pk of 4.3,
then it satisfies 4.20 with c1 1 . 2
PROOF. Since pk ak , we have from Lemma 4.3 that mk0mkpkc2 mk0mkpC 1c2 gk min ak, gk
,
satisfy 4.20 with c1 1 , because they all produce approximate solutions pk for which 2
mkpk mkpkC.
CONVERGENCE TO STATIONARY POINTS
Global convergence results for trustregion methods come in two varieties, depending on whether we set the parameter in Algorithm 4.1 to zero or to some small positive value. When 0 that is, the step is taken whenever it produces a lower value of f , we can show that the sequence of gradients gk has a limit point at zero. For the more stringent acceptance test with 0, which requires the actual decrease in f to be at least some small fraction of the predicted decrease, we have the stronger result that gk 0.
In this section we prove the global convergence results for both cases. We assume throughout that the approximate Hessians Bk are uniformly bounded in norm, and that f
giving the result.
Note that the dogleg and twodimensional subspace minimization algorithms both
k2 Bk
80 CHAPTER 4. TRUSTREGION METHODS
is bounded below on the level set
S x f x f x0.
For later reference, we define an open neighborhood of this set by def
where R0 is a positive constant.
To allow our results to be applied more generally, we also allow the length of the
approximate solution pk of 4.3 to exceed the trustregion bound, provided that it stays within some fixed multiple of the bound; that is,
pk ak, for some constant 1. 4.25 The first result deals with the case 0.
Theorem 4.5.
Let 0 in Algorithm 4.1. Suppose that Bk for some constant , that f is bounded below on the level set S defined by 4.24 and Lipschitz continuously differentiable in the neighborhood SR0 for some R0 0, and that all approximate solutions of 4.3 satisfy the inequalities 4.20 and 4.25, for some positive constants c1 and . We then have
lim inf gk 0. 4.26 k
k 1a mk0mkpk a amkpk f xk pka
Since from Taylors theorem Theorem 2.1 we have that 1
f xk pk f xk gxkT pk
for some t 0, 1, it follows from the definition 4.2 of mk that
def
4.24
SR0x xy R0 forsomeyS,
PROOF. By performing some technical manipulation with the ratio k from 4.4, we obtain a f xk f xk pk mk0 mkpka
a mk0mkpk a.
a1 a m p f x p a1 pT B p gx tp gx T p dta
kkkka2kkk kkkka 0
0
gxk tpk gxkT pk dt,
2 pk 21 pk 2,
4.27
where we have used 1 to denote the Lipschitz constant for g on the set SR0, and assumed that pk R0 to ensure that xk and xk tpk both lie in the set SR0.
Suppose for contradiction that there is 0 and a positive index K such that gk , forallkK.
4.28
4.29
4.30 We now derive a bound on the righthandside that holds for all sufficiently small values of
ak , that is, for all ak a , where a is defined as follows:
a min1c1 ,R0. 4.31
2221
The R0 term in this definition ensures that the bound 4.27 is valid because pk ak a R0. Note that since c1 1 and 1, we have a . The latter conditionimpliesthatforallak 0,a ,wehaveminak, ak,sofrom4.30and 4.31, we have
k 1 2a2k21 2ak21 2a 21 1. c1ak c1 c1 2
Therefore, k 1 , and so by the workings of Algorithm 4.1, we have ak1 ak whenever
From 4.20, we have for k K that
mk0 mkpk c1 gk min ak, gk c1 min ak, .
4.2. GLOBAL CONVERGENCE 81
Bk Using 4.29, 4.27, and the bound 4.25, we have
k1 2a2k21. c1 minak,
4
ak falls below the threshold a . It follows that reduction of ak
by a factor of 1 4
can occur
4.32
in our algorithm only if
and therefore we conclude that
ak min aK,a 4 forallkK.
ak a ,
Suppose now that there is an infinite subsequence K such that k 1 for k K. For 4
82 CHAPTER 4. TRUSTREGION METHODS
k Kandk K,wehavefrom4.29that
fxk fxk1 fxk fxk pk
1 mk0 mkpk 4
1c1 minak, . 4
Since f is bounded below, it follows from this inequality that lim ak0,
kK,k
contradicting 4.32. Hence no such infinite subsequence K can exist, and we must have
k 1 for all k sufficiently large. In this case, ak will eventually be multiplied by 1 at every 44
iteration,andwehavelimkak 0,whichagaincontradicts4.32.Hence,ouroriginal assertion 4.28 must be false, giving 4.26.
Our second global convergence result, for the case 0, borrows much of the analysis from the proof above. Our approach here follows that of Schultz, Schnabel, and Byrd 279.
Theorem 4.6.
Let 0, 1 in Algorithm 4.1. Suppose that Bk for some constant , that f is 4
bounded below on the level set S 4.24 and Lipschitz continuously differentiable in SR0 for some R0 0, and that all approximate solutions pk of 4.3 satisfy the inequalities 4.20 and 4.25 for some positive constants c1 and . We then have
lim gk 0. 4.33 k
PROOF. We consider a particular positive index m with gm a 0. Using 1 again to denote the Lipschitz constant for g on the set SR0, we have
gxgm 1 xxm , for all x SR0. We now define the scalars and R to satisfy
1gm, Rmin ,R0. 2 1
Note that the ball
Bxm,Rx xxm R
is contained in SR0, so Lipschitz continuity of g holds inside Bxm, R. We have
xBxm,R gxgm gxgm 1 gm . 2
If the entire sequence xkkm stays inside the ball Bxm, R, we would have gk
0
Since
gk
for k m,m 1,…,l, we can use 4.29 to write
4.3. ITERATIVE SOLUTION OF THE SUBPROBLEM 83
for all k m. The reasoning in the proof of Theorem 4.5 can be used to show that this scenario does not occur. Therefore, the sequence xk km eventually leaves Bxm , R.
Let the index l m be such that xl1 is the first iterate after xm outside Bxm, R.
fxm fxl1
l km
l
km,xk axk1
l
km,xk axk1
fxk fxk1
mk0 mkpk c1 min ak, ,
wherewehavelimitedthesumtotheiterationskforwhichxk axk1,thatis,thoseiterations on which a step was actually taken. If ak for all k m,m 1,…,l, we have
4.34
4.35
4.36
l km,xk axk1
ak c1 Rc1 min ,R0 . 1
fxm fxl1c1
Otherwise, we have ak for some k m,m 1,…,l, and so
fxm fxl1c1 .
Since the sequence f xk k0 is decreasing and bounded below, we have that fxk f
for some f . Therefore, using 4.34 and 4.35, we can write fxm f fxm fxl1
c1 min ,,R0 1
1c1 gm min gm , gm ,R0 0. 2 2 21
Since fxm f 0,wemusthavegm 0,givingtheresult.
4.3 ITERATIVE SOLUTION OF THE SUBPROBLEM
In this section, we describe a technique that uses the characterization 4.6 of the subprob lem solution, applying Newtons method to find the value of which matches the given
84 CHAPTER 4. TRUSTREGION METHODS
trustregion radius a in 4.5. We also prove the key result Theorem 4.1 concerning the characterization of solutions of 4.3.
The methods of Section 4.1 make no serious attempt to find the exact solution of the subproblem 4.5. They do, however, make some use of the information in the model Hessian Bk, and they have advantages of reasonable implementation cost and nice global convergence properties.
When the problem is relatively small that is, n is not too large, it may be worthwhile to exploit the model more fully by looking for a closer approximation to the solution of the subproblem. In this section, we describe an approach for finding a good approximation at the cost of a few factorizations of the matrix B typically three factorization, as compared with a single factorization for the dogleg and twodimensional subspace minimization methods. This approach is based on the characterization of the exact solution given in Theorem 4.1, together with an ingenious application of Newtons method in one variable. Essentially, the algorithm tries to identify the value of for which 4.6 is satisfied by the solution of 4.5.
The characterization of Theorem 4.1 suggests an algorithm for finding the solution p of 4.7. Either 0 satisfies 4.8a and 4.8c with p a, or else we define
p B I1g
for sufficiently large that B I is positive definite and seek a value 0 such that
p a. 4.37
This problem is a onedimensional rootfinding problem in the variable .
To see that a value of with all the desired properties exists, we appeal to the eigende composition of B and use it to study the properties of p . Since B is symmetric, there
is an orthogonal matrix Q and a diagonal matrix such that B Q QT , where diag1,2,…,n,
and12naretheeigenvaluesofB;seeA.16.Clearly,BIQ I Q T , and for a j , we have
pQ I1QTgn qTj g qj, 4.38 j1 j
where qj denotes the jth column of Q. Therefore, by orthonormality of q1,q2,…,qn, we have
2 2 n qTjg
p
j 2. 4.39
j1
Moreover, we have when qTj g a 0 that
lim p
j
4.3. ITERATIVE SOLUTION OF THE SUBPROBLEM 85
p
321
Figure 4.5 p as a function of .
Thisexpressiontellsusalotabout p.If1,wehavej0forall j 1, 2, . . . , n, and so p is a continuous, nonincreasing function of on the interval
1 , . In fact, we have that
lim p
0.
.
4.40
4.41
Figure 4.5 plots p against in a case in whcih q1T g, q2T g, and q3T g are all nonzero. Note that the properties 4.40 and 4.41 hold and that p is a nonincreasing function of on 1, . In particular, as is always the case when q1T g a 0, that there is a unique value 1, such that p a. There may be other, smaller values of for which p a, but these will fail to satisfy 4.8c.
We now sketch a procedure for identifying the 1, for which p a, which works when q1T g a 0. We discuss the case of q1T g 0 later. First, note that when B positive definite and B1g a, the value 0 satisfies 4.8, so the procedure can be terminated immediately with 0. Otherwise, we could use the rootfinding Newtons method see the Appendix to find the value of 1 that solves
1 p a 0. 4.42
86 CHAPTER 4. TRUSTREGION METHODS
The disadvantage of this approach can be seen by considering the form of p when is greater than, but close to, 1. For such , we can approximate 1 by a rational function, as follows:
1 C1 C2, 1
where C1 0 and C2 are constants. Clearly this approximation and hence 1 is highly nonlinear, so the rootfinding Newtons method will be unreliable or slow. Better results will be obtained if we reformulate the problem 4.42 so that it is nearly linear near the optimal . By defining
21 1 , a p
it can be shown using 4.39 that for slightly greater than 1, we have 2 1 1
a C3
for some C3 0. Hence, 2 is nearly linear near 1 see Figure 4.6, and the rootfinding
p 1
1
321
Figure 4.6 1 p as a function of .
Set
end for.
p 2 p qa
4.3. ITERATIVE SOLUTION OF THE SUBPROBLEM 87
Newtons method will perform well, provided that it maintains 1. The rootfinding Newtons method applied to 2 generates a sequence of iterates by setting
1 2
2. 4.43
After some elementary manipulation, this updating formula can be implemented in the following practical way.
Algorithm 4.3 Trust Region Subproblem. Given 0, a 0:
for 0,1,2,…
Factor B I RT R; SolveRTRp g,RTq p;
1
a ;
4.44
Safeguards must be added to this algorithm to make it practical; for instance, when 1,theCholeskyfactorizationB I RT Rwillnotexist.Aslightlyenhanced version of this algorithm does, however, converge to a solution of 4.37 in most cases.
The main work in each iteration of this method is, of course, the Cholesky factorization of B I . Practical versions of this algorithm do not iterate until convergence to the optimal is obtained with high accuracy, but are content with an approximate solution that can be obtained in two or three iterations.
THE HARD CASE
Recall that in the discussion above, we assumed that q1T g a 0. In fact, the approach described above can be applied even when the most negative eigenvalue is a multiple eigenvalue that is, 0 1 2 , provided that Q1T g a 0, where Q1 is the matrix whose columns span the subspace corresponding to the eigenvalue 1. When this condition does not hold, the situation becomes a little complicated, because the limit 4.41 does not holdforj 1 andsotheremaynotbeavalue1,suchthat p asee Figure 4.7. More and Sorensen 214 refer to this case as the hard case. At first glance, it is not clear how p and can be chosen to satisfy 4.8 in the hard case. Clearly, our rootfinding technique will not work, since there is no solution for in the open interval 1, . But Theorem 4.1 assures us that the right value of lies in the interval 1, , so there is only
88 CHAPTER 4. TRUSTREGION METHODS
p
3 2 1
Figure 4.7 The hard case: p a for all 1, . onepossibility:1.Tofindp,itisnotenoughtodeletethetermsforwhichj 1
from the formula 4.38 and set
p q Tj g q j .
j:ja1
Instead, we note that B 1I is singular, so there is a vector z such that z 1 and B 1 I z 0. In fact, z is an eigenvector of B corresponding to the eigenvalue 1 , so by orthogonality of Q we have qTj z 0 for j a 1. It follows from this property that if we set
p qTjgqjz
4.45
j
for any scalar , we have
j
j:ja1
2
2 qTjg 2
p
j2,
j:ja1
so it is always possible to choose to ensure that p a. It is easy to check that the conditions 4.8 holds for this choice of p and 1.
4.3. ITERATIVE SOLUTION OF THE SUBPROBLEM 89
PROOF OF THEOREM 4.1
We now give a formal proof of Theorem 4.1, the result that characterizes the exact solution of 4.5. The proof relies on the following technical lemma, which deals with the unconstrained minimizers of quadratics and is particularly interesting in the case where the Hessian is positive semidefinite.
Lemma 4.7.
Let m be the quadratic function defined by
mp gT p 1 pT Bp,
2
4.46
where B is any symmetric matrix. Then the following statements are true.
i m attains a minimum if and only if B is positive semidefinite and g is in the range of B.
If B is positive semidefinite, then every p satisfying Bp g is a global minimizer of m. ii m has a unique minimizer if and only if B is positive definite.
PROOF. We prove each of the three claims in turn.
i We start by proving the if part. Since g is in the range of B, there is a p with Bp g.
For all w Rn, we have
mp w gT p w 1 p wT Bp w
2
gT p 1 pT Bp gT w BpT w 1 wT Bw 22
mp 1wT Bw 2
mp,
4.47
since B is positive semidefinite. Hence, p is a minimizer of m.
For the only if part, let p be a minimizer of m. Since mp Bp g 0, we
have that g is in the range of B. Also, we have 2mp B positive semidefinite, giving the result.
ii For the if part, the same argument as in i suffices with the additional point that wT Bw 0 whenever w a 0. For the only if part, we proceed as in i to deduce that B is positive semidefinite. If B is not positive definite, there is a vector w a 0 such that Bw 0. Hence, from 4.47, we have mp w mp, so the minimizer is not unique, giving a contradiction.
To illustrate case i, suppose that
100 B0 0 0,
002
90 CHAPTER 4. TRUSTREGION METHODS
which has eigenvalues 0,1,2 and is therefore singular. If g is any vector whose second component is zero, then g will be in the range of B, and the quadratic will attain a minimum. But if the second element in g is nonzero, we can decrease m indefinitely by moving along the direction 0, g2, 0T .
We are now in a position to take account of the trustregion bound p a and hence prove Theorem 4.1.
PROOF. Theorem 4.1
Assume first that there is 0 such that the conditions 4.8 are satisfied.
Lemma 4.7i implies that p is a global minimum of the quadratic function mpgT p1pTBIpmppT p.
S i n c e m p m p , w e h a v e
mp mp pT p pT p.
2
Because a p 0 and therefore a2 pT p 0, we have
mp mp a2 pT p. 2
4.48
4.49
22
Hence,from0,wehavempmpforallpwith p a.Therefore,pisa global minimizer of 4.7.
For the converse, we assume that p is a global solution of 4.7 and show that there is a 0 that satisfies 4.8.
In the case p a, p is an unconstrained minimizer of m, and so mp Bp g 0, 2mp B positive semidefinite,
and so the properties 4.8 hold for 0.
Assume for the remainder of the proof that p a. Then 4.8b is immediately
satisfied, and p also solves the constrained problem minmp subjectto p a.
By applying optimality conditions for constrained optimization to this problem see 12.34, we find that there is a such that the Lagrangian function defined by
Lp,mppT pa2 2
4.3. ITERATIVE SOLUTION OF THE SUBPROBLEM 91
has a stationary point at p. By setting pLp, to zero, we obtain
Bpgp0 BIpg, 4.50
so that 4.8a holds. Since mp mp for any p with pT p pT p a2, we have for such vectors p that
mpmp pTppTp . 2
If we substitute the expression for g from 4.50 into this expression, we obtain after some rearrangement that
1p pTBIp p0. 2
4.51
Since the set of directions
w:w pp ,forsomepwith p a
p p
is dense on the unit sphere, 4.51 suffices to prove 4.8c.
It remains to show that 0. Because 4.8a and 4.8c are satisfied by p, we have
from Lemma 4.7i that p minimizes m , so 4.49 holds. Suppose that there are only negative values of that satisfy 4.8a and 4.8c. Then we have from 4.49 that mp mp whenever p p a. Since we already know that p minimizes m for p a, it follows that m is in fact a global, unconstrained minimizer of m. From Lemma 4.7i it follows that B p g and B is positive semidefinite. Therefore conditions 4.8a and 4.8c are satisfied by 0, which contradicts our assumption that only negative values of can satisfy the conditions. We conclude that 0, completing the proof.
CONVERGENCE OF ALGORITHMS BASED ON NEARLY EXACT SOLUTIONS
As we noted in the discussion of Algorithm 4.3, the loop to determine the optimal values of and p for the subproblem 4.5 does not iterate until high accuracy is achieved. Instead, it is terminated after two or three iterations with a fairly loose approximation to the true solution. The inexactness in this approximate solution is measured in a different way from the dogleg and subspace minimization algorithms. We can add safeguards to the rootfinding Newton method to ensure that the key assumptions of Theorems 4.5 and 4.6 are satisfied by the approximate solution. Specifically, we require that
m0 mp c1m0 mp, 4.52a p a 4.52b
92 CHAPTER 4. TRUSTREGION METHODS
where p is the exact solution of 4.3, for some constants c1 0, 1 and 0. The condition 4.52a ensures that the approximate solution achieves a significant fraction of the maximum decrease possible in the model function m. It is not necessary to know p; there are practical termination criteria that imply 4.52a. One major difference between 4.52 and the earlier criterion 4.20 is that 4.52 makes better use of the secondorder part of m, that is, the pT Bp term. This difference is illustrated by the case in which g 0 while B has negative eigenvalues, indicating that the current iterate xk is a saddle point. Here, the righthandside of 4.20 is zero indeed, the algorithms we described earlier would terminate at such a point. The righthandside of 4.52 is positive, indicating that decrease in the model function is still possible, so it forces the algorithm to move away from xk .
The close attention that nearexact algorithms pay to the secondorder term is war ranted only if this term closely reflects the actual behavior of the function f in fact, the trustregion Newton method, for which B 2 f x, is the only case that has been treated in the literature. For purposes of global convergence analysis, the use of the exact Hessian allows us to say more about the limit points of the algorithm than merely that they are stationary points. The following result shows that secondorder necessary conditions Theorem 2.3 are satisfied at the limit points.
Theorem 4.8.
Suppose that the assumptions of Theorem 4.6 are satisfied and in addition that f is twice continuously differentiable in the level set S. Suppose that Bk 2 f xk for all k, and that the approximate solution pk of 4.3 at each iteration satisfies 4.52 for some fixed 0. Then limk gk 0.
If, in addition, the level set S of 4.24 is compact, then either the algorithm terminates at a point xk at which the secondorder necessary conditions Theorem 2.3 for a local solution hold, or else xk has a limit point x in S at which the secondorder necessary conditions hold.
We omit the proof, which can be found in More and Sorensen 214, Section 4.
4.4 LOCAL CONVERGENCE OF TRUSTREGION NEWTON METHODS
Since global convergence of trustregion methods that use exact Hessians 2 f x is estab lished above, we turn our attention now to local convergence issues. The key to attaining the fast rate of convergence usually associated with Newtons method is to show that the trustregion bound eventually does not interfere as we approach a solution. Specifically, we hope that near the solution, the approximate solution of the trustregion subproblem is well inside the trust region and becomes closer and closer to the true Newton step. Steps that satisfy the latter property are said to be asymptotically similar to Newton steps.
We first prove a general result that applies to any algorithm of the form of Algo rithm 4.1 see Chapter 4 that generates steps that are asymptotically similar to Newton
4.4. LOCAL CONVERGENCE OF TRUSTREGION NEWTON METHODS 93
steps whenever the Newton steps easily satisfy the trustregion bound. It shows that the trustregion constraint eventually becomes inactive in algorithms with this property and that superlinear convergence can be attained. The result assumes that the exact Hessian Bk 2 f xk is used in 4.3 when xk is close to a solution x that satisfies secondorder sufficient conditions see Theorem 2.4. Moreover, it assumes that the algorithm uses an approximate solution pk of 4.3 that achieves a similar decrease in the model function mk as the Cauchy point.
Theorem 4.9.
Let f be twice Lipschitz continuously differentiable in a neighborhhod of a point x at
which secondorder sufficient conditions Theorem 2.4 are satisfied. Suppose the sequence xk
converges to x and that for all k sufficiently large, the trustregion algorithm based on 4.3
with Bk 2 fxk chooses steps pk that satisfy the Cauchypointbased model reduction
criterion 4.20 and are asymptotically similar to Newton steps pN whenever pN 1 ak , kk2
that is,
pkpkN opkN . 4.53 Then the trustregion bound ak becomes inactive for all k sufficiently large and the sequence
xk converges superlinearly to x.
PROOF. We show that pN 1ak and pk ak, for all sufficiently large k, so the
k2
nearoptimal step pk in 4.53 will eventually always be taken.
We first seek a lower bound on the predicted reduction mk0 mkpk for all
sufficiently large k. We assume that k is large enough that the o pkN term in 4.53 is less
than pN .When pN 1ak,wethenhavethat pk pN o pN 2 pN ,while kk2 kkk
if pN 1ak,wehave pk ak 2 pN .Inbothcases,then,wehave k2k
a 1a
p k 2 p kN 2 a 2 f x k a g k ,
andso gk 1 pk a2 fxk1a. 2
We have from the relation 4.20 that mk0 mkpk
c1gkminak,agk a a2 fxka
pk pk
c1 2a2 f xk1a min pk , 2a2 f xkaa2 f xk1a
pk 2
c14a2 fxk1a2a2 fxka.
Because xk x, we use continuity of 2 f x and positive definiteness of 2 f x, to
94 CHAPTER 4. TRUSTREGION METHODS
deduce that the following bound holds for all k sufficiently large:
c1
a 2 1a2a 2
c1 def a a 2 1a2a 2 ac3,
4 fxk where c3 0. Hence, we hae
fxk
8 fx fx
mk0mkpkc3 pk 2
for all sufficiently large k. By Lipschitz continuity of 2 f x near x, and using Taylors
theorem Theorem 2.1, we have
fxk fxk pkmk0mkpk
a 1 a
a1pT2fxp 1 pT2fx tpp dta a2k kk2 k k kka
0
L pk 3, 4
where L 0 is the Lipschitz constant for 2 f . Hence, by definition 4.4 of k , we have for sufficiently large k that
k1 pk 3L4 L pk Lak. 4.55 c3 pk 2 4c3 4c3
Now, the trustregion radius can be reduced only if k 1 or some other fixed number less 4
than 1, so it is clear from 4.55 that the sequence ak is bounded away from zero. Since
xk x, we have pkN 0 and therefore pk 0 from 4.53. Hence, the trustregion
bound is inactive for all k sufficiently large, and the bound pN 1 ak is eventually always k2
satisfied.
To prove superlinear convergence, we use the quadratic convergence of Newtons
4.54
method, proved in Theorem 3.5. In particular, we have from 3.33 that x k p kN x o x k x 2 ,
which implies that pkN O xk x . Therefore, using 4.53, we have xk pk x
x k p kN x p kN p k o x k x 2 o p kN o x k x thus proving superlinear convergence.
,
It is immediate from Theorem 3.5 that if pk pkN for all k sufficiently large, we have quadratic convergence of xk to x.
Reasonable implementations of the dogleg, subspace minimization, and nearlyexact algorithm of Section 4.3 with Bk 2 f xk eventually use the steps pk pkN under the conditions of Theorem 4.9, and therefore converge quadratically. In the case of the dogleg and twodimensional subspace minimization methods, the exact step pkN is one of the candidates for pk it lies inside the trust region, along the dogleg path, and inside the twodimensional subspace. Since under the assumptions of Theorem 4.9, pkN is the unconstrained minimizer of mk for k sufficiently large, it is certainly the minimizer in the more restricted domains, so we have pk pkN. For the approach of Section 4.3, if we follow the reasonable strategy of checking whether pkN is a solution of 4.3 prior to embarking on Algorithm 4.3, then eventually we will also have pk pkN also.
4.5 OTHER ENHANCEMENTS
SCALING
As we noted in Chapter 2, optimization problems are often posed with poor scaling the objective function f is highly sensitive to small changes in certain components of the vector x and relatively insensitive to changes in other components. Topologically, a symptom of poor scaling is that the minimizer x lies in a narrow valley, so that the contours of the objective f near x tend towards highly eccentric ellipses. Algorithms that fail to compensate for poor scaling can perform badly; see Figure 2.7 for an illustration of the poor performance of the steepest descent approach.
Recalling our definition of a trust regiona region around the current iterate within which the model m k is an adequate representation of the true objective f it is easy to see that a spherical trust region may not be appropriate when f is poorly scaled. Even if the model Hessian Bk is exact, the rapid changes in f along certain directions probably will cause mk to be a poor approximation to f along these directions. On the other hand, mk may be a more reliable approximation to f along directions in which f is changing more slowly. Since the shape of our trust region should be such that our confidence in the model is more or less the same at all points on the boundary of the region, we are led naturally to consider elliptical trust regions in which the axes are short in the sensitive directions and longer in the less sensitive directions.
Elliptical trust regions can be defined by
Dp a, 4.56
where D is a diagonal matrix with positive diagonal elements, yielding the following scaled trustregion subproblem:
4.5. OTHER ENHANCEMENTS 95
def T 1T
minmkpfkgkppBkp s.t. Dpak. 4.57
pIRn 2
96 CHAPTER 4. TRUSTREGION METHODS
When fxishighlysensitivetothevalueoftheithcomponentxi,wesetthecorresponding diagonal element dii of D to be large, while dii is smaller for lesssensitive components.
Information to construct the scaling matrix D may be derived from the second derivatives 2 fxi2. We can allow D to change from iteration to iteration; most of the theory of this chapter will still apply with minor modifications provided that each dii stays within some predetermined range dlo,dhi, where 0 dlo dhi . Of course, we do not need D to be a precise reflection of the scaling of the problem, so it is not necessary to devise elaborate heuristics or to perform extensive computations to get it just right.
The following procedure shows how the Cauchy point calculation Algorithm 4.2 changes when we use a scaled trust region,
Algorithm 4.4 Generalized Cauchy Point Calculation. Find the vector pkS that solves
pkS argmin fk gkT p s.t. Dp ak; 4.58 pIRn
Calculate the scalar k 0 that minimizes mk pkS subject to satisfying the trustregion bound, that is,
k argminmkpkS s.t. DpkS ak; 0
p kC k p kS . For this scaled version, we find that
pkS ak D2gk, D1gk
and that the step length k is obtained from the following modification of 4.12:
i f g kT D 2 B k D 2 g k 0 otherwise.
4.59
4.60
4.61
1
k D1gk 3
min a gT D2B D2g ,1 kkkk
The details are left as an exercise.
A simpler alternative for adjusting the definition of the Cauchy point and the various
algorithms of this chapter to allow for the elliptical trust region is simply to rescale the variables p in the subproblem 4.57 so that the trust region is spherical in the scaled variables. By defining
def
p D p ,
and by substituting into 4.57, we obtain
def T 1 1 T 1 1
minm kp fkgkD p p D BkD p s.t. p ak. p IR n 2
4.5. OTHER ENHANCEMENTS 97
The theory and algorithms can now be derived in the usual way by substituting p for p, D1gk forgk,D1BkD1 forBk,andsoon.
TRUST REGIONS IN OTHER NORMS
Trust regions may also be defined in terms of norms other than the Euclidean norm. For instance, we may have
p 1 ak or p ak, or their scaled counterparts
Dp 1 ak or Dp ak,
where D is a positive diagonal matrix as before. Norms such as these offer no obvious ad vantages for smallmedium unconstrained problems, but they may be useful for constrained problems. For instance, for the boundconstrained problem
min f x, subject to x 0, x IR n
the trustregion subproblem may take the form
minmkpfkgkTp1pTBkp s.t.xkp0, p ak. 4.62
When the trust region is defined by a Euclidean norm, the feasible region for 4.62 consists of the intersection of a sphere and the nonnegative orthantan awkward object, geometrically speaking. When the norm is used, however, the feasible region is simply the rectangular box defined by
xk p0, pake, pake,
where e 1, 1, . . . , 1T , so the solution of the subproblem is easily calculated by using techniques for boundconstrained quadratic programming.
For large problems, in which factorization or formation the model Hessian Bk is not computationally desirable, the use of a trust region defined by will also give rise to a boundconstrained subproblem, which may be more convenient to solve than the standard subproblem 4.3. To our knowledge, there has not been much research on the relative performance of methods that use trust regions of different shapes on large problems.
pIRn 2
98 CHAPTER 4. TRUSTREGION METHODS
NOTES AND REFERENCES
One of the earliest works on trustregion methods is Winfield 307. The influential paper of Powell 244 proves a result like Theorem 4.5 for the case of 0, where the algo rithm takes a step whenever it decreases the function value. Powell uses a weaker assumption than ours on the matrices B , but his analysis is more complicated. More 211 summarizes developments in algorithms and software before 1982, paying particular attention to the importance of using a scaled trustregion norm.
Byrd, Schnabel, and Schultz 279, 54 provide a general theory for inexact trust region methods; they introduce the idea of twodimensional subspace minimization and also focus on proper handling of the case of indefinite B to ensure stronger local convergence results than Theorems 4.5 and 4.6. Dennis and Schnabel 93 survey trustregion methods as part of their overview of unconstrained optimization, providing pointers to many important developments in the literature.
The monograph of Conn, Gould, and Toint 74 is an exhaustive treatment of the state of the art in trustregion methods for both unconstrained and constrained optimization. It includes an comprehensive annotated bibliography of the literature in the area.
EXERCISES
4.1 Let fx10x2 x122 1×12.Atx 0,1drawthecontourlinesof the quadratic model 4.2 assuming that B is the Hessian of f . Draw the family of solutions of 4.3 as the trust region radius varies from a 0 to a 2. Repeat this at x 0, 0.5.
4.2 Write a program that implements the dogleg method. Choose Bk to be the exact Hessian. Apply it to solve Rosenbrocks function 2.22. Experiment with the update rule for the trust region by changing the constants in Algorithm 4.1, or by designing your own rules.
4.3 Program the trustregion method based on Algorithm 7.2. Choose Bk to be the exact Hessian, and use it to minimize the function
min f x
n i1
1 x 2 10x 2i1
x2 2 2i 2i1
with n 10. Experiment with the starting point and the stopping test for the CG iteration. Repeat the computation with n 50.
Your program should indicate, at every iteration, whether Algorithm 7.2 encountered negative curvature, reached the trustregion boundary, or met the stopping test.
4.4 Theorem 4.5 shows that the sequence g has an accumulation point at zero. Show that if the iterates x stay in a bounded set B, then there is a limit point x of the sequence xk such that gx 0.
4.5 Show that k defined by 4.12 does indeed identify the minimizer of mk along the direction gk .
4.6 The CauchySchwarz inequality states that for any vectors u and v, we have uT v2 uT uvT v,
with equality only when u and v are parallel. When B is positive definite, use this inequality to show that
g4
gT BggT B1g 1,
with equality only if g and Bg and B1g are parallel.
4.7 When B is positive definite, the doubledogleg method constructs a path with three
line segments from the origin to the full step. The four points that define the path are
the origin;
the unconstrained Cauchy step pC gT ggT Bgg;
a fraction of the full step pB B1g, for some , 1, where is defined in
the previous question; and
the full step pB B1g.
Show that p increases monotonically along this path.
Note: The doubledogleg method, as discussed in Dennis and Schnabel 92, Section
6.4.2, was for some time thought to be superior to the standard dogleg method, but later testing has not shown much difference in performance.
4.8 Show that 4.43 and 4.44 are equivalent. Hints: Note that
d 1 d p 2 12 1 p 2 32 d p 2,
d 2 nqTjg2
d p 2
def
4.5. OTHER ENHANCEMENTS 99
dpd 2 d
j1
j 3
100 CHAPTER 4. TRUSTREGION METHODS
from 4.39, and
the case where B is positive definite. is positive definite.
2T2T 1nqTjg2
q R p pBI p
4.9 Derive the solution of the twodimensional subspace minimization problem in
4.11 Verify that the definitions 4.60 for pkS and 4.61 for k are valid for the Cauchy point in the case of an elliptical trust region. Hint: Using the theory of Chapter 12, we can show that the solution of 4.58 satisfies gk D2 pkS 0 for some scalar 0.
4.12 The following example shows that the reduction in the model function m achieved by the twodimensional minimization strategy can be much smaller than that achieved by the exact solution of 4.5.
In 4.5, set
1T g ,1,2 ,
where is a small positive number. Set
Bdiag 1,1, 3 , a0.5. 3
Show that the solution of 4.5 has components O , 1 O , O T and that the 2
reduction in the model m is 3 O . For the twodimensional minimization strategy, 8
show that the solution is a multiple of B1g and that the reduction in m is O .
j3.
4.10 Show that if B is any symmetric matrix, then there exists 0 such that B I
j1
CHAPTER5 Conjugate
Gradient Methods
Our interest in conjugate gradient methods is twofold. First, they are among the most useful techniques for solving large linear systems of equations. Second, they can be adapted to solve nonlinear optimization problems. The remarkable properties of both linear and nonlinear conjugate gradient methods will be described in this chapter.
The linear conjugate gradient method was proposed by Hestenes and Stiefel in the 1950s as an iterative method for solving linear systems with positive definite coefficient matrices. It is an alternative to Gaussian elimination that is well suited for solving large problems. The performance of the linear conjugate gradient method is determined by the
This is page 101 Printer: Opaque this
102 CHAPTER 5. CONJUGATE GRADIENT METHODS
distribution of the eigenvalues of the coefficient matrix. By transforming, or preconditioning, the linear system, we can make this distribution more favorable and improve the convergence of the method significantly. Preconditioning plays a crucial role in the design of practical conjugate gradient strategies. Our treatment of the linear conjugate gradient method will highlight those properties of the method that are important in optimization.
The first nonlinear conjugate gradient method was introduced by Fletcher and Reeves in the 1960s. It is one of the earliest known techniques for solving largescale nonlinear optimization problems. Over the years, many variants of this original scheme have been proposed, and some are widely used in practice. The key features of these algorithms are that they require no matrix storage and are faster than the steepest descent method.
5.1 THE LINEAR CONJUGATE GRADIENT METHOD
In this section we derive the linear conjugate gradient method and discuss its essential convergence properties. For simplicity, we drop the qualifier linear throughout.
The conjugate gradient method is an iterative method for solving a linear system of equations
Ax b, 5.1 where A is an n n symmetric positive definite matrix. The problem 5.1 can be stated
equivalently as the following minimization problem:
def 1 T T
minx2x Axb x, 5.2
that is, both 5.1 and 5.2 have the same unique solution. This equivalence will allow us to interpret the conjugate gradient method either as an algorithm for solving linear systems or as a technique for minimizing convex quadratic functions. For future reference, we note that the gradient of equals the residual of the linear system, that is,
def x Ax b rx,
so in particular at x xk we have
rk Axk b. CONJUGATE DIRECTION METHODS
5.3
5.4
One of the remarkable properties of the conjugate gradient method is its ability to generate, in a very economical fashion, a set of vectors with a property known as conjugacy. A
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 103
set of nonzero vectors p0, p1,…, pl is said to be conjugate with respect to the symmetric positive definite matrix A if
piTApj 0, foralliaj. 5.5
It is easy to show that any set of vectors satisfying this property is also linearly independent. For a geometrical illustration of conjugate directions see Section 9.4.
The importance of conjugacy lies in the fact that we can minimize in n steps by successively minimizing it along the individual directions in a conjugate set. To verify this claim, we consider the following conjugate direction method. The distinction between the conjugate gradient method and the conjugate direction method will become clear as we proceed.Givenastartingpointx0 IRn andasetofconjugatedirectionsp0, p1,…, pn1, let us generate the sequence xk by setting
xk1 xk kpk, 5.6 where k is the onedimensional minimizer of the quadratic function along xk pk ,
given explicitly by
krkTpk ; 5.7 p kT A p k
see 3.55. We have the following result.
Theorem 5.1.
For any x0 IRn the sequence xk generated by the conjugate direction algorithm 5.6, 5.7 converges to the solution x of the linear system 5.1 in at most n steps.
PROOF. Since the directions pi are linearly independent, they must span the whole space IRn. Hence, we can write the difference between x0 and the solution x in the following way:
x x0 0p0 1p1 n1pn1,
for some choice of scalars k. By premultiplying this expression by pkT A and using the
conjugacy property 5.5, we obtain
k pkT Ax x0. 5.8
p kT A p k
We now establish the result by showing that these coefficients k coincide with the step lengths k generated by the formula 5.7.
104 CHAPTER 5. CONJUGATE GRADIENT METHODS
If xk is generated by algorithm 5.6, 5.7, then we have
xk x0 0p0 1p1 k1pk1.
By premultiplying this expression by pkT A and using the conjugacy property, we have that p kT A x k x 0 0 ,
and therefore
p kT A x x 0 p kT A x x k p kT b A x k p kT r k . Bycomparingthisrelationwith5.7and5.8,wefindthatk k,givingtheresult.
There is a simple interpretation of the properties of conjugate directions. If the matrix A in 5.2 is diagonal, the contours of the function are ellipses whose axes are aligned with the coordinate directions, as illustrated in Figure 5.1. We can find the minimizer of this function by performing onedimensional minimizations along the coordinate directions
e2
.x
. .x1 x0
e1
Figure 5.1 Successive minimizations along the coordinate directions find the minimizer of a quadratic with a diagonal Hessian in n iterations.
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 105
e2
x
x3 x2
x1
x0
e1
Figure 5.2 Successive minimization along coordinate axes does not find the solution in n iterations, for a general convex quadratic.
e1 , e2 , . . . , en in turn. When A is not diagonal, its contours are still elliptical, but they are usually no longer aligned with the coordinate directions. The strategy of successive minimization along these directions in turn no longer leads to the solution in n iterations or even in a finite number of iterations. This phenomenon is illustrated in the twodimensional example of Figure 5.2 We can, however, recover the nice behavior of Figure 5.1 if we transform the problem to make A diagonal and then minimize along the coordinate directions. Suppose we transform the problem by defining new variables x as
x S1x, 5.9
where S is the n n matrix defined by
Sp0 p1 pn1,
wherep0,p2,…,pn1isthesetofconjugatedirectionswithrespectto A.Thequadratic defined by 5.2 now becomes
def 1 T T T T xSx2x S ASxS b x.
Bytheconjugacyproperty5.5,thematrixST ASisdiagonal,sowecanfindtheminimizing value of by performing n onedimensional minimizations along the coordinate directions
106 CHAPTER 5. CONJUGATE GRADIENT METHODS
of x. Because of the relation 5.9, however, the ith coordinate direction in xspace corre sponds to the direction pi in xspace. Hence, the coordinate search strategy applied to is equivalent to the conjugate direction algorithm 5.6, 5.7. We conclude, as in Theorem 5.1, that the conjugate direction algorithm terminates in at most n steps.
Returning to Figure 5.1, we note another interesting property: When the Hessian ma trix is diagonal, each coordinate minimization correctly determines one of the components of the solution x. In other words, after k onedimensional minimizations, the quadratic has been minimized on the subspace spanned by e1,e2,…,ek. The following theorem proves this important result for the general case in which the Hessian of the quadratic is not necessarily diagonal. Here and later, we use the notation span p0 , p1 , . . . , pk to denote the set of all linear combinations of the vectors p0 , p1 , . . . , pk . In proving the result we will make use of the following expression, which is easily verified from the relations 5.4 and 5.6:
rk1 rk kApk. 5.10
Theorem 5.2 Expanding Subspace Minimization.
Let x0 IRn be any starting point and suppose that the sequence xk is generated by the
conjugate direction algorithm 5.6, 5.7. Then
rkT pi 0, for i 0,1,…,k 1, 5.11 and xk is the minimizer of x 1 xT Ax bT x over the set
2
xx x0 spanp0,p1,…,pk1. 5.12
PROOF. We begin by showing that a point x minimizes over the set 5.12 if and only ifrx Tpi 0,foreachi0,1,…,k1.Letusdefinehx00p0 k1 pk1, where 0, 1, . . . , k1T . Since h is a strictly convex quadratic, it has a unique minimizer that satisfies
h 0, i 0,1,…,k1. i
By the chain rule, this equation implies that
x0 0p0 k1pk1T pi 0, i 0,1,…,k1.
By recalling the definition 5.3, we have for the minimizer x x0 0 p0 1 p2 pk1 on the set 5.12 that rx T pi 0, as claimed.
k1
We now use induction to show that xk satisfies 5.11. For the case k 1, we have
from the fact that x1 x0 0 p0 minimizes along p0 that r1T p0 0. Let us now make
have
so that
rk rk1 k1 Apk1,
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 107
the induction hypothesis, namely, that rT pi 0 for i 0,1,…,k 2. By 5.10, we k1
pkT1rk pkT1rk1 k1 pkT1 Apk1 0, bythedefinition5.7ofk1.Meanwhile,fortheothervectors pi,i 0,1,…,k2,we
have
piTrk piTrk1 k1piT Apk1 0,
where piTrk1 0 because of the induction hypothesis and piT Apk1 0 because of conjugacy of the vectors pi. We have shown that rkT pi 0, for i 0,1,…,k 1, so the proof is complete.
The fact that the current residual rk is orthogonal to all previous search directions, as expressed in 5.11, is a property that will be used extensively in this chapter.
The discussion so far has been general, in that it applies to a conjugate direction method 5.6, 5.7 based on any choice of the conjugate direction set p0, p1,…, pn1. There are many ways to choose the set of conjugate directions. For instance, the eigen vectors v1,v2,…,vn of A are mutually orthogonal as well as conjugate with respect to A, so these could be used as the vectors p0, p1,…, pn1. For largescale applications, however, computation of the complete set of eigenvectors requires an excessive amount of computation. An alternative approach is to modify the GramSchmidt orthogonalization process to produce a set of conjugate directions rather than a set of orthogonal directions. This modification is easy to produce, since the properties of conjugacy and orthogonality are closely related in spirit. However, the GramSchmidt approach is also expensive, since it requires us to store the entire direction set.
BASIC PROPERTIES OF THE CONJUGATE GRADIENT METHOD
The conjugate gradient method is a conjugate direction method with a very special property: In generating its set of conjugate vectors, it can compute a new vector pk by using only the previous vector pk1. It does not need to know all the previous elements
p0, p1, . . . , pk2 of the conjugate set; pk is automatically conjugate to these vectors. This remarkable property implies that the method requires little storage and computation.
In the conjugate gradient method, each direction pk is chosen to be a linear combi nation of the negative residual rk which, by 5.3, is the steepest descent direction for the
108 CHAPTER 5. CONJUGATE GRADIENT METHODS
function and the previous direction pk1. We write
pk rk k pk1, 5.13
where the scalar k is to be determined by the requirement that pk1 and pk must be conjugate with respect to A. By premultiplying 5.13 by pkT1 A and imposing the condition
pkT1 Apk 0, we find that
k r kT A p k 1 . p kT 1 A p k 1
We choose the first search direction p0 to be the steepest descent direction at the initial point x0. As in the general conjugate direction method, we perform successive onedimensional minimizations along each of the search directions. We have thus specified a complete algorithm, which we express formally as follows:
Algorithm 5.1 CGPreliminary Version. Given x0;
Set r0 Ax0 b, p0 r0, k 0; while rk a 0
end while
k rkT pk ; p kT A p k
xk1 xk kpk; rk1 Axk1 b;
5.14a
5.14b 5.14c
5.14d
5.14e 5.14f
rT Apk k1 k1 ;
p kT A p k
pk1 rk1 k1 pk ;
k k 1;
This version is useful for studying the essential properties of the conjugate gradient method, but we present a more efficient version later. We show first that the directions p0, p1, . . . , pn1 are indeed conjugate, which by Theorem 5.1 implies termination in n steps. The theorem below establishes this property and two other important properties. First, the residuals ri are mutually orthogonal. Second, each search direction pk and residual
rk is contained in the Krylov subspace of degree k for r0, defined as
def k
Kr0; k spanr0, Ar0, . . . , A r0. 5.15
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 109
Theorem 5.3.
Suppose that the kth iterate generated by the conjugate gradient method is not the solution point x. The following four properties hold:
rkTri 0, fori 0,1,…,k1, spanr0,r1,…,rk spanr0, Ar0,…, Akr0, spanp0, p1,…, pk spanr0, Ar0,…, Akr0,
pkT Api 0, fori 0,1,…,k1. Therefore, the sequence xk converges to x in at most n steps.
5.16 5.17 5.18 5.19
PROOF. The proof is by induction. The expressions 5.17 and 5.18 hold trivially for k 0, while 5.19 holds by construction for k 1. Assuming now that these three expressions are true for some k the induction hypothesis, we show that they continue to hold for k 1.
To prove 5.17, we show first that the set on the lefthand side is contained in the set on the righthand side. Because of the induction hypothesis, we have from 5.17 and 5.18 that
rk spanr0, Ar0,…, Akr0, pk spanr0, Ar0,…, Akr0, while by multiplying the second of these expressions by A, we obtain
Apk spanAr0,…,Ak1r0. By applying 5.10, we find that
5.20
rk1 spanr0,Ar0,…,Ak1r0.
By combining this expression with the induction hypothesis for 5.17, we conclude that
spanr0,r1,…,rk,rk1 spanr0, Ar0,…, Ak1r0.
To prove that the reverse inclusion holds as well, we use the induction hypothesis on 5.18
to deduce that
Ak1r0 AAkr0 spanAp0, Ap1,…, Apk. Sinceby5.10wehave Api ri1 rii fori 0,1,…,k,itfollowsthat
Ak1r0 spanr0,r1,…,rk1.
110 CHAPTER 5. CONJUGATE GRADIENT METHODS
By combining this expression with the induction hypothesis for 5.17, we find that spanr0, Ar0,…, Ak1r0 spanr0,r1,…,rk,rk1.
Therefore, the relation 5.17 continues to hold when k is replaced by k 1, as claimed. We show that 5.18 continues to hold when k is replaced by k 1 by the following
argument:
spanp0, p1,…, pk, pk1
spanp0, p1,…, pk,rk1
spanr0, Ar0,…, Akr0,rk1 spanr0,r1,…,rk,rk1
spanr0, Ar0, . . . , Ak1r0
by 5.14e
by induction hypothesis for 5.18 by 5.17
by 5.17 for k 1.
Next, we prove the conjugacy condition 5.19 with k replaced by k 1. By multiplying 5.14e by Api, i 0,1,…,k, we obtain
pT Ap rT Ap pTAp. k1 i k1 i k1k i
5.21
By the definition 5.14d of k, the righthandside of 5.21 vanishes when i k. For i k 1 we need to collect a number of observations. Note first that our induction hypothesis for 5.19 implies that the directions p0, p1, . . . , pk are conjugate, so we can apply Theorem 5.2 to deduce that
rT p 0, fori0,1,…,k. 5.22 k1 i
Second, by repeatedly applying 5.18, we find that for i 0, 1, . . . , k 1, the following inclusion holds:
Api Aspanr0, Ar0,…, Air0 spanAr0, A2r0,…, Ai1r0 spanp0, p1,…, pi1.
By combining 5.22 and 5.23, we deduce that
rT Ap 0, fori 0,1,…,k1,
5.23
k1 i
so the first term in the righthandside of 5.21 vanishes for i 0, 1, . . . , k 1. Be cause of the induction hypothesis for 5.19, the second term vanishes as well, and we
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 111
concludethatpkT1Api 0,i0,1,…,k.Hence,theinductionargumentholdsfor5.19 also.
It follows that the direction set generated by the conjugate gradient method is indeed a conjugate direction set, so Theorem 5.1 tells us that the algorithm terminates in at most n iterations.
Finally, we prove 5.16 by a noninductive argument. Because the direction set is conjugate, we have from 5.11 that rkT pi 0 for all i 0,1,…,k 1 and any k 1,2,…,n 1. By rearranging 5.14e, we find that
pi ri ipi1,
so that ri spanpi, pi1 for all i 1,…,k 1. We conclude that rkTri 0 for all i 1,…,k1.Tocompletetheproof,wenotethatrkTr0 rkT p0 0,bydefinitionof p0 in Algorithm 5.1 and by 5.11.
The proof of this theorem relies on the fact that the first direction p0 is the steep est descent direction r0; in fact, the result does not hold for other choices of p0. Since the gradients rk are mutually orthogonal, the term conjugate gradient method is ac tually a misnomer. It is the search directions, not the gradients, that are conjugate with respect to A.
A PRACTICAL FORM OF THE CONJUGATE GRADIENT METHOD
We can derive a slightly more economical form of the conjugate gradient method by using the results of Theorems 5.2 and 5.3. First, we can use 5.14e and 5.11 to replace the formula 5.14a for k by
k rkTrk . p kT A p k
Second, we have from 5.10 that k Apk rk1 rk, so by applying 5.14e and 5.11 once again we can simplify the formula for k1 to
rT rk1 k1 k1 .
By using these formulae together with 5.10, we obtain the following standard form of the conjugate gradient method.
r kT r k
112 CHAPTER 5. CONJUGATE GRADIENT METHODS
Algorithm 5.2 CG.
Given x0;
Set r0 Ax0 b, p0 r0, k 0; while rk a 0
k rkTrk ; p kT A p k
xk1 xk kpk; rk1 rk kApk;
5.24a
5.24b 5.24c
5.24d
5.24e 5.24f
rT rk1 k1 k1 ;
end while
r kT r k
pk1 rk1 k1 pk ;
k k 1;
At any given point in Algorithm 5.2 we never need to know the vectors x, r, and
p for more than the last two iterations. Accordingly, implementations of this algorithm
overwrite old values of these vectors to save on storage. The major computational tasks to be
performed at each step are computation of the matrixvector product Apk, calculation of
the inner products pT Apk and rT rk1, and calculation of three vector sums. The inner k k1
product and vector sum operations can be performed in a small multiple of n floatingpoint operations, while the cost of the matrixvector product is, of course, dependent on the problem. The CG method is recommended only for large problems; otherwise, Gaussian elimination or other factorization algorithms such as the singular value decomposition are to be preferred, since they are less sensitive to rounding errors. For large problems, the CG method has the advantage that it does not alter the coefficient matrix and in contrast to factorization techniques does not produce fill in the arrays holding the matrix. Another key property is that the CG method sometimes approaches the solution quickly, as we discuss next.
RATE OF CONVERGENCE
We have seen that in exact arithmetic the conjugate gradient method will terminate at the solution in at most n iterations. What is more remarkable is that when the distribution of the eigenvalues of A has certain favorable features, the algorithm will identify the solution in many fewer than n iterations. To explain this property, we begin by viewing the expanding subspace minimization property proved in Theorem 5.2 in a slightly different way, using it to show that Algorithm 5.2 is optimal in a certain important sense.
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 113
From 5.24b and 5.18, we have that
xk1 x0 0 p0 k pk
x0 0r0 1Ar0 k Akr0, 5.25
for some constants i . We now define Pk to be a polynomial of degree k with coefficients 0,1,…,k. Like any polynomial, Pk can take either a scalar or a square matrix as its argument. For the matrix argument A, we have
P k A 0 I 1 A k A k , which allows us to express 5.25 as follows:
xk1 x0 PkAr0. 5.26
We now show that among all possible methods whose first k steps are restricted to the Krylov subspace Kr0; k given by 5.15, Algorithm 5.2 does the best job of minimizing the distance to the solution after k steps, when this distance is measured by the weighted norm measure A defined by
z 2A zT Az. 5.27
Recall that this norm was used in the analysis of the steepest descent method of Chapter 3. Using this norm and the definition of 5.2, and the fact that x minimizes , it is easy to show that
1 xx 2 1xxT Axxxx. 5.28 2A2
Theorem 5.2 states that xk1 minimizes , and hence x x 2A, over the set x0 spanp0, p1,…, pk,whichby5.18isthesameasx0spanr0, Ar0,…, Akr0.Itfollows from 5.26 that the polynomial Pk solves the following problem in which the minimum is taken over the space of all possible polynomials of degree k:
min x0PkAr0x A. Pk
We exploit this optimality property repeatedly in the remainder of the section. Since
5.29
we have that
r0 Ax0 b Ax0 Ax Ax0 x,
xk1 x x0 PkAr0 x I PkAAx0 x.
5.30
114 CHAPTER 5. CONJUGATE GRADIENT METHODS
Let 0 1 2 n be the eigenvalues of A, and let v1,v2,…,vn be the corresponding orthonormal eigenvectors, so that
n x0x
i1
for some coefficients i . It is easy to show that any eigenvector of A is also an eigenvector of Pk A for any polynomial Pk . For our particular matrix A and its eigenvalues i and eigenvectors vi , we have
PkAvi Pkivi, i 1,2,…,n. By substituting 5.31 into 5.30 we have
n
x k 1 x 1 i P k i i v i .
A
Since the eigenvectors span the whole space IRn , we can write
i1 Byusingthefactthat z 2 zT Azn ivTz2,wehave
A i1i n
n i1
i v i v iT .
x x 2 1P22. 5.32 k1 A i ikii
i1
Since the polynomial Pk generated by the CG method is optimal with respect to this norm,
we have
n
x x 2 min 1P22.
k1A iikii
Pk
i1
By extracting the largest of the terms 1 i Pk i 2 from this expression, we obtain that
n
xk1x 2Aminmax1iPki2 j2j
Pk 1in minmax1iPki2 x0x 2A,
Pk 1in
where we have used the fact that x0 x 2A nj1 j2j .
ivi, 5.31
j1
5.33
Theorem 5.5.
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 115
The expression 5.33 allows us to quantify the convergence rate of the CG method by estimating the nonnegative scalar quantity
min max1i Pki2. 5.34 Pk 1in
In other words, we search for a polynomial Pk that makes this expression as small as possible. In some practical cases, we can find this polynomial explicitly and draw some interesting conclusions about the properties of the CG method. The following result is an example.
Theorem 5.4.
If A has only r distinct eigenvalues, then the CG iteration will terminate at the solution in at most r iterations.
PROOF. Suppose that the eigenvalues 1 , 2 , . . . , n take on the r distinct values 1 2 r. We define a polynomial Qr by
1r
Qr 12r,
12r
and note that Qri 0 for i 1,2,…,n and Qr0 1. From the latter observation, we deduce that Qr 1 is a polynomial of degree r with a root at 0, so by polynomial division, the function P r1 defined by
P r1 Qr 1 isapolynomialofdegreer 1.Bysettingk r 1in5.34,wehave
0minmax1iPr1i2 max1iP r1i2 max Qr2i0. Pr1 1in 1in 1in
Hence, the constant in 5.34 is zero for the value k r 1, so we have by substituting into 5.33 that xr x 2A 0, and therefore xr x, as claimed.
By using similar reasoning, Luenberger 195 establishes the following estimate, which gives a useful characterization of the behavior of the CG method.
If A has eigenvalues 1 2 n, we have that nk1 2
xk1x 2A x0x 2A. nk 1
5.35
116 CHAPTER 5. CONJUGATE GRADIENT METHODS
1 01
nm nm1 n
Figure 5.3 Two clusters of eigenvalues.
Without giving details of the proof, we describe how this result is obtained from 5.33. One selects a polynomial P k of degree k such that the polynomial Qk1 1 P k has roots at the k largest eigenvalues n,n1,…,nk1, as well as at the midpoint between 1 and nk. It can be shown that the maximum value attained by Qk1 on the remaining eigenvalues1,2,…,nk ispreciselynk 1nk 1.
We now illustrate how Theorem 5.5 can be used to predict the behavior of the CG method on specific problems. Suppose we have the situation plotted in Figure 5.3, where the eigenvalues of A consist of m large values, with the remaining n m smaller eigenvalues clustered around 1. If we define nm 1, Theorem 5.5 tells us that after m 1 steps of the conjugate gradient algorithm, we have
xm1x A x0x A.
For a small value of , we conclude that the CG iterates will provide a good estimate of the solution after only m 1 steps.
Figure 5.4 shows the behavior of CG on a problem of this type, which has five large eigenvalues with all the smaller eigenvalues clustered between 0.95 and 1.05, and compares this behavior with that of CG on a problem in which the eigenvalues satisfy some random distribution. In both cases, we plot the log of after each iteration.
For the problem with clustered eigenvalues, Theorem 5.5 predicts a sharp decrease in the error measure at iteration 6. Note, however, that this decrease was achieved one iteration earlier, illustrating the fact that Theorem 5.5 gives only an upper bound, and that the rate of convergence can be faster. By contrast, we observe in Figure 5.4 that for the problem with randomly distributed eigenvalues dashed line, the convergence rate is slower and more uniform.
Figure 5.4 illustrates another interesting feature: After one more iteration a total of seven on the problem with clustered eigenvalues, the error measure drops sharply. An extension of the arguments leading to Theorem 5.4 explains this behavior. It is almost true to say that the matrix A has just six distinct eigenvalues: the five large eigenvalues and 1. Then we would expect the error measure to be zero after six iterations. Because the eigenvalues near 1 are slightly spread out, however, the error does not become very small until iteration 7.
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 117
log xxA2 5
clustered eigenvalues
0
5
10
uniformly distributed eigenvalues
1234567 iteration
Figure 5.4 Performance of the conjugate gradient method on a a problem in which five of the eigenvalues are large and the remainder are clustered near 1, and b a matrix with uniformly distributed eigenvalues.
To state this claim more precisely, it is generally true that if the eigenvalues occur in r distinct clusters, the CG iterates will approximately solve the problem in about r steps see 136. This result can be proved by constructing a polynomial P r 1 such that 1 P r 1 has zeros inside each of the clusters. This polynomial may not vanish at the eigenvalues i , i 1, 2, . . . , n, but its value will be small at these points, so the constant defined in 5.34 will be small for k r 1. We illustrate this behavior in Figure 5.5, which shows the performance of CG on a matrix of dimension n 14 that has four clusters of eigenvalues: single eigenvalues at 140 and 120, a cluster of 10 eigenvalues very close to 10, with the remaining eigenvalues clustered between 0.95 and 1.05. After four iterations, the error has decreased significantly. After six iterations, the solution is identified to good accuracy.
Another, more approximate, convergence expression for CG is based on the Euclidean condition number of A, which is defined by
It can be shown that
A A2 A1 2n1.
A1 k
xkx A2 A1 x0x A. 5.36
This bound often gives a large overestimate of the error, but it can be useful in those cases
118 CHAPTER 5. CONJUGATE GRADIENT METHODS
0
5
10
log xxA2 5
1234567 iteration
Figure 5.5 Performance of the conjugate gradient method on a matrix in which the eigenvalues occur in four distinct clusters.
where the only information we have about A is estimates of the extreme eigenvalues 1 and n . This bound should be compared with that of the steepest descent method given by 3.29, which is identical in form but which depends on the condition number A, and not on its square root A.
PRECONDITIONING
We can accelerate the conjugate gradient method by transforming the linear system to improve the eigenvalue distribution of A. The key to this process, which is known as preconditioning, is a change of variables from x to x via a nonsingular matrix C, that is,
x Cx. 5.37 The quadratic defined by 5.2 is transformed accordingly to
x 1 xT CT AC1x CT bT x. 5.38 2
If we use Algorithm 5.2 to minimize or, equivalently, to solve the linear system CT AC1x CT b,
then the convergence rate will depend on the eigenvalues of the matrix CT AC1 rather than those of A. Therefore, we aim to choose C such that the eigenvalues of CT AC1
Set p0 y0, k 0; while rk a 0
end while
5.1. THE LINEAR CONJUGATE GRADIENT METHOD 119
are more favorable for the convergence theory discussed above. We can try to choose C such that the condition number of CT AC1 is much smaller than the original condition number of A, for instance, so that the constant in 5.36 is smaller. We could also try to choose C such that the eigenvalues of CT AC1 are clustered, which by the discussion of the previous section ensures that the number of iterates needed to find a good approximate solution is not much larger than the number of clusters.
It is not necessary to carry out the transformation 5.37 explicitly. Rather, we can apply Algorithm 5.2 to the problem 5.38, in terms of the variables x, and then invert the transformations to reexpress all the equations in terms of x . This process of derivation results in Algorithm 5.3 Preconditioned Conjugate Gradient, which we now define. It happens that Algorithm 5.3 does not make use of C explicitly, but rather the matrix M CT C, which is symmetric and positive definite by construction.
Algorithm 5.3 Preconditioned CG. Given x0, preconditioner M;
Setr0 Ax0 b;
Solve My0 r0 for y0;
k rkTyk ; p kT A p k
xk1 xk kpk;
rk1 rk kApk; Solve Myk1 rk1;
5.39a
5.39b 5.39c 5.39d
5.39e
5.39f 5.39g
rT yk1 k1 k1 ;
rkT yk
pk1 yk1 k1pk;
k k 1;
If we set M I in Algorithm 5.3, we recover the standard CG method, Algorithm 5.2. The properties of Algorithm 5.2 generalize to this case in interesting ways. In particular, the orthogonality property 5.16 of the successive residuals becomes
riTM1rj0 foralliaj. 5.40
120 CHAPTER 5. CONJUGATE GRADIENT METHODS
In terms of computational effort, the main difference between the preconditioned and unpreconditioned CG methods is the need to solve systems of the form M y r step 5.39d.
PRACTICAL PRECONDITIONERS
No single preconditioning strategy is best for all conceivable types of matrices: The tradeoff between various objectiveseffectiveness of M, inexpensive computation and storage of M, inexpensive solution of My rvaries from problem to problem.
Good preconditioning strategies have been devised for specific types of matrices, in particular, those arising from discretizations of partial differential equations PDEs. Often, the preconditioner is defined in such a way that the system M y r amounts to a simplified version of the original system Ax b. In the case of a PDE, My r could represent a coarser discretization of the underlying continuous problem than Ax b. As in many other areas of optimization and numerical analysis, knowledge about the structure and origin of a problem in this case, knowledge that the system Ax b is a finitedimensional representation of a PDE is the key to devising effective techniques for solving the problem.
Generalpurpose preconditioners have also been proposed, but their success varies greatly from problem to problem. The most important strategies of this type include sym metric successive overrelaxation SSOR, incomplete Cholesky, and banded preconditioners. See 272, 136, and 72 for discussions of these techniques. Incomplete Cholesky is prob ably the most effective in general. The basic idea is simple: We follow the Cholesky procedure, but instead of computing the exact Cholesky factor L that satisfies A L L T , we compute an approximate factor L that is sparser than L. Usually, we require L to be no denser, or not much denser, than the lower triangle of the original matrix A. We then have A L L T , andbychoosingC L T,weobtain M L L T and
C T A C 1 L 1 A L T I ,
so the eigenvalue distribution of CT AC1 is favorable. We do not compute M explicitly, but rather store the factor L and solve the system My r by performing two triangular substitutions with L . Because the sparsity of L is similar to that of A, the cost of solving My r is similar to the cost of computing the matrixvector product Ap.
There are several possible pitfalls in the incomplete Cholesky approach. One is that the resulting matrix may not be sufficiently positive definite, and in this case one may need to increase the values of the diagonal elements to ensure that a value for L can be found. Numerical instability or breakdown can occur during the incomplete factorization because of the sparsity conditions we impose on the factor L . This difficulty can be remedied by allowing additional fillin in L , but the denser factor will be more expensive to compute and to apply at each iteration.
end while
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 121
5.2 NONLINEAR CONJUGATE GRADIENT METHODS
We have noted that the CG method, Algorithm 5.2, can be viewed as a minimization algorithm for the convex quadratic function defined by 5.2. It is natural to ask whether we can adapt the approach to minimize general convex functions, or even general nonlinear functions f . In fact, as we show in this section, nonlinear variants of the conjugate gradient are well studied and have proved to be quite successful in practice.
THE FLETCHERREEVES METHOD
Fletcher and Reeves 107 showed how to extend the conjugate gradient method to nonlinear functions by making two simple changes in Algorithm 5.2. First, in place of the formula 5.24a for the step length k which minimizes along the search direction
pk, we need to perform a line search that identifies an approximate minimum of the nonlinear function f along pk . Second, the residual r , which is simply the gradient of in Algorithm 5.2 see 5.3, must be replaced by the gradient of the nonlinear objective f . These changes give rise to the following algorithm for nonlinear optimization.
Algorithm 5.4 FR.
Given x0;
Evaluate f0 fx0,f0 fx0; Set p0 f0, k 0;
whilefk a0
Compute k and set xk1 xk k pk; Evaluate fk1;
fT fk1 FR k1 ;
k 1 f kT f k pk1 fk1 FR
k k 1;
pk;
5.41a
5.41b 5.41c
If we choose f to be a strongly convex quadratic and k to be the exact minimizer, this algorithm reduces to the linear conjugate gradient method, Algorithm 5.2. Algorithm 5.4 is appealing for large nonlinear optimization problems because each iteration requires only evaluation of the objective function and its gradient. No matrix operations are required for the step computation, and just a few vectors of storage are required.
To make the specification of Algorithm 5.4 complete, we need to be more precise about the choice of line search parameter k . Because of the second term in 5.41b, the search direction pk may fail to be a descent direction unless k satisfies certain conditions.
k1
122 CHAPTER 5. CONJUGATE GRADIENT METHODS
By taking the inner product of 5.41b with k replacing k 1 with the gradient vector fk, we obtain
fTpf 2FRfTp . 5.42 kk kkkk1
If the line search is exact, so that k1 is a local minimizer of f along the direction pk1, wehavethatfkT pk1 0.Inthiscasewehavefrom5.42thatfkT pk 0,sothat pk is indeed a descent direction. If the line search is not exact, however, the second term in 5.42 may dominate the first term, and we may have fkT pk 0, implying that pk is actually a direction of ascent. Fortunately, we can avoid this situation by requiring the step length k to satisfy the strong Wolfe conditions, which we restate here:
fxk kpk fxkc1kfkT pk, 5.43a fxk kpkT pkc2fkT pk, 5.43b
where 0 c1 c2 1 . Note that we impose c2 1 here, in place of the looser condition 22
c2 1 that was used in the earlier statement 3.7. By applying Lemma 5.6 below, we can show that condition 5.43b implies that 5.42 is negative, and we conclude that any line search procedure that yields an k satisfying 5.43 will ensure that all directions pk are descent directions for the function f .
THE POLAKRIBIERE METHOD AND VARIANTS
There are many variants of the FletcherReeves method that differ from each other mainly in the choice of the parameter k. An important variant, proposed by Polak and Ribiere, defines this parameter as follows:
fT fk1fk
PR k1 . 5.44
k1 fk 2
We refer to the algorithm in which 5.44 replaces 5.41a as Algorithm PR. It is identical to
Algorithm FR when f is a strongly convex quadratic function and the line search is exact,
since by 5.16 the gradients are mutually orthogonal, and so PR FR . When applied k1 k1
to general nonlinear functions with inexact line searches, however, the behavior of the two algorithms differs markedly. Numerical experience indicates that Algorithm PR tends to be the more robust and efficient of the two.
A surprising fact about Algorithm PR is that the strong Wolfe conditions 5.43 do not guarantee that pk is always a descent direction. If we define the parameter as
maxPR , 0, 5.45 k1 k1
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 123
giving rise to an algorithm we call Algorithm PR, then a simple adaptation of the strong Wolfe conditions ensures that the descent property holds.
There are many other choices for k1 that coincide with the FletcherReeves formula FR in the case where the objective is quadratic and the line search is exact. The Hestenes
k1
Stiefel formula, which defines
HS k1
gives rise to an algorithm called Algorithm HS that is similar to Algorithm PR, both in terms of its theoretical convergence properties and in its practical performance. Formula 5.46 can be derived by demanding that consecutive search directions be conjugate with respect to the average Hessian over the line segment xk, xk1, which is defined as
1 0
Recalling from Taylors theorem Theorem 2.1 that fk1 fk k G k pk , we see that foranydirectionoftheformpk1 fk1k1pk,theconditionpkT1G kpk 0 requires k1 to be given by 5.46.
Later, we see that it is possible to guarantee global convergence for any parameter k satisfying the bound
k FR, 5.47 k
for all k 2. This fact suggests the following modification of the PR method, which has performed well on some applications. For all k 2 let
k
FR ifPR FR kkk
PR if PR FR k k k
FR if PR FR. kkk
5.48
G k
2 f x k k p k d .
fT fk1fk
k1 , 5.46
fk1 fkT pk
The algorithm based on this strategy will be denoted by FRPR.
Other variants of the CG method have recently been proposed. Two choices for k1
that possess attractive theoretical and computational properties are
k1 f
y 2p yk 2 T fk1, with y f f
5.49
5.50
see 85 and
k 1 k k y T p y T p k k 1 kk kk
k
fk1 2
f T p
k1 k k
124 CHAPTER 5. CONJUGATE GRADIENT METHODS
see 161. These two choices guarantee that pk is a descent direction, provided the steplength k satisfies the Wolfe conditions. The CG algorithms based on 5.49 or 5.50 appear to be competitive with the PolakRibiere method.
QUADRATIC TERMINATION AND RESTARTS
Implementations of nonlinear conjugate gradient methods usually preserve their close connections with the linear conjugate gradient method. Usually, a quadratic or cubic interpolation along the search direction pk is incorporated into the line search procedure; see Chapter 3. This feature guarantees that when f is a strictly convex quadratic, the step length k is chosen to be the exact onedimensional minimizer, so that the nonlinear conjugate gradient method reduces to the linear method, Algorithm 5.2.
Another modification that is often used in nonlinear conjugate gradient procedures is to restart the iteration at every n steps by setting k 0 in 5.41a, that is, by taking a steepest descent step. Restarting serves to periodically refresh the algorithm, erasing old information that may not be beneficial. We can even prove a strong theoretical result about restarting: It leads to nstep quadratic convergence, that is,
xknx O xkx 2 . 5.51
After a little thought, this result is not so surprising. Consider a function f that is strongly convex quadratic in a neighborhood of the solution, but is nonquadratic everywhere else. Assuming that the algorithm is converging to the solution in question, the iterates will eventually enter the quadratic region. At some point, the algorithm will be restarted in that region, and from that point onward, its behavior will simply be that of the linear conjugate gradient method, Algorithm 5.2. In particular, finite termination will occur within n steps of the restart. The restart is important, because the finitetermination property and other appealing properties of Algorithm 5.2 hold only when its initial search direction p0 is equal to the negative gradient.
Even if the function f is not exactly quadratic in the region of a solution, Taylors theorem Theorem 2.1 implies that it can still be approximated quite closely by a quadratic, provided that it is smooth. Therefore, while we would not expect termination in n steps after the restart, it is not surprising that substantial progress is made toward the solution, as indicated by the expression 5.51.
Though the result 5.51 is interesting from a theoretical viewpoint, it may not be relevant in a practical context, because nonlinear conjugate gradient methods can be recom mended only for solving problems with large n. Restarts may never occur in such problems because an approximate solution may be located in fewer than n steps. Hence, nonlinear CG method are sometimes implemented without restarts, or else they include strategies for restarting that are based on considerations other than iteration counts. The most popular restart strategy makes use of the observation 5.16, which is that the gradients are mutually orthogonalwhen f isaquadraticfunction.Arestartisperformedwhenevertwoconsecutive
that satisfy the following inequalities:
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 125
gradients are far from orthogonal, as measured by the test
fkT fk1 , 5.52
fk 2
where a typical value for the parameter is 0.1.
We could also think of formula 5.45 as a restarting strategy, because pk1 will revert
tothesteepestdescentdirectionwheneverPR isnegative.Incontrastto5.52,theserestarts k
areratherinfrequentbecausePR ispositivemostofthetime. k
BEHAVIOR OF THE FLETCHERREEVES METHOD
We now investigate the FletcherReeves algorithm, Algorithm 5.4, a little more closely, proving that it is globally convergent and explaining some of its observed inefficiencies.
The following result gives conditions on the line search under which all search direc tions are descent directions. It assumes that the level set L x : f x f x0 is bounded and that f is twice continuously differentiable, so that we have from Lemma 3.1 that there exists a step length k satisfying the strong Wolfe conditions.
Lemma 5.6.
Suppose that Algorithm 5.4 is implemented with a step length k that satisfies the strong Wolfe conditions 5.43 with 0 c2 1 . Then the method generates descent directions pk
2
1 fkTpk 2c21, forallk0,1,…. 5.53 1c2 fk 2 1c2
def
PROOF. Note first that the function t 2 11 is monotonically increasing
on the interval 0, 1 and that t0 1 and t1 0. Hence, because of c2 0, 1, we 222
have
1 2c2 1 0. 5.54 1c2
The descent condition fkT pk 0 follows immediately once we establish 5.53.
The proof is by induction. For k 0, the middle term in 5.53 is 1, so by using 5.54, we see that both inequalities in 5.53 are satisfied. Next, assume that 5.53 holds
for some k 1. From 5.41b and 5.41a we have
fT pk1 k1
fT pk k1
fT pk
fk1 2
1 k1
fk1 2
1
k1 . 5.55 fk 2
126 CHAPTER 5. CONJUGATE GRADIENT METHODS
By using the line search condition 5.43b, we have
fT p c fT p , k1k 2kk
so by combining with 5.55 and recalling 5.41a, we obtain
fTp fT pk1 fTp 1c2kkk1 1c2kk.
fk 2 fk1 2 fk 2
Substituting for the term fkT pk fk 2 from the lefthandside of the induction
hypothesis 5.53, we obtain
1 2 k1
1c2 fk1 which shows that 5.53 holds for k 1 as well.
This result used only the second strong
condition 5.43a will be needed in the next section to establish global convergence. The bounds on fkT pk in 5.53 impose a limit on how fast the norms of the steps pk can grow, and they will play a crucial role in the convergence analysis given below.
Lemma 5.6 can also be used to explain a weakness of the FletcherReeves method. We will argue that if the method generates a bad direction and a tiny step, then the next direction and next step are also likely to be poor. As in Chapter 3, we let k denote the angle between pk and the steepest descent direction fk , defined by
cosk fkTpk . 5.56 fk pk
Suppose that pk is a poor search direction, in the sense that it makes an angle of nearly 90 with fk , that is, cos k 0. By multiplying both sides of 5.53 by fk pk and using 5.56, we obtain
12c2 fk cosk 1 fk , forallk0,1,…. 5.57 1c2 pk 1c2 pk
Fromtheseinequalities,wededucethatcosk 0ifandonlyif fk a pk .
Since pk is almost orthogonal to the gradient, it is likely that the step from xk to xk1 is tiny, that is, xk1 xk. If so, we have fk1 fk, and therefore
FR 1, 5.58 k1
c fTpk1
c
2
1 2 , 1c2
Wolfe condition 5.43b; the first Wolfe
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 127
by the definition 5.41a. By using this approximation together with fk1 fk a pk in 5.41b, we conclude that
pk1 pk,
so the new search direction will improve little if at all on the previous one. It follows that if the condition cos k 0 holds at some iteration k and if the subsequent step is small, a long sequence of unproductive iterates will follow.
The PolakRibiere method behaves quite differently in these circumstances. If, as in
the previous paragraph, the search direction pk satisfies cos k 0 for some k, and if the
subsequent step is small, it follows by substituting fk fk1 into 5.44 that PR 0. k1
From the formula 5.41b, we find that the new search direction pk1 will be close to the
steepest descent direction fk1, and cos k1 will be close to 1. Therefore, Algorithm PR
essentially performs a restart after it encounters a bad direction. The same argument can
be applied to Algorithms PR and HS. For the FRPR variant, defined by 5.48, we have
noted already that FR 1, and PR 0. The formula 5.48 thus sets k1 PR , as k1 k1 k1
desired. Thus, the modification 5.48 seems to avoid the inefficiencies of the FR method, while falling back on this method for global convergence.
The undesirable behavior of the FletcherReeves method predicted by the arguments given above can be observed in practice. For example, the paper 123 describes a problem with n 100 in which cos k is of order 102 for hundreds of iterations and the steps
xk xk1 are of order 102. Algorithm FR requires thousands of iterations to solve this problem, while Algorithm PR requires just 37 iterations. In this example, the Fletcher Reeves method performs much better if it is periodically restarted along the steepest descent direction, since each restart terminates the cycle of bad steps. In general, Algorithm FR should not be implemented without some kind of restart strategy.
GLOBAL CONVERGENCE
Unlike the linear conjugate gradient method, whose convergence properties are well understood and which is known to be optimal as described above, nonlinear conjugate gradient methods possess surprising, sometimes bizarre, convergence properties. We now present a few of the main results known for the FletcherReeves and PolakRibiere methods using practical line searches.
For the purposes of this section, we make the following nonrestrictive assumptions on the objective function.
Assumptions 5.1.
i ThelevelsetL:x fx fx0isbounded;
ii In some open neighborhood N of L, the objective function f is Lipschitz continuously differentiable.
128 CHAPTER 5. CONJUGATE GRADIENT METHODS
These assumptions imply that there is a constant such that
fx ,forallxL. 5.59
Our main analytical tool in this section is Zoutendijks theoremTheorem 3.2 in Chapter 3. It states, that under Assumptions 5.1, any line search iteration of the form xk1 xk k pk , where pk is a descent direction and k satisfies the Wolfe conditions 5.43 gives the limit
cos2k fk 2. 5.60
k0
We can use this result to prove global convergence for algorithms that are periodically restartedbysettingk 0.Ifk1,k2,andsoondenotetheiterationsonwhichrestartsoccur, we have from 5.60 that
kk1,k2,…
fk 2. 5.61
If we allow no more than n iterations between restarts, the sequence kjj1 is infinite, and from 5.61 we have that limj fkj 0. That is, a subsequence of gradients approaches zero, or equivalently,
lim inf fk 0. 5.62 k
This result applies equally to restarted versions of all the algorithms discussed in this chapter. It is more interesting, however, to study the global convergence of unrestarted conjugate gradient methods, because for large problems say n 1000 we expect to find a solution in many fewer than n iterationsthe first point at which a regular restart would take place. Our study of large sequences of unrestarted conjugate gradient iterations reveals some surprising
patterns in their behavior.
We can build on Lemma 5.6 and Zoutendijks result 5.60 to prove a global conver
gence result for the FletcherReeves method. While we cannot show that the limit of the sequence of gradients fk is zero, the following result shows that this sequence is not bounded away from zero.
Theorem 5.7 AlBaali 3.
Suppose that Assumptions 5.1 hold, and that Algorithm 5.4 is implemented with a line
search that satisfies the strong Wolfe conditions 5.43, with 0 c1 c2 1 . Then 2
lim inf fk 0. k
5.63
condition 5.60, we obtain
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 129
PROOF. The proof is by contradiction. It assumes that the opposite of 5.63 holds, that is, there is a constant 0 such that
fk , 5.64 for all k sufficiently large. By substituting the left inequality of 5.57 into Zoutendijks
f k 4
k0 pk 2 .
5.65
5.66
By using 5.43b and 5.53, we obtain that
f T p c f T p c2 f 2.
k k1 2 k1 k1 1c k1 2
Thus,from5.41bandrecallingthedefinition5.41aofFR weobtain k
pk 2 fk 22FRfkTpk1FR2 pk1 2 kk
fk 2 2c2 FR fk1 2FR2 pk1 2 1c2 k k
1c2 fk 2FR2 pk1 2. 1c2 k
def Applyingthisrelationrepeatedly,anddefiningc3 1c21c21,wehave
p 2 c f 2FR2c f 2FR 2c f 2
k
c3 fk 4 where we used the facts that
3 k k 3 k1 FR2 p 2
k1 3 k2
10
k j0
fj 2,
5.67
2
and p0 f0. By using the bounds 5.59 and 5.64 in 5.67, we obtain
FR2FR
k k1
2 FR ki
fk 4 fki1 4
2 c 3 4
pk 2 k,
5.68
130 CHAPTER 5. CONJUGATE GRADIENT METHODS
which implies that
for some positive constant 4.
On the other hand, from 5.64 and 5.65, we have that
k1
is not true. Hence, 5.64 does not hold, and the claim 5.63 is proved.
This global convergence result can be extended to any choice of k satisfying 5.47, and in particular to the FRPR method given by 5.48.
In general, if we can show that there exist constants c4, c5 0 such that cos c fk , fk c 0, k1,2,…,
it follows from 5.60 that
lim fk 0. k
In fact, this result can be established for the PolakRibiere method under the assumption that f is strongly convex and that an exact line search is used.
For general nonconvex functions, however, is it not possible to prove a result like Theorem 5.7 for Algorithm PR. This fact is unexpected, since the PolakRibiere method performs better in practice than the FletcherReeves method. The following surprising result shows that the PolakRibiere method can cycle infinitely without approaching a solution point, even if an ideal line search is used. By ideal we mean that line search returns a value k that is the first positive stationary point for the function t f xk pk.
Theorem 5.8.
Consider the PolakRibiere method method 5.44 with an ideal line search. There exists a twice continuously differentiable objective function f : IR3 IR and a starting point x0 IR3 such that the sequence of gradients fk is bounded away from zero.
The proof of this result, given in 253, is quite complex. It demonstrates the existence of the desired objective function without actually constructing this function explicitly. The result is interesting, since the step length assumed in the proofthe first stationary point may be accepted by any of the practical line search algorithms currently in use. The proof
11, pk2 4 k
5.69
k1 k1
k4pp5 kk
1 . pk 2
5.70 However, if we combine this inequality with 5.69, we obtain that k1 1k , which
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 131
of Theorem 5.8 requires that some consecutive search directions become almost negatives of each other. In the case of ideal line searches, this happens only if k 0, so the analysis suggests Algorithm PR see 5.45, in which we reset k to zero whenever it becomes negative. We mentioned earlier that a line search strategy based on a slight modification of the Wolfe conditions guarantees that all search directions generated by Algorithm PR are descent directions. Using these facts, it is possible to a prove global convergence result like Theorem 5.7 for Algorithm PR. An attractive property of the formulae 5.49, 5.50 is that global convergence can be established without introducing any modification to a line search based on the Wolfe conditions.
NUMERICAL PERFORMANCE
Table 5.1 illustrates the performance of Algorithms FR, PR, and PR without restarts. For these tests, the parameters in the strong Wolfe conditions 5.43 were chosen to be c1 104 and c2 0.1. The iterations were terminated when
fk 1051fk.
If this condition was not satisfied after 10,000 iterations, we declare failure indicated by a in the table.
The final column, headed mod, indicates the number of iterations of Algorithm PR forwhichtheadjustment5.45wasneededtoensurethatPR 0.AlgorithmFRonproblem
GENROS takes very short steps far from the solution that lead to tiny improvements in the objective function, and convergence was not achieved within the maximum number of iterations.
The PolakRibiere algorithm, or its variation PR, are not always more efficient than Algorithm FR, and it has the slight disadvantage of requiring one more vector of storage. Nevertheless, we recommend that users choose Algorithm PR, PR or FRPR, or the methods based on 5.49 and 5.50.
Table 5.1 Iterations and functiongradient evaluations required by three nonlinear conjugate gradient methods on a set of test problems; see 123
k
Problem n
Alg FR itfg
Alg PR itfg
Alg PR itfg mod
CALCVAR3 200
28085617
26315263
26315263 0
GENROS 500
10682151
10672149 1
XPOWSING 1000
5331102
212473
97229 3
TRIDIA1 1000
264531
262527
262527 0
MSQRT1 1000
422849
113231
113231 0
XPOWELL 1000
5681175
212473
97229 3
TRIGON 1000
231467
4092
4092 0
132 CHAPTER 5. CONJUGATE GRADIENT METHODS
NOTES AND REFERENCES
The conjugate gradient method was developed in the 1950s by Hestenes and Stiefel 168 as an alternative to factorization methods for finding solutions of symmet ric positive definite systems. It was not until some years later, in one of the most important developments in sparse linear algebra, that this method came to be viewed as an iterative method that could give good approximate solutions to systems in many fewer than n steps. Our presentation of the linear conjugate gradient method follows that of Luenberger 195. For a history of the development of the conjugate gradient and Lanczos methods see Golub and OLeary 135.
Interestingly enough, the nonlinear conjugate gradient method of Fletcher and Reeves 107 was proposed after the linear conjugate gradient method had fallen out of favor, but several years before it was rediscovered as an iterative method for linear systems. The PolakRibiere method was introduced in 237, and the example showing that it may fail to converge on nonconvex problems is given by Powell 253. Restart procedures are discussed in Powell 248.
Hager and Zhang 161 report some of the best computational results obtained to date with a nonlinear CG method. Their implementation is based on formula 5.50 and uses a highaccuracy line search procedure. The results in Table 5.1 are taken from Gilbert and Nocedal 123. This paper also describes a line search that guarantees that Algorithm PR always generates descent directions and proves global convergence.
Analysis due to Powell 245 provides further evidence of the inefficiency of the FletcherReeves method using exact line searches. He shows that if the iterates enter a region in which the function is the twodimensional quadratic
fx 1xTx, 2
then the angle between the gradient fk and the search direction pk stays constant. Since this angle can be arbitrarily close to 90, the FletcherReeves method can be slower than the steepest descent method. The PolakRibiere method behaves quite differently in these circumstances: If a very small step is generated, the next search direction tends to the steepest descent direction, as argued above. This feature prevents a sequence of tiny steps.
The global convergence of nonlinear conjugate gradient methods has received much attention; see for example AlBaali 3, Gilbert and Nocedal 123, Dai and Yuan 85, and Hager and Zhang 161. For recent surveys on CG methods see Gould et al. 147 and Hager and Zhang 162.
Most of the theory on the rate of convergence of conjugate gradient methods assumes that the line search is exact. Crowder and Wolfe 82 show that the rate of convergence is linear, and show by constructing an example that Qsuperlinear convergence is not achievable. Powell 245 studies the case in which the conjugate gradient method enters a region where the objective function is quadratic, and shows that either finite termination occurs or the rate of convergence is linear. Cohen 63 and Burmeister 45 prove nstep
5.2. NONLINEAR CONJUGATE GRADIENT METHODS 133
quadratic convergence 5.51 for general objective functions. Ritter 265 shows that in fact, the rate is superquadratic, that is,
xknx oxkx2.
Powell 251 gives a slightly better result and performs numerical tests on small problems to measure the rate observed in practice. He also summarizes rateofconvergence results for asymptotically exact line searches, such as those obtained by Baptist and Stoer 11 and Stoer 282. Even faster rates of convergence can be established see Schuller 278, Ritter 265, under the assumption that the search directions are uniformly linearly independent, but this assumption is hard to verify and does not often occur in practice.
Nemirovsky and Yudin 225 devote some attention to the global efficiency of the FletcherReeves and PolakRibiere methods with exact line searches. For this purpose they define a measure of laboriousness and an optimal bound for it among a certain class of iterations. They show that on strongly convex problems not only do the FletcherReeves and PolakRibiere methods fail to attain the optimal bound, but they may also be slower than the steepest descent method. Subsequently, Nesterov 225 presented an algorithm that attains this optimal bound. It is related to PARTAN, the method of parallel tangents see, for example, Luenberger 195. We feel that this approach is unlikely to be effective in practice, but no conclusive investigation has been carried out, to the best of our knowledge.
EXERCISES
5.1 Implement Algorithm 5.2 and use to it solve linear systems in which A is the Hilbert matrix, whose elements are Ai, j 1i j 1. Set the righthandside to b 1,1,…,1T and the initial point to x0 0. Try dimensions n 5,8,12,20 and report the number of iterations required to reduce the residual below 106.
5.2 Show that if the nonzero vectors p0, p1, . . . , pl satisfy 5.5, where A is symmetric and positive definite, then these vectors are linearly independent. This result implies that A has at most n conjugate directions.
5.3 Verify the formula 5.7.
def
5.4 Show that if fx is a strictly convex quadratic, then the function h
f x0 0 p0 k1 pk1 also is a strictly convex quadratic in the variable 0,1,…,k1T .
5.5 Verify from the formulae 5.14 that 5.17 and 5.18 hold for k 1.
5.6 Show that 5.24d is equivalent to 5.14d.
134 CHAPTER 5. CONJUGATE GRADIENT METHODS
5.7 Let i,vi i 1,2,…,n be the eigenpairs of the symmetric matrix A. Show thattheeigenvaluesandeigenvectorsofIPkAAT AIPkAAarei1i Pki2 and vi , respectively.
5.8 Construct matrices with various eigenvalue distributions clustered and non clustered and apply the CG method to them. Comment on whether the behavior can be explained from Theorem 5.5.
5.9 Derive Algorithm 5.3 by applying the standard CG method in the variables x and then transforming back into the original variables.
5.10 Verify the modified conjugacy condition 5.40.
5.11 Show that when applied to a quadratic function, with exact line searches, both the PolakRibiere formula given by 5.44 and the HestenesStiefel formula given by 5.46 reduce to the FletcherReeves formula 5.41a.
5.12 Prove that Lemma 5.6 holds for any choice of k satisfying k FR. k
CHAPTER6 QuasiNewton
Methods
In the mid 1950s, W.C. Davidon, a physicist working at Argonne National Laboratory, was using the coordinate descent method see Section 9.3 to perform a long optimization calculation. At that time computers were not very stable, and to Davidons frustration, the computer system would always crash before the calculation was finished. So Davidon decided to find a way of accelerating the iteration. The algorithm he developedthe first quasiNewton algorithmturned out to be one of the most creative ideas in nonlinear optimization. It was soon demonstrated by Fletcher and Powell that the new algorithm was much faster and more reliable than the other existing methods, and this dramatic
This is page 135 Printer: Opaque this
136 CHAPTER 6. QUASINEWTON METHODS
advance transformed nonlinear optimization overnight. During the following twenty years, numerous variants were proposed and hundreds of papers were devoted to their study. An interesting historical irony is that Davidons paper 87 was not accepted for publication; it remained as a technical report for more than thirty years until it appeared in the first issue of the SIAM Journal on Optimization in 1991 88.
QuasiNewton methods, like steepest descent, require only the gradient of the ob jective function to be supplied at each iterate. By measuring the changes in gradients, they construct a model of the objective function that is good enough to produce superlinear convergence. The improvement over steepest descent is dramatic, especially on difficult problems. Moreover, since second derivatives are not required, quasiNewton methods are sometimes more efficient than Newtons method. Today, optimization software libraries contain a variety of quasiNewton algorithms for solving unconstrained, constrained, and largescale optimization problems. In this chapter we discuss quasiNewton methods for small and mediumsized problems, and in Chapter 7 we consider their extension to the largescale setting.
The development of automatic differentiation techniques has made it possible to use Newtons method without requiring users to supply second derivatives; see Chapter 8. Still, automatic differentiation tools may not be applicable in many situations, and it may be much more costly to work with second derivatives in automatic differentia tion software than with the gradient. For these reasons, quasiNewton methods remain appealing.
6.1 THE BFGS METHOD
The most popular quasiNewton algorithm is the BFGS method, named for its discoverers Broyden, Fletcher, Goldfarb, and Shanno. In this section we derive this algorithm and its close relative, the DFP algorithm and describe its theoretical properties and practical implementation.
We begin the derivation by forming the following quadratic model of the objective function at the current iterate xk :
mkp fk fkT p 1 pT Bk p. 6.1 2
Here Bk is an n n symmetric positive definite matrix that will be revised or updated at every iteration. Note that the function value and gradient of this model at p 0 match fk and fk , respectively. The minimizer pk of this convex quadratic model, which we can
write explicitly as
pk B1 fk, 6.2 k
is used as the search direction, and the new iterate is
xk1 xk kpk, 6.3
where the step length k is chosen to satisfy the Wolfe conditions 3.6. This iteration is quite similar to the line search Newton method; the key difference is that the approximate Hessian Bk is used in place of the true Hessian.
Instead of computing Bk afresh at every iteration, Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Suppose that we have generated a new iterate xk1 and wish to construct a new quadratic model, of the form
m p f fT p1pTB p. k1 k1 k1 2 k1
What requirements should we impose on Bk1, based on the knowledge gained during the latest step? One reasonable requirement is that the gradient of mk1 should match the gradient of the objective function f at the latest two iterates xk and xk1. Since mk10 is precisely fk1, the second of these conditions is satisfied automatically. The first condition can be written mathematically as
6.1. THE BFGS METHOD 137
mk1kpkfk1 kBk1pk fk. By rearranging, we obtain
Bk1kpk fk1fk. To simplify the notation it is useful to define the vectors
sk xk1 xk kpk, yk fk1 fk, so that 6.4 becomes
Bk1sk yk.
6.4
6.5
6.6
We refer to this formula as the secant equation.
Given the displacement sk and the change of gradients yk , the secant equation requires
that the symmetric positive definite matrix Bk1 map sk into yk. This will be possible only if sk and yk satisfy the curvature condition
skT yk 0, 6.7
as is easily seen by premultiplying 6.6 by skT . When f is strongly convex, the inequality 6.7 will be satisfied for any two points xk and xk1 see Exercise 6.1. However, this condition
138 CHAPTER 6. QUASINEWTON METHODS
will not always hold for nonconvex functions, and in this case we need to enforce 6.7 explicitly, by imposing restrictions on the line search procedure that chooses the step length . In fact, the condition 6.7 is guaranteed to hold if we impose the Wolfe 3.6 or strong Wolfe conditions 3.7 on the line search. To verify this claim, we note from 6.5 and 3.6b thatfT sk c2fTsk,andtherefore
ykTsk c21kfkTpk. 6.8
k1 k
Since c2 1 and since pk is a descent direction, the term on the right is positive, and the curvature condition 6.7 holds.
When the curvature condition is satisfied, the secant equation 6.6 always has a solution Bk1. In fact, it admits an infinite number of solutions, since the nn 12 degrees of freedom in a symmetric positive definite matrix exceed the n conditions imposed by the secant equation. The requirement of positive definiteness imposes n additional inequalitiesall principal minors must be positivebut these conditions do not absorb the remaining degrees of freedom.
To determine Bk1 uniquely, we impose the additional condition that among all symmetric matrices satisfying the secant equation, Bk1 is, in some sense, closest to the current matrix Bk . In other words, we solve the problem
min B Bk 6.9a B
subject to B BT , Bsk yk, 6.9b
where sk and yk satisfy 6.7 and Bk is symmetric and positive definite. Different matrix norms can be used in 6.9a, and each norm gives rise to a different quasiNewton method. A norm that allows easy solution of the minimization problem 6.9 and gives rise to a scaleinvariant optimization method is the weighted Frobenius norm
A W aW12 AW12aF , 6.10 where F isdefinedby C 2 n n c2 .TheweightmatrixW canbechosenas
F i1j1ij
anymatrixsatisfyingtherelationWyk sk.Forconcreteness,thereadercanassumethat
W G 1 where G k is the average Hessian defined by k
1 0
The property
yk G kk pk G ksk
follows from Taylors theorem, Theorem 2.1. With this choice of weighting matrix W , the
G k
2fxkkpkd .
6.11
6.12
norm 6.10 is nondimensional, which is a desirable property, since we do not wish the solution of 6.9 to depend on the units of the problem.
with
With this weighting matrix and this norm, the unique solution of 6.9 is DFP Bk1 I kykskT Bk I kskykT kykykT,
k 1 . y kT s k
6.13
6.14
This formula is called the DFP updating formula, since it is the one originally proposed by Davidon in 1959, and subsequently studied, implemented, and popularized by Fletcher and Powell.
The inverse of Bk , which we denote by
Hk B1,
is useful in the implementation of the method, since it allows the search direction 6.2 to be calculated by means of a simple matrixvector multiplication. Using the Sherman MorrisonWoodbury formula A.28, we can derive the following expression for the update of the inverse Hessian approximation Hk that corresponds to the DFP update of Bk in 6.13:
H k y k y kT H k s k s kT
DFP Hk1 Hk yT H y yT s . 6.15
kkkkk
Note that the last two terms in the righthandside of 6.15 are rankone matrices, so that Hk undergoes a ranktwo modification. It is easy to see that 6.13 is also a ranktwo modification of Bk . This is the fundamental idea of quasiNewton updating: Instead of recomputing the approximate Hessians or inverse Hessians from scratch at every iteration, we apply a simple modification that combines the most recently observed information about the objective function with the existing knowledge embedded in our current Hessian approximation.
The DFP updating formula is quite effective, but it was soon superseded by the BFGS formula, which is presently considered to be the most effective of all quasiNewton updating formulae. BFGS updating can be derived by making a simple change in the argument that led to 6.13. Instead of imposing conditions on the Hessian approximations Bk , we impose similar conditions on their inverses Hk. The updated approximation Hk1 must be symmetric and positive definite, and must satisfy the secant equation 6.6, now written as
Hk1yk sk.
The condition of closeness to Hk is now specified by the following analogue of 6.9:
min H Hk 6.16a H
k
6.1. THE BFGS METHOD 139
subjectto HHT, Hyk sk. 6.16b
140 CHAPTER 6. QUASINEWTON METHODS
The norm is again the weighted Frobenius norm described above, where the weight matrix W is now any matrix satisfying W sk yk . For concreteness, we assume again that W is given by the average Hessian G k defined in 6.11. The unique solution Hk1 to 6.16 is given by
BFGS Hk1 I ksk ykT HkI k ykskT kskskT , 6.17
with k defined by 6.14.
Just one issue has to be resolved before we can define a complete BFGS algorithm: How
should we choose the initial approximation H0? Unfortunately, there is no magic formula that works well in all cases. We can use specific information about the problem, for instance by setting it to the inverse of an approximate Hessian calculated by finite differences at x0. Otherwise, we can simply set it to be the identity matrix, or a multiple of the identity matrix, where the multiple is chosen to reflect the scaling of the variables.
Algorithm 6.1 BFGS Method.
Given starting point x0, convergence tolerance
inverse Hessian approximation H0; k 0;
while fk ;
Compute search direction
0,
Set xk1 xk k pk where k is computed from a line search procedure to satisfy the Wolfe conditions 3.6;
Definesk xk1 xk andyk fk1 fk; Compute Hk1 by means of 6.17;
k k 1;
end while
pk Hk fk;
6.18
Each iteration can be performed at a cost of On2 arithmetic operations plus the cost of function and gradient evaluations; there are no On3 operations such as linear system solves or matrixmatrix operations. The algorithm is robust, and its rate of convergence is superlinear, which is fast enough for most practical purposes. Even though Newtons method converges more rapidly that is, quadratically, its cost per iteration usually is higher, because of its need for second derivatives and solution of a linear system.
We can derive a version of the BFGS algorithm that works with the Hessian approx imation Bk rather than Hk . The update formula for Bk is obtained by simply applying the ShermanMorrisonWoodbury formula A.28 to 6.17 to obtain
B k s k s kT B k y k y kT
BFGS Bk1 Bk sT B s yT s . 6.19
kkkkk
A naive implementation of this variant is not efficient for unconstrained minimization, because it requires the system Bk pk fk to be solved for the step pk , thereby increasing the cost of the step computation to On3. We discuss later, however, that less expensive implementations of this variant are possible by updating Cholesky factors of Bk .
PROPERTIES OF THE BFGS METHOD
It is usually easy to observe the superlinear rate of convergence of the BFGS method on practical problems. Below, we report the last few iterations of the steepest descent, BFGS, and an inexact Newton method on Rosenbrocks function 2.22. The table gives the value of
xk x . The Wolfe conditions were imposed on the step length in all three methods. From the starting point 1.2, 1, the steepest descent method required 5264 iterations, whereas BFGS and Newton took only 34 and 21 iterations, respectively to reduce the gradient norm to 105.
6.1. THE BFGS METHOD 141
steepest descent
BFGS
Newton
1.827e04 1.826e04 1.824e04 1.823e04
1.70e03 1.17e03 1.34e04 1.01e06
3.48e02 1.44e02 1.82e04 1.17e08
A few points in the derivation of the BFGS and DFP methods merit further discussion. Note that the minimization problem 6.16 that gives rise to the BFGS update formula does not explicitly require the updated Hessian approximation to be positive definite. It is easy to show, however, that Hk1 will be positive definite whenever Hk is positive definite, by using the following argument. First, note from 6.8 that ykT sk is positive, so that the updating formula 6.17, 6.14 is welldefined. For any nonzero vector z, we have
zTHk1zwTHkwkzTsk2 0,
w h e r e w e h a v e d e fi n e d w z k y k s kT z . T h e r i g h t h a n d s i d e c a n b e z e r o o n l y i f s kT z 0 , but in this case w z a 0, which implies that the first term is greater than zero. Therefore, Hk1 is positive definite.
To make quasiNewton updating formulae invariant to transformations in the vari ables such as scaling transformations, it is necessary for the objectives 6.9a and 6.16a to be invariant under the same transformations. The choice of the weighting matrices W used to define the norms in 6.9a and 6.16a ensures that this condition holds. Many other choices of the weighting matrix W are possible, each one of them giving a different update formula. However, despite intensive searches, no formula has been found that is significantly more effective than BFGS.
142 CHAPTER 6. QUASINEWTON METHODS
The BFGS method has many interesting properties when applied to quadratic func tions. We discuss these properties later in the more general context of the Broyden family of updating formulae, of which BFGS is a special case.
It is reasonable to ask whether there are situations in which the updating formula such as 6.17 can produce bad results. If at some iteration the matrix Hk becomes a poor approx imation to the true inverse Hessian, is there any hope of correcting it? For example, when the inner product ykT sk is tiny but positive, then it follows from 6.14, 6.17 that Hk1 contains very large elements. Is this behavior reasonable? A related question concerns the rounding errors that occur in finiteprecision implementation of these methods. Can these errors grow to the point of erasing all useful information in the quasiNewton approximate Hessian?
These questions have been studied analytically and experimentally, and it is now known that the BFGS formula has very effective selfcorrecting properties. If the matrix Hk incorrectly estimates the curvature in the objective function, and if this bad estimate slows down the iteration, then the Hessian approximation will tend to correct itself within a few steps. It is also known that the DFP method is less effective in correcting bad Hessian approx imations; this property is believed to be the reason for its poorer practical performance. The selfcorrecting properties of BFGS hold only when an adequate line search is performed. In particular, the Wolfe line search conditions ensure that the gradients are sampled at points that allow the model 6.1 to capture appropriate curvature information.
It is interesting to note that the DFP and BFGS updating formulae are duals of each other, in the sense that one can be obtained from the other by the interchanges s y, B H. This symmetry is not surprising, given the manner in which we derived these methods above.
IMPLEMENTATION
A few details and enhancements need to be added to Algorithm 6.1 to produce an efficient implementation. The line search, which should satisfy either the Wolfe conditions 3.6 or the strong Wolfe conditions 3.7, should always try the step length k 1 first, because this step length will eventually always be accepted under certain conditions, thereby producing superlinear convergence of the overall algorithm. Computational observations strongly suggest that it is more economical, in terms of function evaluations, to perform a fairly inaccurate line search. The values c1 104 and c2 0.9 are commonly used in 3.6.
As mentioned earlier, the initial matrix H0 often is set to some multiple I of the identity, but there is no good general strategy for choosing the multiple . If is too large, so that the first step p0 g0 is too long, many function evaluations may be required to find a suitable value for the step length 0. Some software asks the user to prescribe a value for the norm of the first step, and then set H0 g0 1 I to achieve this norm.
A heuristic that is often quite effective is to scale the starting matrix after the first step has been computed but before the first BFGS update is performed. We change the
provisional value H0 I by setting
G k G 12G 12 see Exercise 6.6. Therefore, by defining zk G 12sk and using the relation kkk
6.1. THE BFGS METHOD 143
H0 ykT sk I, 6.20 ykT yk
before applying the update 6.14 , 6.17 to obtain H1. This formula attempts to make the
size of H0 similar to that of 2 f x01, in the following sense. Assuming that the average
Hessian defined in 6.11 is positive definite, there exists a square root G 12 satisfying k
6.12, we have
The reciprocal of 6.21 is an approximation to one of the eigenvalues of G k , which in turn is close to an eigenvalue of 2 fxk. Hence, the quotient 6.21 itself approximates an eigenvalue of 2 f xk 1. Other scaling factors can be used in 6.20, but the one presented here appears to be the most successful in practice.
In 6.19 we gave an update formula for a BFGS method that works with the Hes sian approximation Bk instead of the the inverse Hessian approximation Hk . An efficient implementation of this approach does not store Bk explicitly, but rather the Cholesky fac torization Lk Dk LkT of this matrix. A formula that updates the factors Lk and Dk directly in On2 operations can be derived from 6.19. Since the linear system Bk pk fk also can be solved in On2 operations by performing triangular substitutions with Lk and LkT and a diagonal substitution with Dk , the total cost is quite similar to the variant described in Algorithm 6.1. A potential advantage of this alternative strategy is that it gives us the option of modifying diagonal elements in the Dk factor if they are not sufficiently large, to prevent instability when we divide by these elements during the calculation of pk . However, computational experience suggests no real advantages for this variant, and we prefer the simpler strategy of Algorithm 6.1.
The performance of the BFGS method can degrade if the line search is not based on the Wolfe conditions. For example, some software implements an Armijo backtracking line search see Section 3.1: The unit step length k 1 is tried first and is successively decreased until the sufficient decrease condition 3.6a is satisfied. For this strategy, there is noguaranteethatthecurvatureconditionykTsk 06.7willbesatisfiedbythechosenstep, since a step length greater than 1 may be required to satisfy this condition. To cope with this shortcoming, some implementations simply skip the BFGS update by setting Hk1 Hk when ykT sk is negative or too close to zero. This approach is not recommended, because the updates may be skipped much too often to allow Hk to capture important curvature information for the objective function f . In Chapter 18 we discuss a damped BFGS update that is a more effective strategy for coping with the case where the curvature condition 6.7 is not satisfied.
y T s k G 1 2 s k T G 1 2 s k z T z k kkkk. 6.21
ykT yk G 12skT G kG 12sk zkT G kzk kk
144 CHAPTER 6. QUASINEWTON METHODS
6.2 THE SR1 METHOD
In the BFGS and DFP updating formulae, the updated matrix Bk1 or Hk1 differs from its predecessor Bk or Hk by a rank2 matrix. In fact, as we now show, there is a simpler rank1 update that maintains symmetry of the matrix and allows it to satisfy the secant equation. Unlike the ranktwo update formulae, this symmetricrank1, or SR1, update does not guarantee that the updated matrix maintains positive definiteness. Good numerical results have been obtained with algorithms based on SR1, so we derive it here and investigate its properties.
The symmetric rank1 update has the general form Bk1 Bk vvT,
where is either 1 or 1, and and v are chosen so that Bk1 satisfies the secant equation 6.6, that is, yk Bk1sk. By substituting into this equation, we obtain
ykBkskvTsk v. 6.22 Since the term in brackets is a scalar, we deduce that v must be a multiple of yk Bk sk , that
is, v yk Bk sk for some scalar . By substituting this form of v into 6.22, we obtain
yk Bksk 2 skT yk Bksk yk Bksk, 6.23
and it is clear that this equation is satisfied if and only if we choose the parameters and to be
sign skT yk Bksk , askT yk Bkska12 .
Hence, we have shown that the only symmetric rank1 updating formula that satisfies the
secant equation is given by
SR1 Bk1 Bk y B s T s . 6.24
By applying the ShermanMorrison formula A.27, we obtain the corresponding update formula for the inverse Hessian approximation Hk :
sk Hk yksk Hk ykT
SR1 Hk1Hk s HyTy . 6.25
kkkk
This derivation is so simple that the SR1 formula has been rediscovered a number of times. It is easy to see that even if Bk is positive definite, Bk1 may not have the same property. The same is, of course, true of Hk. This observation was considered a major drawback
yk Bkskyk BkskT kkkk
in the early days of nonlinear optimization when only line search iterations were used. However, with the advent of trustregion methods, the SR1 updating formula has proved to be quite useful, and its ability to generate indefinite Hessian approximations can actually be regarded as one of its chief advantages.
The main drawback of SR1 updating is that the denominator in 6.24 or 6.25 can vanish. In fact, even when the objective function is a convex quadratic, there may be steps on which there is no symmetric rank1 update that satisfies the secant equation. It pays to reexamine the derivation above in the light of this observation.
By reasoning in terms of Bk similar arguments can be applied to Hk, we see that there are three cases:
1. If yk Bk sk T sk a 0, then the arguments above show that there is a unique rankone updating formula satisfying the secant equation 6.6, and that it is given by 6.24.
2. If yk Bksk, then the only updating formula satisfying the secant equation is simply Bk1 Bk.
3. If yk a Bksk and yk BkskT sk 0, then 6.23 shows that there is no symmetric rankone updating formula satisfying the secant equation.
The last case clouds an otherwise simple and elegant derivation, and suggests that numerical instabilities and even breakdown of the method can occur. It suggests that rankone updating does not provide enough freedom to develop a matrix with all the desired characteristics, and that a ranktwo correction is required. This reasoning leads us back to the BFGS method, in which positive definiteness and thus nonsingularity of all Hessian approximations is guaranteed.
Nevertheless, we are interested in the SR1 formula for the following reasons.
i Asimplesafeguardseemstoadequatelypreventthebreakdownofthemethodandthe occurrence of numerical instabilities.
ii The matrices generated by the SR1 formula tend to be good approximations to the true Hessian matrixoften better than the BFGS approximations.
iii In quasiNewton methods for constrained problems, or in methods for partially separable functions see Chapters 18 and 7, it may not be possible to impose the curvatureconditionykTsk 0,andthusBFGSupdatingisnotrecommended.Indeed, in these two settings, indefinite Hessian approximations are desirable insofar as they reflect indefiniteness in the true Hessian.
We now introduce a strategy to prevent the SR1 method from breaking down. It
has been observed in practice that SR1 performs well simply by skipping the update if the denominator is small. More specifically, the update 6.24 is applied only if
askT yk Bkska r sk yk Bksk , 6.26
6.2. THE SR1 METHOD 145
146 CHAPTER 6. QUASINEWTON METHODS
where r 0,1 is a small number, say r 108. If 6.26 does not hold, we set Bk1 Bk. Most implementations of the SR1 method use a skipping rule of this kind.
Why do we advocate skipping of updates for the SR1 method, when in the previous section we discouraged this strategy in the case of BFGS? The two cases are quite different. The condition skT yk Bk sk 0 occurs infrequently, since it requires certain vectors to be aligned in a specific way. When it does occur, skipping the update appears to have no negative effects on the iteration. This is not surprising, since the skipping condition im plies that skT G sk skT Bk sk , where G is the average Hessian over the last stepmeaning that the curvature of Bk along sk is already correct. In contrast, the curvature condition skT yk 0 required for BFGS updating may easily fail if the line search does not im pose the Wolfe conditions for example, if the step is not long enough, and therefore skipping the BFGS update can occur often and can degrade the quality of the Hessian approximation.
We now give a formal description of an SR1 method using a trustregion framework, which we prefer over a line search framework because it can accommodate indefinite Hessian approximations more easily.
Algorithm 6.2 SR1 TrustRegion Method.
Given starting point x0, initial Hessian approximation B0,
trustregion radius a0, convergence tolerance 0,
parameters 0, 103 and r 0, 1; k 0;
whilefk ;
Compute sk by solving the subproblem
minfkTs1sTBks subjectto s ak; s2
Compute
yk fxk skfk,
ared fk f xk sk actual reduction
pred fkT sk 1 skT Bk sk predicted reduction; 2
if aredpred xk1 xk sk;
else
xk1 xk; end if
6.27
if aredpred 0.75
if sk 0.8ak
ak1 ak; else
ak1 2ak; end if
else if 0.1 aredpred 0.75 ak1 ak;
else
ak1 0.5ak; end if
if 6.26 holds
Use 6.24 to compute Bk1 even if xk1 xk ;
else
Bk1 Bk; end if
k k 1; end while
This algorithm has the typical form of a trust region method cf. Algorithm 4.1. For concreteness, we have specified a particular strategy for updating the trust region radius, but other heuristics can be used instead.
To obtain a fast rate of convergence, it is important for the matrix Bk to be updated even along a failed direction sk. The fact that the step was poor indicates that Bk is an inadequate approximation of the true Hessian in this direction. Unless the quality of the approximation is improved, steps along similar directions could be generated on later iterations, and repeated rejection of such steps could prevent superlinear convergence.
PROPERTIES OF SR1 UPDATING
One of the main advantages of SR1 updating is its ability to generate good Hessian approximations. We demonstrate this property by first examining a quadratic function. For functions of this type, the choice of step length does not affect the update, so to examine the effect of the updates, we can assume for simplicity a uniform step length of 1, that is,
It follows that pk sk .
Theorem 6.1.
pk Hk fk, xk1 xk pk. 6.28
Suppose that f : IRn IR is the strongly convex quadratic function f x bT x
1xT Ax, where A is symmetric positive definite. Then for any starting point x0 and any 2
6.2. THE SR1 METHOD 147
148 CHAPTER 6. QUASINEWTON METHODS
symmetric starting matrix H0, the iterates xk generated by the SR1 method 6.25, 6.28 converge to the minimizer in at most n steps, provided that sk Hk yk T yk a 0 for all k. Moreover, if n steps are performed, and if the search directions pi are linearly independent, then Hn A1.
PROOF. Because of our assumption sk Hk yk T yk a 0, the SR1 update is always well defined. We start by showing inductively that
Hkyj sj, j 0,1,…,k1. 6.29
In other words, we claim that the secant equation is satisfied not only along the most recent search direction, but along all previous directions.
By definition, the SR1 update satisfies the secant equation, so we have H1 y0 s0 . Let us now assume that 6.29 holds for some value k 1 and show that it holds also for k 1. From this assumption, we have from 6.29 that
skHkykTyjskTyjykTHkyjskTyjykTsj0, alljk, 6.30 wherethelastequalityfollowsbecauseyi Asi forthequadraticfunctionweareconsidering
here. By using 6.30 and the induction hypothesis 6.29 in 6.25, we have Hk1yj Hkyj sj, foralljk.
Since Hk1yk sk by the secant equation, we have shown that 6.29 holds when k is replaced by k 1. By induction, then, this relation holds for all k.
If the algorithm performs n steps and if these steps s j are linearly independent, we have
sj Hnyj HnAsj, j0,1,…,n1.
It follows that Hn A I, that is, Hn A1. Therefore, the step taken at xn is the Newton step, and so the next iterate xn1 will be the solution, and the algorithm terminates.
Consider now the case in which the steps become linearly dependent. Suppose that sk is a linear combination of the previous steps, that is,
sk 0s0 k1sk1, 6.31 for some scalars i . From 6.31 and 6.29 we have that
Hk yk Hk Ask
0 Hk As0 k1 Hk Ask1
0Hky0 k1Hkyk1 0s0 k1sk1
sk.
Sinceyk fk1 fk andsincesk pk Hkfk from6.28,wehavethat Hkfk1 fkHkfk,
which, by the nonsingularity of Hk , implies that fk1 0. Therefore, xk1 is the solution point.
The relation 6.29 shows that when f is quadratic, the secant equation is satisfied along all previous search directions, regardless of how the line search is performed. A result like this can be established for BFGS updating only under the restrictive assumption that the line search is exact, as we show in the next section.
For general nonlinear functions, the SR1 update continues to generate good Hessian approximations under certain conditions.
Theorem 6.2.
Suppose that f is twice continuously differentiable, and that its Hessian is bounded and Lipschitz continuous in a neighborhood of a point x. Let xk be any sequence of iterates such that xk x for some x IRn. Suppose in addition that the inequality 6.26 holds for all k, for some r 0, 1, and that the steps sk are uniformly linearly independent. Then the matrices Bk generated by the SR1 updating formula satisfy
lim Bk2fx 0. k
The term uniformly linearly independent steps means, roughly speaking, that the steps do not tend to fall in a subspace of dimension less than n. This assumption is usually, but not always, satisfied in practice see the Notes and References at the end of this chapter.
6.3 THE BROYDEN CLASS
So far, we have described the BFGS, DFP, and SR1 quasiNewton updating formulae, but there are many others. Of particular interest is the Broyden class, a family of updates specified by the following general formula:
Bk1 Bk k
s kT B k s k
k kskT BkskvkvkT, 6.32 y kT s k
BksksT Bk ykyT
6.3. THE BROYDEN CLASS 149
150 CHAPTER 6. QUASINEWTON METHODS
where k is a scalar parameter and
vkykBksk . 6.33
y kT s k s kT B k s k
The BFGS and DFP methods are members of the Broyden classwe recover BFGS by setting k 0 and DFP by setting k 1 in 6.32. We can therefore rewrite 6.32 as a linear combination of these two methods, that is,
B 1 BBFGS BDFP . k1 k k1 k k1
This relationship indicates that all members of the Broyden class satisfy the secant equation 6.6, since the BGFS and DFP matrices themselves satisfy this equation. Also, since BFGS and DFPupdatingpreservepositivedefinitenessoftheHessianapproximationswhenskTyk 0, this relation implies that the same property will hold for the Broyden family if 0 k 1.
Much attention has been given to the socalled restricted Broyden class, which is obtained by restricting k to the interval 0,1. It enjoys the following property when applied to quadratic functions. Since the analysis is independent of the step length, we assume for simplicity that each iteration has the form
pk B1 fk, xk1 xk pk. 6.34 k
Theorem 6.3.
2
6.34 and B0 be any symmetric positive definite starting matrix, and suppose that the matrices
Bk are updated by the Broyden formula 6.32 with k 0, 1. Define k1 k2 kn to be the eigenvalues of the matrix
Suppose that f : IRn IR is the strongly convex quadratic function f x bT x 1 xT Ax, where A is symmetric and positive definite. Let x0 be any starting point for the iteration
1 1 1 A2Bk A2.
Then for all k, we have
mink,1 k1 maxk,1,
6.35
interval 0, 1.
Let us discuss the significance of this result. If the eigenvalues ik of the matrix 6.35 are all 1, then the quasiNewton approximation Bk is identical to the Hessian A of the quadratic objective function. This situation is the ideal one, so we should be hoping for these eigenvalues to be as close to 1 as possible. In fact, relation 6.36 tells us that the
i 1,2,…,n.
Moreover, the property 6.36 does not hold if the Broyden parameter k is chosen outside the
iii
6.36
eigenvalues ik converge monotonically but not strictly monotonically to 1. Suppose, for
example, that at iteration k the smallest eigenvalue is k1 0.7. Then 6.36 tells us that
at the next iteration k1 0.7,1. We cannot be sure that this eigenvalue has actually 1
moved closer to 1, but it is reasonable to expect that it has. In contrast, the first eigenvalue can become smaller than 0.7 if we allow k to be outside 0, 1. Significantly, the result of Theorem 6.3 holds even if the line searches are not exact.
Although Theorem 6.3 seems to suggest that the best update formulas belong to the restricted Broyden class, the situation is not at all clear. Some analysis and computational testing suggest that algorithms that allow k to be negative in a strictly controlled manner may in fact be superior to the BFGS method. The SR1 formula is a case in point: It is a member of the Broyden class, obtained by setting
k skTyk , s kT y k s kT B k s k
but it does not belong to the restricted Broyden class, because this value of k may fall outside the interval 0, 1.
In the remaining discussion of this section, we determine more precisely the range of values of k that preserve positive definiteness.
The last term in 6.32 is a rankone correction, which by the interlacing eigenvalue theoremTheoremA.1increasestheeigenvaluesofthematrixwhenk ispositive.Therefore Bk1 is positive definite for all k 0. On the other hand, by Theorem A.1 the last term in 6.32 decreases the eigenvalues of the matrix when k is negative. As we decrease k , this matrix eventually becomes singular and then indefinite. A little computation shows that Bk1 is singular when k has the value
6.3. THE BROYDEN CLASS 151
where
kc 1 , 1k
yT B1yksT Bksk k k k k .
6.37
6.38
y kT s k 2
ByapplyingtheCauchySchwarzinequalityA.5to6.38,weseethatk 1andtherefore kc 0. Hence, if the initial Hessian approximation B0 is symmetric and positive definite, and if skT yk 0 and k kc for each k, then all the matrices Bk generated by Broydens formula 6.32 remain symmetric and positive definite.
When the line search is exact, all methods in the Broyden class with k kc generate the same sequence of iterates. This result applies to general nonlinear functions and is based on the observation that when all the line searches are exact, the directions generated by Broydenclass methods differ only in their lengths. The line searches identify the same
152 CHAPTER 6. QUASINEWTON METHODS
minima along the chosen search direction, though the values of the step lengths may differ because of the different scaling.
The Broyden class has several remarkable properties when applied with exact line searches to quadratic functions. We state some of these properties in the next theorem, whose proof is omitted.
Theorem 6.4.
Suppose that a method in the Broyden class is applied to the strongly convex quadratic function f x bT x 1 xT Ax, where x0 is the starting point and B0 is any symmetric positive
2
definite matrix. Assume that k is the exact step length and that k kc for all k, where kc is defined by 6.37. Then the following statements are true.
i The iterates are independent of k and converge to the solution in at most n iterations.
ii The secant equation is satisfied for all previous search directions, that is,
Bksj yj, jk1,k2,…,1.
iii If the starting matrix is B0 I, then the iterates are identical to those generated by the conjugate gradient method see Chapter 5. In particular, the search directions are conjugate, that is,
s iT A s j 0 , f o r i a j .
iv If n iterations are performed, we have Bn A.
Note that parts i, ii, and iv of this result echo the statement and proof of Theorem 6.1, where similar results were derived for the SR1 update formula.
We can generalize Theorem 6.4 slightly: It continues to hold if the Hessian approxi mations remain nonsingular but not necessarily positive definite. Hence, we could allow k to be smaller than kc , provided that the chosen value did not produce a singular updated matrix. We can also generalize point iii as follows. If the starting matrix B0 is not the identity matrix, then the Broydenclass method is identical to the preconditioned conjugate gradient method that uses B0 as preconditioner.
We conclude by commenting that results like Theorem 6.4 would appear to be of mainly theoretical interest, since the inexact line searches used in practical implementations of Broydenclass methods and all other quasiNewton methods cause their performance to differ markedly. Nevertheless, it is worth noting that this type of analysis guided much of the development of quasiNewton methods.
6.4 CONVERGENCE ANALYSIS
In this section we present global and local convergence results for practical implementations of the BFGS and SR1 methods. We give more details for BFGS because its analysis is more general and illuminating than that of SR1. The fact that the Hessian approximations evolve by means of updating formulas makes the analysis of quasiNewton methods much more complex than that of steepest descent and Newtons method.
Although the BFGS and SR1 methods are known to be remarkably robust in practice, we will not be able to establish truly global convergence results for general nonlinear objective functions. That is, we cannot prove that the iterates of these quasiNewton methods approach a stationary point of the problem from any starting point and any suitable initial Hessian approximation. In fact, it is not yet known if the algorithms enjoy such properties. In our analysis we will either assume that the objective function is convex or that the iterates satisfy certain properties. On the other hand, there are well known local, superlinear convergence results that are true under reasonable assumptions.
Throughout this section we use to denote the Euclidean vector or matrix norm, and denote the Hessian matrix 2 f x by Gx.
GLOBAL CONVERGENCE OF THE BFGS METHOD
We study the global convergence of the BFGS method, with a practical line search, when applied to a smooth convex function from an arbitrary starting point x0 and from any initial Hessian approximation B0 that is symmetric and positive definite. We state our precise assumptions about the objective function formally, as follows.
Assumption 6.1.
i The objective function f is twice continuously differentiable.
ii The level set L x IRn f x f x0 is convex, and there exist positive constants
m and M such that
forallzIRn andxL.
m z 2 zT Gxz M z 2 6.39
Part ii of this assumption implies that Gx is positive definite on L and that f has a unique minimizer x in L.
By using 6.12 and 6.39 we obtain
ykTsk skTG ksk m, 6.40 s kT s k s kT s k
6.4. CONVERGENCE ANALYSIS 153
154 CHAPTER 6. QUASINEWTON METHODS
where G k is the average Hessian defined in 6.11. Assumption 6.1 implies that G k is
positive definite, so its square root is welldefined. Therefore, as in 6.21, we have by
definingzk G 12sk that k
ykTyk skTG 2ksk zkTG kzk M. 6.41 y kT s k s kT G k s k z kT z k
We are now ready to present the global convergence result for the BFGS method. It does not seem to be possible to establish a bound on the condition number of the Hessian approximations Bk, as is done in Section 3.2. Instead, we will introduce two new tools in the analysis, the trace and determinant, to estimate the size of the largest and smallest eigenvalues of the Hessian approximations. The trace of a matrix denoted by trace is the sum of its eigenvalues, while the determinant denoted by det is the product of the eigenvalues; see the Appendix for a brief discussion of their properties.
Theorem 6.5.
Let B0 be any symmetric positive definite initial matrix, and let x0 be a starting point for which Assumption 6.1 is satisfied. Then the sequence xk generated by Algorithm 6.1 with
0 converges to the minimizer x of f . PROOF. We start by defining
m k y kT s k , s kT s k
and note from 6.40 and 6.41 that
mk m,
By computing the trace of the BFGS update 6.19, we obtain that
Bksk 2 yk 2 traceBk1traceBksTBs yTs
kkkkk
see Exercise 6.11. We can also show Exercise 6.10 that
M k y kT y k , y kT s k
6.42
6.43
6.44
6.45
6.46
Mk M.
We now define
detBk1 detBk
c o s k s kT B k s k , s k B k s k
ykT sk . s kT B k s k
q k s kT B k s k , s kT s k
so that k is the angle between sk and Bk sk . We then obtain that
Bksk 2 Bksk 2 sk 2skTBksk qk TT222. 6.47
sk Bksk sk Bksk sk In addition, we have from 6.42 that
det Bk 1 det Bk ykT sk skT sk s kT s k s kT B k s k
6.4. CONVERGENCE ANALYSIS 155
cos k
det Bk m k . 6.48 q k
We now combine the trace and determinant by introducing the following function of a positive definite matrix B:
B traceB lndetB, 6.49 where ln denotes the natural logarithm. It is not difficult to show that B 0; see
Exercise 6.9. By using 6.42 and 6.446.49, we have that Bk1traceBkMk qk lndetBklnmk lnqk
1 qk ln qk ln cos2 k . 6.50 cos2 k cos2 k
Now, since the function ht 1 t ln t is nonpositive for all t 0 see Exercise 6.8, the term inside the square brackets is nonpositive, and thus from 6.43 and 6.50 we have
k j0
wherewecanassumetheconstantc Mlnm1tobepositive,withoutlossofgenerality.
We now relate these expressions to the results given in Section 3.2. Note from the form
sk k B1 fk of the quasiNewton iteration that cos k defined by 6.46 is the angle k
between the steepest descent direction and the search direction, which plays a crucial role in the global convergence theory of Chapter 3. From 3.22, 3.23 we know that the sequence fk generated by the line search algorithm is bounded away from zero only if cos j 0. Letusthenproceedbycontradictionandassumethatcosj 0.Thenthereexists
k1 0suchthatforall j k1,wehave
lncos2j 2c,
cos2 k BkMk lnmk 1
0 Bk1 B0 ck 1
ln cos2 j , 6.51
156 CHAPTER 6. QUASINEWTON METHODS
where c is the constant defined above. Using this inequality in 6.51 we find the following relations to be true for all k k1:
k1 k
0B0ck1 lncos2j 2c j0 jk11
k1
j0
However, the righthandside is negative for large k, giving a contradiction. Therefore, there exists a subsequence of indices jk k 1,2,… such that cos jk 0. By Zoutendijks result 3.14 this limit implies that lim inf fk 0. Since the problem is strongly convex, the latter limit is enough to prove that xk x.
Theorem 6.5 has been generalized to the entire restricted Broyden class, except for the DFP method. In other words, Theorem 6.5 can be shown to hold for all k 0, 1 in 6.32, but the argument seems to break down as k approaches 1 because some of the selfcorrecting properties of the update are weakened considerably.
An extension of the analysis just given shows that the rate of convergence of the iterates is linear. In particular, we can show that the sequence xk x converges to zero rapidly enough that
xkx . 6.52
k1
We will not prove this claim, but rather establish that if 6.52 holds, then the rate of
convergence is actually superlinear.
SUPERLINEAR CONVERGENCE OF THE BFGS METHOD
The analysis of this section makes use of the Dennis and More characterization 3.36 of superlinear convergence. It applies to general nonlinearnot just convexobjective functions. For the results that follow we need to make an additional assumption.
Assumption 6.2.
The Hessian matrix G is Lipschitz continuous at x, that is, GxGx L xx ,
B0
lncos2j 2ck1cck.
for all x near x, where L is a positive constant.
We start by introducing the quantities
s G12s, y G12y, B G12BG12,
kkkkkk where G Gx and x is a minimizer of f . Similarly to 6.46, we define
c o s k s kT B k s k , s k B k s k
while we echo 6.42 and 6.43 in defining y k 2
q k s kT B k s k , s k 2
y kTs k m k s T s .
6.4. CONVERGENCE ANALYSIS 157
M k y T s ,
kk kk
By pre and postmultiplying the BFGS update formula 6.19 by G12 and grouping
terms appropriately, we obtain
B k s k s kT B k y k y kT B k 1 B k T y T s .
s kBks k kk
Since this expression has precisely the same form as the BFGS formula 6.19, it follows
from the argument leading to 6.50 that B k1B kM k lnm k 1
1 q k ln q k 6.53 cos2 k cos2 k
ln cos2 k . Recalling 6.12, we have that
and thus
y k G s k G k G s k ,
y s G 1 2 G G G 1 2 s .
kkkk By Assumption 6.2, and recalling the definition 6.11, we have
y s G12 2 s G G G12 2 s L , kkkkkk
where k is defined by
kmaxxk1x , xkx .
158 CHAPTER 6. QUASINEWTON METHODS
We have thus shown that
kkkkkkkk It follows from the definition of m k that
m ky kTs k 1c k. s k 2
By combining 6.55 and 6.56, we obtain also that
y kT s k 1 c k
c c such that the following inequalities hold for all sufficiently large k:
y k s k c k, 6.54 s k
for some positive constant c . This inequality and 6.52 play an important role in superlinear convergence, as we now show.
Theorem 6.6.
Suppose that f is twice continuously differentiable and that the iterates generated by the BFGS algorithm converge to a minimizer x at which Assumption 6.2 holds. Suppose also that 6.52 holds. Then xk converges to x at a superlinear rate.
PROOF. From 6.54, we have from the triangle inequality A.4a that
y s c s , s y c s ,
so that
1c k s k y k 1c k s k . By squaring 6.54 and using 6.55, we obtain
1c 2 s 22y Ts s 2 y 22y Ts s 2c 2 2 s 2, kkkkkkkkkkk
kkkk kkkk
and therefore
2y Ts 12c c 221c 22s 221c s 2.
6.56
6.57 Since xk x, we have that k 0, and thus by 6.57 there exists a positive constant
6.55
y 2 1c M kk k.
M 1 2c 1c . 6.58
k 1c k k k
We again make use of the nonpositiveness of the function ht 1 t ln t. Therefore, we have
x ln1 x h 1 0. 1x 1x
Now, for k large enough we can assume that c k 1 , and therefore 2
ln1c k c k 2c k. 1 c k
This relation and 6.56 imply that for sufficiently large k, we have lnm k ln1c k2c k 2c k.
6.59
By summing this expression and making use of 6.52 we have that
1 q j q j
lncos2 1cos2 lncos2 B03c j .
6.4. CONVERGENCE ANALYSIS 159
We can now deduce from 6.53, 6.58, and 6.59 that 0B B 3clncos2 1klnk .6.60
q q k1 k k k cos2 cos2
kk
j0 j j j j0 Since the term in the square brackets is nonpositive, and since ln 1 cos2 j
0 for all j,
we obtain the two limits lim ln 1
0, lim j
lim cos j 1, j
1
q j ln cos2 j
lim q j 1. j
q j cos2 j
0,
j cos2 j which imply that
The essence of the result has now been proven; we need only to interpret these limits in terms of the DennisMore characterization of superlinear convergence.
6.61
160 CHAPTER 6. QUASINEWTON METHODS
Recalling 6.47, we have
G 1 2 B G s 2 B I s 2
kkkk
G 1 2 s k
2 s k 2
B k s k 2 2 s kT B k s k s kT s k
s kT s k c o s k2 2 q k 1 .
q k2
Since by 6.61 the righthandside converges to 0, we conclude that
lim BkGsk 0. k sk
Thelimit3.36andTheorem3.6implythattheunitsteplengthk 1willsatisfytheWolfe conditions near the solution, and hence that the rate of convergence is superlinear.
CONVERGENCE ANALYSIS OF THE SR1 METHOD
The convergence properties of the SR1 method are not as well understood as those of the BFGS method. No global results like Theorem 6.5 or local superlinear results like The orem 6.6 have been established, except the results for quadratic functions discussed earlier. There is, however, an interesting result for the trustregion SR1 algorithm, Algorithm 6.2. It states that when the objective function has a unique stationary point and the condition 6.26 holds at every step so that the SR1 update is never skipped and the Hessian ap proximations Bk are bounded above, then the iterates converge to x at an n 1step superlinear rate. The result does not require exact solution of the trustregion subproblem 6.27.
We state the result formally as follows.
Theorem 6.7.
Supposethattheiteratesxk aregeneratedbyAlgorithm6.2.Supposealsothatthefollowing conditions hold:
c1 Thesequenceofiteratesdoesnotterminate,butremainsinaclosed,bounded,convexset D, on which the function f is twice continuously differentiable, and in which f has a unique stationary point x;
c2 the Hessian 2 fx is positive definite, and 2 fx is Lipschitz continuous in a neighborhood of x;
c3 the sequence of matrices Bk is bounded in norm;
c4 condition 6.26 holds at every iteration, where r is some constant in 0, 1.
Then limk xk x, and we have that
k xk x
Note that the BFGS method does not require the boundedness assumption c3 to hold. As we have mentioned already, the SR1 update does not necessarily maintain positive definiteness of the Hessian approximations Bk. In practice, Bk may be indefinite at any iteration, which means that the trust region bound may continue to be active for arbitrarily large k. Interestingly, however, it can be shown that the SR1 Hessian approximations tend to be positive definite most of the time. The precise result is that
lim number of indices j 1,2,…,k for which Bj is positive semidefinite 1, k k
under the assumptions of Theorem 6.7. This result holds regardless of whether the initial Hessian approximation is positive definite or not.
NOTES AND REFERENCES
For a comprehensive treatment of quasiNewton methods see Dennis and Schn abel 92, Dennis and More 91, and Fletcher 101. A formula for updating the Cholesky factors of the BFGS matrices is given in Dennis and Schnabel 92.
Several safeguards and modifications of the SR1 method have been proposed, but the condition 6.26 is favored in the light of the analysis of Conn, Gould, and Toint 71. Computational experiments by Conn, Gould, and Toint 70, 73 and Khalfan, Byrd, and Schnabel 181, using both line search and trustregion approaches, indicate that the SR1 method appears to be competitive with the BFGS method. The proof of Theorem 6.7 is given in Byrd, Khalfan, and Schnabel 51.
A study of the convergence of BFGS matrices for nonlinear problems can be found in Ge and Powell 119 and Boggs and Tolle 32; however, the results are not as satisfactory as for SR1 updating.
The global convergence of the BFGS method was established by Powell 246. This result was extended to the restricted Broyden class, except for DFP, by Byrd, Nocedal, and Yuan 53. For a discussion of the selfcorrecting properties of quasiNewton methods see Nocedal 229. Most of the early analysis of quasiNewton methods was based on the bounded deterioration principle. This is a tool for the local analysis that quantifies the worst case behavior of quasiNewton updating. Assuming that the starting point is sufficiently close to the solution x and that the initial Hessian approximation is sufficiently close to 2 f x, one can use the bounded deterioration bounds to prove that the iteration cannot stray away from the solution. This property can then be used to show that the quality of the quasiNewton approximations is good enough to yield superlinear convergence. For details, see Dennis and More 91 or Dennis and Schnabel 92.
xkn1 x
lim 0.
6.4. CONVERGENCE ANALYSIS 161
162 CHAPTER 6. QUASINEWTON METHODS
EXERCISES 6.1
a Show that if f is strongly convex, then 6.7 holds for any vectors xk and xk1.
b Give an example of a function of one variable satisfying g0 1 and g1 1
4
and show that 6.7 does not hold in this case.
6.2 Show that the second strong Wolfe condition 3.7b implies the curvature condition 6.7.
6.3 Verify that 6.19 and 6.17 are inverses of each other.
6.4 Use the ShermanMorrison formula A.27 to show that 6.24 is the inverse of
6.25.
6.5 Prove the statements ii and iii given in the paragraph following 6.25.
6.6 The square root of a matrix A is a matrix A12 such that A12 A12 A. Show that any symmetric positive definite matrix A has a square root, and that this square root is itself symmetric and positive definite. Hint: Use the factorization A U DU T A.16, where U is orthogonal and D is diagonal with positive diagonal elements.
6.7 Use the CauchySchwarz inequality A.5 to verify that k 1, where k is defined by 6.38.
6.8 Defineht1tlnt,andnotethatht11t, ht1t2 0, h10,andh10.Showthatht0forallt 0.
6.9 DenotetheeigenvaluesofthepositivedefinitematrixBby1,2,…,n,where 0 1 2 n. Show that the function defined in 6.49 can be written as
n
i lni.
i1
Use this form to show that B 0.
6.10 The object of this exercise is to prove 6.45.
a Show that detI xyT 1 yT x, where x and y are nvectors. Hint: Assuming that x a 0, we can find vectors w1,w2,…,wn1 such that the matrix Q defined by
B
Q x,w1,w2,…,wn1
is nonsingular and x Qe1, where e1 1,0,0,…,0T . If we define yT Q z1,z2,…,zn,
then
and
z1 yTQe1 yTQQ1xyTx,
detI xyTdetQ1I xyTQdetI e1yT Q. Use a similar technique to prove that
detI xyT uvT1yTx1vTuxTvyTu. Use this relation to establish 6.45.
b
c
6.12 Show that if f satisfies Assumption 6.1 and if the sequence of gradients satisfies lim inf fk 0, then the whole sequence of iterates x converges to the solution x.
6.11 Use the properties of the trace of a symmetric matrix and the formula 6.19 to prove 6.44.
6.4. CONVERGENCE ANALYSIS 163
CHAPTER7
LargeScale Unconstrained Optimization
Many applications give rise to unconstrained optimization problems with thousands or millions of variables. Problems of this size can be solved efficiently only if the storage and computational costs of the optimization algorithm can be kept at a tolerable level. A diverse collection of largescale optimization methods has been developed to achieve this goal, each being particularly effective for certain problem types. Some of these methods are straightforward adaptations of the methods described in Chapters 3, 4, and 6. Other approaches are modifications of these basic methods that allow approximate steps to be calculated at lower cost in computation and storage. One set of approaches that we have already discussedthe nonlinear conjugate gradient methods of Section 5.2can be applied
This is pa Printer: O
g
to large problems without modification, because of its minimal storage demands and its reliance on only firstorder derivative information.
The line search and trustregion Newton algorithms of Chapters 3 and 4 require matrix factorizations of the Hessian matrices 2 fk . In the largescale case, these factorizations can be carried out using sparse elimination techniques. Such algorithms have received much attention, and high quality software implementations are available. If the computational cost and memory requirements of these sparse factorization methods are affordable for a given application, and if the Hessian matrix can be formed explicitly, Newton methods based on sparse factorizations constitute an effective approach for solving such problems.
Often, however, the cost of factoring the Hessian is prohibitive, and it is preferable to compute approximations to the Newton step using iterative linear algebra techniques. Section 7.1 discusses inexact Newton methods that use these techniques, in both line search and trustregion frameworks. The resulting algorithms have attractive global convergence properties and may be superlinearly convergent for suitable choices of parameters. They find effectivesearchdirectionswhentheHessian2 fk isindefinite,andmayevenbeimplemented in a Hessianfree manner, without explicit calculation or storage of the Hessian.
The Hessian approximations generated by the quasiNewton approaches of Chapter 6 are usually dense, even when the true Hessian is sparse, and the cost of storing and working with these approximations can be excessive for large n. Section 7.2 discusses limitedmemory variants of the quasiNewton approach, which use Hessian approximations that can be stored compactly by using just a few vectors of length n. These methods are fairly robust, inexpensive, and easy to implement, but they do not converge rapidly. Another approach, discussed briefly in Section 7.3, is to define quasiNewton approximate Hessians Bk that preserve sparsity, for example by mimicking the sparsity pattern of the Hessian.
In Section 7.4, we note that objective functions in large problems often possess a structural property known as partial separability, which means they can be decomposed into a sum of simpler functions, each of which depends on only a small subspace of IRn. Effective Newton and quasiNewton methods that exploit this property have been developed. Such methods usually converge rapidly and are robust, but they require detailed information about the objective function, which can be difficult to obtain in some applications.
We conclude the chapter with a discussion of software for largescale unconstrained optimization problems.
7.1 INEXACT NEWTON METHODS
Recallfrom2.15thatthebasicNewtonsteppkN isobtainedbysolvingthesymmetricnn linear system
2fkpkN fk. 7.1 In this section, we describe techniques for obtaining approximations to pkN that are
7.1. INEXACT NEWTON METHODS 165
166 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
inexpensive to calculate but are good search directions or steps. These approaches are based on solving 7.1 by using the conjugate gradient CG method see Chapter 5 or the Lanczos method, with modifications to handle negative curvature in the Hessian 2 fk . Both line search and trustregion approaches are described here. We refer to this family of methods by the general name inexact Newton methods.
The use of iterative methods for 7.1 spares us from concerns about the expense of a direct factorization of the Hessian 2 fk and the fillin that may occur during this process. Further, we can customize the solution strategy to ensure that the rapid convergence properties associated with Newtons methods are not lost in the inexact version. In addition, as noted below, we can implement these methods in a Hessianfree manner, so that the Hessian 2 fk need not be calculated or stored explicitly at all.
We examine first how the inexactness in the step calculation determines the local convergence properties of inexact Newton methods. We then consider line search and trustregion approaches based on using CG possibly with preconditioning to obtain an approximate solution of 7.1. Finally, we discuss the use of the Lanczos method for solving 7.1 approximately.
LOCAL CONVERGENCE OF INEXACT NEWTON METHODS
Most rules for terminating the iterative solver for 7.1 are based on the residual rk 2 fkpk fk,
where pk is the inexact Newton step. Usually, we terminate the CG iterations when rk k fk ,
7.2
7.3
where the sequence k with 0 k 1 for all k is called the forcing sequence.
We now study how the rate of convergence of inexact Newton methods based on 7.17.3 is affected by the choice of the forcing sequence. The next two theorems apply not just to NewtonCG procedures but to all inexact Newton methods whose steps satisfy
7.2 and 7.3.
Our first result says that local convergence is obtained simply by ensuring that k is
bounded away from 1.
Theorem 7.1.
Suppose that 2 f x exists and is continuous in a neighborhood of a minimizer x, with 2 f x is positive definite. Consider the iteration xk1 xk pk where pk satisfies 7.3, and assume that k for some constant 0, 1. Then, if the starting point x0 is sufficiently near x, the sequence xk converges to x and satisfies
2 f xxk1 x 2 f xxk x , 7.4 for some constant with 1.
Rather than giving a rigorous proof of this theorem, we present an informal derivation that contains the essence of the argument and motivates the next result.
Since the Hessian matrix 2 f is positive definite at x and continuous near x, there exists a positive constant L such that 2 fk1 L for all xk sufficiently close to x. We therefore have from 7.2 that the inexact Newton step satisfies
pk Lfk rk2Lfk,
where the second inequality follows from 7.3 and k 1. Using this expression together
with Taylors theorem and the continuity of 2 f x, we obtain 1
fk1 fk 2 fkpk
fxk tpkfxkpk dt
0 fk2fkpkopk
fkfkrkofk rkofk .
Taking norms and recalling 7.3, we have that
fk1 k fk o fk k o1 fk .
7.5
7.6
When xk is close enough to x that the o1 term in the last estimate is bounded by 12, we have
fk1 k 12 fk 1 fk , 7.7 2
so the gradient norm decreases by a factor of 1 2 at this iteration. By choosing the initial point x0 sufficiently close to x, we can ensure that this rate of decrease occurs at every iteration.
To prove 7.4, we note that under our smoothness assumptions, we have fk 2 fxxk xo xk x .
Hence it can be shown that for xk close to x, the gradient fk differs from the scaled error 2 f xxk x by only a relatively small perturbation. A similar estimate holds at xk1, so 7.4 follows from 7.7.
7.1. INEXACT NEWTON METHODS 167
From 7.6, we have that
fk1 k o1. 7.8 fk
168 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
If limk k 0, we have from this expression that lim fk1 0,
k fk
indicating Qsuperlinear convergence of the gradient norms fk to zero. Superlinear convergence of the iterates xk to x can be proved as a consequence.
We can obtain quadratic convergence by making the additional assumption that the Hessian 2 f x is Lipschitz continuous near x. In this case, the estimate 7.5 can be tightened to
fk1rkO fk 2 .
By choosing the forcing sequence so that k O fk , we have from this expression
that
fk1 Ofk 2,
indicating Qquadratic convergence of the gradient norms to zero and thus also Qquadratic convergence of the iterates xk to x. The last two observations are summarized in the following theorem.
Theorem 7.2.
Suppose that the conditions of Theorem 7.1 hold, and assume that the iterates xk generated by the inexact Newton method converge to x. Then the rate of convergence is superlinear if k 0. If in addition, 2 f x is Lipschitz continuous for x near x and if k O fk , then the convergence is quadratic.
Toobtainsuperlinearconvergence,wecanset,forexample,k min 0.5, fk ; the choice k min0.5, fk would yield quadratic convergence.
All the results presented in this section, which are proved by Dembo, Eisenstat, and Steihaug 89, are local in nature: They assume that the sequence xk eventually enters the near vicinity of the solution x. They also assume that the unit step length k 1 is taken and hence that globalization strategies do not interfere with rapid convergence. In the following pages we show that inexact Newton strategies can, in fact, be incorporated in practical line search and trustregion implementations of Newtons method, yielding algorithms with good local and global convergence properties. We start with a line search approach.
LINE SEARCH NEWTONCG METHOD
In the line search NewtonCG method, also known as the truncated Newton method, we compute the search direction by applying the CG method to the Newton equations 7.1 and
attempt to satisfy a termination test of the form 7.3. However, the CG method is designed to solve positive definite systems, and the Hessian 2 fk may have negative eigenvalues when xk is not close to a solution. Therefore, we terminate the CG iteration as soon as a direction of negative curvature is generated. This adaptation of the CG method produces a search direction pk that is a descent direction. Moreover, the adaptation guarantees that the fast convergence rate of the pure Newton method is preserved, provided that the step length k 1 is used whenever it satisfies the acceptance criteria.
We now describe Algorithm 7.1, a line search algorithm that uses a modification of Algorithm 5.2 as the inner iteration to compute each search direction pk . For purposes of this algorithm, we write the linear system 7.1 in the form
Bk p fk, 7.9
where Bk represents 2 fk. For the inner CG iteration, we denote the search directions by dj and the sequence of iterates that it generates by zj. When Bk is positive definite, the inneriterationsequencezjwillconvergetotheNewtonsteppkN thatsolves7.9.Ateach majoriteration,wedefineatolerance kthatspecifiestherequiredaccuracyofthecomputed solution. For concreteness, we choose the forcing sequence to be k min0.5, fk to obtain a superlinear convergence rate, but other choices are possible.
Algorithm 7.1 Line Search NewtonCG. Given initial point x0;
for k 0, 1, 2, . . .
Define tolerance k min0.5,
Setz0 0,r0 fk,d0 r0 fk; for j 0,1,2,…
i f d Tj B k d j 0 if j 0
returnpk fk; else
returnpk zj; Setj rTj rjdTj Bkdj;
Setzj1 zj jdj;
Setrj1 rj jBkdj;
if rj1 k
returnpk zj1;
Setj1 rTj1rj1rTj rj;
Setdj1 rj1 j1dj; end for
Set xk1 xk k pk , where k satisfies the Wolfe, Goldstein, or Armijo backtracking conditions using k 1 if possible;
7.1. INEXACT NEWTON METHODS 169
end
fk fk ;
170 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
The main differences between the inner loop of Algorithm 7.1 and Algorithm 5.2 are that the specific starting point z0 0 is used; the use of a positive tolerance k allows the CG iterations to terminate at an inexact solution; and the negative curvature test dTj Bkdj 0 ensures that pk is a descent direction for f at xk. If negative curvature is detected on the first inner iteration j 0, the returned direction pk fk is both a descent direction and a direction of nonpositive curvature for f at xk .
We can modify the CG iterations in Algorithm 7.1 by introducing preconditioning, in the manner described in Chapter 5.
Algorithm 7.1 is well suited for large problems, but it has a weakness. When the Hessian 2 fk isnearlysingular,thelinesearchNewtonCGdirectioncanbelongandofpoorquality, requiring many function evaluations in the line search and giving only a small reduction in the function. To alleviate this difficulty, we can try to normalize the Newton step, but good rules for doing so are difficult to determine. They run the risk of undermining the rapid convergence of Newtons method in the case where the pure Newton step is well scaled. It is preferable to introduce a threshold value into the test dTj Bdj 0, but good choices of the threshold are difficult to determine. The trustregion NewtonCG method described below deals more effectively with this problematic situation and is therefore preferable, in our opinion.
The line search NewtonCG method does not require explicit knowledge of the Hessian Bk 2 fk. Rather, it requires only that we can supply Hessianvector products of the form 2 fkd for any given vector d. When the user cannot easily supply code to calculate second derivatives, or where the Hessian requires too much storage, the techniques of Chapter 8 automatic differentiation and finite differencing can be used to calculate these Hessianvector products. Methods of this type are known as Hessianfree Newton methods.
To illustrate the finitedifferencing technique briefly, we use the approximation
2 fkd f xk hd f xk, 7.10
h
for some small differencing interval h. It is easy to prove that the accuracy of this approxi mation is Oh; appropriate choices of h are discussed in Chapter 8. The price we pay for bypassing the computation of the Hessian is one new gradient evaluation per CG iteration.
TRUSTREGION NEWTONCG METHOD
In Chapter 4, we discussed approaches for finding an approximate solution of the trustregion subproblem 4.3 that produce improvements on the Cauchy point. Here we define a modified CG algorithm for solving the subproblem with these properties. This algorithm, due to Steihaug 281, is specified below as Algorithm 7.2. A complete algorithm for minimizing f is obtained by using Algorithm 7.2 to generate the step pk required by Algorithm 4.1 of Chapter 4, for some choice of tolerance k at each iteration.
We use notation similar to 7.9 to define the trustregion subproblem for which Steihaugs method finds an approximate solution:
7.1. INEXACT NEWTON METHODS 171
def T1T
minmkp fk fk p p Bkp subjectto p ak, 7.11
pIRn 2
where Bk 2 fk. As in Algorithm 7.1, we use dj to denote the search directions of this
modifiedCGiterationandzj todenotethesequenceofiteratesthatitgenerates.
Algorithm 7.2 CGSteihaug.
Given tolerance k 0;
Setz0 0,r0 fk,d0 r0 fk; if r0 k
return pk z0 0; for j 0,1,2,…
i f d Tj B k d j 0
Findsuchthatpk zj dj minimizesmkpkin4.5
and satisfies pk ak ; return pk;
Setj rTj rjdTj Bkdj; Setzj1 zj jdj; if zj1 ak
Find 0suchthat pk zj dj satisfies pk ak;
return pk; Setrj1 rj jBkdj;
if rj1 k
returnpk zj1;
Setj1 rTj1rj1rTj rj;
Setdj1 rj1 j1dj; end for.
The first if statement inside the loop stops the method if its current search direction dj isadirectionofnonpositivecurvaturealongBk,whilethesecondifstatementinsidethe loop causes termination if z j 1 violates the trustregion bound. In both cases, the method returns the step pk obtained by intersecting the current search direction with the trustregion boundary.
The choice of the tolerance k at each call to Algorithm 7.2 is important in keeping the overall cost of the trustregion NewtonCG method low. Near a wellbehaved solution x, the trustregion bound becomes inactive, and the method reduces to the inexact Newton method analyzed in Theorems 7.1 and 7.2. Rapid convergence can be obtained in these circumstances by choosing k in a similar fashion to Algorithm 7.1.
172 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
The essential differences between Algorithm 5.2 and the inner loop of Algorithm 7.2 are that the latter terminates when it violates the trustregion bound p a, when it encounters a direction of negative curvature in 2 fk, or when it satisfies a convergence tolerance defined by a parameter k . In these respects, Algorithm 7.2 is quite similar to the inner loop of Algorithm 7.1.
The initialization of z0 to zero in Algorithm 7.2 is a crucial feature of the algorithm. Provided fk 2 k,Algorithm7.2terminatesatapoint pk forwhichmkpk mkpkC, that is, when the reduction in model function equals or exceeds that of the Cauchy point. To demonstrate this fact, we consider several cases. First, if d0T Bk d0 fk T Bk fk 0, then the condition in the first if statement is satisfied, and the algorithm returns the Cauchy point p ak fk fk . Otherwise, Algorithm 7.2 defines z1 as follows:
z1 0d0 r0Tr0 d0 fkTfk fk. d0T Bkd0 fkT Bk fk
If z1 ak , then z1 is exactly the Cauchy point. Subsequent steps of Algorithm 7.2 ensure that the final pk satisfies mkpk mkz1. When z1 ak, on the other hand, the second if statement is activated, and Algorithm 7.2 terminates at the Cauchy point, proving our claim. This property is important for global convergence: Since each step is at least as good as the Cauchy point in reducing the model mk , Algorithm 7.2 is globally convergent.
Another crucial property of the method is that each iterate z j is larger in norm than its predecessor. This property is another consequence of the initialization z0 0. Its main implication is that it is acceptable to stop iterating as soon as the trustregion boundary is reached, because no further iterates giving a lower value of the model function mk will lie inside the trust region. We state and prove this property formally in the following theorem, which makes use of the expanding subspace property of the conjugate gradient algorithm, described in Theorem 5.2.
Theorem 7.3.
The sequence of vectors z j generated by Algorithm 7.2 satisfies
0 z0 2 zj 2 zj1 2 pk 2ak.
PROOF. We first show that the sequences of vectors generated by Algorithm 7.2 satisfy zTj rj 0for j 0andzTj dj 0for j 1.
Algorithm 7.2 computes z j 1 recursively in terms of z j ; but when all the terms of this recursion are written explicitly, we see that
j1 j1
zj z0 idi idi, i0 i0
since z0 0. Multiplying by r j and applying the expanding subspace property of conjugate gradients see Theorem 5.2, we obtain
j1
zTjrj
An induction proof establishes the relation zTj dj 0. By applying the expanding
subspace property again, we obtain
z 1T d 1 0 d 0 T r 1 1 d 0 0 1 d 0T d 0 0 .
We now make the inductive hypothesis that zTj dj 0 and deduce that zTj1dj1 0. From 7.12, we have zTj1rj1 0, and therefore
zTj1dj1 zTj1rj1 j1dj j1zTj1dj
j1zj jdjTdj
j 1 z Tj d j j j 1 d Tj d j .
Because of the inductive hypothesis and positivity of j1 and j, the last expression is positive.
We now prove the theorem. If Algorithm 7.2 terminates because d Tj Bk d j 0 or zj1 2 ak, then the final point pk is chosen to make pk 2 ak, which is the largest possible length. To cover all other possibilities in the algorithm, we must show that
zj 2 zj1 2whenzj1zjjdjandj1.Observethat
zj1 2 zj jdjTzj jdj zj 2 2jzTj dj 2j dj 2.
It follows from this expression and our intermediate result that proof is complete.
z j
2
z j 1
2 , so our
i0
FromthistheoremweseethatAlgorithm7.2sweepsoutpointszj thatmoveonsome interpolating path from z1 to the final solution pk, a path in which every step increases its total distance from the start point. When Bk 2 fk is positive definite, this path may be compared to the path of the dogleg method: Both methods start by minimizing mk along the negative gradient direction fk and subsequently progress toward pkN, until the trustregion boundary intervenes. One can show that, when Bk 2 fk is positive definite, Algorithm 7.2 provides a decrease in the model 7.11 that is at least half as good as the optimal decrease 320.
7.1. INEXACT NEWTON METHODS 173
idiTrj 0. 7.12
174 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
PRECONDITIONING THE TRUSTREGION NEWTONCG METHOD
As discussed in Chapter 5, preconditioning can be used to accelerate the CG iteration. Preconditioning techniques are based on finding a nonsingular matrix D such that the eigen values of DT 2 fk D1 have a more favorable distribution. By generalizing Theorem 7.3, wecanshowthattheiterateszj generatedbyapreconditionedvariantofAlgorithm7.2will grow monotonically in the weighted norm D . To be consistent, we should redefine the trustregion subproblem in terms of the same norm, as follows:
7.13
def T1T
minmkpfkfk ppBkp subjectto Dpak.
pIRn 2
Making the change of variables p Dp and defining
g k D T f k , B k D T 2 f k D 1 ,
we can write 7.13 as
min fkgkTp1pTBkp subjectto p a,
p IR n 2
which has exactly the form of 7.11. We can apply Algorithm 7.2 without any modification to this subproblem, which is equivalent to applying a preconditioned version of Algorithm 7.2 to the problem 7.13.
Many preconditioners can be used within this framework; we discuss some of them in Chapter 5. Of particular interest is incomplete Cholesky factorization, which has proved useful in a wide range of optimization problems. The incomplete Cholesky factorization of a positive definite matrix B finds a lower triangular matrix L such that
BLLT R,
where the amount of fillin in L is restricted in some way. For instance, it is constrained to have the same sparsity structure as the lower triangular part of B or is allowed to have a number of nonzero entries similar to that in B. The matrix R accounts for the inexactness in the approximate factorization. The situation is complicated somewhat by the possible indefiniteness of the Hessian 2 fk; we must be able to handle this indefiniteness as well as maintain the sparsity. The following algorithm combines incomplete Cholesky and a form of modified Cholesky to define a preconditioner for the trustregion NewtonCG approach.
Algorithm 7.3 Inexact Modified Cholesky.
Compute T diag Be1 , Be2 ,…, Ben , where ei is the
ith coordinate vector; SetB T12BT12;Set B ;
compute a shift to ensure positive definiteness ifmini bii 0
0 0 else
0 2; for k 0,1,2,…
Attempt to apply incomplete Cholesky algorithm to obtain
L L T B k I ;
if the factorization is completed successfully stop and return L;
else
end for
k1 max2k,2;
We can then set the preconditioner to be D L T , where L is the lower triangular matrix output from Algorithm 7.3. A trustregion NewtonCG method using this preconditioner is implemented in the LANCELOT 72 and TRON 192 codes.
TRUSTREGION NEWTONLANCZOS METHOD
A limitation of Algorithm 7.2 is that it accepts any direction of negative curvature, even when this direction gives an insignificant reduction in the model. Consider, for example, the case where the subproblem 7.11 is
min mp103p1 104p12 p2 subjectto p 1, p
where subscripts indicate elements of the vector p. The steepest descent direction at p 0 is 103, 0T , which is a direction of negative curvature for the model. Algorithm 7.2 would follow this direction to the boundary of the trust region, yielding a reduction in model function m of about 103. A step along e2also a direction of negative curvaturewould yield a much greater reduction of 1.
Several remedies have been proposed. We have seen in Chapter 4 that when the Hessian 2 fk contains negative eigenvalues, the search direction should have a significant component along the eigenvector corresponding to the most negative eigenvalue of 2 fk. This feature would allow the algorithm to move away rapidly from stationary points that are not minimizers. One way to achieve this is to compute a nearly exact solution of the trustregion subproblem 7.11 using the techniques described in Section 4.3. This approach requires the solution of a few linear systems with coefficient matrices of the form
7.1. INEXACT NEWTON METHODS 175
176 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
Bk I.Althoughthisapproachisperhapstooexpensiveinthelargescalecase,itgenerates productive search directions in all cases.
A more practical alternative is to use the Lanczos method see, for example, 136 rather than the CG method to solve the linear system Bk p fk . The Lanczos method can be seen as a generalization of the CG method that is applicable to indefinite systems, and we can use it to continue the CG process while gathering negative curvature information.
After j steps, the Lanczos method generates an n j matrix Q j with orthogonal columns that span the Krylov subspace 5.15 generated by this method. This matrix has the property that Q Tj B Q j T j , where T j is an tridiagonal. We can take advantage of this tridiagonal structure and seek to find an approximate solution of the trustregion subproblem in the range of the basis Q j . To do so, we solve the problem
min fk e1TQjfke1Tw1wTTjw subjectto w ak, 7.14 w IR j 2
wheree1 1,0,0,…,0T,andwedefinetheapproximatesolutionofthetrustregion subproblem as pk Q j w. Since T j is tridiagonal, problem 7.14 can be solved by factoring the system Tj I and following the nearly exact approach of Section 4.3.
The Lanczos iteration may be terminated, as in the NewtonCG methods, by a test of the form 7.3. Preconditioning can also be incorporated to accelerate the convergence of the Lanczos iteration. The additional robustness in this trustregion algorithm comes at the cost of a more expensive solution of the subproblem than in the NewtonCG approach. A sophisticated implementation of the NewtonLanczos approach has been implemented in the GLTR package 145.
7.2 LIMITEDMEMORY QUASINEWTON METHODS
Limitedmemory quasiNewton methods are useful for solving large problems whose Hes sian matrices cannot be computed at a reasonable cost or are not sparse. These methods maintain simple and compact approximations of Hessian matrices: Instead of storing fully dense n n approximations, they save only a few vectors of length n that represent the approximations implicitly. Despite these modest storage requirements, they often yield an acceptable albeit linear rate of convergence. Various limitedmemory methods have been proposed; we focus mainly on an algorithm known as LBFGS, which, as its name suggests, is based on the BFGS updating formula. The main idea of this method is to use curvature information from only the most recent iterations to construct the Hessian approximation. Curvature information from earlier iterations, which is less likely to be relevant to the ac tual behavior of the Hessian at the current iteration, is discarded in the interest of saving storage.
Following our discussion of LBFGS and its convergence behavior, we discuss its relationship to the nonlinear conjugate gradient methods of Chapter 5. We then discuss
see 6.17, where
and
Hk1 VkT HkVk kskskT
k 1 , VkIkykskT, y kT s k
sk xk1 xk, yk fk1 fk.
7.16
7.17
7.18
7.2. LIMITEDMEMORY QUASINEWTON METHODS 177
implementations of limitedmemory schemes that make use of a compact representation of approximate Hessian information. These techniques can be applied not only to LBFGS but also to limitedmemory versions of other quasiNewton procedures such as SR1. Finally, we discuss quasiNewton updating schemes that impose a particular sparsity pattern on the approximate Hessian.
LIMITEDMEMORY BFGS
We begin our description of the LBFGS method by recalling its parent, the BFGS method, which was described in Algorithm 8.1. Each step of the BFGS method has the form
xk1 xk kHkfk, 7.15 where k is the step length and Hk is updated at every iteration by means of the formula
Since the inverse Hessian approximation Hk will generally be dense, the cost of storing and manipulating it is prohibitive when the number of variables is large. To circumvent this problem, we store a modified version of Hk implicitly, by storing a certain number say, m of the vector pairs si , yi used in the formulas 7.167.18. The product Hk fk can be obtained by performing a sequence of inner products and vector summations involving fk and the pairs si , yi . After the new iterate is computed, the oldest vector pair in the set of pairs si , yi is replaced by the new pair sk , yk obtained from the current step 7.18. In this way, the set of vector pairs includes curvature information from the m most recent iterations. Practical experience has shown that modest values of m between 3 and 20, say often produce satisfactory results.
We now describe the updating process in a little more detail. At iteration k, the current iterateisxk andthesetofvectorpairsisgivenbysi,yiforikm,…,k1.Wefirst choose some initial Hessian approximation Hk0 in contrast to the standard BFGS iteration, this initial approximation is allowed to vary from iteration to iteration and find by repeated application of the formula 7.16 that the LBFGS approximation Hk satisfies the following
178 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
formula:
HVTVT H0VV k k1 km k km k1
VT VT
km k1 km1
s sT V V km km km1 k1
km1
VT VT
k1 km2
s
sT km1 km1
V km2
V k1
k1sk1skT1.
7.19
From this expression we can derive a recursive procedure to compute the product Hk fk efficiently.
Algorithm 7.4 LBFGS twoloop recursion. q fk;
fori k1,k2,…,km
i isiTq;
q q i yi ; end for
r Hk0q;
fori km,km1,…,k1
i y iT r ;
r r si i end for
stop with result Hk fk r.
Without considering the multiplication Hk0q, the twoloop recursion scheme requires 4mn multiplications; if Hk0 is diagonal, then n additional multiplications are needed. Apart from being inexpensive, this recursion has the advantage that the multiplication by the initial matrix Hk0 is isolated from the rest of the computations, allowing this matrix to be chosen freely and to vary between iterations. We may even use an implicit choice of Hk0 by defining some initial approximation Bk0 to the Hessian not its inverse and obtaining r by solving the system Bk0r q.
A method for choosing Hk0 that has proved effective in practice is to set Hk0 k I , where
sT yk1
k k1 . 7.20
As discussed in Chapter 6, k is the scaling factor that attempts to estimate the size of the true Hessian matrix along the most recent search direction see 6.21. This choice helps to ensure that the search direction pk is well scaled, and as a result the step length k 1 is accepted in most iterations. As discussed in Chapter 6, it is important that the line search be
yT yk1 k1
7.2. LIMITEDMEMORY QUASINEWTON METHODS 179
based on the Wolfe conditions 3.6 or strong Wolfe conditions 3.7, so that BFGS updating is stable.
The limitedmemory BFGS algorithm can be stated formally as follows.
Algorithm 7.5 LBFGS.
Choose starting point x0, integer m 0; k 0;
repeat
Choose Hk0 for example, by using 7.20; Compute pk Hk fk from Algorithm 7.4; Compute xk1 xk k pk , where k is chosen to
satisfy the Wolfe conditions;
if k m
Discard the vector pair skm , ykm from storage;
Computeandsavesk xk1 xk,yk fk1 fk;
k k 1; until convergence.
The strategy of keeping the m most recent correction pairs si , yi works well in practice; indeed no other strategy has yet proved to be consistently better. During its first m 1 iterations, Algorithm 7.5 is equivalent to the BFGS algorithm of Chapter 6 if the initial matrix H0 is the same in both methods, and if LBFGS chooses Hk0 H0 at each iteration.
Table 7.1 presents results illustrating the behavior of Algorithm 7.5 for various levels of memory m. It gives the number of function and gradient evaluations nfg and the total CPU time. The test problems are taken from the CUTE collection 35, the number of variables is indicated by n, and the termination criterion fk 105 is used. The table shows that the algorithm tends to be less robust when m is small. As the amount of storage increases, the number of function evaluations tends to decrease; but since the cost of each iteration increases with the amount of storage, the best CPU time is often obtained for small values of m. Clearly, the optimal choice of m is problem dependent.
Because some rival algorithms are inefficient, Algorithm 7.5 is often the approach of choice for large problems in which the true Hessian is not sparse. In particular, a Newton
Table 7.1 Performance of Algorithm 7.5.
Problem n
LBFGS m3
LBFGS m5
LBFGS m 17
LBFGS m 29
nfg time
nfg time
nfg time
nfg time
DIXMAANL 1500
146 16.5
134 17.4
120 28.2
125 44.4
EIGENALS 110
821 21.5
569 15.7
363 16.2
168 12.5
FREUROTH 1000
999
999
69 8.1
38 6.3
TRIDIA 1000
876 46.6
611 41.4
531 84.6
462 127.1
180 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
method in which the exact Hessian is computed and factorized is not practical in such circumstances. The LBFGS approach may also outperform Hessianfree Newton methods such as NewtonCG approaches, in which Hessianvector products are calculated by finite differences or automatic differentiation. The main weakness of the LBFGS method is that it converges slowly on illconditioned problemsspecifically, on problems where the Hessian matrix contains a wide distribution of eigenvalues. On certain applications, the nonlinear conjugate gradient methods discussed in Chapter 5 are competitive with limitedmemory quasiNewton methods.
RELATIONSHIP WITH CONJUGATE GRADIENT METHODS
Limitedmemory methods evolved as an attempt to improve nonlinear conjugate gradient methods, and early implementations resembled conjugate gradient methods more than quasiNewton methods. The relationship between the two classes is the basis of a memoryless BFGS iteration, which we now outline.
We start by considering the HestenesStiefel form of the nonlinear conjugate gradient method 5.46. Recalling that sk k pk , we have that the search direction for this method is given by
skyT
pk I k fk1 Hk1fk1. 7.21
nor positive definite. We could symmetrize it as H T Hk1, but this matrix does not satisfy k1
the secant equation Hk1 yk sk and is, in any case, singular. An iteration matrix that is symmetric, positive definite, and satisfies the secant equation is given by
s k y kT y k s kT s k s kT
Hk1 I yT s I yT s yT s . 7.22
kkkkkk
This matrix is exactly the one obtained by applying a single BFGS update 7.16 to the identity matrix. Hence, an algorithm whose search direction is given by pk1 Hk1 fk1, with Hk1 defined by 7.22, can be thought of as a memoryless BFGS method, in which the previous Hessian approximation is always reset to the identity matrix before updating it and where only the most recent correction pair sk , yk is kept at every iteration. Alternatively, we can view the method as a variant of Algorithm 7.5 in which m 1 and Hk0 I at each iteration.
A more direct connection with conjugate gradient methods can be seen if we consider the memoryless BFGS formula 7.22 in conjunction with an exact line search, for which
fT yk pk1 fk1 k1
y kT p k
y kT s k
This formula resembles a quasiNewton iteration, but the matrix Hk1 is neither symmetric
7.2. LIMITEDMEMORY QUASINEWTON METHODS 181
fT pk 0forallk.Wethenobtain k1
fT yk
pk1 Hk1 fk1 fk1 k1 pk , 7.23
ykT pk
which is none other than the HestenesStiefel conjugate gradient method. Moreover, it is easy to verify that when f T pk 0, the HestenesStiefel formula reduces to the Polak
k1
Ribiere formula 5.44. Even though the assumption of exact line searches is unrealistic,
it is intriguing that the BFGS formula is related in this way to the PolakRibiere and HestenesStiefel methods.
GENERAL LIMITEDMEMORY UPDATING
Limitedmemory quasiNewton approximations are useful in a variety of optimization methods. LBFGS, Algorithm 7.5, is a line search method for unconstrained optimization that implicitly updates an approximation Hk to the inverse of the Hessian matrix. Trust region methods, on the other hand, require an approximation Bk to the Hessian matrix, not to its inverse. We would also like to develop limitedmemory methods based on the SR1 formula, which is an attractive alternative to BFGS; see Chapter 6. In this section we consider limitedmemory updating in a general setting and show that by representing quasiNewton matrices in a compact or outer product form, we can derive efficient implementations of all popular quasiNewton update formulas, and their inverses. These compact representations will also be useful in designing limitedmemory methods for constrained optimization, where approximations to the Hessian or reduced Hessian of the Lagrangian are needed; see Chapter 18 and Chapter 19.
We will consider only limitedmemory methods such as LBFGS that continuously refresh the correction pairs by removing and adding information at each stage. A different approach saves correction pairs until the available storage is exhausted and then discards all correction pairs except perhaps one and starts the process anew. Computational experience suggests that this second approach is less effective in practice.
Throughout this chapter we let Bk denote an approximation to a Hessian matrix and Hk the approximation to the inverse. In particular, we always have that B1 Hk .
COMPACT REPRESENTATION OF BFGS UPDATING
We now describe an approach to limitedmemory updating that is based on repre senting quasiNewton matrices in outerproduct form. We illustrate it for the case of a BFGS approximation Bk to the Hessian.
Theorem 7.4.
Let B0 be symmetric and positive definite, and assume that the k vector pairs si , yi k1 i0
satisfy siT yi 0. Let Bk be obtained by applying k BFGS updates with these vector pairs to B0,
k
182 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
using the formula 6.19. We then have that
ST B S L 1 ST B BBBSY k0k k k0,7.24
k 0 0k k LkT Dk YkT where Sk and Yk are the n k matrices defined by
Sk s0,…,sk1, while Lk and Dk are the k k matrices
sT y Lki,j i1 j1
0
Dk diag s0Ty0,…,skT1yk1 .
This result can be proved by induction. We note that the conditions siT yi 0, i 0, 1, . . . , k 1, ensure that the middle matrix in 7.24 is nonsingular, so that this expres sion is well defined. The utility of this representation becomes apparent when we consider limitedmemory updating.
As in the LBFGS algorithm, we keep the m most recent correction pairs si , yi and refresh this set at every iteration by removing the oldest pair and adding a newly generated pair. During the first m iterations, the update procedure described in Theorem 7.4 can be used without modification, except that usually we make the specific choice Bk0 k I for the basic matrix, where k 1k and k is defined by 7.20.
At subsequent iterations k m, the update procedure needs to be modified slightly to reflectthechangingnatureofthesetofvectorpairssi,yifori km,km1,…,k1. Defining the n m matrices Sk and Yk by
Sk skm,…,sk1, Yk ykm,…,yk1, 7.28 we find that the matrix B resulting from m updates to the basic matrix Bk I is given
by
Yk y0,…, yk1,
ifij,
7.25
7.26
7.27
otherwise,
k0k
ST S L 1 ST BISY kkk k kk , 7.29
k k kk k LkT Dk YkT where Lk and Dk are now the m m matrices defined by
skm1iTykm1j ifi j,
0 otherwise,
Lki,j
Dk diag skTmykm,…,skT1yk1 .
7.2. LIMITEDMEMORY QUASINEWTON METHODS 183
Bkk I
Figure 7.1
Compact or outer product representation of Bk in 7.29.
After the new iterate xk1 is generated, we obtain Sk1 by deleting skm from Sk and adding the new displacement sk, and we update Yk1 in a similar fashion. The new matrices Lk1 and Dk1 are obtained in an analogous way.
Since the middle matrix in 7.29 is smallof dimension 2mits factorization re quires a negligible amount of computation. The key idea behind the compact representation 7.29 is that the corrections to the basic matrix can be expressed as an outer product of two longandnarrowmatriceskSk Ykanditstransposewithaninterveningmultiplication by a small 2m 2m matrix. See Figure 7.1 for a graphical illustration.
The limitedmemory updating procedure of Bk requires approximately 2mn Om3 operations, and matrixvector products of the form Bkv can be performed at a cost of 4m 1n Om2 multiplications. These operation counts indicate that updating and manipulating the direct limitedmemory BFGS matrix Bk is quite economical when m is small.
This approximation Bk can be used in a trustregion method for unconstrained opti mization or, more significantly, in methods for boundconstrained and generalconstrained optimization. The program LBFGSB 322 makes extensive use of compact limitedmemory approximations to solve large nonlinear optimization problems with bound constraints. In this situation, projections of Bk into subspaces defined by the constraint gradients must be calculated repeatedly. Several codes for generalconstrained optimization, including KNITRO and IPOPT, make use of the compact limitedmemory matrix Bk to approximate the Hessian of the Lagrangians; see Section 19.3
We can derive a formula, similar to 7.24, that provides a compact representation of the inverse BFGS approximation Hk; see 52 for details. An implementation of the unconstrained LBFGS algorithm based on this expression requires a similar amount of computation as the algorithm described in the previous section.
Compact representations can also be derived for matrices generated by the symmetric
rankone SR1 formula. If k updates are applied to the symmetric matrix B0 using the
vector pairs si , yi k1 and the SR1 formula 6.24, the resulting matrix Bk can be expressed i0
as
Bk B0YkB0SkDkLkLkT SkTB0Sk1YkB0SkT, 7.30
184 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
whereSk,Yk,Dk,andLk areasdefinedin7.25,7.26,and7.27.SincetheSR1method is selfdual, the inverse formula Hk can be obtained simply by replacing B, s, and y by H, y, and s, respectively. Limitedmemory SR1 methods can be derived in the same way as the BFGS method. We replace B0 with the basic matrix Bk0 at the kth iteration, and we redefine Sk and Yk to contain the m most recent corrections, as in 7.28. We note, however, that limitedmemory SR1 updating is sometimes not as effective as LBFGS updating because it may not produce positive definite approximations near a solution.
UNROLLING THE UPDATE
The reader may wonder whether limitedmemory updating can be implemented in simpler ways. In fact, as we show here, the most obvious implementation of limited memory BFGS updating is considerably more expensive than the approach based on compact representations discussed in the previous section.
The direct BFGS formula 6.19 can be written as Bk1 Bk akakT bkbkT,
where the vectors ak and bk are defined by
a Bksk , b yk .
We could continue to save the vector pairs si , yi but use the formula 7.31 to compute matrixvector products. A limitedmemory BFGS method that uses this approach would proceed by defining the basic matrix Bk0 at each iteration and then updating according to the formula
k1
kT1kT1 sk Bksk2 yk sk2
7.31
7.32
BkBk0
thestoredvectorpairssi,yi,i km,km1,…,k1,bythefollowingprocedure:
bibiTaiaiT . 7.33 The vector pairs ai,bi, i k m,k m 1,…,k 1, would then be recovered from
Procedure 7.6 Unrolling the BFGS formula. for i k m, k m 1, . . . , k 1
end for
bi yiyiTsi12;
aiB0sii1 bTsibjaTsiaj ;
k jkm j j ai aisiTai12;
ikm
7.3. SPARSE QUASINEWTON UPDATES 185
Notethatthevectorsai mustberecomputedateachiterationbecausetheyalldepend on the vector pair skm , ykm , which is removed at the end of iteration k. On the other hand, the vectors bi and the inner products bTj si can be saved from the previous iteration, so only the new values bk1 and bTj sk1 need to be computed at the current iteration.
By taking all these computations into account, and assuming that Bk0 I , we find that approximately 3 m2n operations are needed to determine the limitedmemory matrix.
2
The actual computation of the inner product Bmv for arbitrary v IRn requires 4mn
multiplications. Overall, therefore, this approach is less efficient than the one based on the
compact matrix representation described previously. Indeed, while the product Bkv costs
the same in both cases, updating the representation of the limitedmemory matrix by using
the compact form requires only 2mn multiplications, compared to 3 m2n multiplications 2
needed when the BFGS formula is unrolled.
7.3 SPARSE QUASINEWTON UPDATES
We now discuss a quasiNewton approach to largescale problems that has intuitive appeal: We demand that the quasiNewton approximations Bk have the same or similar sparsity pattern as the true Hessian. This approach would reduce the storage requirements of the algorithm and perhaps give rise to more accurate Hessian approximations.
Suppose that we know which components of the Hessian may be nonzero at some point in the domain of interest. That is, we know the contents of the set defined by
def 2
i,j fxij a0 forsomexinthedomainof f.
Suppose also that the current Hessian approximation Bk mirrors the nonzero structure of the exact Hessian, that is, Bkij 0 for i, j . In updating Bk to Bk1, then, we could try to find the matrix Bk1 that satisfies the secant condition, has the same sparsity pattern, and is as close as possible to Bk. Specifically, we define Bk1 to be the solution of the following quadratic program:
Bij Bkij2, subjecttoBsk yk, BBT, and Bij 0 fori,j.
min BBk 2F B
i,j
7.34a
7.34b
One can show that the solution Bk1 of this problem can be obtained by solving an n n linear system whose sparsity pattern is , the same as the sparsity of the true Hessian. Once Bk1 has been computed, we can use it, within a trustregion method, to obtain the new iterate xk1. We note that Bk1 is not guaranteed to be positive definite.
We omit further details of this approach because it has several drawbacks. The updating process does not possess scale invariance under linear transformations of the variables and,
186 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
more significantly, its practical performance has been disappointing. The fundamental weakness of this approach is that 7.34a is an inadequate model and can produce poor Hessian approximations.
An alternative approach is to relax the secant equation, making sure that it is approx imately satisfied along the last few steps rather than requiring it to hold strictly on the latest step. To do so, we define Sk and Yk by 7.28 so that they contain the m most recent difference pairs. We can then define the new Hessian approximation Bk1 to be the solution of
min BSkYk 2F B
subjecttoBBT andBij 0fori,j.
This convex optimization problem has a solution, but it is not easy to compute. Moreover, this approach can produce singular or poorly conditioned Hessian approximations. Even though it frequently outperforms methods based on 7.34a, its performance on large problems has not been impressive.
7.4 ALGORITHMS FOR PARTIALLY SEPARABLE FUNCTIONS
In a separable unconstrained optimization problem, the objective function can be decom posed into a sum of simpler functions that can be optimized independently. For example, if we have
f x f1x1, x3 f2x2, x4, x6 f3x5,
we can find the optimal value of x by minimizing each function fi , i 1, 2, 3, indepen dently, since no variable appears in more than one function. The cost of performing m lowerdimensional optimizations is much less in general than the cost of optimizing an ndimensional function.
In many large problems the objective function f : IRn IR is not separable, but it can still be written as the sum of simpler functions, known as element functions. Each element function has the property that it is unaffected when we move along a large number of linearly independent directions. If this property holds, we say that f is partially separable. All functions whose Hessians 2 f are sparse are partially separable, but so are many functions whose Hessian is not sparse. Partial separability allows for economical problem representation, efficient automatic differentiation, and effective quasiNewton updating.
The simplest form of partial separability arises when the objective function can be written as
f x
fi x, 7.35
ne
i1
and note that
x1 x1 , x3
x1U1x with U1 1 0 0 0 . 0010
If we define the function 1 by
7.4. ALGORITHMS FOR PARTIALLY SEPARABLE FUNCTIONS 187
where each of the element functions fi depends on only a few components of x. It follows that the gradients fi and Hessians 2 fi of each element function contain just a few nonzeros. By differentiating 7.35, we obtain
ne ne
fx fix, 2 fx 2 fix. i1 i1
A natural question is whether it is more effective to maintain quasiNewton approximations to each of the element Hessians 2 fix separately, rather than approximating the entire Hessian2 f x.Wewillshowthattheanswerisaffirmative,providedthatthequasiNewton approximation fully exploits the structure of each element Hessian.
We introduce the concept by means of a simple example. Consider the objective function
fxx1 x322 x2 x422 x3 x22 x4 x122 7.36 f1x f2x f3x f4x.
The Hessians of the element functions fi are 4 4 sparse, singular matrices with 4 nonzero entries.
Let us focus on f1; all other element functions have exactly the same form. Even though f1 is formally a function of all components of x, it depends only on x1 and x3, which we call the element variables for f1. We assemble the element variables into a vector that we call x1, that is,
1z1, z2 z1 z22,
then we can write f1x 1U1x. By applying the chain rule to this representation, we obtain
f1x U1T 1U1x, 2 f1x U1T 21U1xU1. 7.37
188 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
In our case, we have
2Ux 1 1
2 0 4×3 0
2 4x 00 0 0 3 , 2fx .
4x 12x24x 1 331331
The matrix U1, known as a compactifying matrix, allows us to map the derivative information for the lowdimensional function 1 into the derivative information for the element function
f1.
Now comes the key idea: Instead of maintaining a quasiNewton approximation to
2 f1, we maintain a 2 2 quasiNewton approximation B1 of 21 and use the relation 7.37 to transform it into a quasiNewton approximation to 2 f1. To update B1 after a typical step from x to x, we record the information
s1 x x1, y1 1x 1×1, 7.38 1 1
and use BFGS or SR1 updating to obtain the new approximation B . We therefore update 1
small, dense quasiNewton approximations with the property B1 21U1x 21×1.
7.39
To obtain an approximation of the element Hessian 2 f1, we use the transformation suggested by the relationship 7.37; that is,
2 f1x U1T B1U1.
This operation has the effect of mapping the elements of B1 to the correct positions in the full n n Hessian approximation.
Thepreviousdiscussionconcernedonlythefirstelementfunction f1,butwecantreat all other functions fi in the same way. The full objective function can now be written as
B
UiT BiUi. 7.41
ne
i1
ne
f x
and we maintain a quasiNewton approximation Bi for each of the functions i . To obtain a complete approximation to the full Hessian 2 f , we simply sum the element Hessian approximations as follows:
i1
4x 0 12x24x 0 0000
i Ui x, 7.40
7.5. PERSPECTIVES AND SOFTWARE 189
We may use this approximate Hessian in a trustregion algorithm, obtaining an approximate solution pk of the system
Bk pk fk. 7.42
We need not assemble Bk explicitly but rather use the conjugate gradient approach to solve 7.42, computing matrixvector products of the form Bk v by performing operations with the matrices Ui and Bi.
To illustrate the usefulness of this elementbyelement updating technique, let us consider a problem of the form 7.36 but this time involving 1000 variables, not just 4. The functionsi stilldependononlytwointernalvariables,sothateachHessianapproximation Bi is a 2 2 matrix. After just a few iterations, we will have sampled enough directions si to make each Bi an accurate approximation to 2i. Hence the full quasiNewton approximation 7.41 will tend to be a very good approximation to 2 f x. By contrast, a quasiNewton method that ignores the partially separable structure of the objective function will attempt to estimate the total average curvaturethe sum of the individual curvatures of the element functionsby approximating the 1000 1000 Hessian matrix. When the number of variables n is large, many iterations will be required before this quasiNewton approximation is of good quality. Hence an algorithm of this type for example, standard BFGS or LBFGS will require many more iterations than a method based on the partially separable approximate Hessian.
It is not always possible to use the BFGS formula to update the partial Hessian Bi, because there is no guarantee that the curvature condition sT yi 0 will be satisfied. That
i
is, even though the full Hessian 2 f x is at least positive semidefinite at the solution x,
some of the individual Hessians 2 i may be indefinite. One way to overcome this obstacle is to apply the SR1 update to each of the element Hessians. This approach has proved effective in the LANCELOT package 72, which is designed to take full advantage of partial separability.
The main limitations of this quasiNewton approach are the cost of the step computa tion 7.42, which is comparable to the cost of a Newton step, and the difficulty of identifying the partially separable structure of a function. The performance of quasiNewton methods is satisfactory provided that we find the finest partially separable decomposition of the problem; see 72. Furthermore, even when the partially separable structure is known, it may be more efficient to compute a Newton step. For example, the modeling language AMPL automatically detects the partially separable structure of a function f and uses it to compute the Hessian 2 f x.
7.5 PERSPECTIVES AND SOFTWARE
NewtonCG methods have been used successfully to solve large problems in a vari ety of applications. Many of these implementations are developed by engineers and
190 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
scientists and use problemspecific preconditioners. Freely available packages include TNTNBC 220 and TNPACK 275. Software for more general problems, such as LANCELOT 72, KNITROCG 50, and TRON 192, employ NewtonCG methods when applied to unconstrained problems. Other packages, such as LOQO 294 implement Newton meth ods with a sparse factorization modified to ensure positive definiteness. GLTR 145 offers a NewtonLanczos method. There is insufficient experience to date to say whether the NewtonLanczos method is significantly better in practice than the Steihaug strategy given in Algorithm 7.2.
Software for computing incomplete Cholesky preconditioners includes the ICFS 193 and MA57 166 packages. A preconditioner for NewtonCG based on limitedmemory BFGS approximations is provided in PREQN 209.
Limitedmemory BFGS methods are implemented in LBFGS 194 and M1QN3 122; see Gill and Leonard 125 for a variant that requires less storage and appears to be quite efficient. The compact limitedmemory representations of Section 7.2 are used in LBFGSB 322, IPOPT 301, and KNITRO.
The LANCELOT package exploits partial separability. It provides SR1 and BFGS quasi Newton options as well as a Newton methods. The step computation is obtained by a preconditioned conjugate gradient iteration using trust regions. If f is partially separable, a general affine transformation will not in general preserve the partially separable structure. The quasiNewton method for partially separable functions described in Section 7.4 is not invariant to affine transformations of the variables, but this is not a drawback because the method is invariant under transformations that preserve separability.
NOTES AND REFERENCES
A complete study of inexact Newton methods is given in 74. For a discussion of the NewtonLanczos method see 145. Other iterative methods for the solution of a trustregion problem have been proposed by Hager 160, and by Rendl and Wolkowicz 263.
For further discussion on the LBFGS method see Nocedal 228, Liu and Nocedal 194, and Gilbert and Lemare chal 122. The last paper also discusses various ways in which the scaling parameter can be chosen. Algorithm 7.4, the twoloop LBFGS recursion, constitutes an economical procedure for computing the product Hk fk . It is based, however, on the specific form of the BFGS update formula 7.16, and recursions of this type have not yet been developed and may not exist for other members of the Broyden class for instance, the SR1 and DFP methods. Our discussion of compact representations of limitedmemory matrices is based on Byrd, Nocedal, and Schnabel 52.
Sparse quasiNewton updates have been studied by Toint 288, 289 and Fletcher et al. 102, 104, among others. The concept of partial separability was introduced by Griewank and Toint 156, 155. For an extensive treatment of the subject see Conn, Gould, and Toint 72.
f x
x x2 2 1 x 2 , 2i 2i1 2i1
7.5. PERSPECTIVES AND SOFTWARE 191
EXERCISES
7.1 Code Algorithm 7.5, and test it on the extended Rosenbrock function
n2
i1
where is a parameter that you can vary for example, 1 or 100. The solution is x 1,1,…,1T, f 0. Choose the starting point as 1,1,…,1T. Observe the behavior of your program for various values of the memory parameter m.
7.2
7.3
Show that the matrix Hk1 in 7.21 is singular.
Derive the formula 7.23 under the assumption that line searches are exact. Consider limitedmemory SR1 updating based on 7.30. Explain how the storage
7.4
can be cut in half if the basic matrix Bk0 is kept fixed for all k. Hint: Consider the matrix Qk q0,…,qk1 Yk B0Sk.
7.5 Write the function defined by
fxx2x3ex1x3x4 x2x32x3x4
in the form 7.40. In particular, give the definition of each of the compactifying transformations Ui .
7.6 Does the approximation B obtained by the partially separable quasiNewton updating 7.38, 7.41 satisfy the secant equation Bs y?
7.7 The minimum surface problem is a classical application of the calculus of vari ations and can be found in many textbooks. We wish to find the surface of minimum area, defined on the unit square, that interpolates a prescribed continuous function on the boundary of the square. In the standard discretization of this problem, the unknowns are the values of the soughtafter function zx, y on a q q rectangular mesh of points over the unit square.
More specifically, we divide each edge of the square into q intervals of equal length, yielding q 12 grid points. We label the grid points as
xi1q11,…,xiq1 for i 1,2,…,q 1,
so that each value of i generates a line. With each point we associate a variable zi that represents the height of the surface at this point. For the 4q grid points on the boundary of the unit square, the values of these variables are determined by the given function. The
192 CHAPTER 7. LARGESCALE UNCONSTRAINED OPTIMIZATION
optimization problem is to determine the other q 12 4q variables zi so that the total surface area is minimized.
A typical subsquare in this partition looks as follows:
x jq1 x jq2
xj xj1
We denote this square by A j and note that its area is q2. The desired function is zx, y, and we wish to compute its surface over A j . Calculus books show that the area of the surface is given by
22 fjx 1 z z dxdy.
x,yAj x y
Approximate the derivatives by finite differences, and show that f j has the form
1 1q2 2 22
fjx q2 1 2 xj xjq1 xj1 xjq . 7.43
7.8 Compute the gradient of the element function 7.43 with respect to the full vector x. Show that it contains at most four nonzeros, and that two of these four nonzero components are negatives of the other two. Compute the Hessian of f j , and show that, among the 16 nonzeros, only three different magnitudes are represented. Also show that this Hessian is singular.
CHAPTER8 Calculating
Derivatives
Most algorithms for nonlinear optimization and nonlinear equations require knowledge of derivatives. Sometimes the derivatives are easy to calculate by hand, and it is reasonable to expect the user to provide code to compute them. In other cases, the functions are too complicated, so we look for ways to calculate or approximate the derivatives automatically. A number of interesting approaches are available, of which the most important are probably the following.
Finite Differencing. This technique has its roots in Taylors theorem see Chapter 2. By observing the change in function values in response to small perturbations of the unknowns
This is page 193 Printer: Opaque this
194 CHAPTER 8. CALCULATING DERIVATIVES
near a given point x, we can estimate the response to infintesimal perturbations, that is, the derivatives. For instance, the partial derivative of a smooth function f : IRn IR with respect to the ith variable xi can be approximated by the centraldifference formula
f fx eifx ei, xi 2
where is a small positive scalar and ei is the ith unit vector, that is, the vector whose elements are all 0 except for a 1 in the i th position.
Automatic Differentiation. This technique takes the view that the computer code for evaluating the function can be broken down into a composition of elementary arithmetic operations, to which the chain rule one of the basic rules of calculus can be applied. Some software tools for automatic differentiation such as ADIFOR 25 produce new code that calculates both function and derivative values. Other tools such as ADOLC 154 keep a record of the elementary computations that take place while the function evaluation code for a given point x is executing on the computer. This information is processed to produce the derivatives at the same point x.
Symbolic Differentiation. In this technique, the algebraic specification for the function f is manipulated by symbolic manipulation tools to produce new algebraic expressions for each component of the gradient. Commonly used symbolic manipulation tools can be found in the packages Mathematica 311, Maple 304, and Macsyma 197.
In this chapter we discuss the first two approaches: finite differencing and automatic differentiation.
The usefulness of derivatives is not restricted to algorithms for optimization. Modelers in areas such as design optimization and economics are often interested in performing postoptimal sensitivity analysis, in which they determine the sensitivity of the optimum to small perturbations in the parameter or constraint values. Derivatives are also important in other areas such as nonlinear differential equations and simulation.
8.1 FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS
Finite differencing is an approach to the calculation of approximate derivatives whose motivation like that of so many algorithms in optimization comes from Taylors theorem. Many software packages perform automatic calculation of finite differences whenever the user is unable or unwilling to supply code to calculate exact derivatives. Although they yield only approximate values for the derivatives, the results are adequate in many situations.
By definition, derivatives are a measure of the sensitivity of the function to infinitesimal changes in the values of the variables. Our approach in this section is to make small, finite perturbations in the values of x and examine the resulting differences in the function values.
8.1. FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS 195
By taking ratios of the function difference to variable difference, we obtain approximations to the derivatives.
APPROXIMATING THE GRADIENT
An approximation to the gradient vector f x can be obtained by evaluating the function f at n 1 points and performing some elementary arithmetic. We describe this technique, along with a more accurate variant that requires additional function evaluations.
Apopularformulaforapproximatingthepartialderivativefxi atagivenpointx is the forwarddifference, or onesideddifference, approximation, defined as
fx fx eifx. 8.1 xi
The gradient can be built up by simply applying this formula for i 1,2,…,n. This process requires evaluation of f at the point x as well as the n perturbed points x ei , i 1,2,…,n: a total of n 1 points.
The basis for the formula 8.1 is Taylors theorem, Theorem 2.1 in Chapter 2. When f is twice continuously differentiable, we have
fxp fxfxTp1pT2fxtpp, somet0,1 8.2 2
see 2.6. If we choose L to be a bound on the size of 2 f in the region of interest, it follows directly from this formula that the last term in this expression is bounded by L2 p 2,sothat
afxp fxfxT paL2 p 2. 8.3
We now choose the vector p to be ei , so that it represents a small change in the value of a single component of x the ith component. For this p, we have that f xT p fxTei fxi,sobyrearranging8.3,weconcludethat
fxfx eifx, whereL2. 8.4 xi
Wederivetheforwarddifferenceformula8.1bysimplyignoringtheerrorterm inthis expression, which becomes smaller and smaller as approaches zero.
An important issue in implementing the formula 8.1 is the choice of the parameter . The error expression 8.4 suggests that we should choose as small as possible. Unfor tunately, this expression ignores the roundoff errors that are introduced when the function f is evaluated on a real computer, in floatingpoint arithmetic. From our discussion in the Appendix see A.30 and A.31, we know that the quantity u known as unit roundoff
196 CHAPTER 8. CALCULATING DERIVATIVES
is crucial: It is a bound on the relative error that is introduced whenever an arithmetic operation is performed on two floatingpoint numbers. u is about 1.1 1016 in double precision IEEE floatingpoint arithmetic. The effect of these errors on the final computed value of f depends on the way in which f is computed. It could come from an arithmetic formula, or from a differential equation solver, with or without refinement.
As a rough estimate, let us assume simply that the relative error in the computed f is bounded by u, so that the computed values of f x and f x ei are related to the exact values in the following way:
compfx fxuLf, compfx ei fx eiuLf,
wherecompdenotesthecomputedvalue,andLf isaboundonthevalueoffinthe region of interest. If we use these computed values of f in place of the exact values in 8.4 and 8.1, we obtain an error that is bounded by
L2 2uL f . 8.5 Naturally, we would like to choose to make this error as small as possible; it is easy to see
that the minimizing value is
2 4L f u. L
If we assume that the problem is well scaled, then the ratio L f L the ratio of function values to second derivative values does not exceed a modest size. We can conclude that the following choice of is fairly close to optimal:
u. 8.6
In fact, this value is used in many of the optimization software packages that use finite differencing as an option for estimating derivatives. For this value of , we have from 8.5 that the total error in the forwarddifference approximation is fairly close to u.
A more accurate approximation to the derivative can be obtained by using the central difference formula, defined as
fx fx eifx ei. 8.7 xi 2
As we show below, this approximation is more accurate than the forwarddifference approx imation 8.1. It is also about twice as expensive, since we need to evaluate f at the points xandx ei,i1,2,…,n:atotalof2n1points.
8.1. FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS 197
The basis for the central difference approximation is again Taylors theorem. When the second derivatives of f exist and are Lipschitz continuous, we have from 8.2 that
fxpfxfxTp1pT2fxtpp forsomet0,1 2
fxfxTp1pT2fxpO p3 . 2
8.8
By setting p ei and p ei , respectively, we obtain fxeifxf122fO 3,
xi 2 xi2 fxeifxf122fO 3.
rj Jx xi
rxT
xi 2 xi2
Note that the final error terms in these two expressions are generally not the same, but they are both bounded by some multiple of 3. By subtracting the second equation from the first and dividing by 2 , we obtain the expression
fxfxeifxeiO 2. xi 2
We see from this expression that the error is O 2 , as compared to the O error in the forwarddifferenceformula8.1.However,whenwetakeevaluationerrorin f intoaccount, the accuracy that can be achieved in practice is less impressive; the same assumptions that were used to derive 8.6 lead to an optimal choice of of about u13 and an error of about u23. In some situations, the extra few digits of accuracy may improve the performance of the algorithm enough to make the extra expense worthwhile.
APPROXIMATING A SPARSE JACOBIAN
Consider now the case of a vector function r : IRn IRm , such as the residual vector that we consider in Chapter 10 or the system of nonlinear equations from Chapter 11. The matrix Jx of first derivatives for this function is defined as follows:
j1,2,…,m i 1,2,…,n
2
. , 8.9
where r j , j 1, 2, . . . , m are the components of r . The techniques described in the previous
r1xT
. rmxT
198 CHAPTER 8. CALCULATING DERIVATIVES
section can be used to evaluate the full Jacobian J x one column at a time. When r is twice continuously differentiable, we can use Taylors theorem to deduce that
rxprxJxp L2 p 2, 8.10
where L is a Lipschitz constant for J in the region of interest. If we require an approximation to the Jacobianvector product J x p for a given vector p as is the case with inexact Newton methods for nonlinear systems of equations; see Section 11.1, this expression immediately suggests choosing a small nonzero and setting
Jxprx prx, 8.11
anapproximationthatisaccuratetoO .Atwosidedapproximationcanbederivedfrom the formula 8.7.
If an approximation to the full Jacobian J x is required, we can compute it a column at a time, analogously to 8.1, by setting set p ei in 8.10 to derive the following estimate of the ith column:
rxrx eirx. 8.12 xi
A full Jacobian estimate can be obtained at a cost of n 1 evaluations of the function r. When the Jacobian is sparse, however, we can often obtain the estimate at a much lower cost, sometimes just three or four evaluations of r. The key is to estimate a number of different columns of the Jacobian simultaneously, by judicious choices of the perturbation vector p in 8.10.
W e i l l u s t r a t e t h e t e c h n i q u e w i t h a s i m p l e e x a m p l e . C o n s i d e r t h e f u n c t i o n r : IR n IR n defined by
2 x 23 x 12
3×23 x12 2×3 x2
rx 3×3 x2 2×43 x32 . 8.13
. .
3xn3 xn21
Each component of r depends on just two or three components of x , so that each row of the Jacobian contains only two or three nonzero elements. For the case of n 6, the Jacobian
has the following structure:
8.1. FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS 199
, 8.14
where each cross represents a nonzero element, with zeros represented by a blank space. Staying for the moment with the case n 6, suppose that we wish to compute a finite difference approximation to the Jacobian. Of course, it is easy to calculate this particular Jacobian by hand, but there are complicated functions with similar structure for which hand calculation is more difficult. A perturbation p e1 to the first component of x will affect only the first and second components of r. The remaining components will be unchanged, so that the righthandside of formula 8.12 will correctly evaluate to zero in the components 3, 4, 5, 6. It is wasteful, however, to reevaluate these components of r when we know in advance that their values are not affected by the perturbation. Instead, we look for a way to modify the perturbation vector so that it does not have any further effect on components 1 and 2, but does produce a change in some of the components 3, 4, 5, 6, which we can then use as the basis of a finitedifference estimate for some other column of the Jacobian. It is not hard to see that the additional perturbation e4 has the desired property: It alters the 3rd, 4th, and 5th elements of r , but leaves the 1st and 2nd elements unchanged. The changes in r as a result of the perturbations e1 and e4 do not interfere with each
other.
To express this discussion in mathematical terms, we set
p e1e4,
and note that
rx p1,2 rx e1 e41,2 rx e11,2 8.15 where the notation 1,2 denotes the subvector consisting of the first and second elements,
while
rx p3,4,5 rx e1 e43,4,5 rx e43,4,5. 8.16 By substituting 8.15 into 8.10, we obtain
rx p1,2 rx1,2 Jxe11,2 O 2.
200 CHAPTER 8. CALCULATING DERIVATIVES
By rearranging this expression, we obtain the following difference formula for estimating the 1, 1 and 2, 1 elements of the Jacobian matrix:
r 1 x
x1 Jxe11,2 rx p1,2 rx1,2 . 8.17 r2 x
x1
A similar argument shows that the nonzero elements of the fourth column of the Jacobian
can be estimated by substituting 8.16 into 8.10; we obtain
r4
x x
Jxe
4 3,4,5
3
r4 x
rx p3,4,5 rx3,4,5 .
8.18
x4
r4
x x
5
To summarize: We have been able to estimate two columns of the Jacobian J x by evaluating the function r at the single extra point x e1 e4.
We can approximate the remainder of J x in an economical manner as well. Columns 2and5canbeapproximatedbychoosing p e2 e5,whilewecanuse p e3 e6 to approximate columns 3 and 6. In total, we need 3 evaluations of the function r after the initial evaluation at x to estimate the entire Jacobian matrix.
In fact, for any choice of n in 8.13 no matter how large, three extra evaluations of r are sufficient to approximate the entire Jacobian. The corresponding choices of perturbation vectors p are
p e1 e4 e7 e10 , p e2 e5 e8 e11 , p e3 e6 e9 e12 .
In the first of these vectors, the nonzero components are chosen so that no two of the columns 1, 4, 7, . . . have a nonzero element in the same row. The same property holds for the other two vectors and, in fact, points the way to the criterion that we can apply to general problems to decide on a valid set of perturbation vectors.
Algorithms for choosing the perturbation vectors can be expressed conveniently in the language of graphs and graph coloring. For any function r : IRn IRm , we can construct a column incidence graph G with n nodes by drawing an arc between nodes i and k if there is some component of r that depends on both xi and xk . In other words, the i th and kth columns of the Jacobian Jx each have a nonzero element in some row j, for some
j 1, 2, . . . , m and some value of x . The intersection graph for the function defined in
8.1. FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS 201
12
34
56
Figure 8.1
Column incidence graph for r x defined in 8.13.
8.13, with n 6, is shown in Figure 8.1. We now assign each node a color according to the following rule: Two nodes can have the same color if there is no arc that connects them. Finally, we choose one perturbation vector corresponding to each color: If nodes i1,i2,…,i havethesamecolor,thecorresponding pis ei1 ei2 ei .
Usually, there are many ways to assign colors to the n nodes in the graph in a way that satisfies the required condition. The simplest way is just to assign each node a different color, but since that scheme produces n perturbation vectors, it is usually not the most efficient approach. It is generally very difficult to find the coloring scheme that uses the fewest possible colors, but there are simple algorithms that do a good job of finding a nearoptimal coloring at low cost. Curtis, Powell, and Reid 83 and Coleman and More 68 provide descriptions of some methods and performance comparisons. Newsam and Ramsdell 227 show that by considering a more general class of perturbation vectors p, it is possible to evaluate the full Jacobian using no more than nz evaluations of r in addition to the evaluation at the point x, where nz is the maximum number of nonzeros in each row of Jx.
For some functions r with wellstudied structures those that arise from discretizations of differential operators, or those that give rise to banded Jacobians, as in the example above, optimal coloring schemes are known. For the tridiagonal Jacobian of 8.14 and its associated graph in Figure 8.1, the scheme with three colors is optimal.
APPROXIMATING THE HESSIAN
In some situations, the user may be able to provide a routine to calculate the gradient f x but not the Hessian 2 f x. We can obtain the Hessian by applying the techniques described above for the vector function r to the gradient f . By using the graph coloring techniques discussed above, sparse Hessians often can be approximated in this manner by using considerably fewer than n perturbation vectors. This approach ignores symmetry of the Hessian, and will usually produce a nonsymmetric approximation. We can recover
202 CHAPTER 8. CALCULATING DERIVATIVES
symmetry by adding the approximation to its transpose and dividing the result by 2. Alternative differencing approaches that take symmetry of 2 f x explicitly into account are discussed below.
Some important algorithmsmost notably the NewtonCG methods described in Chapter 7do not require knowledge of the full Hessian. Instead, each iteration requires ustosupplytheHessianvectorproduct2 fxp,foragivenvector p.Wecanobtainan approximation to this matrixvector product by appealing once again to Taylors theorem. When second derivatives of f exist and are Lipschitz continuous near x, we have
so that
fx pfx 2 fxpO 2, 8.19
2fxpfx pfx 8.20
see also 7.10. The approximation error is O , and the cost of obtaining the approxi mation is a single gradient evaluation at the point x p. The formula 8.20 corresponds to the forwarddifference approximation 8.1. A centraldifference formula like 8.7 can be derived by evaluating f x p as well.
For the case in which even gradients are not available, we can use Taylors theorem once again to derive formulae for approximating the Hessian that use only function values. The main tool is the formula 8.8: By substituting the vectors p ei, p ej, and
p ei e j into this formula and combining the results appropriately, we obtain
2f xfxeiejfxeifxejfxO. 8.21
xixj 2
If we wished to approximate every element of the Hessian with this formula, then we would needtoevaluatefatx eiejforallpossibleiandjatotalofnn12points aswellasatthenpointsx ei,i 1,2,…,n.IftheHessianissparse,wecan,ofcourse, reduce this operation count by skipping the evaluation whenever we know the element 2 fxixj tobezero.
APPROXIMATING A SPARSE HESSIAN
We noted above that a Hessian approximation can be obtained by applying finite difference Jacobian estimation techniques to the gradient f , treated as a vector function. We now show how symmetry of the Hessian 2 f can be used to reduce the number of perturbation vectors p needed to obtain a complete approximation, when the Hessian is sparse. The key observation is that, because of symmetry, any estimate of the element 2 fxi,j 2 fxxixj isalsoanestimateofitssymmetriccounterpart2 fxj,i.
the case of n 6:
8.1. FINITEDIFFERENCE DERIVATIVE APPROXIMATIONS 203
We illustrate the point with the simple function f : IRn IR defined by
n i1
. 8.23
f x x1
It is easy to show that the Hessian 2 f has the arrowhead structure depicted below, for
If we were to construct the intersection graph for the function f analogous to Figure 8.1, we would find that every node is connected to every other node, for the simple reason that row 1 has a nonzero in every column. According to the rule for coloring the graph, then, we would have to assign a different color to every node, which implies that we would need to evaluatef atthen1pointsxandx ei fori1,2,…,n.
We can construct a much more efficient scheme by taking the symmetry into account. Suppose we first use the perturbation vector p e1 to estimate the first column of 2 f x. Because of symmetry, the same estimates apply to the first row of 2 f . From 8.23, we see that all that remains is to find the diagonal elements 2 f x22, 2 f x33, . . . , 2 f x66. The intersection graph for these remaining elements is completely disconnected, so we can assign them all the same color and choose the corresponding perturbation vector to be
p e2 e3 e6 0,1,1,1,1,1T. 8.24
Note that the second component of f is not affected by the perturbations in components 3, 4, 5, 6 of the unknown vector, while the third component of f is not affected by perturbations in components 2, 4, 5, 6 of x , and so on. As in 8.15 and 8.16, we have for each component i that
fxpi fx e2e3e6i fx eii.
By applying the forwarddifference formula 8.1 to each of these individual components,
we then obtain
2fxfx eiifxi fx pifxi, i2,3,…,6. x i2
i2xi2. 8.22
204 CHAPTER 8. CALCULATING DERIVATIVES
By exploiting symmetry, we have been able to estimate the entire Hessian by evaluating f only at x and two other points.
Again, graphcoloring techniques can be used to choose the perturbation vectors p economically. We use the adjacency graph in place of the intersection graph described earlier. The adjacency graph has n nodes, with arcs connecting nodes i and k whenever i a k and 2 fxxixka0forsomex.Therequirementsonthecoloringschemearealittlemore complicated than before, however. We require not only that connected nodes have different colors, but also that any path of length 3 through the graph contain at least three colors. In other words, if there exist nodes i1, i2, i3, i4 in the graph that are connected by arcs i1, i2, i2,i3, and i3,i4, then at least three different colors must be used in coloring these four nodes. See Coleman and More 69 for an explanation of this rule and for algorithms to compute valid colorings. The perturbation vectors are constructed as before: Whenever the nodes i1, i2, . . . , i have the same color, we set the corresponding perturbation vector to be
p ei1 ei2 ei .
8.2 AUTOMATIC DIFFERENTIATION
Automatic differentiation is the generic name for techniques that use the computational representation of a function to produce analytic values for the derivatives. Some techniques produce code for the derivatives at a general point x by manipulating the function code directly. Other techniques keep a record of the computations made during the evaluation of the function at a specific point x and then review this information to produce a set of derivatives at x.
Automatic differentiation techniques are founded on the observation that any func tion, no matter how complicated, is evaluated by performing a sequence of simple elementary operations involving just one or two arguments at a time. Twoargument operations include addition, multiplication, division, and the power operation ab . Examples of singleargument operations include the trigonometric, exponential, and logarithmic functions. Another com mon ingredient of the various automatic differentiation tools is their use of the chain rule. This is the wellknown rule from elementary calculus that says that if h is a function of the vector y IRm , which is in turn a function of the vector x IRn , we can write the derivative of h with respect to x as follows:
xhyx m h yix. 8.25 i1 yi
See Appendix A for further details.
There are two basic modes of automatic differentiation: the forward and reverse modes.
The difference between them can be illustrated by a simple example. We work through such
8.2. AUTOMATIC DIFFERENTIATION 205
an example below, and indicate how the techniques can be extended to general functions, including vector functions.
AN EXAMPLE
Consider the following function of 3 variables:
fxx1x2sinx3 ex1x2x3. 8.26
Figure 8.2 shows how the evaluation of this function can be broken down into its elementary operations and also indicates the partial ordering associated with these operations. For instance, the multiplication x1 x2 must take place prior to the exponentiation ex1 x2 , or else we would obtain the incorrect result ex1 x2. This graph introduces the intermediate variables x4, x5, . . . that contain the results of intermediate computations; they are distinguished from the independent variables x1, x2, x3 that appear at the left of the graph. We can express the evaluation of f in arithmetic terms as follows:
x4 x1 x2, x5 sinx3, x6 ex4,
x7 x4 x5, x8 x6 x7, x9 x8x3.
8.27
The final node x9 in Figure 8.2 contains the function value f x. In the terminology of graph theory, node i is the parent of node j, and node j the child of node i, whenever there is a directed arc from i to j. Any node can be evaluated when the values of all its parents are known, so computation flows through the graph from left to right. Flow of
x1
x2
x exp x 46
xsinx x x x
35789
Figure 8.2 Computational graph for f x defined in 8.26.
206 CHAPTER 8. CALCULATING DERIVATIVES
computation in this direction is known as a forward sweep. It is important to emphasize that software tools for automatic differentiation do not require the user to break down the code for evaluating the function into its elements, as in 8.27. Identification of intermediate quantities and construction of the computational graph is carried out, explicitly or implicitly, by the software tool itself.
THE FORWARD MODE
In the forward mode of automatic differentiation, we evaluate and carry forward a directional derivative of each intermediate variable xi in a given direction p IRn, simultaneouslywiththeevaluationofxi itself.Forthethreevariableexampleabove,weuse the following notation for the directional derivative for p associated with each variable:
def T 3xi
Dpxi xi p x pj, i 1,2,…,9, 8.28
j1 j
where indicates the gradient with respect to the three independent variables. Our goal is to evaluate Dpx9, which is the same as the directional derivative f xT p. We note immediately that initial values D p xi for the independent variables xi , i 1, 2, 3, are simply the components p1, p2, p3 of p. The direction p is referred to as the seed vector.
As soon as the value of xi at any node is known, we can find the corresponding value of Dp xi from the chain rule. For instance, suppose we know the values of x4, Dp x4, x5, and Dpx5, and we are about to calculate x7 in Figure 8.2. We have that x7 x4x5; that is, x7 is a function of the two variables x4 and x5, which in turn are functions of x1, x2, x3. By applying the rule 8.25, we have that
x7 x7 x4 x7 x5 x5x4 x4x5. x4 x5
By taking the inner product of both sides of this expression with p and applying the definition 8.28, we obtain
Dpx7 x7Dpx4x7Dpx5 x5Dpx4x4Dpx5. 8.29 x4 x5
The directional derivatives Dpxi are therefore evaluated side by side with the intermediate resultsxi,andattheendoftheprocessweobtainDpx9 Dp f fxT p.
The principle of the forward mode is straightforward enough, but what of its practical implementation and computational requirements? First, we repeat that the user does not need to construct the computational graph, break the computation down into elementary operations as in 8.27, or identify intermediate variables. The automatic differentiation software should perform these tasks implicitly and automatically. Nor is it necessary to store
8.2. AUTOMATIC DIFFERENTIATION 207
theinformationxi andDpxi foreverynodeofthecomputationgraphatoncewhichisjust as well, since this graph can be very large for complicated functions. Once all the children of any node have been evaluated, its associated values xi and Dpxi are not needed further and may be overwritten in storage.
Thekeytopracticalimplementationisthesidebysideevaluationofxi andDpxi.The automatic differentiation software associates a scalar Dpw with any scalar w that appears in the evaluation code. Whenever w is used in an arithmetic computation, the software performs an associated operation based on the chain rule with the gradient vector Dpw. For instance, if w is combined in a division operation with another value y to produce a new value z, that is,
zw, y
we use w, z, Dpw, and Dp y to evaluate the directional derivative Dpz as follows: Dpz 1Dpw wDpy.
y y2
8.30
To obtain the complete gradient vector, we can carry out this procedure simultaneously forthenseedvectorspe1,e2,…,en.Bythedefinition8.28,weseethatpej implies that Dp f fxj, j 1,2,…,n. We note from the example 8.30 that the additional cost of evaluating f and f over the cost of evaluating f alone may be significant. In this example, the single division operation on w and y needed to calculate z gives rise to approximately 2n multiplications and n additions in the computation of the gradient elements Dej z, j 1,2,…,n. It is difficult to obtain an exact bound on the increase in computation, since the costs of retrieving and storing the data should also be taken into account. The storage requirements may also increase by a factor as large as n, since we now have to store n additional scalars De j xi , j 1, 2, . . . , n, alongside each intermediate variable xi . It is usually possible to make savings by observing that many of these quantities are zero, particularly in the early stages of the computation that is, toward the left of the computational graph, so sparse data structures can be used to store the vectors Dej xi,
j 1,2,…,n see 27.
The forward mode of automatic differentiation can be implemented by means of a
precompiler, which transforms function evaluation code into extended code that evaluates the derivative vectors as well. An alternative approach is to use the operatoroverloading facilities available in languages such as C to transparently extend the data structures and operations in the manner described above.
THE REVERSE MODE
The reverse mode of automatic differentiation does not perform function and gradient evaluationsconcurrently.Instead,aftertheevaluationof f iscomplete,itrecoversthepartial
208 CHAPTER 8. CALCULATING DERIVATIVES
derivatives of f with respect to each variable xi independent and intermediate variables alikeby performing a reverse sweep of the computational graph. At the conclusion of this process, the gradient vector f can be assembled from the partial derivatives f xi with respect to the independent variables xi , i 1, 2, . . . , n.
Instead of the gradient vectors Dpxi used in the forward mode, the reverse mode associates a scalar variable x i with each node in the graph; information about the partial derivative f xi is accumulated in x i during the reverse sweep. The x i are sometimes called the adjoint variables, and we initialize their values to zero, with the exception oftherightmostnodeinthegraphnodeN,say,forwhichwesetx N 1.This choice makes sense because x N contains the final function value f , so we have f xN 1.
The reverse sweep makes use of the following observation, which is again based on the chain rule 8.25: For any node i , the partial derivative f xi can be built up from the partial derivatives f x j corresponding to its child nodes j according to the following formula:
f f xj. 8.31 xi jachildofi xj xi
For each node i, we add the righthandside term in 8.31 to x i as soon as it becomes known; that is, we perform the operation
x i f xj. 8.32 xj xi
In this expression and the ones below, we use the arithmetic notation of the programming language C, in which xa means x x a. Once contributions have been received from all the child nodes of i, we have x i fxi, so we declare node i to be finalized. At this point, node i is ready to contribute a term to the summation for each of its parent nodes according to the formula 8.31. The process continues in this fashion until all nodes are finalized. Note that for derivative evaluation, the flow of computation in the graph is from children to parentsthe opposite direction to the computation flow for function evaluation.
During the reverse sweep, we work with numerical values, not with formulae or computer code involving the variables xi or the partial derivatives f xi . During the forward sweepthe evaluation of f we not only calculate the values of each variable xi , but we also calculate and store the numerical values of each partial derivative x j xi . Each of these partial derivatives is associated with a particular arc of the computational graph. Thenumericalvaluesofxjxi computedduringtheforwardsweeparethenusedinthe formula 8.32 during the reverse sweep.
We illustrate the reverse mode for the example function 8.26. In Figure 8.3 we fill in the graph of Figure 8.2 for a specific evaluation point x 1, 2, 2T , indicating the
8.2. AUTOMATIC DIFFERENTIATION 209
1
2
p4,12
2 exp e2
p6,4e p7,41
2
p7,52
p9,384e
p4,21
p8,61
2 sin 1 p5,30
p8,71
2e2 p9,82
42e2
22
Figure 8.3 Computational graph for f x defined in 8.26 showing numerical values of intermediate values and partial derivatives for the point x 1, 2, 2T . Notation: pj,ixjxi.
numerical values of the intermediate variables x4, x5, . . . , x9 associated with each node and thepartialderivativesxjxi associatedwitheacharc.
As mentioned above, we initialize the reverse sweep by setting all the adjoint variables x i to zero, except for the rightmost node, for which we have x 9 1. Since f x x9 and sincenode9hasnochildren,wehavex 9 fx9,andsowecanimmediatelydeclarenode 9 to be finalized.
Node 9 is the child of nodes 3 and 8, so we use formula 8.32 to update the values of x 3 and x 8 as follows:
f x9 2e2 x 3x x 22
84e2 2 ,
8.33a 8.33b
93
x 8f x9 1 2.
x9 x8 2
Node 3 is not finalized after this operation; it still awaits a contribution from its other child,
node 5. On the other hand, node 9 is the only child of node 8, so we can declare node 8 to
be finalized with the value f 2 . We can now update the values of x i at the two parent x8
nodes of node 8 by applying the formula 8.32 once again; that is, x 6 f x 8 2 ;
x8 x7
At this point, nodes 6 and 7 are finalized, so we can use them to update nodes 4 and 5. At
x8 x6 x 7 f x 8 2 .
210 CHAPTER 8. CALCULATING DERIVATIVES
the end of this process, when all nodes are finalized, nodes 1, 2, and 3 contain
x 44e2 1
x 2 fx 22e2 , x 3 8 4e22
and the derivative computation is complete.
The main appeal of the reverse mode is that its computational complexity is low for
the scalar functions f : IRn IR discussed here. The extra arithmetic associated with the gradient computation is at most four or five times the arithmetic needed to evaluate the function alone. Taking the division operation in 8.33 as an example, we see that two multiplications, a division, and an addition are required for 8.33a, while a division and an addition are required for 8.33b. This is about five times as much work as the single division involving these nodes that was performed during the forward sweep.
As we noted above, the forward mode may require up to n times more arithmetic to compute the gradient f than to compute the function f alone, making it appear uncompetitive with the reverse mode. When we consider vector functions r : IRn IRm , the relative costs of the forward and reverse modes become more similar as m increases, as we describe in the next section.
An apparent drawback of the reverse mode is the need to store the entire computational graph, which is needed for the reverse sweep. In principle, storage of this graph is not too dif ficult to implement. Whenever an elementary operation is performed, we can form and store a new node containing the intermediate result, pointers to the one or two parent nodes, and the partial derivatives associated with these arcs. During the reverse sweep, the nodes can be read in the reverse order to that in which they were written, giving a particularly simple access pattern. The process of forming and writing the graph can be implemented as a straightfor ward extension to the elementary operations via operator overloading as in ADOLC 154. The reverse sweepgradient evaluation can be invoked as a simple function call.
Unfortunately, the computational graph may require a huge amount of storage. If each node can be stored in 20 bytes, then a function that requires one second of evaluation time on a 100 megaflop computer may produce a graph of up to 2 gigabytes in size. The storage requirements can be reduced, at the cost of some extra arithmetic, by performing partial forward and reverse sweeps on pieces of the computational graph, reevaluating portions of the graph as needed rather than storing the whole structure. Descriptions of this approach, sometimes known as checkpointing, can be found in Griewank 150 and Grimm, Pottier, and RostaingSchmidt 157. An implementation of checkpointing in the context of variational data assimilation can be found in Restrepo, Leaf, and Griewank 264 .
VECTOR FUNCTIONS AND PARTIAL SEPARABILITY
So far, we have looked at automatic differentiation of general scalarvalued functions f : IRn IR. In nonlinear leastsquares problems Chapter 10 and nonlinear equations
f x
the vector function r from the partially separable components, that is,
it follows from 8.34 that
f1x
f2x rx . ,
. fnex
f x JxT e,
8.2. AUTOMATIC DIFFERENTIATION 211
Chapter 11, we have to deal with vector functions r : IRn IRm with m components r j , j 1, 2, . . . , m. The rightmost column of the computational graph then consists of m nodes, none of which has any children, in place of the single node described above. The forward and reverse modes can be adapted in straightforward ways to find the Jacobian
J x , the m n matrix defined in 8.9.
Besides their applications to leastsquares and nonlinearequations problems, auto
matic differentiation of vector functions is a useful technique for dealing with partially separable functions. We recall that partial separability is commonly observed in largescale optimization, and we saw in Chapter 7 that there exist efficient quasiNewton procedures for the minimization of objective functions with this property. Since an automatic procedure for detecting the decomposition of a given function f into its partially separable representation was developed recently by Gay 118, it has become possible to exploit the efficiencies that accrue from this property without asking much information from the user.
In the simplest sense, a function f is partially separable if we can express it in the form ne
fi x, 8.34 where each element function fi depends on just a few components of x. If we construct
i1
where, as usual, e 1,1,…,1T. Because of the partial separability property, most columns of Jx contain just a few nonzeros. This structure makes it possible to calculate Jx efficiently by applying graphcoloring techniques, as we discuss below. The gradient f x can then be recovered from the formula 8.35.
In constrained optimization, it is often beneficial to evaluate the objective function f andtheconstraintfunctionsci,i IE,simultaneously.Bydoingso,wecantakeadvantage of common expressions which show up as shared intermediate nodes in the computation graph and thus can reduce the total workload. In this case, the vector function r can be
8.35
212 CHAPTER 8. CALCULATING DERIVATIVES
defined as
rx cjx
An example of shared intermediate nodes was seen in Figure 8.2, where x4 is shared during
the computation of x6 and x7.
CALCULATING JACOBIANS OF VECTOR FUNCTIONS
The forward mode is the same for vector functions as for scalar functions. Given a seed vector p, we continue to associate quantities Dpxi with the node that calculates each intermediate variable xi . At each of the rightmost nodes containing r j , j 1, 2, . . . , m, this variable contains the quantity Dpr j r j T p, j 1, 2, . . . , m. By assembling these m quantities, we obtain Jxp, the product of the Jacobian and our chosen vector
p. As in the case of scalar functions m 1, we can evaluate the complete Jacobian by setting p e1,e2,…,en and evaluating the n quantities Dej xi simultaneously. For sparse Jacobians, we can use the coloring techniques outlined above in the context of finite difference methods to make more intelligent and economical choices of the seed vectors p. The factor of increase in cost of arithmetic, when compared to a single evaluation of r, is about equal to the number of seed vectors used.
The key to applying the reverse mode to a vector function rx is to choose seed vectors q IRm and apply the reverse mode to the scalar functions rxT q. The result of this process is the vector
m j1
Instead of the Jacobianvector product that we obtain with the forward mode, the reverse mode yields a Jacobiantransposevector product. The technique can be implemented by seeding the variables x i in the m dependent nodes that contain r1, r2, . . . , rm , with the components q1, q2, . . . , qm of the vector q. At the end of the reverse sweep, the node for independentvariablesx1,x2,…,xn willcontain
d rxTq, i1,2,…,n, dxi
which are simply the components of JxT q.
As usual, we can obtain the full Jacobian by carrying out the process above for the m
unit vectors q e1 , e2 , . . . , em . Alternatively, for sparse Jacobians, we can apply the usual coloring techniques to find a smaller number of seed vectors qthe only difference being
rxTq
qjrjx JxTq.
fx
jIE
.
8.2. AUTOMATIC DIFFERENTIATION 213
that the graphs and coloring strategies are defined with reference to the transpose JxT rather than to Jx itself. The factor of increase in the number of arithmetic operations required, in comparison to an evaluation of r alone, is no more than 5 times the number of seed vectors. The factor of 5 is the usual overhead from the reverse mode for a scalar function. The space required for storage of the computational graph is no greater than in the scalar case. As before, we need only store the graph topology information together with the partial derivative associated with each arc.
The forward and reversemode techniques can be combined to cumulatively reveal all the elements of Jx. We can choose a set of seed vectors p for the forward mode to reveal some columns of J , then perform the reverse mode with another set of seed vectors q to reveal the rows that contain the remaining elements.
Finally, we note that for some algorithms, we do not need full knowledge of the Jacobian Jx. For instance, iterative methods such as the inexact Newton method for nonlinear equations see Section 11.1 require repeated calculation of J x p for a succession of vectors p. Each such matrixvector product can be computed using the forward mode by using a single forward sweep, at a similar cost to evaluation of the function alone.
CALCULATING HESSIANS: FORWARD MODE
So far, we have described how the forward and reverse modes can be applied to obtain first derivatives of scalar and vector functions. We now outline extensions of these techniques to the computation of the Hessian 2 f of a scalar function f , and evaluation of the Hessianvector product 2 f xp for a given vector p.
Recall that the forward mode makes use of the quantities D p xi , each of which stores xi T p for each node i in the computational graph and a given vector p. For a given pair of seed vectors p and q both in IRn we now define another scalar quantity by
Dpqxi pT2xiq, 8.36
for each node i in the computational graph. We can evaluate these quantities during the forward sweep through the graph, alongside the function values xi and the firstderivative valuesDpxi.TheinitialvaluesofDpq attheindependentvariablenodesxi,i1,2…,n, willbe0,sincethesecondderivativesofxi arezeroateachofthesenodes.Whentheforward sweep is complete, the value of Dpq xi in the rightmost node of the graph will be pT 2 f xq.
The formulae for transformation of the Dpq xi variables during the forward sweep can once again be derived from the chain rule. For instance, if xi is obtained by adding the values at its two parent nodes, xi x j xk , the corresponding accumulation operations on Dp xi and Dpq xi are as follows:
Dpxi Dpxj Dpxk, Dpqxi Dpqxj Dpqxk. 8.37
214 CHAPTER 8. CALCULATING DERIVATIVES
The other binary operations , , are handled similarly. If xi is obtained by applying the unitary transformation L to x j , we have
xi Lxj,
Dpxi LxjDpxj,
Dpqxi LxjDpxjDqxj LxjDpqxj.
8.38a 8.38b 8.38c
We see in 8.38c that computation of Dpqxi can rely on the firstderivative quantities Dpxi andDqxi,soboththesequantitiesmustbeaccumulatedduringtheforwardsweep as well.
We could compute a general dense Hessian by choosing the pairs p,q to be all possible pairs of unit vectors ej,ek, for j 1,2,…,n and k 1,2,…, j, a total of nn 12 vector pairs. Note that we need only evaluate the lower triangle of 2 f x, because of symmetry. When we know the sparsity structure of 2 f x, we need evaluate Dejek xi only for the pairs ej,ek for which the j,k component of 2 fx is possibly nonzero.
The total increase factor for the number of arithmetic operations, compared with the amount of arithmetic to evaluate f alone, is a small multiple of 1 n Nz2 f , where Nz 2 f is the number of elements of 2 f that we choose to evaluate. This number reflects the evaluation of the quantities xi, Dej xi j 1,2,…,n, and Dejek xi for the Nz2 f vector pairs e j , ek . The small multiple results from the fact that the update operations for Dpxi and Dpqxi may require a few times more operations than the update operation forxi alone;see,forexample,8.38.Onestoragelocationpernodeofthegraphisrequired for each of the 1 n Nz2 f quantities that are accumulated, but recall that storage of node i can be overwritten once all its children have been evaluated.
When we do not need the complete Hessian, but only a matrixvector product involv ing the Hessian as in the NewtonCG algorithm of Chapter 7, the amount of arithmetic is, of course, smaller. Given a vector q IRn, we use the techniques above to compute the firstderivative quantities De1 xi , . . . Den xi and Dq xi , as well as the secondderivative quantities De1 q xi , . . . , Den q xi , during the forward sweep. The final node will contain the quantities
eTj 2fx q 2fxq j, j1,2,…,n,
which are the components of the vector 2 f xq. Since 2n 1 quantities in addition to xi are being accumulated during the forward sweep, the increase factor in the number of arithmetic operations increases by a small multiple of 2n.
An alternative technique for evaluating sparse Hessians is based on the forward mode propagation of first and second derivatives of univariate functions. To motivate this
8.2. AUTOMATIC DIFFERENTIATION 215
approach, note that the i, j element of the Hessian can be expressed as follows: 2 f x i j e iT 2 f x e j
1 8.39
2 ei ejT2 fxei ejeiT2 fxei eTj 2 fxej .
We can use this interpolation formula to evaluate 2 f xi j , provided that the second derivatives Dppxk, for p ei, p ej, p ei ej, and all nodes xk, have been evaluated during the forward sweep through the computational graph. In fact, we can evaluate all the nonzero elements of the Hessian, provided that we use the forward mode to evaluate Dpxk andDppxk foraselectionofvectorspoftheformei ej,whereiandjarebothindices in 1,2,…,n, possibly with i j.
One advantage of this approach is that it is no longer necessary to propagate cross terms of the form Dpq xk for p a q see, for example, 8.37 and 8.38c. The propagation formulae therefore simplify somewhat. Each Dpp xk is a function of x , Dp x , and Dpp x for all parent nodes of node k.
Note, too, that if we define the univariate function by
t fxtp, 8.40
then the values of Dp f and Dpp f , which emerge at the completion of the forward sweep, are simply the first two derivatives of evaluated at t 0; that is,
Dp f pTfxtt0, Dpp f pT2 fxptt0.
Extension of this technique to third, fourth, and higher derivatives is possible. Inter polation formulae analogous to 8.39 can be used in conjunction with higher derivatives of the univariate functions defined in 8.40, again for a suitably chosen set of vectors p, where each p is made up of a sum of unit vectors ei . For details, see Bischof, Corliss, and Griewank 26.
CALCULATING HESSIANS: REVERSE MODE
We can also devise schemes based on the reverse mode for calculating Hessian vector products 2 f xq, or the full Hessian 2 f x. A scheme for obtaining 2 f xq proceeds as follows. We start by using the forward mode to evaluate both f and f xT q, by accumulating the two variables xi and Dq xi during the forward sweep in the manner described above. We then apply the reverse mode in the normal fashion to the computed function fxTq. At the end of the reverse sweep, the nodes i 1,2,…,n of the computational graph that correspond to the independent variables will contain
fxTq 2fxqi, i1,2,…,n. xi
216 CHAPTER 8. CALCULATING DERIVATIVES
The number of arithmetic operations required to obtain 2 f xq by this procedure increases by only a modest factor, independent of n, over the evaluation of f alone. By the usual analysis for the forward mode, we see that the computation of f and f xT q jointly requires a small multiple of the operation count for f alone, while the reverse sweep introduces a further factor of at most 5. The total increase factor is approximately 12 over the evaluationof f alone.IftheentireHessian2 fxisrequired,wecouldapplytheprocedure justdescribedwithq e1,e2,…,en.Thisapproachwouldintroduceanadditionalfactor of n into the operation count, leading to an increase of at most 12n over the cost of f alone.
Once again, when the Hessian is sparse with known structure, we may be able to use graphcoloring techniques to evaluate this entire matrix using many fewer than n seed vectors. The choices of q are similar to those used for finitedifference evaluation of the Hessian, described above. The increase in operation count over evaluating f alone is a multiple of up to 12Nc2 f , where Nc is the number of seed vectors q used in calculating 2 f.
CURRENT LIMITATIONS
The current generation of automatic differentiation tools has proved its worth through successful application to some large and difficult design optimization problems. However, these tools can run into difficulties with some commonly used programming constructs and some implementations of computer arithmetic. As an example, if the evaluation of f x depends on the solution of a partial differential equation PDE, then the computed value of f may contain truncation error arising from the finitedifference or the finiteelement techniquethatisusedtosolvethePDEnumerically.Thatis,wehave fx fxx, where f is the computed value of f and is the truncation error. Though x is usually small, its derivative x may not be, so the error in the computed derivative fx is potentially large. The finitedifference approximation techniques discussed in Section 8.1 experience the same difficulty. Similar problems arise when the computer uses piecewise rational functions to approximate trigonometric functions.
Another source of potential difficulty is the presence of branching in the code to improve the speed or accuracy of function evaluation in certain domains. A pathological example is provided by the linear function f x x 1. If we used the following perverse, but valid piece of code to evaluate this function,
if x1.0 then f 0.0 else f x1.0,
then by applying automatic differentiation to this procedure we would obtain the derivative value f 1 0. For a discussion of such issues and an approach to dealing with them, see Griewank 151, 152.
In conclusion, automatic differentiation should be regarded as a set of increasingly sophisticated techniques that enhances optimization algorithms, allowing them to be applied more widely to practical problems involving complicated functions. By providing sensitivity information, it helps the modeler to extract more information from the results of the
8.2. AUTOMATIC DIFFERENTIATION 217
computation. Automatic differentiation should not be regarded as a panacea that absolves the user altogether from the responsibility of thinking about derivative calculations.
NOTES AND REFERENCES
A comprehensive and authoritative reference on automatic differentiation is the book of Griewank 152. The web site www.autodiff.org contains a wealth of current infor mation about theory, software, and applications. A number of edited collections of papers on automatic differentiation have appeared since 1991; see Griewank and Corliss 153, Berz et al. 20, and Bu cker et al. 40. An historical paper of note is Corliss and Rall 78, which includes an extensive bibliography. Software tool development in automatic dif ferentiation makes use not only of forward and reverse modes but also includes mixed modes and crosscountry algorithms that combine the two approaches; see for example Naumann 222.
The field of automatic differentiation grew considerably during the 1990s, and and a number of good software tools appeared. These included ADIFOR 25 and ADIC 28, and ADOLC 154. Tools developed in more recent years include TAPENADE, which accepts Fortran code through a web server and returns differentiated code; TAF, a commercial tool that also performs sourcetosource automatic differentiation of Fortran codes; OpenAD, which works with Fortran, C, and C; and TOMLABMAD, which works with MATLAB code.
The technique for calculating the gradient of a partially separable function was de scribed by Bischof et al. 24, whereas the computation of the Hessian matrix has been considered by several authors; see, for example, Gay 118.
The work of Coleman and More 69 on efficient estimation of Hessians was predated by Powell and Toint 261, who did not use the language of graph coloring but nevertheless devised highly effective schemes. Software for estimating sparse Hessians and Jacobians is described by Coleman, Garbow, and More 66, 67. The recent paper of Gebremedhin, Manne, and Pothen 120 contains a comprehensive discussion of the application of graph coloring to both finite difference and automatic differentiation techniques.
EXERCISES
8.1 Show that a suitable value for the perturbation in the centraldifference formula is u13, and that the accuracy achievable by this formula when the values of f contain roundoff errors of size u is approximately u23. Use similar assumptions to the ones used to derive the estimate 8.6 for the forwarddifference formula.
8.2 Derive a centraldifference analogue of the Hessianvector approximation formula 8.20.
218 CHAPTER 8. CALCULATING DERIVATIVES
8.3 Verify the formula 8.21 for approximating an element of the Hessian using only function values.
8.4 Verify that if the Hessian of a function f has nonzero diagonal elements, then its adjacency graph is a subgraph of the intersection graph for f . In other words, show that any arc in the adjacency graph also belongs to the intersection graph.
8.5 Draw the adjacency graph for the function f defined by 8.22. Show that the coloring scheme in which node 1 has one color while nodes 2, 3, . . . , n have another color is valid. Draw the intersection graph for f .
8.6 Construct the adjacency graph for the function whose Hessian has the nonzero structure
,
and find a valid coloring scheme with just four colors.
8.7 Trace the computations performed in the forward mode for the function f x in 8.26, expressing the intermediate derivatives xi , i 4, 5, . . . , 9 in terms of quantities available at their parent nodes and then in terms of the independent variables x1, x2, x3.
8.8 Formula 8.30 showed the gradient operations associated with scalar division. Derive similar formulae for the following operations:
s, t s t t et
t tant s,t st.
addition; exponentiation; tangent;
8.9 By calculating the partial derivatives x j xi for the function 8.26 from the expressions 8.27, verify the numerical values for the arcs in Figure 8.3 for the evaluation point x 1, 2, 2T . Work through the remaining details of the reverse sweep process, indicating the order in which the nodes become finalized.
8.2. AUTOMATIC DIFFERENTIATION 219
8.10 Using 8.33 as a guide, describe the reverse sweep operations corresponding to the following elementary operations in the forward sweep:
xk xixj multiplication; xk cosxi cosine.
In each case, compare the arithmetic workload in the reverse sweep to the workload required for the forward sweep.
8.11 Defineformulaesimilarto8.37foraccumulatingthefirstderivativesDpxi and thesecondderivativesDpqxi whenxi isobtainedfromthefollowingthreebinaryoperations: xi xj xk,xi xjxk,andxi xjxk.
8.12 By using the definitions 8.28 of D p xi and 8.36 of D pq xi , verify the differentiationformulae8.38fortheunitaryoperationxi Lxj.
8.13 Leta IRn beafixedvectoranddefine f as fx 1 xT x aT x 2 .Count 2
the number of operations needed to evaluate f , f , 2 f , and the Hessianvector product 2 f xp for an arbitrary vector p.
CHAPTER9 DerivativeFree
Optimization
Many practical applications require the optimization of functions whose derivatives are not available. Problems of this kind can be solved, in principle, by approximating the gradient and possibly the Hessian using finite differences see Chapter 8, and using these approximate gradients within the algorithms described in earlier chapters. Even though this finitedifference approach is effective in some applications, it cannot be regarded a generalpurpose technique for derivativefree optimization because the number of function evaluations required can be excessive and the approach can be unreliable in the presence of noise. For the purposes of this chapter we define noise to be inaccuracy in the function evaluation. Because of these shortcomings, various algorithms have been developed that
This is pa Printer: O
g
9.1. FINITE DIFFERENCES AND NOISE 221
do not attempt to approximate the gradient. Rather, they use the function values at a set of sample points to determine a new iterate by some other means.
Derivativefree optimization DFO algorithms differ in the way they use the sampled function values to determine the new iterate. One class of methods constructs a linear or quadratic model of the objective function and defines the next iterate by seeking to minimize this model inside a trust region. We pay particular attention to these modelbased approaches because they are related to the unconstrained minimization methods described in earlier chapters. Other widely used DFO methods include the simplexreflection method of Nelder and Mead, patternsearch methods, conjugatedirection methods, and simulated annealing. In this chapter we briefly discuss these methods, with the exception of simulated annealing, which is a nondeterministic approach and has little in common with the other techniques discussed in this book.
Derivativefree optimization methods are not as well developed as gradientbased methods; current algorithms are effective only for small problems. Although most DFO methods have been adapted to handle simple types of constraints, such as bounds, the efficient treatment of general constraints is still the subject of investigation. Consequently, we limit our discussion to the unconstrained optimization problem
min fx. 9.1 x IR n
Problems in which derivatives are not available arise often in practice. The evaluation of f x can, for example, be the result of an experimental measurement or a stochastic simulation, with the underlying analytic form of f unknown. Even if the objective function
f is known in analytic form, coding its derivatives may be time consuming or impractical. Automatic differentiation tools Chapter 8 may not be applicable if f x is provided only in the form of binary computer code. Even when the source code is available, these tools cannot be applied if the code is written in a combination of languages.
Methods for derivativefree optimization are often used with mixed success to minimize problems with nondifferentiable functions or to try to locate the global minimizer of a function. Since we do not treat nonsmooth optimization or global optimization in this book, we will restrict our attention to smooth problems in which f has a continuous derivative. We do, however, discuss the effects of noise in Sections 9.1 and 9.6.
9.1 FINITE DIFFERENCES AND NOISE
As mentioned above, an obvious DFO approach is to estimate the gradient by using finite differences and then employ a gradientbased method. This approach is sometimes successful and should always be considered, but the finitedifference estimates can be inaccurate when the objective function contains noise. We quantify the effect of noise in this section.
Noise can arise in function evaluations for various reasons. If f x depends on a stochastic simulation, there will be a random error in the evaluated function because of the
222 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
finite number of trials in the simulation. When a differential equation solver or some other complex numerical procedure is needed to calculate f , small but nonzero error tolerances that are used during the calculations will produce noise in the value of f .
In many applications, then, the objective function f has the form
f x hx x, 9.2
where h is a smooth function and represents the noise. Note that we have written to be a function of x but in practice it need not be. For instance, if the evaluation of f depends on a simulation, the value of will generally differ at each evaluation, even at the same x. The form 9.2 is, however, useful for illustrating some of the difficulties caused by noise in gradient estimates and for developing algorithms for derivativefree optimization.
Given a difference interval , recall that the centered finitedifference approximation 8.7 to the gradient of f at x is defined as follows:
fx fxeifxei , 9.3 2 i 1,2,…,n
where ei is the ith unit vector the vector whose only nonzero element is a 1 in the ith position.Wewishtorelate fxtothegradientoftheunderlyingsmoothfunctionhx, as a function of and the noise level. For this purpose we define the noise level to be the largest value of in a box of edge length 2 centered at x, that is,
x; sup z. 9.4 zx
By applying to the central difference formula 9.3 the argument that led to 8.5, we can establish the following result.
Lemma 9.1.
Supposethat2hisLipschitzcontinuousinaneighborhoodoftheboxz zx with Lipschitz constant L h . Then we have
fxhxLh 2x;. 9.5
Thus the error in the approximation 9.3 comes from both the intrinsic finite difference approximation error the O 2 term and the noise the x; term. If the noise dominates the difference interval , we cannot expect any accuracy at all in f x, so it will only be pure luck if f x turns out to be a direction of descent for f .
Instead of computing a tight cluster of function values around the current iterate, as required by a finitedifference approximation to the gradient, it may be preferable to separate these points more widely and use them to construct a model of the objective function. This
approach, which we consider in the next section and in Section 9.6, may be more robust to the presence of noise.
9.2 MODELBASED METHODS
Some of the most effective algorithms for unconstrained optimization described in the previous chapters compute steps by minimizing a quadratic model of the objective function f . The model is formed by using function and derivative information at the current iterate. When derivatives are not available, we may define the model mk as the quadratic function that interpolates f at a set of appropriately chosen sample points. Since such a model is usually nonconvex, the modelbased methods discussed in this chapter use a trustregion
approach to compute the step.
Suppose that at the current iterate xk we have a set of sample points Y
y1,y2,…,yq, with yi IRn, i 1,2,…,q. We assume that xk is an element of this set and that no point in Y has a lower function value than xk . We wish to construct a quadratic model of the form
mkxk p c gT p 1 pT Gp. 9.6 2
We cannot define g f xk and G 2 f xk because these derivatives are not available. Instead, we determine the scalar c, the vector g Rn, and the symmetric matrix G Rnn by imposing the interpolation conditions
mkyl f yl, l 1,2,…,q. 9.7 Since there are 1 n 1n 2 coefficients in the model 9.6 that is, the components of
determine mk uniquely only if
q 1n1n2. 9.8 2
In this case, 9.7 can be written as a square linear system of equations in the coefficients of themodel.Ifwechoosetheinterpolationpointsy1,y2,…,yq sothatthislinearsystemis nonsingular, the model mk will be uniquely determined.
Once mk has been formed, we compute a step p by approximately solving the trust region subproblem
min mkxk p, subject to p 2 a, 9.9 p
for some trustregion radius a 0. We can use one of the techniques described in Chapter 4 to solve this subproblem. If xk p gives a sufficient reduction in the objective function,
9.2. MODELBASED METHODS 223
2
c, g, and G, taking into account the symmetry of G, the interpolation conditions 9.7
224 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
the new iterate is defined as xk1 xk p, the trust region radius a is updated, and a new iteration commences. Otherwise the step is rejected, and the interpolation set Y may be improved or the trust region shrunk.
To reduce the cost of the algorithm, we update the model mk at every iteration, rather than recomputing it from scratch. In practice, we choose a convenient basis for the space of quadratic polynomials, the most common choices being Lagrange and Newton polynomials. The properties of these bases can be used both to measure appropriateness of the sample set Y and to change this set if necessary. A complete algorithm that treats all these issues effectively is far more complicated than the quasiNewton methods discussed in Chapter 6. Consequently, we will provide only a broad outline of modelbased DFO methods.
As is common in trustregion algorithms, the stepacceptance and trustregion update strategies are based on the ratio between the actual reduction in the function and the reduction predicted by the model, that is,
fxkfxk, 9.10 mkxkmkxk
where xk denotes the trial point. Throughout this section, the integer q is defined by 9.8.
Algorithm 9.1 ModelBased DerivativeFree Method.
Choose an interpolation set Y y1, y2, . . . , yq such that the linear system defined
by 9.7 is nonsingular, and select x0 as a point in this set such that f x0 f yi for all yi Y.Chooseaninitialtrustregionradiusa0,aconstant0,1,andsetk0.
repeat until a convergence test is satisfied:
Form the quadratic model mk xk p that satisfies the interpolation
conditions 9.7;
Compute a step p by approximately solving subproblem 9.9; Define the trial point as xk xk p;
Compute the ratio defined by 9.10;
if
else if
Replace an element of Y by xk; Chooseak1 ak;
Set xk1 xk;
Set k k 1 and go to the next iteration;
the set Y need not be improved Chooseak1 ak;
Setxk1 xk;
Set k k 1 and go to the next iteration;
end if
Invoke a geometryimproving procedure to update Y :
at least one of the points in Y is replaced by some other point, with the goal of improving the conditioning of 9.7;
Setak1 ak;
Choose x as an element in Y with lowest function value; Set xk x and recompute by 9.10;
if
Set xk1 xk; else
Setxk1 xk; end if
Set k k 1; end repeat
The case of , in which we obtain sufficient reduction in the merit function, is the simplest. In this case we always accept the trial point xk as the new iterate, include xk in Y , and remove an element from Y .
When sufficient reduction is not achieved , we look at two possible causes: inadequacy of the interpolation set Y and a trust region that is too large. The first cause can arise when the iterates become restricted to a lowdimensional surface of IRn that does not contain the solution. The algorithm could then be converging to a minimizer in this subset. Behavior such as this can be detected by monitoring the conditioning of the linear system defined by the interpolation conditions 9.7. If the condition number is too high, we change Y to improve it, typically by replacing one element of Y with a new element so as to move the interpolation system 9.7 as far away from singularity as possible. If Y seems adequate, we simply decrease the trust region radius a, as is done in the methods of Chapter 4.
A good initial choice for Y is given by the vertices and the midpoints of the edges of a simplex in IRn .
The use of quadratic models limits the size of problems that can be solved in practice. Performing On2 function evaluations just to start the algorithm is onerous, even for moderate values of n say, n 50. In addition, the cost of the iteration is high. Even by updating the model mk at every iteration, rather than recomputing it from scratch, the number of operations required to construct mk and compute a step is On4 257.
To alleviate these drawbacks, we can replace the quadratic model by a linear model in which the matrix G in 9.6 is set to zero. Since such a model contains only n 1 parameters, we need to retain only n 1 interpolation points in the set Y , and the cost of each iteration is On3. Algorithm 9.1 can be applied with little modification when the model is linear, but it is not rapidly convergent because linear models cannot represent curvature of the problem. Therefore, some modelbased algorithms start with n 1 initial points and compute steps
9.2. MODELBASED METHODS 225
226 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
using a linear model, but after q 1 n 1n 2 function values become available, they 2
switch to using quadratic models.
INTERPOLATION AND POLYNOMIAL BASES
We now consider in more detail how to form a model of the objective function using interpolation techniques. We begin by considering a linear model of the form
mkxk p f xk gT p. 9.11 To determine the vector g IRn, we impose the interpolation conditions mkyl f yl,
l 1,2,…,n, which can be written as
slT g f yl f xk, l 1,2,…,n, 9.12
where
sl yl xk, l 1,2,…,n. 9.13
Conditions 9.12 represent a linear system of equations in which the rows of the coefficient matrix are given by the vectors sl T . It follows that the model 9.11 is determined uniquely by 9.12 if and only if the interpolation points y1,y2,…,yn are such that the set sl :l1,2,…,nislinearlyindependent.Ifthisconditionholds,thesimplexformedby thepointsxk,y1,y2,…,yn issaidtobenondegenerate.
Let us now consider how to construct a quadratic model of the form 9.6, with f fxk.Werewritethemodelas
mkxkp fxkgTp def T
Gijpipj1 2
ij i
Giipi2
9.14 9.15
, 9.16
.
The model 9.15 has the same form as 9.11, and the determination of the vector of unknown coefficients g can be done as in the linear case.
fxkg p,
where we have collected the elements of g and G in the q 1vector of unknowns
T1
g g ,Gijij, 2Gii
and where the q 1vector p is given by T12
T
T
p p , p i p j i j , 2 p i
Multivariate quadratic functions can be represented in various ways. The monomial basis 9.14 has the advantage that known structure in the Hessian can be imposed easily by setting appropriate elements in G to zero. Other bases are, however, more convenient when one is developing mechanisms for avoiding singularity of the system 9.7.
We denote by iq a basis for the linear space of ndimensional quadratic i1
functions. The function 9.6 can therefore be expressed as
coefficients i uniquely if the determinant
1y1 1yq
def . . Y det . .
qy1 qyq
mkx
for some coefficients i . The interpolation set Y y1, y2, . . . , yq
q i1
is nonzero.
As modelbased algorithms iterate, the determinant Y may approach zero, leading
to numerical difficulties or even failure. Several algorithms therefore contain a mechanism for keeping the interpolation points well placed. We now describe one of those mechanisms.
UPDATING THE INTERPOLATION SET
Rather than waiting until the determinant Y becomes smaller than a threshold, we may invoke a geometryimproving procedure whenever a trial point does not provide sufficient decrease in f . The goal in this case is to replace one of the interpolation points so that the determinant 9.17 increases in magnitude. To guide us in this exchange, we use the following property of Y , which we state in terms of Lagrange functions.
For every y Y, we define the Lagrangian function L,y to be a polynomial of degree at most 2 such that Ly, y 1 and Ly, y 0 for y a y, y Y. Suppose that the set Y is updated by removing a point y and replacing it by some other point y, to give the new set Y. One can show that after a suitable normalization and given certain conditions 256
Y Ly, yY. 9.18
Algorithm 9.1 can make good use of this inequality to update the interpolation set. Consider first the case in which trial point x provides sufficient reduction in the objective function . We include x in Y and remove another point y from Y.
9.2. MODELBASED METHODS 227
iix,
determines the
9.17
228 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
Motivated by 9.18, we select the outgoing point as follows: y argmaxLx,y.
yY
Next, let us consider the case in which the reduction in f is not sufficient . We first determine whether the set Y should be improved, and for this purpose we use the following rule. We consider Y to be adequate at the current iterate xk if for all yi Y such that xk yi a we have that Y cannot be doubled by replacing one of these interpolation points yi with any point y inside the trust region. If Y is adequate but the reduction in f was not sufficient, we decrease the trustregion radius and begin a new iteration.
If Y is inadequate, the geometryimproving mechanism is invoked. We choose a point y Y and replace it by some other point y that is chosen solely with the objective of improving the determinant 9.17. For every point yi Y, we define its potential replacement yri as
yriarg max Ly,yi. yxk a
The outgoing point y is selected as the point for which Lyri , yi is maximized over all indicesyi Y.
Implementing these rules efficiently in practice is not simple, and one must also consider several possible difficulties we have not discussed; see 76. Strategies for improving the position of the interpolation set are the subject of ongoing investigation and new developments are likely in the coming years.
A METHOD BASED ON MINIMUMCHANGE UPDATING
We now consider a method that be viewed as an extension of the quasiNewton approach discussed in Chapter 6. The method uses quadratic models but requires only On3 operations per iteration, substantially fewer than the On4 operations required by the methods described above. To achieve this economy, the method retains only On points for the interpolation conditions 9.7 and absorbs the remaining degrees of freedom in the model 9.6 by requiring that the Hessian of the model change as little as possible from one iteration to the next. This leastchange property is one of the key ingredients in quasi Newton methods, the other ingredient being the requirement that the model interpolate the gradient f at the two most recent points. The method we describe now combines the leastchange property with interpolation of function values.
At the kth iteration of the algorithm, a new quadratic model mk1 of the form 9.6 is constructed after taking a step from xk to xk1. The coefficients fk1, gk1, Gk1 of the
9.3. COORDINATE AND PATTERNSEARCH METHODS 229
model mk1 are determined as the solution of the problem
min G Gk 2F 9.19a
f,g,G
subject to G symmetric
myl fyl l1,2,…,q, 9.19b
where F denotes the Frobenius norm see A.9, Gk is the Hessian of the previous model mk, and q is an integer comparable to n. One can show that the integer q must be chosen larger than n 1 to guarantee that Gk1 is not equal to Gk . An appropriate value in practice is q 2n 1; for this choice the number of interpolation points is roughly twice that used for linear models.
Problem 9.19 is an equalityconstrained quadratic program whose KKT conditions can be expressed as a system of equations. Once the model mk1 is determined, we compute a new step by solving a trustregion problem of the form 9.9. In this approach, too, it is necessary to ensure that the geometry of the interpolation set Y is adequate. We therefore impose two minimum requirements. First, the set Y should be such that the equations 9.19b can be satisfied for any righthand side. Second, the points yi should not all lie in a hyperplane. If these two conditions hold, problem 9.19 has a unique solution.
A practical algorithm based on the subproblem 9.19 resembles Algorithm 9.1 in that
it contains procedures both for generating new iterates and for improving the geometry
of the set Y . The implementation described in 260 contains other features to ensure that
the interpolation points are well separated and that steps are not too small. A strength of
this method is that it requires only On interpolation points to start producing productive
steps. In practice the method often approaches a solution with fewer than 1 n 1n 2 2
function evaluations. However, since this approach has been developed only recently, there is insufficient numerical experience to assess its full potential.
9.3 COORDINATE AND PATTERNSEARCH METHODS
Ratherthanconstructingamodelof f explicitlybasedonfunctionvalues,coordinatesearch and patternsearch methods look along certain specified directions from the current iterate for a point with a lower function value. If such a point is found, they step to it and repeat the process, possibly modifying the directions of search for the next iteration. If no satisfactory new point is found, the step length along the current search directions may be adjusted, or new search directions may be generated.
We describe first a simple approach of this type that has been used often in practice. We then consider a generalized approach that is potentially more efficient and has stronger theoretical properties.
230 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
x0 x1
x
COORDINATE SEARCH METHOD
The coordinate search method also known as the coordinate descent method or the alternating variables method cycles through the n coordinate directions e1,e2,…,en, obtaining new iterates by performing a line search along each direction in turn. Specifically, at the first iteration, we fix all components of x except the first one x1 and find a new value of this component that minimizes or at least reduces the objective function. On the next iteration, we repeat the process with the second component x2, and so on. After n iterations, we return to the first variable and repeat the cycle. Though simple and somewhat intuitive, this method can be quite inefficient in practice, as we illustrate in Figure 9.1 for a quadratic function in two variables. Note that after a few iterations, neither the vertical x2 nor the horizontal x1 move makes much progress toward the solution at each iteration.
In general, the coordinate search method can iterate infinitely without ever approach ing a point where the gradient of the objective function vanishes, even when exact line searches are used. By contrast, as we showed in Section 3.2, the steepest descent method produces a sequence of iterates xk for which fk 0, under reasonable assumptions. In fact, a cyclic search along any set of linearly independent directions does not guarantee global convergence 243. Technically speaking, this difficulty arises because the steepest de scent search direction fk may become more and more perpendicular to the coordinate search direction. In such circumstances, the Zoutendijk condition 3.14 is satisfied because cos k approaches zero rapidly, even when fk does not approach zero.
When the coordinate search method does converge to a solution, it often converges much more slowly than the steepest descent method, and the difference between the two approaches tends to increase with the number of variables. However, coordinate search may
Figure 9.1
Coordinate search method makes slow progress on this function of two variables.
9.3. COORDINATE AND PATTERNSEARCH METHODS 231
still be useful because it does not require calculation of the gradient fk, and the speed of convergence can be quite acceptable if the variables are loosely coupled in the objective function f .
Many variants of the coordinate search method have been proposed, some of which allow a global convergence property to be proved. One simple variant is a backandforth approach in which we search along the sequence of directions
e1,e2,…,en1,en,en1,…,e2,e1,e2,… repeats.
Another approach, suggested by Figure 9.1, is first to perform a sequence of coordinate descent steps and then search along the line joining the first and last points in the cycle. Several algorithms, such as that of Hooke and Jeeves, are based on these ideas; see Fletcher 101 and Gill, Murray, and Wright 130.
The patternsearch approach, described next, generalizes coordinate search in that it allows the use of a richer set of search directions at each iteration.
PATTERNSEARCH METHODS
We consider patternsearch methods that choose a certain set of search directions at each iterate and evaluate f at a given step length along each of these directions. These candidate points form a frame, or stencil, around the current iterate. If a point with a significantly lower function value is found, it is adopted as the new iterate, and the center of the frame is shifted to this new point. Whether shifted or not, the frame may then be altered in some way the set of search directions may be changed, or the step length may grow or shrink, and the process repeats. For certain methods of this type it is possible to prove global convergence resultstypically, that there exists a stationary accumulation point.
The presence of noise or other forms of inexactness in the function values may affect the performance of patternsearch algorithms and certainly impacts the convergence theory. Nonsmoothness may also cause undesirable behavior, as can be shown by simple examples, although satisfactory convergence is often observed on nonsmooth problems.
To define patternsearch methods, we introduce some notation. For the current iterate xk, we define Dk to be the set of possible search directions and k to be the line search parameter. The frame consists of the points xk k pk , for all pk Dk . When one of the points in the frame yields a significant decrease in f , we take the step and may also increase k , so as to expand the frame for the next iteration. If none of the points in the frame has a significantly better function value than fk , we reduce k contract the frame, set xk1 xk , and repeat. In either case, we may change the direction set Dk prior to the next iteration, subject to certain restrictions.
A more precise description of the algorithm follows.
232 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
Algorithm 9.2 PatternSearch.
Given convergence tolerance tol, contraction parameter max,
sufficient decrease function : 0, IR with t an increasing
functionoft andtt 0ast 0;
Choose initial point x0, initial step length 0 tol, initial direction set D0; for k 1,2,…
ifk tol stop;
if fxk kpk fxkkforsomepk Dk
Setxk1 xk kpk forsomesuchpk;
Set k1 kk for some k 1; increase step length
else
end for
Setxk1 xk;
Setk1 kk,where0k max 1; end if
AwisechoiceofthedirectionsetDk iscrucialtothepracticalbehaviorofthisapproach and to the theoretical results that can be proved about it. A key condition is that at least one direction in this set should give a direction of descent for f whenever f xk a 0 that is, whenever xk is not a stationary point. To make this condition specific, we refer to formula 3.12, where we defined the angle between a possible search direction d and the gradient fk as follows:
cos fkTp . 9.20 fk p
Recall from Theorem 3.2 that global convergence of a linesearch method to a stationary point of f could be ensured if the search direction d at each iterate xk satisfied cos , for some constant 0, and if the line search parameter satisfied certain conditions. In the same spirit, we choose Dk so that at least one direction p Dk will yield cos , regardless of the value of fk . This condition is as follows:
vT p
Dk minmax . 9.21
def
vIRn pDk v p
A second condition on Dk is that the lengths of the vectors in this set are all roughly similar, so that the diameter of the frame formed by this set is captured adequately by the step length parameter k . Thus, we impose the condition
min p max, forallpDk, 9.22 for some positive constants min and max and all k. If the conditions 9.21 and 9.22 hold,
9.3. COORDINATE AND PATTERNSEARCH METHODS 233
we have for any k that
fkTpDkfk p min fk , forsomepDk.
ExamplesofsetsDk thatsatisfytheproperties9.21and9.22includethecoordinate direction set
e1,e2,…,en,e1,e2,…,en, 9.23 and the set of n 1 vectors defined by
pi 1 eei, i 1,2,…,n; pn1 1 e, 9.24 2n 2n
where e 1,1,…,1T . For n 3 these direction sets are sketched in Figure 9.2.
The coordinate descent method described above is similar to the special case of Algorithm 9.2 obtained by setting Dk ei , ei for some i 1, 2, . . . , n at each iteration.
Note that for this choice of Dk, we have Dk 0 for all k. Hence, as noted above, cos can be arbitrarily close to zero at each iteration.
Often, the directions that satisfy the properties 9.21 and 9.22 form only a subset of the direction set Dk , which may contain other directions as well. These additional directions could be chosen heuristically, according to some knowledge of the function f and its scaling, or according to experience on previous iterations. They could also be chosen as linear combinations of the core set of directions the ones that ensure 0.
Note that Algorithm 9.2 does not require us to choose the point xk k pk , pk Dk , with the smallest objective value. Indeed, we may save on function evaluations by not evaluating f at all points in the frame, but rather performing the evaluations one at a time and accepting the first candidate point that satisfies the sufficient decrease condition.
e3
e p4
1
2
2
e1
p1
p p3 2
3
Figure 9.2 Generating search sets in IR3: coordinate direction set left and simplex set right.
234 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
Another important detail in the implementation of Algorithm 9.2 is the choice of sufficient decrease function t. If is chosen to be identically zero, then any candidate pointthatproducesadecreasein f isacceptableasanewiterate.AswehaveseeninChapter3, such a weak condition does not lead to strong global convergence results in general. A more appropriate choice might be t Mt32, where M is some positive constant.
9.4 A CONJUGATEDIRECTION METHOD
We have seen in Chapter 5 that the minimizer of a strictly convex quadratic function
f x 1 xT Ax bT x 9.25
2
can be located by performing onedimensional minimizations along a set of n conjugate directions. These directions were defined in Chapter 5 as a linear combination of gradients. In this section, we show how to construct conjugate directions using only function values, and we therefore devise an algorithm for minimizing 9.25 that requires only function value calculations. Naturally, we also consider an extension of this approach to the case of a nonlinear objective f .
We use the parallel subspace property, which we describe first for the case n 2. Consider two parallel lines l1 x1 p and l2 x2 p, where x1, x2, and p are given vectors in IR2 and is the scalar parameter that defines the lines. We show below that if x1 and x2 denote the minimizers of f x along l1 and l2, respectively, then x1 x2 is conjugate to p. Hence, if we perform a onedimensional minimization along the line joining x1 and x2, we will reach the minimizer of f , because we have successively minimized along the two conjugate directions p and x2 x1. This process is illustrated in Figure 9.3.
This observation suggests the following algorithm for minimizing a twodimensional quadratic function f . We choose a set of linearly independent directions, say the coordinate directions e1 and e2. From any initial point x0, we first minimize f along e2 to obtain the point x1. We then perform successive minimizations along e1 and e2, starting from x1, to obtain the point z. It follows from the parallel subspace property that z x1 is conjugate to e2 because x1 and z are minimizers along two lines parallel to e2. Thus, if we perform a onedimensionalsearchfromx1 alongthedirectionzx1,wewilllocatetheminimizerof f.
We now state the parallel subspace minimization property in its most general form. Supposethatx1,x2 aretwodistinctpointsinIRn andthatp1,p2,…,plisasetoflinearly independent directions in IRn . Let us define the two parallel linear varieties
l
S1 x1 ipiiIR,i1,2,…,l ,
i1 l
S2 x2 ipiiIR,i1,2,…,l . i1
9.4. A CONJUGATEDIRECTION METHOD 235
x
x2
l2
l1
x1
Figure 9.3
Geometric construction of conjugate directions. The minimizer of f is denoted by x.
If we denote the minimizers of f on S1 and S2 by x1 and x2, respectively, then x2 x1 is conjugate to p1, p2, . . . , pl . It is easy to verify this claim. By the minimization property, we have that
fx1ipia fx1Tpi 0, i1,2,…,l, i i0
and similarly for x2. Therefore we have from 9.25 that
0 fx1 fx2T pi
A x 1 b A x 2 b T p i
x1 x2T Api, i 1,2,…,l. 9.26
We now consider the case n 3 and show how the parallel subspace property can be used to generate a set of three conjugate directions. We choose a set of linearly independent directions, say e1, e2, e3. From any starting point x0 we first minimize f along the last direction e3 to obtain a point x1. We then perform three successive onedimensional minimizations, starting from x1, along the directions e1, e2, e3 and denote the resulting point by z. Next, we minimize f along the direction p1 z x1 to obtain x2. As noted earlier, p1 z x1 is conjugate to e3. We note also that x2 is the minimizer of f on the set S1 y 1e3 2 p1 1 IR, 2 IR, where y is the intermediate point obtained after minimizing along e1 and e2.
A new iteration now commences. We discard e1 and define the new set of search directions as e2, e3, p1. We perform onedimensional minimizations along e2, e3, p1, starting
236 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
from x2, to obtain the point z. Note that z can be viewed as the minimizer of f on the set S2 y 1e3 2 p1 1 IR, 2 IR, for some intermediate point y. Therefore, by applying the parallel subspace minimization property to the sets S1 and S2 just defined, we have that p2 z x2 is conjugate to both e3 and p1. We then minimize f along p2 to obtain a point x3, which is the minimizer of f . This procedure thus generates the conjugate directions e3, p1, p2.
We can now state the general algorithm, which consists of an inner and an outer iteration. In the inner iteration, n onedimensional minimizations are performed along a set of linearly independent directions. Upon completion of the inner iteration, a new conjugate direction is generated, which replaces one of the previously stored search directions.
Algorithm 9.3 DFO Method of Conjugate Directions. Chooseaninitialpointx0 andset pi ei,fori 1,2,…,n; Compute x1 as the minimizer of f along the line x0 pn; Set k 1.
repeat until a convergence test is satisfied
Setz1 xk;
for j 1,2,…,n
Calculatej sothat fzj j pjisminimized;
Setzj1 zj jpj; end for
Setpj pj1forj1,2,…,n1andpn zn1z1; Calculate n so that f zn1 n pn is minimized;
Setxk1 zn1 npn;
Set k k 1;
end repeat
The line searches can be performed by quadratic interpolation using three function values along each search direction. Since the restriction of 9.25 to a line is a strictly convex quadratic, the interpolating quadratic matches it exactly, and the onedimensional minimizer can easily be computed. Note that at the end of the outer iteration k, the directions pnk , pnk1, . . . , pn are conjugate by the property mentioned above. Thus the algorithm terminates at the minimizer of 9.25 after n 1 iterations, provided none of the conjugate directions is zero. Unfortunately, this possibility cannot be ruled out, and some safeguards described below must be incorporated to improve robustness. In the usual case that Algorithm 9.3 terminates after n 1 iterations, it will perform On2 function evaluations.
Algorithm 9.3 can be extended to minimize nonquadratic objective functions. The only change is in the line search, which must be performed approximately, using interpola tion. Because of the possible nonconvexity, this onedimensional search must be done with care; see Brent 39 for a treatment of this subject. Numerical experience indicates that this
9.4. A CONJUGATEDIRECTION METHOD 237
extension of Algorithm 9.3 performs adequately for smalldimensional problems but that sometimes the directions pi tend to become linearly dependent. Several modifications of the algorithm have been proposed to guard against this possibility. One such modification measures the degree to which the directions pi are conjugate. To do so, we define the scaled directions
pi pi , i1,2,…,n. 9.27 piT Api
One can show 239 that the quantity
detp1, p2,…, pn 9.28
is maximized if and only if the vectors pi are conjugate with respect to A. This result suggests that we should not replace one of the existing search directions in the set p1 , p2 , . . . , pn by the most recently generated conjugate direction if this action causes the quantity 9.28 to decrease.
Procedure 9.4 implements this strategy for the case of the quadratic objective function 9.25. Some algebraic manipulations which we do not present here show that we can compute the scaled directions pi without using the Hessian A because the terms piT Api are available from the line search along pi . Further, only comparisons using computed function values are needed to ensure that 9.28 does not increase. The following pro cedure is invoked immediately after the execution of the inner iteration or forloop of Algorithm 9.3.
Procedure 9.4 Updating of the Set of Directions.
Find the integer m 1,2,…,n such that m f xm1 f xm
is maximized;
Let f1 fz1, f2 fzn1,and f3 f2zn1 z1;
if f3 f1 orf1 2f2 f3f1 f2 m2 1mf1 f32 2
Keep the set p1, p2, . . . , pn unchanged and set xk1 zn1; else
Set p zn1 z1 and calculate so that f zn1 p is minimized; S e t x k 1 z n 1 p ;
Remove pm from the set of directions and add p to this set;
end if
This procedure can be applied to general objective functions by implementing inexact onedimensional line searches. The resulting conjugategradient method has been found to be useful for solving small dimensional problems.
238 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
9.5 NELDERMEAD METHOD
The NelderMead simplexreflection method has been a popular DFO method since its introduction in 1965 223. It takes its name from the fact that at any stage of the algorithm, we keep track of n 1 points of interest in IRn , whose convex hull forms a simplex. The method has nothing to do with the simplex method for linear programming discussed in Chapter 13. Given a simplex S with vertices z1, z2, . . . , zn1, we can define an as sociated matrix VS by taking the n edges along V from one of its vertices z1, say, as follows:
VS z2 z1,z3 z1,…,zn1 z1.
The simplex is said to be nondegenerate or nonsingular if V is a nonsingular matrix. For example, a simplex in IR3 is nondegenerate if its four vertices are not coplanar.
In a single iteration of the NelderMead algorithm, we seek to remove the vertex with the worst function value and replace it with another point with a better value. The new point is obtained by reflecting, expanding, or contracting the simplex along the line joining the worst vertex with the centroid of the remaining vertices. If we cannot find a better point in this manner, we retain only the vertex with the best function value, and we shrink the simplex by moving all other vertices toward this value.
We specify a single step of the algorithm after some defining some notation. The n 1 vertices of the current simplex are denoted by x1, x2, . . . , xn1, where we choose the ordering so that
fx1 fx2 fxn1. The centroid of the best n points is denoted by
n x xi.
i1
Points along the line joining x and the worst vertex xn1 are denoted by
x t x t x n 1 x .
Procedure 9.5 One Step of NelderMead Simplex.
Compute the reflection point x 1 and evaluate f1 f x 1; if fx1 f1 fxn
reflected point is neither best nor worst in the new simplex
replace xn1 by x 1 and go to next iteration; elseif f1 fx1
reflected point is better than the current best; try to go farther along this direction
Compute the expansion point x 2 and evaluate f2 f x 2; if f2 f1
replace xn1 by x2 and go to next iteration; else
replace xn1 by x1 and go to next iteration; elseif f1 fxn
reflected point is still worse than xn; contract if fxn f1 fxn1
try to perform outside contraction
evaluate f12 x 12;
if f12 f1
replace xn1 by x12 and go to next iteration;
else
neither outside nor inside contraction was acceptable; shrink the simplex toward x1
replacexi 12×1 xifori 2,3,…,n1;
Procedure 9.5 is illustrated on a threedimensional example in Figure 9.4. The worst
current vertex is x3, and the possible replacement points are x 1, x 2, x 1, x 1. If 22
none of the replacement points proves to be satisfactory, the simplex is shrunk to the smaller
triangle indicated by the dotted line, which retains the best vertex x1. The scalars t used
in defining the candidate points x t have been assigned the specific and standard values
1, 2, 1 , and 1 in our description above. Different choices are also possible, subject to 22
certain restrictions.
Practical performance of the NelderMead algorithm is often reasonable, though
stagnation has been observed to occur at nonoptimal points. Restarting can be used when stagnation is detected; see Kelley 178. Note that unless the final shrinkage step is performed, the average function value
n1 1
n 1
will decrease at each step. When f is convex, even the shrinkage step is guaranteed not to increase the average function value.
try to perform inside contraction
evaluate f12 x 12;
if f12 fn1
replace xn1 by x12 and go to next iteration;
9.5. NELDERMEAD METHOD 239
f xi 9.29
i1
240 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
x2
x2 x1 x12 x12 x3
x1
Figure 9.4 One step of the NelderMead simplex method in IR3, showing current
simplex solid triangle with vertices x1, x2, x3, reflection point x 1, expansion
point x 2, inside contraction point x 1, outside contraction point x 1, and 22
shrunken simplex dotted triangle.
A limited amount of convergence theory has been developed for the NelderMead method in recent years; see, for example, Kelley 179 and Lagarias et al. 186.
9.6 IMPLICIT FILTERING
We now describe an algorithm designed for functions whose evaluations are modeled by 9.2, where h is smooth. This implicit filtering approach is, in its simplest form, a variant of the steepest descent algorithm with line search discussed in Chapter 3, in which the gradient fk is replaced by a finite difference estimate such as 9.3, with a difference parameter that may not be particularly small.
Implicit filtering works best on functions for which the noise level decreases as the iterates approach a solution. This situation may occur when we have control over the noise level, as is the case when f is obtained by solving a differential equation to a userspecified tolerance, or by running a stochastic simulation for a userspecified number of trials where an increase in the number of trials usually produces a decrease in the noise. The implicit filtering algorithm decreases systematically but, one hopes, not as rapidly as the decay in errorsoastomaintainreasonableaccuracyin fx,giventhenoiselevelatthecurrent value of x . For each value of , it performs an inner loop that is simply an Armijo line search using the search direction f x. If the inner loop is unable to find a satisfactory step length after backtracking at least amax times, we return to the outer loop, choose a smaller value of , and repeat. A formal specification follows.
Algorithm 9.6 Implicit Filtering.
Choose a sequence k 0, Armijo parameters c and in 0, 1,
maximum backtracking parameter amax; Set k 1, Choose initial point x x0;
repeat
increment k false; repeat
Compute f x and ifak fxa k
else
else
xxm fx; until increment k;
xk x;kk1;
until a termination test is satisfied.
Note that the inner loop in Algorithm 9.6 is essentially the backtracking line search algorithmAlgorithm 3.1 of Chapter 3with a convergence criterion added to detect whether the minimum appears to have been found to within the accuracy implied by the dif ference parameter k . If the gradient estimate k f is small, or if the line search fails to find a satisfactory new iterate indicating that the gradient approximation k f x is insufficiently accuratetoproducedescentinf,wedecreasethedifferenceparameterto k1andproceed.
A basic convergence result for Algorithm 9.6 is the following.
Theorem 9.2.
Suppose that 2h is Lipschitz continuous, that Algorithm 9.6 generates an infinite sequence of iterates xk , and that
l i m k2 x k ; k 0 . k k
Suppose, too, that all but a finite number of inner loops in Algorithm 9.6 terminate with a k f xk a k . Then all limit points of the sequence xk are stationary.
PROOF. Using k 0, we have under our assumptions on inner loop termination that k f xk 0. By invoking the error bound 9.5 and noting that the righthand side of this expression is approaching zero, we conclude that hxk 0. Hence all limit points satisfy hx 0, as claimed.
k
Find the smallest integer m between 0 and amax such that f xmk fx fxcmak fxa2;
if no such m exists increment k true;
9.6. IMPLICIT FILTERING 241
f x; increment k true;
242 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
More sophisticated versions of implicit filtering methods can be derived by using the gradient estimate k f to construct quasiNewton approximate Hessians, and thus generating quasiNewton search directions instead of the negativeapproximategradient search direction used in Algorithm 9.6.
NOTES AND REFERENCES
A classical reference on derivativefree methods is Brent 39, which focuses primarily on onedimensional problems and includes discussion of roundoff errors and global min imization. Recent surveys on derivativefree methods include Wright 314, Powell 256, Conn, Scheinberg, and Toint 76, and Kolda, Lewis, and Torczon 183.
The first modelbased method for derivativefree optimization was proposed by Win field 307. It uses quadratic models, which are determined by the interpolation conditions 9.7, and computes steps by solving a subproblem of the form 9.9. Practical procedures for improving the geometry of the interpolation points were first developed by Powell in the context of modelbased methods using linear and quadratic polynomials; see 256 for a review of this work.
Conn, Scheinberg, and Toint 75 propose and analyze modelbased methods and study the use of Newton fundamental polynomials. Methods that combine minimum change updating and interpolation are discussed by Powell 258, 260. Our presentation of model based methods in Section 9.2 is based on 76, 259, 258.
For a comprehensive discussion of patternsearch methods of the type discussed here, we refer the reader to the review paper of Kolda, Lewis, and Torczon 183, and the references therein.
The method of conjugate directions given in Algorithm 9.3 was proposed by Pow ell 239. For a discussion on the rate of convergence of the coordinate descent method and for more references about this method, see Luenberger 195. For further information on implicit filtering, see Kelley 179 and Choi and Kelley 60 and the references therein.
Software packages that implement modelbased methods include COBYLA 258, DFO 75, UOBYQA 257, WEDGE 200, and NEWUOA 260. The earliest code is COBYLA, which employs linear models. DFO, UOBYQA, and WEDGE use quadratic models, whereas the method based on minimum change updating 9.19 is implemented in NEWUOA. A patternsearch method is implemented in APPS 171, while DIRECT 173 is designed to find a global solution.
EXERCISES
9.1 Prove Lemma 9.1. 9.2
a Verify that the number of interpolation conditions to uniquely determine the
coefficients in 9.6 are q 1 n 1n 2. 2
b
c
9.4 Consider the determination of a quadratic function in two variables.
a Show that six points on a line do not determine the quadratic.
b Show that six points in a circle in the plane do not uniquely determine the quadratic.
9.5 Use induction to show that at the end of the outer iteration k of Algorithm 9.3, the directions pnk , pnk1, . . . , pn are conjugate. Use this fact to show that if the step lengths i in Algorithm 9.3 are never zero, the iteration terminates at the minimizer of 9.25 after at most n outer iterations.
9.6 Write a program that computes the onedimensional minimizer of a strictly convex quadratic function f along a direction p using quadratic interpolation. Describe the formulas used in your program.
9.7 Find the quadratic function
mx1,x2 fg1x1g2x21G211x12G12x1x21G22x2
22
that interpolates the following data: x0 y1 0, 0T , y2 1, 0T , y3 2, 0T , y4 1,1T,y5 0,2T,y6 0,1T,and fy11, fy22.0084, fy37.0091,
f y4 1.0168, f y5 0.9909, and f y6 0.9916.
9.8 Find the value of for which the coordinate generating set 9.23 satisfies the
property 9.21.
9.9 ShowthatDk0,whereisdefinedby9.21andDk ei,eiforany
i 1,2,…,n.
9.10 Hard Prove that the generating set 9.24 satisfies the property 9.21 for a
certain value 0, and find this value of .
9.11 Justify the statement that the average function value at the NelderMead simplex
points will decrease over one step if any of the points x 1, x 2, x 1 , x 1 are adopted 22
as a replacement for xn1.
Verify that the number of vertices and midpoints of the edges of a nondegenerate simplexin Rn adduptoq 1n1n2andcanthereforebeusedastheinitial
How many interpolation conditions would be required to determine the coefficients in 9.6 if the matrix G were identically 0? How many if G were diagonal? How many if G were tridiagonal?
9.3 Describe conditions on the vectors sl that guarantee that the model 9.14 is uniquely determined.
9.6. IMPLICIT FILTERING 243
2 interpolation set in a DFO algorithm.
244 CHAPTER 9. DERIVATIVEFREE OPTIMIZATION
9.12 Show that if f is a convex function, the shrinkage step in the NelderMead simplex method will not increase the average value of the function over the simplex vertices defined by 9.29. Show that unless f x1 f x2 f xn1, the average value will in fact decrease.
9.13 Suppose for the f defined in 9.2, we define the approximate gradient f x by the forwarddifference formula
fx fxeifx
,
rather than the centraldifference formula 9.3. This formula requires only half as many function evaluations but is less accurate. For this definition, prove the following variant of Lemma 9.1: Suppose that hx is Lipschitz continuous in a neighborhood of the box zzx, zx withLipschitzconstantLh.Thenwehave
fxhxLh x;, where x; is redefined as follows:
x; sup z. zx, zx
i 1,2,…,n
CHAPTER10 LeastSquares
Problems
In leastsquares problems, the objective function f has the following special form:
m 2j
r2x, 10.1
where each r j is a smooth function from IRn to IR. We refer to each r j as a residual, and we assume throughout this chapter that m n.
Leastsquares problems arise in many areas of applications, and may in fact be the largest source of unconstrained optimization problems. Many who formulate a parametrized
f x 1
j1
This is page 245 Printer: Opaque this
246 CHAPTER 10. LEASTSQUARES PROBLEMS
model for a chemical, physical, financial, or economic application use a function of the form 10.1 to measure the discrepancy between the model and the observed behavior of the system see Example 2.1, for instance. By minimizing this function, they select values for the parameters that best match the model to the data. In this chapter we show how to devise efficient, robust minimization algorithms by exploiting the special structure of the function f and its derivatives.
Toseewhythespecialformof f oftenmakesleastsquaresproblemseasiertosolvethan general unconstrained minimization problems, we first assemble the individual components r j from 10.1 into a residual vector r : IRn IRm , as follows
rx r1x,r2x,…,rmxT . 10.2 Using this notation, we can rewrite f as f x 1 rx 2. The derivatives of f x can be
22
expressed in terms of the Jacobian J x , which is the m n matrix of first partial derivatives
of the residuals, defined by
rj Jx xi
r1xT
rxT
j1,2,…,m i 1,2,…,n
2
. , 10.3
where each rjx, j 1,2,…,m is the gradient of rj. The gradient and Hessian of f can then be expressed as follows:
fx 2 fx
m j1
m j1
rjxrjx JxTrx,
m rjxrjxT rjx2rjx
10.4
10.5
JxT Jx
m j1
j1 rjx2rjx.
In many applications, the first partial derivatives of the residuals and hence the Jacobian matrix Jx are relatively easy or inexpensive to calculate. We can thus obtain the gradient f x as written in formula 10.4. Using J x , we also can calculate the first term JxT Jx in the Hessian 2 f x without evaluating any second derivatives of the functions rj. This availability of part of 2 fx for free is the distinctive feature of leastsquares problems. Moreover, this term JxT Jx is often more important than the second summation term in 10.5, either because the residuals r j are close to affine near the solution that is, the 2rjx are relatively small or because of small residuals that
. rmxT
is, the r j x are relatively small. Most algorithms for nonlinear leastsquares exploit these structural properties of the Hessian.
The most popular algorithms for minimizing 10.1 fit into the line search and trustregion frameworks described in earlier chapters. They are based on the Newton and quasiNewton approaches described earlier, with modifications that exploit the particular structure of f .
Section 10.1 contains some background on applications. Section 10.2 discusses lin ear leastsquares problems, which provide important motivation for algorithms for the nonlinear problem. Section 10.3 describes the major algorithms, while Section 10.4 briefly describes a variant of least squares known as orthogonal distance regression.
Throughout this chapter, we use the notation to denote the Euclidean norm 2, unless a subscript indicates that some other norm is intended.
10.1 BACKGROUND
We discuss a simple parametrized model and show how leastsquares techniques can be used to choose the parameters that best fit the model to the observed data.
EXAMPLE 10.1
We would like to study the effect of a certain medication on a patient. We draw blood samples at certain times after the patient takes a dose, and measure the concentration of the medicationineachsample,tabulatingthetimetj andconcentrationyj foreachsample.
Based on our previous experience in such experiments, we find that the following function x; t provides a good prediction of the concentration at time t, for appropriate values of the fivedimensional parameter vector x x1, x2, x3, x4, x5:
x;tx1 tx2 t2x3 x4ex5t. 10.6
We choose the parameter vector x so that this model best agrees with our observation, in some sense. A good way to measure the difference between the predicted model values and the observations is the following leastsquares function:
m 2
j1
t j . This function has precisely the form 10.1 if we define
rjx x;tj yj. 10.8
x;tj yj2, 10.7 which sums the squares of the discrepancies between predictions and observations at each
1
10.1. BACKGROUND 247
248 CHAPTER 10. LEASTSQUARES PROBLEMS
y
t1 t2 t3 t4 t5 t6 t7 t
Figure 10.1 Model 10.7 smooth curve and the observed measurements, with deviations indicated by vertical dotted lines.
Graphically, each term in 10.7 represents the square of the vertical distance between the curve x;t plotted as a function of t and the point tj,yj, for a fixed choice of parameter vector x; see Figure 10.1. The minimizer x of the leastsquares problem is the parameter vector for which the sum of squares of the lengths of the dotted lines in Figure 10.1 is minimized. Having obtained x, we use x; t to estimate the concentration of medication remaining in the patients bloodstream at any time t.
This model is an example of what statisticians call a fixedregressor model. It assumes that the times tj at which the blood samples are drawn are known to high accuracy, while the observations yj may contain more or less random errors due to the limitations of the equipment or the lab technician!
In general datafitting problems of the type just described, the ordinate t in the model x;t could be a vector instead of a scalar. In the example above, for instance, t could have two dimensions, with the first dimension representing the time since the drug was admistered and the second dimension representing the weight of the patient. We could then use observations for an entire population of patients, not just a single patient, to obtain the best parameters for this model.
The sumofsquares function 10.7 is not the only way of measuring the discrepancy between the model and the observations. Other common measures include the maximum absolute value
max x;tjyj 10.9 j 1,2,…,m
and the sum of absolute values
By using the definitions of the
fx rx , fx rx 1, 10.11
respectively. As we discuss in Chapter 17, the problem of minimizing the functions 10.11 can be reformulated a smooth constrained optimization problem.
In this chapter we focus only on the 2norm formulation 10.1. In some situations, there are statistical motivations for choosing the leastsquares criterion. Changing the no tation slightly, let the discrepancies between model and observation be denoted by j , that is,
j x;tjyj.
It often is reasonable to assume that the j s are independent and identically distributed with a certain variance 2 and probability density function g . This assumption will often be true, for instance, when the model accurately reflects the actual process, and when theerrorsmadeinobtainingthemeasurementsyj donotcontainasystematicbias.Under thisassumption,thelikelihoodofaparticularsetofobservations yj, j 1,2,…,m,given that the actual parameter vector is x, is given by the function
1m py;x,
j1
g j
1m j1
gx;tjyj. 10.12
j1 and
1 norms, we can rewrite these two measures as
Given the observations y1, y2, . . . , ym , the most likely value of x is obtained by maximizing py;x, with respect to x. The resulting value of x is called the maximum likelihood
estimate.
When we assume that the discrepancies follow a normal distribution, we have
12 g22exp 22 .
10.1. BACKGROUND 249
m
x;tj yj. 10.10
Substitution in 10.12 yields
py;x, 22m2 exp22
x;tj yj2.
1 m
j1
250 CHAPTER 10. LEASTSQUARES PROBLEMS
For any fixed value of the variance 2, it is obvious that p is maximized when the sum of squares 10.7 is minimized. To summarize: When the discrepancies are assumed to be independent and identically distributed with a normal distribution function, the maximum likelihood estimate is obtained by minimizing the sum of squares.
The assumptions on j in the previous paragraph are common, but they do not describe the only situation for which the minimizer of the sum of squares makes good statistical sense. Seber and Wild 280 describe many instances in which minimization of functions like 10.7, or generalizations of this function such as
rxT Wrx, where W IRmm is symmetric,
is the crucial step in obtaining estimates of the parameters x from observed data.
10.2 LINEAR LEASTSQUARES PROBLEMS
Many models x;t in datafitting problems are linear functions of x. In these cases, the residuals r j x defined by 10.8 also are linear, and the problem of minimizing 10.7 is called a linear leastsquares problem. We can write the residual vector as r x J x y for some matrix J and vector y, both independent of x, so that the objective is
fx1 Jxy2, 2
where y r 0. We also have
f x JT Jx y, 2 f x JT J.
Note that the second term in 2 f x see 10.5 disappears, because 2rj
j 1,2,…,m.Itiseasytoseethatthe fxin10.13isconvexapropertythatdoes not necessarily hold for the nonlinear problem 10.1. Theorem 2.5 tells us that any point x for which f x 0 is the global minimizer of f . Therefore, x must satisfy the following
linear system of equations:
JT Jx JT y. 10.14
These are known as the normal equations for 10.13.
We outline briefly three major algorithms for the unconstrained linear leastsquares
problem. We assume in most of our discussion that m n and that J has full column rank.
10.13
0 for all
where
10.2. LINEAR LEASTSQUARES PROBLEMS 251
The first and most obvious algorithm is simply to form and solve the system 10.14 by the following threestep procedure:
compute the coefficient matrix J T J and the righthandside J T y;
compute the Cholesky factorization of the symmetric matrix J T J ;
performtwotriangularsubstitutionswiththeCholeskyfactorstorecoverthesolution x.
The Cholesky factorization
JT J R T R 10.15
where R is an n n upper triangular with positive diagonal elements is guaranteed to exist when m n and J has rank n. This method is frequently used in practice and is often effective, but it has one significant disadvantage, namely, that the condi tion number of JT J is the square of the condition number of J. Since the relative error in the computed solution of a problem is usually proportional to the condition num ber, the Choleskybased method may result in less accurate solutions than those obtained from methods that avoid this squaring of the condition number. When J is ill condi tioned, the Cholesky factorization process may even break down, since roundoff errors may cause small negative elements to appear on the diagonal during the factorization process.
A second approach is based on a QR factorization of the matrix J . Since the Euclidean norm of any vector is not affected by orthogonal transformations, we have
Jxy QTJxy 10.16 for any m m orthogonal matrix Q. Suppose we perform a QR factorization with column
pivoting on the matrix J see A.24 to obtain RR
J Q 0 Q1 Q2 0 Q1R, 10.17
is an n n permutation matrix hence, orthogonal;
Q is m m orthogonal;
Q1 is the first n columns of Q, while Q2 contains the last m n columns; R is n n upper triangular with positive diagonal elements.
252 CHAPTER 10. LEASTSQUARES PROBLEMS
By combining 10.16 and 10.17, we obtain
a a a Jxy2 2 a a
a
Q 1T Q2T
a a a 2 JT x ya2
Ta2 R Tx Q1y a
J x y by driving the first term to zero, that is, by setting x R 1 Q 1T y .
In practice, we perform a triangular substitution to solve Rz Q1T y, then permute the components of z to obtain x z.
This QRbased approach does not degrade the conditioning of the problem unnec essarily. The relative error in the final computed solution x is usually proportional to the condition number of J , not its square, and this method is usually reliable. Some situations, however, call for greater robustness or more information about the sensitivity of the solu tion to perturbations in the data J or y. A third approach, based on the singularvalue decomposition SVD of J , can be used in these circumstances. Recall from A.15 that the SVD of J is given by
JU S VT U U S VT USVT, 10.19 01201
where
U is m m orthogonal;
U1 contains the first n columns of U , U2 the last m n columns;
V is n n orthogonal; Sisnndiagonal,withdiagonalelements1 2 n 0.
Note that JT J VS2VT, so that the columns of V are eigenvectors of JT J with eigenvalues j2, j 1,2,…,n. By following the same logic that led to 10.18, we obtain
aS U1T a2 J x y 2 a a 0 V T x U 2T y a a
a a
0 Q2Ty a aRT x Q1T ya2 aQ2T ya2 .
10.18 No choice of x has any effect on the second term of this last expression, but we can minimize
SVTxU1Ty 2 U2Ty 2. 10.20
10.2. LINEAR LEASTSQUARES PROBLEMS 253
Again, the optimum is found by choosing x to make the first term equal to zero; that is, x V S 1 U 1T y .
Denoting the ith columns of U and V by ui IRm and vi IRn, respectively, we have xn uiTyvi. 10.21
i1 i
This formula yields useful information about the sensitivity of x . When i is small, x is particularly sensitive to perturbations in y that affect uiT y, and also to perturbations in J that affect this same quantity. Such information is particularly useful when J is nearly rankdeficient, that is, when n1 a 1. It is sometimes worth the extra cost of the SVD algorithm to obtain this sensitivity information.
All three approaches above have their place. The Choleskybased algorithm is partic ularly useful when m n and it is practical to store J T J but not J itself. It can also be less expensive than the alternatives when m n and J is sparse. However, this approach must be modified when J is rankdeficient or ill conditioned to allow pivoting of the diagonal elements of JT J. The QR approach avoids squaring of the condition number and hence may be more numerically robust. While potentially the most expensive, the SVD approach is the most robust and reliable of all. When J is actually rankdeficient, some of the singular values i are exactly zero, and any vector x of the form
x uiTyvi ivi 10.22 ia0 i i0
for arbitrary coefficients i is a minimizer of 10.20. Frequently, the solution with smallest norm is the most desirable, and we obtain it by setting each i 0 in 10.22. When J has full rank but is ill conditioned, the last few singular values n,n1,… are small relative to 1. The coefficients uiT yi in 10.22 are particularly sensitive to perturbations in uiT y when i is small, so an approximate solution that is less sentitive to perturbations than the true solution can be obtained by omitting these terms from the summation.
When the problem is very large, it may be efficient to use iterative techniques, such as the conjugate gradient method, to solve the normal equations 10.14. A direct imple mentation of conjugate gradients Algorithm 5.2 requires one matrix vector multiplication with J T J to be performed at each iteration. This operation can be performed by means of successive multiplications by J and J T ; we need only the ability to perform matrixvector multiplications with these two matrices to implement this algorithm. Several modifications of the conjugate gradient approach have been proposed that involve a similar amount of work per iteration one matrixvector multiplication each with J and J T but that have superior numerical properties. Some alternatives are described by Paige and Saunders 234,
254 CHAPTER 10. LEASTSQUARES PROBLEMS
who propose in particular an algorithm called LSQR which has become the basis of a highly successful code.
10.3 ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS
THE GAUSSNEWTON METHOD
We now describe methods for minimizing the nonlinear objective function 10.1 that
exploit the structure in the gradient f 10.4 and Hessian 2 f 10.5. The simplest of these
methodsthe GaussNewton methodcan be viewed as a modified Newtons method with
line search. Instead of solving the standard Newton equations 2 f xkp f xk, we
solve instead the following system to obtain the search direction pGN: k
JTJpGN JTr. kkk kk
10.23
This simple modification gives a number of advantages over the plain Newtons method. First, our use of the approximation
2 fk JkT Jk 10.24
saves us the trouble of computing the individual residual Hessians 2r j , j 1, 2, . . . , m, which are needed in the second term in 10.5. In fact, if we calculated the Jacobian Jk in the course of evaluating the gradient fk JkT rk , the approximation 10.24 does not require any additional derivative evaluations, and the savings in computational time can be quite significant in some applications. Second, there are many interesting situations in which the first term JT J in 10.5 dominates the second term at least close to the solution x, so that JkT Jk is a close approximation to 2 fk and the convergence rate of GaussNewton is similar to that of Newtons method. The first term in 10.5 will be dominant when the normofeachsecondordertermthatis,rjx 2rjx issignificantlysmallerthanthe eigenvalues of J T J . As mentioned in the introduction, we tend to see this behavior when either the residuals r j are small or when they are nearly affine so that the 2r j are small. In practice, many leastsquares problems have small residuals at the solution, leading to rapid local convergence of GaussNewton.
A third advantage of GaussNewton is that whenever Jk has full rank and the gradient fk is nonzero, the direction pGN is a descent direction for f , and therefore a suitable
direction for a line search. From 10.4 and 10.23 we have
pGNTfk pGNT JkTrk pGNT JkT JkpGN JkpGN 2 0. 10.25 kkkkk
k
10.3. ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS 255
ThefinalinequalityisstrictunlessJkpGN 0,inwhichcasewehaveby10.23andfullrank k
of Jk that JkT rk fk 0; that is, xk is a stationary point. Finally, the fourth advantage
of GaussNewton arises from the similarity between the equations 10.23 and the normal
equations 10.14 for the linear leastsquares problem. This connection tells us that pGN is k
in fact the solution of the linear leastsquares problem
min1 Jkprk 2. 10.26
p2
Hence, we can find the search direction by applying linear leastsquares algorithms to the subproblem 10.26. In fact, if the QR or SVDbased algorithms are used, there is no need to calculate the Hessian approximation JkT Jk in 10.23 explicitly; we can work directly with the Jacobian Jk . The same is true if we use a conjugategradient technique to solve 10.26. For this method we need to perform matrixvector multiplications with JkT Jk , which can be done by first multiplying by Jk and then by JkT .
If the number of residuals m is large while the number of variables n is relatively small, it may be unwise to store the Jacobian J explicitly. A preferable strategy may be to calculate the matrix J T J and gradient vector J T r by evaluating r j and r j successively for
j 1, 2, . . . , m and performing the accumulations
m JT J
i1
rjrjT,
JTr
m i1
rjrj. 10.27
The GaussNewton steps can then be computed by solving the system 10.23 of normal equations directly.
The subproblem 10.26 suggests another motivation for the GaussNewton search
direction. We can view this equation as being obtained from a linear model for the the vector
function rxk p rk Jk p, substituted into the function 1 2. In other words, we 2
use the approximation
fxkp1 rxkp21 Jkprk 2, 22
andchoosepGN tobetheminimizerofthisapproximation. k
Implementations of the GaussNewton method usually perform a line search in the direction pGN, requiring the step length k to satisfy conditions like those discussed in
Chapter 3, such as the Armijo and Wolfe conditions; see 3.4 and 3.6.
CONVERGENCE OF THE GAUSSNEWTON METHOD
The theory of Chapter 3 can applied to study the convergence properties of the GaussNewton method. We prove a global convergence result under the assumption that the Jacobians Jx have their singular values uniformly bounded away from zero in the
k
256 CHAPTER 10. LEASTSQUARES PROBLEMS
region of interest; that is, there is a constant 0 such that
Jxz z 10.28
for all x in a neighborhood N of the level set
Lx fx fx0, 10.29
where x0 is the starting point for the algorithm. We assume here and in the rest of the chapter that L is bounded. Our result is a consequence of Theorem 3.2.
Theorem 10.1.
Suppose each residual function rj is Lipschitz continuously differentiable in a neigh borhood N of the bounded level set 10.29, and that the Jacobians Jx satisfy the uniform fullrank condition 10.28 on N . Then if the iterates xk are generated by the GaussNewton method with step lengths k that satisfy 3.6, we have
lim JkTrk 0. k
PROOF. First, we note that the neighborhood N of the bounded level set L can be chosen small enough that the following properties are satisfied for some positive constants L and :
rjx and rjx , rjxrjx L xx and rjxrjx L xx ,
for all x,x N and all j 1,2,…,m. It is easy to deduce that there exists a constant 0 such that JxT Jx for all x L. In addition, by applying the results concerning Lipschitz continuity of products and sums see for example A.43 to the gradient fx mj1 rjxrjx, we can show that f is Lipschitz continuous. Hence, the assumptions of Theorem 3.2 are satisfied.
We check next that the angle k between the search direction pGN and the negative k
gradient fk is uniformly bounded away from 2. From 3.12, 10.25, and 10.28, wehaveforxxk LandpGN pGN that
k
f T pGN J pGN 2 2 pGN 2 2
c o s k p G N f p G N J T J p G N 2 p G N 2 2 0 .
It follows from 3.14 in Theorem 3.2 that f xk 0, giving the result.
If Jk is rankdeficient for some k so that a condition like 10.28 is not satisfied, the coefficient matrix in 10.23 is singular. The system 10.23 still has a solution, however, because of the equivalence between this linear system and the minimization problem 10.26.
10.3. ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS 257
In fact, there are infinitely many solutions for pGN in this case; each of them has the form k
of 10.22. However, there is no longer an assurance that cos k is uniformly bounded away from zero, so we cannot prove a result like Theorem 10.1.
The convergence of GaussNewton to a solution x can be rapid if the leading term JkT Jk dominates the secondorder term in the Hessian 10.5. Suppose that xk is close to x and that assumption 10.28 is satisfied. Then, applying an argument like the Newtons method analysis 3.31, 3.32, 3.33 in Chapter 3, we have for a unit step in the Gauss Newton direction that
xkpGN xxkxJTJxk1fxk k
JT Jxk1 JT Jxkxk xfxfxk ,
where J T J x is shorthand notation for J x T J x . Using H x to denote the secondorder
term in 10.5, we have from A.57 that 1
fxkfx JT Jx txk xxk xdt 01
Hx txk xxk xdt.
A similar argument as in 3.32, 3.33, assuming Lipschitz continuity of H near x,
shows that
xk pGN x k
1 0
0
JTJxk1Hxtxkx xkx dtOxkx 2
JT Jx1Hx xk x O xk x 2. 10.30
JT Jx1Hx a 1, we can expect a unit step of GaussNewton to move
Hence, if
us much closer to the solution x, giving rapid local convergence. When Hx 0, the convergence is actually quadratic.
When n and m are both large and the Jacobian Jx is sparse, the cost of computing steps exactly by factoring either Jk or JkT Jk at each iteration may become quite expensive relative to the cost of function and gradient evaluations. In this case, we can design inexact variants of the GaussNewton algorithm that are analogous to the inexact Newton algo rithms discussed in Chapter 7. We simply replace the Hessian 2 f xk in these methods by its approximation JkT Jk . The positive semidefiniteness of this approximation simplifies the resulting algorithms in several places.
258 CHAPTER 10. LEASTSQUARES PROBLEMS
THE LEVENBERGMARQUARDT METHOD
Recall that the GaussNewton method is like Newtons method with line search, except that we use the convenient and often effective approximation 10.24 for the Hessian. The LevenbergMarquardt method can be obtained by using the same Hessian approximation, but replacing the line search with a trustregion strategy. The use of a trust region avoids one of the weaknesses of GaussNewton, namely, its behavior when the Jacobian Jx is rankdeficient, or nearly so. Since the same Hessian approximations are used in each case, the local convergence properties of the two methods are similar.
The LevenbergMarquardt method can be described and analyzed using the trust region framework of Chapter 4. In fact, the LevenbergMarquardt method is sometimes considered to be the progenitor of the trustregion approach for general unconstrained optimization discussed in Chapter 4. For a spherical trust region, the subproblem to be solved at each iteration is
min 1 Jkprk 2, subjectto p ak, 10.31 p2
whereak 0isthetrustregionradius.Ineffect,wearechoosingthemodelfunctionmk in 4.3 to be
mkp1 rk 2pTJkTrk1pTJkTJkp. 10.32 22
We drop the iteration counter k during the rest of this section and concern ourselves with the subproblem 10.31. The results of Chapter 4 allow us to characterize the solution of 10.31 in the following way: When the solution pGN of the GaussNewton equations 10.23 lies strictly inside the trust region that is, pGN a, then this step pGN also solves the subproblem 10.31. Otherwise, there is a 0 such that the solution p pLM of 10.31 satisfies p a and
JTJI pJTr. 10.33 This claim is verified in the following lemma, which is a straightforward consequence of
Theorem 4.1 from Chapter 4.
Lemma 10.2.
ThevectorpLM isasolutionofthetrustregionsubproblem min Jpr 2, subjectto p a,
p
ifandonlyifpLM isfeasibleandthereisascalar0suchthat
JTJIpLM JTr, a pLM 0.
10.34a 10.34b
We then have
10.3. ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS 259
PROOF. In Theorem 4.1, the semidefiniteness condition 4.8c is satisfied automatically, since J T J is positive semidefinite and 0. The two conditions 10.34a and 10.34b follow from 4.8a and 4.8b, respectively.
Note that the equations 10.33 are just the normal equations for the following linear leastsquares problem:
a a2
min 1 a J p r a . 10.35
Just as in the GaussNewton case, the equivalence between 10.33 and 10.35 gives us a way of solving the subproblem without computing the matrixmatrix product J T J and its Cholesky factorization.
IMPLEMENTATION OF THE LEVENBERGMARQUARDT METHOD
To find a value of that approximately matches the given a in Lemma 10.2, we can use the rootfinding algorithm described in Chapter 4. It is easy to safeguard this procedure: The Cholesky factor R is guaranteed to exist whenever the current estimate is positive, since the approximate Hessian B J T J is already positive semidefinite. Because of the special structure of B, we do not need to compute the Cholesky factorization of B I from scratch in each iteration of Algorithm 4.1. Rather, we present an efficient technique for finding the following QR factorization of the coefficient matrix in 10.35:
R QT J 10.36 0 I
Q orthogonal, R upper triangular. The upper triangular factor R satisfies RT R JT J I.
We can save computer time in the calculation of the factorization 10.36 by using a combination of Householder and Givens transformations. Suppose we use Householder transformations to calculate the QR factorization of J alone as
p2aI 0a
R JQ .
0
10.37
10.38
R
QT J
0 I I . I
260 CHAPTER 10. LEASTSQUARES PROBLEMS
The leftmost matrix in this formula is upper triangular except for the n nonzero terms of the matrix I. These can be eliminated by a sequence of nn 12 Givens rotations, in which the diagonal elements of the upper triangular part are used to eliminate the nonzeros of I and the fillin terms that arise in the process. The first few steps of this process are as follows:
rotate row n of R with row n of I, to eliminate the n,n element of I;
rotaterown1ofRwithrown1of Itoeliminatethen1,n1element
of the latter matrix. This step introduces fillin in position n 1, n of I , which
is eliminated by rotating row n of R with row n 1 of element at position n 1, n;
I, to eliminate the fillin
rotaterown2ofRwithrown2of I,toeliminatethen2diagonalinthe
latter matrix. This step introduces fillin in the n 2, n 1 and n 2, n positions, which we eliminate by
and so on. If we gather all the Givens rotations into a matrix Q , we obtain from 10.38 that
and hence 10.36 holds with
RR
Q T0 0, I 0
QQ Q . I
The advantage of this combined approach is that when the value of is changed in the rootfinding algorithm, we need only recalculate Q and not the Householder part of the factorization 10.38. This feature can save a lot of computation in the case of m n, since just On3 operations are required to recalculate Q and R for each value of , after the initial cost of Omn2 operations needed to calculate Q in 10.37.
Leastsquares problems are often poorly scaled. Some of the variables could have values of about 104, while other variables could be of order 106. If such wide variations are ignored, the algorithms above may encounter numerical difficulties or produce solutions of poor quality. One way to reduce the effects of poor scaling is to use an ellipsoidal trust region in place of the spherical trust region defined above. The step is confined to an ellipse in which the lengths of the principal axes are related to the typical values of the corresponding variables. Analytically, the trustregion subproblem becomes
min 1 Jkprk 2, subjectto Dkp ak, 10.39 p2
10.3. ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS 261
where Dk is a diagonal matrix with positive diagonal entries cf. 7.13. Instead of 10.33, the solution of 10.39 satisfies an equation of the form
JTJD2 pLMJTr, kkkkkk
and, equivalently, solves the linear leastsquares problem
10.40
10.41
a a2
mina Jk p p a Dk
rk a . 0 a
The diagonals of the scaling matrix Dk can change from iteration to iteration, as we gather information about the typical range of values for each component of x. If the variation in these elements is kept within certain bounds, then the convergence theory for the spherical case continues to hold, with minor modifications. Moreover, the technique described above for calculating R needs no modification. Seber and Wild 280 suggest choosing the diagonals of Dk2 to match those of JkT Jk , to make the algorithm invariant under diagonal scaling of the components of x. This approach is analogous to the technique of scaling by diagonal elements of the Hessian, which was described in Section 4.5 in the context of trustregion algorithms for unconstrained optimization.
For problems in which m and n are large and Jx is sparse, we may prefer to solve 10.31 or 10.39 approximately using the CGSteihaug algorithm, Algorithm 7.2 from Chapter 7, with JkT Jk replacing the exact Hessian 2 fk. Positive semidefiniteness of the matrix JkT Jk makes for some simplification of this algorithm, because negative curvature cannot arise. It is not necessary to calculate JkT Jk explicitly to implement Algorithm 7.2; the matrixvector products required by the algorithm can be found by forming matrixvector products with Jk and JkT separately.
CONVERGENCE OF THE LEVENBERGMARQUARDT METHOD
It is not necessary to solve the trustregion problem 10.31 exactly in order for the LevenbergMarquardt method to enjoy global convergence properties. The following convergence result can be obtained as a direct consequence of Theorem 4.6.
Theorem 10.3.
Let 0, 1 in Algorithm 4.1 of Chapter 4, and suppose that the level set L defined 4
in 10.29 is bounded and that the residual functions r j , j 1, 2, . . . , m are Lipschitz continuously differentiable in a neighborhood N of L. Assume that for each k, the approximate solution pk of 10.31 satisfies the inequality
mk0 mkpk c1 JkT rk min ak, JkT rk , 10.42 JkT Jk
262 CHAPTER 10. LEASTSQUARES PROBLEMS
for some constant c1 0, and in addition pk ak for some constant 1. We then have that
lim fk lim JkTrk 0. k k
PROOF. The smoothness assumption on r j implies that we can choose a constant M 0 such that JkT Jk M for all iterates k. Note too that the objective f is bounded below by zero. Hence, the assumptions of Theorem 4.6 are satisfied, and the result follows immediately.
As in Chapter 4, there is no need to calculate the righthandside in the inequality 10.42 or to check it explicitly. Instead, we can simply require the decrease given by our approximate solution pk of 10.31 to at least match the decrease given by the Cauchy point, which can be calculated inexpensively in the same way as in Chapter 4. If we use the iterative CGSteihaug approach, Algorithm 7.2, the condition 10.42 is satisfied automatically for c1 12, since the Cauchy point is the first estimate of pk computed by this approach, while subsequent estimates give smaller values for the model function.
The local convergence behavior of LevenbergMarquardt is similar to the Gauss Newton method. Near a solution x at which the first term of the Hessian 2 f x 10.5 dominates the second term, the model function in 10.31, the trust region becomes inactive and the algorithm takes GaussNewton steps, giving the rapid local convergence expression 10.30.
METHODS FOR LARGERESIDUAL PROBLEMS
In largeresidual problems, the quadratic model in 10.31 is an inadequate repre sentation of the function f because the secondorder part of the Hessian 2 f x is too significant to be ignored. In datafitting problems, the presence of large residuals may indicate that the model is inadequate or that errors have been made in monitoring the observations. Still, the practitioner may need to solve the leastsquares problem with the current model and data, to indicate where improvements are needed in the weighting of observations, modeling, or data collection.
On largeresidual problems, the asymptotic convergence rate of GaussNewton and LevenbergMarquardt algorithms is only linearslower than the superlinear convergence rate attained by algorithms for general unconstrained problems, such as Newton or quasi Newton.IftheindividualHessians2rj areeasytocalculate,itmaybebettertoignorethe structure of the leastsquares objective and apply Newtons method with trust region or line search to the problem of minimizing f . QuasiNewton methods, which attain a superlin ear convergence rate without requiring calculation of 2rj, are another option. However, the behavior of both Newton and quasiNewton on early iterations before reaching a neighborhood of the solution may be inferior to GaussNewton and LevenbergMarquardt.
10.3. ALGORITHMS FOR NONLINEAR LEASTSQUARES PROBLEMS 263
Of course, we often do not know beforehand whether a problem will turn out to have small or large residuals at the solution. It seems reasonable, therefore, to consider hybrid algorithms, which would behave like GaussNewton or LevenbergMarquardt if the residuals turn out to be small and hence take advantage of the cost savings associated with these methods but switch to Newton or quasiNewton steps if the residuals at the solution appear to be large.
There are a couple of ways to construct hybrid algorithms. One approach, due to Fletcher and Xu see Fletcher 101, maintains a sequence of positive definite Hessian ap proximations Bk. If the GaussNewton step from xk reduces the function f by a certain fixed amount say, a factor of 5, then this step is taken and Bk is overwritten by JkT Jk. Otherwise, a direction is computed using Bk, and the new point xk1 is obtained by per forming a line search. In either case, a BFGSlike update is applied to Bk to obtain a new approximation Bk1. In the zeroresidual case, the method eventually always takes Gauss Newton steps giving quadratic convergence, while it eventually reduces to BFGS in the nonzeroresidual case giving superlinear convergence. Numerical results in Fletcher 101, Tables 6.1.2, 6.1.3 show good results for this approach on small, large, and zeroresidual problems.
A second way to combine GaussNewton and quasiNewton ideas is to maintain approximations to just the secondorder part of the Hessian. That is, we maintain a sequence of matrices Sk that approximate the summation term mj1 rjxk2rjxk in 10.5, and then use the overall Hessian approximation
Bk JkT Jk Sk
in a trustregion or line search model for calculating the step pk . Updates to Sk are devised so that the approximate Hessian Bk, or its constituent parts, mimics the behavior of the corresponding exact quantities over the step just taken. The update formula is based on a secant equation, which arises also in the context of unconstrained minimization 6.6 and nonlinear equations 11.27. In the present instance, there are a number of different ways to define the secant equation and to specify the other conditions needed for a complete update formula for Sk . We describe the algorithm of Dennis, Gay, and Welsch 90, which is probably the bestknown algorithm in this class because of its implementation in the wellknown NL2SOL package.
In 90, the secant equation is motivated in the following way. Ideally, Sk1 should be a close approximation to the exact secondorder term at x xk1; that is,
m j1
Since we do not want to calculate the individual Hessians 2rj in this formula, we could replace each of them with an approximation Bj k1 and impose the condition that Bj k1
Sk1
rjxk12rjxk1.
264 CHAPTER 10. LEASTSQUARES PROBLEMS
should mimic the behavior of its exact counterpart 2rj over the step just taken; that is,
Bjk1xk1 xkrjxk1rjxk
row j of Jxk1T row j of JxkT .
This condition leads to a secant equation on Sk1, namely,
m j1
m j1
JTrJTr. k1 k1 k k1
As usual, this condition does not completely specify the new approximation Sk1. Dennis, Gay, and Welsch add requirements that Sk1 be symmetric and that the difference Sk1 Sk from the previous estimate Sk be minimized in a certain sense, and derive the following update formula:
where
Sk1xk1 xk
rjxk1Bjk1xk1 xk
rjxk1 row j of Jxk1T row j of JxkT
Sk1Sk
y SksyT yy SksT y SksTs T
yTs yTs2 yy , 10.43
s xk1 xk, yJT r JTr,
k1k1 kk yJTrJTr.
k1 k1 k k1
Note that 10.43 is a slight variant on the DFP update for unconstrained minimization. It would be identical if y and y were the same.
Dennis, Gay, and Welsch use their approximate Hessian JkT Jk Sk in conjunction with a trustregion strategy, but a few more features are needed to enhance its performance. One deficiency of its basic update strategy for Sk is that this matrix is not guaranteed to vanish as the iterates approach a zeroresidual solution, so it can interfere with superlinear convergence. This problem is avoided by scaling Sk prior to its update; we replace Sk by k Sk on the righthandside of 10.43, where
kmin 1,sTy . sT Sks
min 1
w2 2 d22,
10.4. ORTHOGONAL DISTANCE REGRESSION 265
A final modification in the overall algorithm is that the Sk term is omitted from the Hessian approximation when the resulting GaussNewton model produces a sufficiently good step.
10.4 ORTHOGONAL DISTANCE REGRESSION
In Example 10.1 we assumed that no errors were made in noting the time at which the blood samplesweredrawn,sothatthedifferencesbetweenthemodelx;tjandtheobservation y j were due to inadequacy in the model or measurement errors in y j . We assumed that any errors in the ordinatesthe times tjare tiny by comparison with the errors in the observations. This assumption often is reasonable, but there are cases where the answer can be seriously distorted if we fail to take possible errors in the ordinates into account. Models that take these errors into account are known in the statistics literature as errorsinvariables models 280, Chapter 10, and the resulting optimization problems are referred to as total least squares in the case of a linear model see Golub and Van Loan 136, Chapter 5 or as orthogonal distance regression in the nonlinear case see Boggs, Byrd, and Schnabel 30.
We formulate this problem mathematically by introducing perturbations j for the ordinates t j , as well as perturbations j for y j , and seeking the values of these 2m perturba tions that minimize the discrepancy between the model and the observations, as measured by a weighted leastsquares objective function. To be precise, we relate the quantities t j , y j , j,and jby
yjx;tjj j, and define the minimization problem as
j1,2,…,m,
subject to 10.44.
10.44
10.45
m
x,j,j2 jj jj
j1
The quantities wi and di are weights, selected either by the modeler or by some automatic estimate of the relative significance of the error terms.
It is easy to see how the term orthogonal distance regression originates when we graph this problem; see Figure 10.2. If all the weights wi and di are equal, then each term in the summation 10.45 is simply the shortest distance between the point t j , y j and the curve x;t plotted as a function of t. The shortest path between each point and the curve is orthogonal to the curve at the point of intersection.
Using the constraints 10.44 to eliminate the variables j from 10.45, we obtain the unconstrained leastsquares problem
m 2m
min Fx, 1 w2y x;t 2 d22 1 r2x,, 10.46 x, 2jjjjjj2j
j1 j1
266 CHAPTER 10. LEASTSQUARES PROBLEMS
y
t1 t2 t3 t4 t5 t6 t7 t
Figure 10.2 Orthogonal distance regression minimizes the sum of squares of the distance from each point to the curve.
where 1,2,…,mT and we have defined
rjx, wjx;tj j yj, j 1,2,…,m, 10.47
djmjm, j m1,…,2m.
Note that 10.46 is now a standard leastsquares problem with 2m residuals and m n unknowns, which we can solve by using the techniques in this chapter. A naive implementa tion of this strategy may, however, be quite expensive, since the number of parameters 2n and the number of observations m n may both be much larger than for the original problem.
Fortunately, the Jacobian matrix for 10.46 has a special structure that can be ex ploited in implementing the GaussNewton or LevenbergMarquardt methods. Many of its components are zero; for instance, we have
rj tjj;xyj0, i,j1,2,…,m,iaj, i i
and
rj 0, xi
j m 1,…,2m, i 1,2,…,n.
10.4. ORTHOGONAL DISTANCE REGRESSION 267
Additionally, we have for j 1,2,…,m and i 1,2,…,m that rmj dj ifij,
i 0 otherwise.
Hence, we can partition the Jacobian of the residual function r defined by 10.47 into
blocks and write
Jx, J V , 10.48 0D
where V and D are m m diagonal matrices and J is the m n matrix of partial derivatives ofthefunctionswjtj j;xwithrespecttox.Boggs,Byrd,andSchnabel30apply the LevenbergMarquardt algorithm to 10.46 and note that block elimination can be used to solve the subproblems 10.33, 10.35 efficiently. Given the partitioning 10.48, we can partition the step vector p and the residual vector r accordingly as
p px , r r1 , p r2
and write the normal equations 10.33 in the partitioned form
JT J I JT V px JT r1 . 10.49
V J V 2 D 2 I p V r 1 D r 2
Since the lower right submatrix V 2 D2 I is diagonal, it is easy to eliminate p from this system and obtain a smaller n n system to be solved for px alone. The total cost of finding a step is only marginally greater than for the m n problem arising from the standard leastsquares model.
NOTES AND REFERENCES
Algorithms for linear least squares are discussed comprehensively by Bjo rck 29, who includes detailed error analyses of the different algorithms and software listings. He considers not just the basic problem 10.13 but also the situation in which there are bounds for example, x 0 or linear constraints for example, Ax b on the variables. Golub and Van Loan 136, Chapter 5 survey the state of the art, including discussion of the suitability of the different approaches for example, normal equations vs. QR factorization for different problem types. A classical reference on linear leastsquares is Lawson and Hanson 188.
268 CHAPTER 10. LEASTSQUARES PROBLEMS
Very large nonlinear leastsquares problems arise in numerous areas of application, such as medical imaging, geophysics, economics, and engineering design. In many instances, both the number of variables n and the number of residuals m is large, but it is also quite common that only m is large.
The original description of the LevenbergMarquardt algorithm 190, 203 did not make the connection with the trustregion concept. Rather, it adjusted the value of in 10.33 directly, increasing or decreasing it by a certain factor according to whether or not the previous trial step was effective in decreasing f . The heuristics for adjusting were analogous to those used for adjusting the trustregion radius ak in Algorithm 4.1. Similar convergence results to Theorem 10.3 can be proved for algorithms that use this approach see, for instance, Osborne 231, independently of trustregion analysis. The connection with trust regions was firmly established by More 210.
Wright and Holt 318 present an inexact LevenbergMarquardt approach for largescale nonlinear least squares that manipulates the parameter directly rather than making use of the connection to trustregion algorithms. This method takes steps p k that, analogously to 7.2 and 7.3 in Chapter 7, satisfy the system
aJTJI p JTra JTr , forsome0,, kkkkkkkkk k
where 0,1 is a constant and k is a forcing sequence. A ratio of actual to pre dicted decrease is used to decide whether the step p k should be taken, and convergence to stationary points can be proved under certain assumptions. The method can be imple mented efficiently by using Algorithm LSQR of Paige and Saunders 234 to calculate the approximate solution of 10.35 since, for a small marginal cost, this algorithm can compute approximate solutions for a number of different values of k simultaneously. Hence, we can compute values of p k corresponding to a range of values of k , and choose the actual step to be the one corresponding to the smallest k for which the actualpredicted decrease ratio is satisfactory.
Nonlinear least squares software is fairly prevalent because of the high demand for it. Major numerical software libraries such as IMSL, HSL, NAG, and SAS, as well as programming environments such as Mathematica and Matlab, contain robust nonlinear leastsquares implementations. Other high quality implmentations include DFNLP, MINPACK, NL2SOL, and NLSSOL; see More and Wright 217, Chapter 3. The nonlinear programming packages LANCELOT, KNITRO, and SNOPT provide largescale implementions of the Gauss Newton and LevenbergMarquardt methods. The orthogonal distance regression algorithm is implemented by ORDPACK 31.
All these routines which can be accessed through the web give the user the option of either supplying Jacobians explicitly or else allowing the code to compute them by finite differencing. In the latter case, the user need only write code to compute the residual vector rx; see Chapter 8. Seber and Wild 280, Chapter 15 describe some of the important practical issues in selecting software for statistical applications.
10.4
10.4. ORTHOGONAL DISTANCE REGRESSION 269
EXERCISES
10.1 LetJbeanmnmatrixwithmn,andletyIRm beavector.
a
b
a b
Show that J has full column rank if and only if J T J is nonsingular. Show that J has full column rank if and only if J T J is positive definite.
10.2 Show that the function f x in 10.13 is convex.
10.3 Show that
if Q is an orthogonal matrix, then Qx x for any vector x;
the matrices R in 10.15 and R in 10.17 are identical if I , provided that J has full column rank n.
Show that x defined in 10.22 is a minimizer of 10.13.
Find x and conclude that this norm is minimized when i 0 for all i with i 0.
a
b
for all j 1,2,…,m and all x,x D, where D is a compact subset of IRn. Assume also that the rj are bounded on D, that is, there exists M 0 such that rjx M for all j 1,2,…,m and all x D. Find Lipschitz constants for the Jacobian J 10.3 and the
gradient f 10.4 over D.
10.6 Express the solution p of 10.33 in terms of the singularvalue decomposition
of J x and the scalar . Express its squarednorm p 2 in these same terms, and show that
l i m p u iT r v i . 0 ia0 i
10.5 Suppose that each residual function r j and its gradient are Lipschitz continuous with Lipschitz constant L, that is,
rjxrjx L xx , rjxrjx L xx
CHAPTER11 Nonlinear
Equations
In many applications we do not need to optimize an objective function explicitly, but rather to find values of the variables in a model that satisfy a number of given relationships. When these relationships take the form of n equalitiesthe same number of equality conditions as variables in the modelthe problem is one of solving a system of nonlinear equations. We write this problem mathematically as
rx 0, 11.1
This is pa Printer: O
g
CHAPTER 11. NONLINEAR EQUATIONS 271
where r : IRn IRn is a vector function, that is,
r1x
r2x rx . .
. rnx
In this chapter, we assume that each function ri : IRn IR, i 1,2,…,n, is smooth. A vector x for which 11.1 is satisfied is called a solution or root of the nonlinear equations. A simple example is the system
rx x2 1 0, sin x1 x2
which is a system of n 2 equations with infinitely many solutions, two of which are x 32,1T andx 2,1T.Ingeneral,thesystem11.1mayhavenosolutions, a unique solution, or many solutions.
The techniques for solving nonlinear equations overlap in their motivation, analysis, and implementation with optimization techniques discussed in earlier chapters. In both optimization and nonlinear equations, Newtons method lies at the heart of many important algorithms. Features such as line searches, trust regions, and inexact solution of the linear algebra subproblems at each iteration are important in both areas, as are other issues such as derivative evaluation and global convergence.
Because some important algorithms for nonlinear equations proceed by minimizing a sum of squares of the equations, that is,
n
x
i1
there are particularly close connections with the nonlinear leastsquares problem discussed in Chapter 10. The differences are that in nonlinear equations, the number of equations equals the number of variables instead of exceeding the number of variables, as is typically the case in Chapter 10, and that we expect all equations to be satisfied at the solution, rather than just minimizing the sum of squares. This point is important because the nonlinear equations may represent physical or economic constraints such as conservation laws or consistency principles, which must hold exactly in order for the solution to be meaningful.
Many applications require us to solve a sequence of closely related nonlinear systems, as in the following example.
min
ri2x,
272 CHAPTER 11. NONLINEAR EQUATIONS
EXAMPLE 11.1 RHEINBOLDT; SEE 212
An interesting problem in control is to analyze the stability of an aircraft in response to the commands of the pilot. The following is a simplified model based on forcebalance equations, in which gravity terms have been neglected.
The equilibrium equations for a particular aircraft are given by a system of 5 equations in 8 unknowns of the form
Fx Ax x 0, where F : IR8 IR5, the matrix A is given by
11.2
7.64 0
3.933 0
A 0.002
0 1.0 0 1.0 0 0.168 0 0
and the nonlinear part is defined by
0.107 0.987 0
0.126 0
0 22.95
9.99 0
0 28.37
45.83 0 0.921
0.235 0
0 0 1.0 0 0.196 0 0.0071 0
x
0.727×2 x3 8.39×3 x4 684.4×4 x5 63.5×4 x2
5.67 0
x 1 x 5
0.949x1x3 0.173x1x5
0.716x1x2 1.578x1x4 1.132x4x2 .
The first three variables x1, x2, x3, represent the rates of roll, pitch, and yaw, respec tively, while x4 is the incremental angle of attack and x5 the sideslip angle. The last three variables x6, x7, x8 are the controls; they represent the deflections of the elevator, aileron, and rudder, respectively.
For a given choice of the control variables x6, x7, x8 we obtain a system of 5 equations and 5 unknowns. If we wish to study the behavior of the aircraft as the controls are changed, we need to solve a system of nonlinear equations with unknowns x1, x2, . . . , x5 for each setting of the controls.
Despite the many similarities between nonlinear equations and unconstrained and leastsquares optimization algorithms, there are also some important differences. To ob tain quadratic convergence in optimization we require second derivatives of the objective function, whereas knowledge of the first derivatives is sufficient in nonlinear equations.
x1 x4
6.51 ,
CHAPTER 11. NONLINEAR EQUATIONS 273
4 3 2 1 0
1 2 3
4
3 2 1 0 1 2 3
Figure 11.1 The function r x sin5x x has three roots.
QuasiNewton methods are perhaps less useful in nonlinear equations than in optimiza tion. In unconstrained optimization, the objective function is the natural choice of merit function that gauges progress towards the solution, but in nonlinear equations various merit functions can be used, all of which have some drawbacks. Line search and trustregion tech niques play an equally important role in optimization, but one can argue that trustregion algorithms have certain theoretical advantages in solving nonlinear equations.
Some of the difficulties that arise in trying to solve nonlinear equations can be illustrated by a simple scalar example n 1. Suppose we have
rx sin5x x, 11.3
as plotted in Figure 11.1. From this figure we see that there are three solutions of the problem rx 0, also known as roots of r, located at zero and approximately 0.519148. This situation of multiple solutions is similar to optimization problems where, for example, a function may have more than one local minimum. It is not quite the same, however: In the case of optimization, one of the local minima may have a lower function value than the others making it a better solution, while in nonlinear equations all solutions are equally good from a mathematical viewpoint. If the modeler decides that the solution
274 CHAPTER 11. NONLINEAR EQUATIONS
found by the algorithm makes no sense on physical grounds, their model may need to be reformulated.
In this chapter we start by outlining algorithms related to Newtons method and examining their local convergence properties. Besides Newtons method itself, these in clude Broydens quasiNewton method, inexact Newton methods, and tensor methods. We then address global convergence, which is the issue of trying to force convergence to a solution from a remote starting point. Finally, we discuss a class of methods in which an easy problemone to which the solution is well knownis gradually transformed into the problem Fx 0. In these socalled continuation or homotopy methods, we track the solution as the problem changes, with the aim of finishing up at a solution of Fx 0.
Throughout this chapter we make the assumption that the vector function r is con tinuously differentiable in the region D containing the values of x we are interested in. In other words, the Jacobian J x the matrix of first partial derivatives of r x defined in the Appendix and in 10.3 exists and is continuous. We say that x satisfying rx 0 is a degenerate solution if Jx is singular, and a nondegenerate solution otherwise.
11.1 LOCAL ALGORITHMS
NEWTONS METHOD FOR NONLINEAR EQUATIONS
Recall from Theorem 2.1 that Newtons method for minimizing f : IRn IR forms a quadratic model function by taking the first three terms of the Taylor series approximation of f around the current iterate xk . The Newton step is the vector that minimizes this model. In the case of nonlinear equations, Newtons method is derived in a similar way, but with a linear model, one that involves function values and first derivatives of the functions ri x , i 1, 2, . . . , m at the current iterate xk . We justify this strategy by referring to the following multidimensional variant of Taylors theorem.
Theorem 11.1.
Suppose that r : IRn IRn is continuously differentiable in some convex open set D and that x and x p are vectors in D. We then have that
1 0
righthandside of 11.4 by J x p, and writing
def
Mkp rxk Jxkp. 11.5
rx p rx
We can define a linear model Mk p of r xk p by approximating the second term on the
Jx tpp dt. 11.4
Newtons method, in its pure form, chooses the step pk to be the vector for which Mk pk 0,thatis,pk Jxk1rxk.Wedefineitformallyasfollows.
Algorithm 11.1 Newtons Method for Nonlinear Equations. Choose x0;
for k 0,1,2,…
Calculate a solution pk to the Newton equations Jxkpk rxk;
xk1 xk pk; end for
11.6
We use a linear model to derive the Newton step, rather than a quadratic model as in unconstrained optimization, because the linear model normally has a solution and yields an algorithm with rapid convergence properties. In fact, Newtons method for unconstrained optimization see 2.15 can be derived by applying Algorithm 11.1 to the nonlinear equations f x 0. We see also in Chapter 18 that sequential quadratic programming for equalityconstrained optimization can be derived by applying Algorithm 11.1 to the nonlinear equations formed by the firstorder optimality conditions 18.3 for this problem. Another connection is with the GaussNewton method for nonlinear least squares; the formula 11.6 is equivalent to 10.23 in the usual case in which J xk is nonsingular.
When the iterate xk is close to a nondegenerate root x, Newtons method converges superlinearly, as we show in Theorem 11.2 below. Potential shortcomings of the method include the following.
When the starting point is remote from a solution, Algorithm 11.1 can behave erratically. When J xk is singular, the Newton step may not even be defined.
Firstderivative information the Jacobian matrix J may be difficult to obtain.
It may be too expensive to find and calculate the Newton step pk exactly when n is
large.
The root x in question may be degenerate, that is, Jx may be singular.
An example of a degenerate problem is the scalar function rx x2, which has a single degenerate root at x 0. Algorithm 11.1, when started from any nonzero x0, generates the sequence of iterates
xk 1 x0, 2k
which converges to the solution 0, but only at a linear rate.
As we show later in this chapter, Newtons method can be modified and enhanced in
various ways to get around most of these problems. The variants we describe form the basis of much of the available software for solving nonlinear equations.
11.1. LOCAL ALGORITHMS 275
276 CHAPTER 11. NONLINEAR EQUATIONS
We summarize the local convergence properties of Algorithm 11.1 in the following theorem. For part of this result, we make use of a Lipschitz continuity assumption on the Jacobian, by which we mean that there is a constant L such that
Jx0Jx1 L x0x1 , 11.7 for all x0 and x1 in the domain in question.
Theorem 11.2.
Suppose that r is continuously differentiable in a convex open set D IRn. Let x D be a nondegenerate solution of rx 0, and let xk be the sequence of iterates generated by Algorithm 11.1. Then when xk D is sufficiently close to x, we have
xk1 x o xk x , 11.8 indicating local Qsuperlinear convergence. When r is Lipschitz continuously differentiable
near x, we have for all xk sufficiently close to x that xk1xOxkx 2,
indicating local Qquadratic convergence.
PROOF. Since rx 0, we have from Theorem 11.1 that rxkrxkrx Jxkxk xwxk,x,
where
11.9
11.10
11.11
Jx tx xk Jxk xk x dt
Since Jx is nonsingular, there is a radius 0 and a positive constant such that for
all x in the ball Bx, defined by
Bx,x xx , 11.13
wxk,x
From A.12 and continuity of J , we have
a 1 wxk,x a
0 1
a JxtxxkJxkxk xdta
1 0
Jxk txxkJxk xk x.
oxkx .
11.12
0
we have that
Jx1 and x D. 11.14 Assuming that xk Bx,, and recalling the definition 11.6, we multiply both sides of
11.10 by J xk 1 to obtain
xkpkxoxkx ,
pk xk x Jxk1 o xk x ,
xk1xoxkx , 11.15
which yields 11.8.
When the Lipschitz continuity assumption 11.7 is satisfied, we can obtain a sharper
estimate for the remainder term wxk, x defined in 11.11. By using 11.7 in 11.12, we obtain
wxk,x Oxkx 2. By multiplying 11.10 by J xk 1 as above, we obtain
pk xk x Jxk1wxk,x, so the estimate 11.9 follows as in 11.15.
INEXACT NEWTON METHODS
11.16
Instead of solving 11.6 exactly, inexact Newton methods use search directions pk that satisfy the condition
rk Jkpk k rk , forsomek 0,, 11.17
where 0,1 is a constant. As in Chapter 7, we refer to k as the forcing sequence. Different methods make different choices of the forcing sequence, and they use different algorithms for finding the approximate solutions pk . The general framework for this class of methods can be stated as follows.
Framework 11.2 Inexact Newton for Nonlinear Equations. Given 0, 1;
Choose x0;
for k 0,1,2,…
Chooseforcingparameterk 0,; Find a vector pk that satisfies 11.17; xk1 xk pk;
end for
11.1. LOCAL ALGORITHMS 277
278 CHAPTER 11. NONLINEAR EQUATIONS
The convergence theory for these methods depends only on the condition 11.17 and not on the particular technique used to calculate pk. The most important methods in this class, however, make use of iterative techniques for solving linear systems of the form Jp r, such as GMRES Saad and Schultz 273, Walker 302 or other Krylov space methods. Like the conjugategradient algorithm of Chapter 5 which is not directly applicable here, since the coefficient matrix J is not symmetric positive definite, these methods typically require us to perform a matrixvector multiplication of the form J d for some d at each iteration, and to store a number of work vectors of length n. GMRES requires an additional vector to be stored at each iteration, so must be restarted periodically often every 10 or 20 iterations to keep memory requirements at a reasonable level.
The matrixvector products J d can be computed without explicit knowledge of the Jacobian J. A finitedifference approximation to Jd that requires one evaluation of r is given by the formula 8.11. Calculation of Jd exactly at least, to within the limits of finiteprecision arithmetic can be performed by using the forward mode of automatic differentiation, at a cost of at most a small multiple of an evaluation of r . Details of this procedure are given in Section 8.2.
We do not discuss the iterative methods for sparse linear systems here, but refer the interested reader to Kelley 177 and Saad 272 for comprehensive descriptions and implementations of the most interesting techniques. We prove a local convergence theorem for the method, similar to Theorem 11.2.
Theorem 11.3.
Suppose that r is continuously differentiable in a convex open set D IRn. Let x D be a nondegenerate solution of rx 0, and let xk be the sequence of iterates generated by the Framework 11.2. Then when xk D is sufficiently close to x, the following are true:
i If in 11.17 is sufficiently small, the convergence of xk to x is Qlinear. ii If k 0, the convergence is Qsuperlinear.
iii If, in addition, J is Lipschitz continuous in a neighborhood of x and k O rk , the convergence is Qquadratic.
PROOF. We first rewrite 11.17 as
Jxkpk rxk vk, where vk k rxk . 11.18
Since x is a nondegenerate root, we have as in 11.14 that there is a radius 0 such that Jx1 for some constant and all x Bx,. By multiplying both sides of
11.18 by J xk 1 and rearranging, we find that
ap Jx 1rx a aJx 1v a rx . 11.19 kkkkkkk
As in 11.10, we have that
rx Jxx x wx, x, 11.20
def
wherex wx,x xx 0asxx .Byreducingifnecessary,wehave
from this expression that the following bound holds for all x Bx, : rx 2 Jx xx o xx 4 Jx xx .
We now set x xk in 11.20, and use 11.19 and 11.21 to obtain
xk pk x apk Jxk1rxkwxk,xa k rxk Jxk1 wxk,x
4 Jx k xk xk x .
11.21
11.22
By choosing xk close enough to x that xk 14, and choosing 18 Jx , we have that the term in square brackets in 11.22 is at most 12. Hence, since xk1 xk pk, this formula indicates Qlinear convergence of xk to x, proving part i.
Part ii follows immediately from the fact that the term in brackets in 11.22 goes to zero as xk x and k 0. For part iii, we combine the techniques above with the logic of the second part of the proof of Theorem 11.2. Details are left as an exercise.
BROYDENS METHOD
Secant methods, also known as quasiNewton methods, do not require calculation of the Jacobian J x . Instead, they construct their own approximation to this matrix, updating it at each iteration so that it mimics the behavior of the true Jacobian J over the step just taken. The approximate Jacobian, which we denote at iteration k by Bk, is then used to construct a linear model analogous to 11.5, namely
Mkp rxk Bk p. 11.23 We obtain the step by setting this model to zero. When Bk is nonsingular, we have the
following explicit formula cf. 11.6:
pk B1rxk. 11.24
The requirement that the approximate Jacobian should mimic the behavior of the true Jacobian can be specified as follows. Let sk denote the step from xk to xk1, and let yk
k
11.1. LOCAL ALGORITHMS 279
280 CHAPTER 11. NONLINEAR EQUATIONS
be the corresponding change in r , that is,
sk xk1xk, yk rxk1rxk.
From Theorem 11.1, we have that sk and yk are related by the expression 1
11.25
Jxk tsksk dt Jxk1sk o sk . which is known as the secant equation,
yk Bk1sk, 11.27
which ensures that Bk1 and Jxk1 have similar behavior along the direction sk. Note the similarity with the secant equation 6.6 in quasiNewton methods for unconstrained optimization; the motivation is the same in both cases. The secant equation does not say anything about how Bk1 should behave along directions orthogonal to sk. In fact, we can view 11.27 as a system of n linear equations in n2 unknowns, where the unknowns are the components of Bk1, so for n 1 the equation 11.27 does not determine all the components of Bk1 uniquely. The scalar case of n 1 gives rise to the scalar secant method; see A.60.
The most successful practical algorithm is Broydens method, for which the update formula is
y k B k s k s kT
Bk1 Bk sT s . 11.28
kk
The Broyden update makes the smallest possible change to the Jacobian as measured by the Euclidean norm Bk Bk1 2 that is consistent with 11.27, as we show in the following Lemma.
Lemma 11.4 Dennis and Schnabel 92, Lemma 8.1.1. AmongallmatricesBsatisfyingBsk yk,thematrixBk1definedby11.28minimizes
the difference B Bk .
PROOF. Let B be any matrix that satisfies Bsk yk. By the properties of the Euclidean norm see A.10 and the fact that ssT sT s 1 for any vector s see Exercise 11.1, we have
yk
We require the updated Jacobian approximation Bk1 to satisfy the following equation,
0
11.26
B k 1 B k
a a a y k B k s k s kT a a a a s kT s k a
aB BksksT a asksT a
a kaBBa kaBB. a s kT s k a k a s kT s k a k
Hence, we have that
and the result is proved.
Bk1arg min BBk , B : yk Bsk
In the specification of the algorithm below, we allow a line search to be performed along the search direction pk , so that sk pk for some 0 in the formula 11.25. See below for details about linesearch methods.
Algorithm 11.3 Broyden.
Choose x0 and a nonsingular initial Jacobian approximation B0; for k 0,1,2,…
Calculate a solution pk to the linear equations Bk pk rxk;
Choose k by performing a line search along pk ; xk1 xk kpk;
sk xk1xk;
yk rxk1rxk;
Obtain Bk1 from the formula 11.28; end for
11.29
Under certain assumptions, Broydens method converges superlinearly, that is,
xk1x oxkx. 11.30
This local convergence rate is fast enough for most practical purposes, though not as fast as the Qquadratic convergence of Newtons method.
We illustrate the difference between the convergence rates of Newtons and Broydens method with a small example. The function r : IR2 IR2 defined by
rx x1 3×23 7 18 11.31 sinx2ex1 1
has a nondegenerate root at x 0, 1T . We start both methods from the point x0 0.5,1.4T , and use the exact Jacobian Jx0 at this point as the initial Jacobian approximation B0. Results are shown in Table 11.1.
Newtons method clearly exhibits Qquadratic convergence, which is characterized by doubling of the exponent of the error at each iteration. Broydens method takes twice as
11.1. LOCAL ALGORITHMS 281
282 CHAPTER 11. NONLINEAR EQUATIONS
Table 11.1 Convergence of Iterates in Broyden and Newton Methods
Iteration k 0
1
2
3
4
5
6
7
8
Iteration k 0
1
2
3
4
5
6
7
8
xkx 2 Newton
0.64 100 0.62 101 0.21 103 0.18 107 0.12 1015
Broyden 0.64 100 0.62 101 0.52 103 0.25 103 0.43 104 0.14 106 0.57 109 0.18 1011 0.87 1015
Convergence of Function Norms in Broyden and Newton Methods
Table 11.2
Broyden 0.74 101 0.59 100 0.20 102 0.21 102 0.37 103 0.12 105 0.49 108 0.15 1010 0.11 1018
rxk 2 Newton
0.74 101 0.59 100 0.23 102 0.16 106 0.22 1015
many iterations as Newtons, and reduces the error at a rate that accelerates slightly towards the end. The function norms rxk approach zero at a similar rate to the iteration errors
xk x . As in 11.10, we have that
rxkrxkrx Jxxk x,
so by nonsingularity of Jx, the norms of rxk and xk x are bounded above and below by multiples of each other. For our example problem 11.31, convergence of the sequence of function norms in the two methods is shown in Table 11.2.
The convergence analysis of Broydens method is more complicated than that of Newtons method. We state the following result without proof.
Theorem 11.5.
Suppose the assumptions of Theorem 11.2 hold. Then there are positive constants and such that if the starting point x0 and the starting approximate Jacobian B0 satisfy
x0x , B0Jx , 11.32
the sequence xk generated by Broydens method 11.24, 11.28 is welldefined and converges Qsuperlinearly to x.
The second condition in 11.32that the initial Jacobian approximation B0 must be close to the true Jacobian at the solution Jxis difficult to guarantee in practice. In contrast to the case of unconstrained minimization, a good choice of B0 can be crit ical to the performance of the algorithm. Some implementations of Broydens method recommend choosing B0 to be Jx0, or some finitedifference approximation to this matrix.
The Broyden matrix Bk will be dense in general, even if the true Jacobian J is sparse. Therefore, when n is large, an implementation of Broydens method that stores Bk as a full n n matrix may be inefficient. Instead, we can use limitedmemory methods in which Bk is stored implicitly in the form of a number of vectors of length n, while the system 11.29 is solved by a technique based on application of the ShermanMorrisonWoodbury formula A.28. These methods are similar to the ones described in Chapter 7 for largescale unconstrained optimization.
TENSOR METHODS
In tensor methods, the linear model Mkp used by Newtons method 11.5 is aug mented with an extra term that aims to capture some of the nonlinear, higherorder, behavior of r. By doing so, it achieves more rapid and reliable convergence to degenerate roots, in particular, to roots x for which the Jacobian Jx has rank n 1 or n 2. We give a broad outline of the method here, and refer to Schnabel and Frank 277 for details.
We use M k p to denote the model function on which tensor methods are based; this function has the form
Mkp rxk Jxkp 1 Tk pp, 11.33 2
where Tk is a tensor defined by n3 elements Tk i jl whose action on a pair of arbitrary vectors u and v in IRn is defined by
the second derivatives of r at the point xk , that is,
Tkijl 2rixkjl.
n n j1 l1
Tkuvi
If we followed the reasoning behind Newtons method, we could consider building Tk from
Tkijlujvl.
11.1. LOCAL ALGORITHMS 283
284 CHAPTER 11. NONLINEAR EQUATIONS
For instance, in the example 11.31, we have that
T2 T03x2
condition
where
1Tksjksjk rxkjrxkJxksjk, 2
def
sjk xkjxk, j1,2,…,q.
Txuv1u r1xvu 3×2 6x2x13 v 3x2u1v2 u2v1 6x2x1 3u2v2.
However, use of the exact second derivatives is not practical in most instances. If we were to store this information explicitly, about n32 memory locations would be needed, about n times the requirements of Newtons method. Moreover, there may be no vector p for which Mkp 0, so the step may not even be defined.
Instead, the approach described in 277 defines Tk in a way that requires little additional storage, but which gives M k some potentially appealing properties. Specifically, Tk is chosen so that Mkp interpolates the function rxk p at some previous iterates visited by the algorithm. That is, we require that
Mkxkj xk rxkj, for j 1,2,…,q, 11.34 for some integer q 0. By substituting from 11.33, we see that Tk must satisfy the
In 277 it is shown that this condition can be ensured by choosing Tk so that its action on arbitrary vectors u and v is
q j1
where aj, j 1,2,…,q, are vectors of length n. The number of interpolating points q is typically chosen to be quite modest, usually less than n. This Tk can be stored in 2nq locations, which contain the vectors aj and sjk for j 1,2,…,q. Note the connection between this idea and Broydens method, which also chooses information in the model albeit in the firstorder part of the model to interpolate the function value at the previous iterate.
This technique can be refined in various ways. The points of interpolation can be chosen to make the collection of directions sjk more linearly independent. There may still not be a vector p for which M k p 0, but we can instead take the step to be the vector that
Tkuv
ajsTjkusTjkv,
minimizes Mkp 2, which can be found by using a specialized leastsquares technique.
There is no assurance that the step obtained in this way is a descent direction for the merit
function 1 r x 2 which is discussed in the next section, and in this case it can be replaced 2
11.2. PRACTICAL METHODS 285
by the standard Newton direction J1rk. k
11.2 PRACTICAL METHODS
We now consider practical variants of the Newtonlike methods discussed above, in which linesearch and trustregion modifications to the steps are made in order to ensure better global convergence behavior.
MERIT FUNCTIONS
As mentioned above, neither Newtons method 11.6 nor Broydens method 11.24, 11.28 with unit step lengths can be guaranteed to converge to a solution of r x 0 unless they are started close to that solution. Sometimes, components of the unknown or function vector or the Jacobian will blow up. Another, more exotic, kind of behavior is cycling, where the iterates move between distinct regions of the parameter space without approaching a root. An example is the scalar function
rxx5 x3 4x,
which has five nondegenerate roots. When started from the point x0 1, Newtons method produces a sequence of iterates that oscillates between 1 and 1 see Exercise 11.3 without converging to any of the roots.
The Newton and Broyden methods can be made more robust by using linesearch and trustregion techniques similar to those described in Chapters 3 and 4. Before describing these techniques, we need to define a merit function, which is a scalarvalued function of x that indicates whether a new iterate is better or worse than the current iterate, in the sense of making progress toward a root of r . In unconstrained optimization, the objective function
f is itself a natural merit function; most algorithms for minimizing f require a decrease in f at each iteration. In nonlinear equations, the merit function is obtained by combining the n components of the vector r in some way.
The most widely used merit function is the sum of squares, defined by
n 22i
fx1 rx21
The factor 12 is introduced for convenience. Any root x of r obviously has f x 0, and since f x 0 for all x, each root is a minimizer of f . However, local minimizers of f are not roots of r if f is strictly positive at the point in question. Still, the merit function
r2x. 11.35
i1
286 CHAPTER 11. NONLINEAR EQUATIONS
6
5
4
3
2
1
0
3 2 1 0 1 2 3
Figure 11.2 Plot of 1 sin5x x2, showing its many local minima. 2
11.35 has been used successfully in many applications and is implemented in a number of software packages.
The merit function for the example 11.3 is plotted in Figure 11.2. It shows three local minima corresponding to the three roots, but there are many other local minima for example, those at around 1.53053. Local minima like these that are not roots of f satisfy an interesting property. Since
f x JxT rx 0, 11.36
we can have rx a 0 only if Jx is singular.
Since local minima for the sumofsquares merit function may be points of attraction
for the algorithms described in this section, global convergence results for the algorithms discussed here are less satisfactory than for similar algorithms applied to unconstrained optimization.
Other merit functions are also used in practice. One such is the defined by
m i1
1 norm merit function
f1x rx1
rix.
This function is studied in Chapters 17 and 18 in the context of algorithms for constrained optimization.
LINE SEARCH METHODS
We can obtain algorithms with global convergence properties by applying the line search approach of Chapter 3 to the sumofsquares merit function fx 1 rx 2.
Step lengths k are chosen by one of the procedures of Chapter 3, and the iterates are defined by the formula
xk1 xk kpk, k0,1,2,…. 11.39 For the case of line searches that choose k to satisfy the Wolfe conditions 3.6, we have the
following convergence result, which follows directly from Theorem 3.2.
Theorem 11.6.
Suppose that Jx is Lipschitz continuous in a neighborhood D of the level set L x : f x f x0, and that Jx and rx are bounded above on D. Suppose that a linesearch algorithm 11.39 is applied to f , where the search directions pk satisfy pkT fk 0 while the step lengths k satisfy the Wolfe conditions 3.6. Then we have that the Zoutendijk condition holds, that is,
where
cos2k JkTrk 2, k0
cosk pkTfxk . pk f xk
11.2. PRACTICAL METHODS 287
When it is well defined, the Newton step
Jxkpk rxk
is a descent direction for f whenever rk a 0, since pkTfxkpkTJkTrkrk 20.
11.37
11.38
2
11.40
We omit the proof, which verifies that f is Lipschitz continuous on D and that f is bounded below by 0 on D, and then applies Theorem 3.2.
Provided that the sequence of iterates satisfies
cos k , for some 0, 1 and all k sufficiently large, 11.41
Theorem11.6guaranteesthatJkTrk 0,meaningthattheiteratesapproachstationarityof the merit function f . Moreover, if we know that J xk 1 is bounded then we must have rk 0.
288 CHAPTER 11. NONLINEAR EQUATIONS
We now investigate the values of cos k for the directions generated by the Newton and inexact Newton methods. From 11.40 and 11.38, we have for the exact Newton step 11.6 that
pkTfxk rk 2 1 1
coskp fxJ1r JTr JT J1 J. 11.42
kkkkkkkkk
When pk is an inexact Newton directionthat is, one that satisfies the condition
11.17we have that
rkJkpk 2k2 rk 22pkTJkTrk rk 2 Jkpk 22 rk 2
Meanwhile,
and
pkTfk pkT JkTrk 2 12 rk 2.
pk J1 rkJkpk rk J1 1rk , kk
fk JkTrk Jk rk . By combining these estimates, we obtain
p kT f k 1 2 1 cosk p f 2 J J1 12J.
kkkkk
We conclude that a bound of the form 11.41 is satisfied both for the exact and inexact Newton methods, provided that the condition number Jk is bounded.
WhenJkislarge,however,thislowerboundisclosetozero,anduseoftheNewton direction may cause poor performance of the algorithm. In fact, the following example shows that condition cos k can converge to zero, causing the algorithm to fail. This example highlights a fundamental weakness of the linesearch approach.
EXAMPLE 11.2 POWELL 241
Consider the problem of finding a solution of the nonlinear system
x1
rx 10×1 2×2 , 11.43 x1 0.1
withuniquesolutionx 0.WetrytosolvethisproblemusingtheNewtoniteration11.37, 11.39 where k is chosen to minimize f along pk . It is proved in 241 that, starting from thepoint3,1T,theiteratesconvergeto1.8016,0T tofourdigitsofaccuracy.However, this point is not a solution of 11.43. In fact, it is not even a stationary point of f , and a step from this point in the direction f will produce a decrease in both components of r . To verify these claims, note that the Jacobian of r , which is
10
Jx 1 , x1 0.12 4×2
is singular at all x for which x2 0. For such points, we have
10×1 fx x1x10.13 ,
0
so that the gradient points in the direction of the positive x1 axis whenever x1 0. The point 1.8016, 0T is therefore not a stationary point of f .
For this example, a calculation shows that the Newton step generated from an iterate that is close to but not quite on the x1 axis tends to be parallel to the x2 axis, making it nearly orthogonal to the gradient f x . That is, cos k for the Newton direction may be arbitrarily close to zero.
In this example, a Newton method with exact line searches is attracted to a point of no interest at which the Jacobian is singular. Since systems of nonlinear equations often contain singular points, this behavior gives cause for concern.
To prevent this undesirable behavior and ensure that 11.41 holds, we may have to modify the Newton direction. One possibility is to add some multiple k I of the identity to J kT J k , a n d d e fi n e t h e s t e p p k t o b e
pk JkT Jk k I1 JkT rk. 11.44
For any k 0 the matrix in parentheses is nonsingular, and if k is bounded away from zero, a condition of the form 11.41 is satisfied. Therefore, some practical algorithms choose k adaptively to ensure that the matrix in 11.44 does not approach singularity. This approach is analogous to the classical LevenbergMarquardt algorithm discussed in Chapter 10. To implement it without forming JkT Jk explicitly and performing trial Cholesky factorizations of the matrices JkT Jk I, we can use the technique 10.36 illustrated earlier for the leastsquares case. This technique uses the fact that the Cholesky factor of JkT Jk I is
11.2. PRACTICAL METHODS 289
290 CHAPTER 11. NONLINEAR EQUATIONS
identical to RT , where R is the upper triangular factor from the QR factorization of the matrix
Jk
. 11.45 I
A combination of Householder and Givens transformations can be used, as for 10.36, and the savings noted in the discussion following 10.36 continue to hold if we need to perform this calculation for several candidate values of k .
The drawback of this LevenbergMarquardt approach is that it is difficult to choose k. If too large, we can destroy the fast rate of convergence of Newtons method. Note that pk approaches a multiple of JkT rk as k , so the step becomes small and tends to point in the steepestdescent direction for f . If k is too small, the algorithm can be inefficient in the presence of Jacobian singularities. A more satisfactory approach is to follow the trustregion approach described below, which chooses k indirectly.
We conclude by specifying an algorithm based on Newtonlike steps and line searches that regularizes the step calculations where necessary. Several details are deliberately left vague; we refer the reader to the papers cited above for details.
Algorithm 11.4 Line Search Newtonlike Method. Givenc1,c2 with0c1 c2 1;
2
Choose x0;
for k 0,1,2,…
Calculate a Newtonlike step from 11.6 regularizing with 11.44 if Jk appears to be nearsingular, or 11.17 or 11.24;
if 1 satisfies the Wolfe conditions 3.6 Setk 1;
else
Perform a line search to find k 0 that satisfies 3.6; end if
xk1 xk kpk; end for
TRUSTREGION METHODS
The most widely used trustregion methods for nonlinear equations simply ap
ply Algorithm 4.1 from Chapter 4 to the merit function fx 1 rx 2, using 22
Bk J xk T J xk as the approximate Hessian in the model function mk p, which is defined as follows:
mkp1 rkJkp2fkpTJkTrk1pTJkTJkpk. 22
The step pk is generated by finding an approximate solution of the subproblem
min mk p, subject to p ak , 11.46
p
where ak is the radius of the trust region. The ratio k of actual to predicted reduction see
11.47
4.4, which plays a critical role in many trustregion algorithms, is therefore rxk 2 rxk pk 2
k rxk 2 rxkJxkpk 2.
We can state the trustregion framework that results from this model as follows.
11.2. PRACTICAL METHODS 291
Algorithm 11.5 TrustRegion Method for Nonlinear Equations. Givena 0,a0 0,a ,and 0,1 :
4
for k 0,1,2,…
Calculate pk as an approximate solution of 11.46;
Evaluate k from 11.47;
if k 1 4
ak11 pk ; 4
else
ifk3and pk ak
else
ak1 ak; end if
end if ifk
xk1 xk pk; else
xk1 xk; end if
end for.
4
ak1 min2ak,a ;
The dogleg method is a special case of the trustregion algorithm, Algorithm 4.1, that constructs an approximate solution to 11.46 based on the Cauchy point pkC and the unconstrained minimizer of m k . The Cauchy point is
where
pkC kak JkTrk JkTrk, 11.48
k min 1, JkT rk 3akrkT JkJkT JkJkT rk ; 11.49
292 CHAPTER 11. NONLINEAR EQUATIONS
By comparing with the general definition 4.11, 4.12 we see that it is not necessary to consider the case of an indefinite Hessian approximation in mk p, since the model Hessian JkT Jk that we use is positive semidefinite. The unconstrained minimizer of mkp is unique when Jk is nonsingular. In this case, we denote it by pkJ and write
pJ JTJ1JTrJ1r. kkkkkkk
The selection of pk in the dogleg method proceeds as follows.
Procedure 11.6 Dogleg. Calculate pkC;
ifpkC ak
p k p kC ; else
Calculate pkJ ;
pk pkC pkJ pkC,where isthelargestvaluein0,1
such that pk ak ;
end if.
Lemma 4.2 shows that when Jk is nonsingular, the vector pk chosen above is the minimizer of mk along the piecewise linear path that leads from the origin to the Cauchy point and then to the unconstrained minimizer pkJ . Hence, the reduction in model function at least matches the reduction obtained by the Cauchy point, which can be estimated by specializing the bound 4.20 to the leastsquares case by writing
mk0 mkpk c1 JkT rk min ak, JkT rk , JkT Jk
where c1 is some positive constant.
From Theorem 4.1, we know that the exact solution of 11.46 has the form
pk JkT Jk k I1 JkT rk,
11.50
11.51
for some k 0, and that k 0 if the unconstrained solution pkJ satisfies pkJ ak . Note that 11.51 is identical to the formula 10.34a from Chapter 10. In fact, the Levenberg Marquardt approach for nonlinear equations is a special case of the same algorithm for nonlinear leastsquares problems. The LevenbergMarquardt algorithm uses the techniques of Section 4.3 to search for the value of k that satisfies 11.51. The procedure described in the exact trustregion algorithm, Algorithm 4.3, is based on Cholesky factorizations, but as in Chapter 10, we can replace these by specialized algorithms to compute the QR factorization of the matrix 11.45. Even if the exact k corresponding to the solution of 11.46 is not found, the pk calculated from 11.51 will still yield global convergence if it
satisfies the condition 11.50 for some value of c1, together with
pk ak , for some constant 1. 11.52
The dogleg method requires just one linear system to be solved per iteration, whereas methods that search for the exact solution of 11.46 require several such systems to be solved. As in Chapter 4, there is a tradeoff to be made between the amount of effort to spend on each iteration and the total number of function and derivative evaluations required.
We can also consider alternative trustregion approaches that are based on different merit functions and different definitions of the trust region. An algorithm based on the 1 merit function with an norm trust region gives rise to subproblems of the form
min Jkprk 1 subjectto p a, 11.53 p
which can be formulated and solved using linear programming techniques. This approach is closely related to the S 1QP and SLQP approaches for nonlinear programming discussed in Section 18.5.
Global convergence results of Algorithm 11.5 when the steps pk satisfy 11.50 and 11.52 are given in the following theorem, which can be proved by referring directly to Theorems 4.5 and 4.6. The first result is for 0, in which the algorithm accepts all steps that produce a decrease in the merit function fk , while the second stronger result requires a strictly positive choice of .
Theorem 11.7.
Suppose that J x is Lipschitz continuous and that J x is bounded above in a neighborhood D of the level set L x : f x f x0. Suppose in addition that all approximate solutions of 11.46 satisfy the bounds 11.50 and 11.52. Then if 0 in Algorithm 11.5, we have that
whileif 0,1 ,wehave 4
liminf JkTrk 0, k
lim JkTrk 0. k
11.2. PRACTICAL METHODS 293
We turn now to local convergence of the trustregion algorithm for the case in which the subproblem 11.46 is solved exactly. We assume that the sequence xk converges to a nondegenerate solution x of the nonlinear equations rx 0. The significance of this result is that the algorithmic enhancements needed for global convergence do not, in welldesigned algorithms, interfere with the fast local convergence properties described in Section 11.1.
294 CHAPTER 11. NONLINEAR EQUATIONS
Theorem 11.8.
Suppose that the sequence xk generated by Algorithm 11.5 converges to a nondegenerate solution x of the problem rx 0. Suppose also that Jx is Lipschitz continuous in an open neighborhood D of x and that the trustregion subproblem 11.46 is solved exactly for all sufficiently large k. Then the sequence xk converges quadratically to x.
PROOF. We prove this result by showing that there is an index K such that the trustregion radius is not reduced further after iteration K; that is, ak aK for all k K. We then show that the algorithm eventually takes the pure Newton step at every iteration, so that quadratic convergence follows from Theorem 11.2.
Let pk denote the exact solution of 11.46. Note first that pk will simply be the uncon strainedNewtonstepJ1rk wheneverthisstepsatisfiesthetrustregionbound.Otherwise,
k
we have J 1rk ak , while the solution pk satisfies pk k
pk J 1rk . k
ak . In either case, we have
11.54
We consider the ratio k of actual to predicted reduction defined by 11.47. We have
directly from the definition that
arkJkpk 2rxkpk2a
1 k rxk 2 rxk Jxkpk 2 . 11.55
From Theorem 11.1, we have for the second term in the numerator that
rxk pk 2 rxk Jxkpk wxk, xk pk 2 , 11.56
where w, is defined as in 11.11. Because of Lipschitz continuity of J with Lipschitz constant L 11.7, we have
1
0 1
0
so that using 11.56 and the fact that rk Jk pk rk f xk12 since pk is the solution of 11.46, we can bound the numerator as follows:
arkJkpk 2rxkpk2a
2rkJkpk wxk,xkpk wxk,xkpk2
fxk12L pk 2L22 pk 4
xk pk 2, 11.57
wxk,xk pk
Jxk tpk Jxk pk dt L pk 2dtL2 pk 2,
where we define
xk f xk12L L22 pk 2.
Since xk x by assumption, it follows that f xk 0 and rk 0. Because x is a nondegenerate root, we have as in 11.14 that Jxk1 for all k sufficiently large, so from 11.54, we have
pk J1rk rk 0. 11.58 k
Hence, xk 0.
Turning now to the denominator of 11.55, we define p k to be a step of the same
length as the solution pk in the Newton direction J1rk, that is, k
p k p k J 1 r k . J1r k
kk
Since p k is feasible for 11.46, and since pk is optimal for this subproblem, we have
11.2. PRACTICAL METHODS 295
a p a2 r 2 r J p 2 r 2 ar k r a
k kkk kakJ1rka kk
pk 2 pk2 2 2 J1r rk J1r 2 rk
kk kk pk rk2,
J1rk k
where for the last inequality we have used 11.54. By using 11.58 again, we have from this bound that
rk 2rkJkpk 2 pk rk 21 pk rk . 11.59 J1rk
k
By substituting 11.57 and 11.59 into 11.55, and then applying 11.58 again, we have
xk pk 2 2
1k p r xk0. 11.60
kk
Therefore, for all k sufficiently large, we have k 1 , and so the trust region radius ak will
4
not be increased beyond this point. As claimed, there is an index K such that
ak aK, forallkK.
296 CHAPTER 11. NONLINEAR EQUATIONS
Since J1rk rk 0, the Newton step J1rk will eventually be smaller kk
thanaK andhenceak,soitwilleventuallyalwaysbeacceptedasthesolutionof11.46. The result now follows from Theorem 11.2.
We can replace the assumption that xk x with an assumption that the nonde generate solution x is just one of the limit points of the sequence. In fact, this condition implies that xk x; see Exercise 11.9.
11.3 CONTINUATIONHOMOTOPY METHODS
MOTIVATION
We mentioned above that Newtonbased methods all suffer from one shortcoming: Unless Jx is nonsingular in the region of interesta condition that often cannot be guaranteedthey are in danger of converging to a local minimum of the merit function rather that is not a solution of the nonlinear system. Continuation methods, which we outline in this section, are more likely to converge to a solution of r x 0 in difficult cases. Their underlying motivation is simple to describe: Rather than dealing with the original problem rx 0 directly, we set up an easy system of equations for which the solution is obvious. We then gradually transform the easy system into the original system rx, and follow the solution as it moves from the solution of the easy problem to the solution of the original problem.
One simple way to define the socalled homotopy map H x , is as follows:
Hx, rx 1 x a, 11.61
where is a scalar parameter and a IRn is a fixed vector. When 0, 11.61 defines the artificial, easy problem Hx,0 x a, whose solution is obviously x a. When 1, we have Hx,1 rx, the original system of equations.
To solve r x 0, consider the following algorithm: First, set 0 in 11.61 and set x a. Then, increase from 0 to 1 in small increments, and for each value of , calculate the solution of the system Hx, 0. The final value of x corresponding to 1 will solve the original problem rx 0.
This naive approach sounds plausible, and Figure 11.3 illustrates a situation in which it would be successful. In this figure, there is a unique solution x of the system H x , 0 foreachvalueofintherange0,1.Thetrajectoryofpointsx,forwhich Hx, 0 is called the zero path.
Unfortunately, however, the approach often fails, as illustrated in Figure 11.4. Here, the algorithm follows the lower branch of the curve from 0 to T , but it then loses the trail unless it is lucky enough to jump to the top branch of the path. The value T is
11.3. CONTINUATIONHOMOTOPY METHODS 297
x
01
Figure11.3 Plotofazeropath:Trajectoryofpointsx,withHx,0.
.. x,
x
0 T 1
Figure 11.4 Zero path with turning points. The path joining a, 0 to x, 1 cannot be followed by increasing monotonically from 0 to 1.
known as a turning point, since at this point we can follow the path smoothly only if we no longer insist on increasing at every step. In fact, practical continuation methods work by doing exactly as Figure 11.4 suggests, that is, they follow the zero path explicitly, even if this means allowing to decrease from time to time.
PRACTICAL CONTINUATION METHODS
In one practical technique, we model the zero path by allowing both x and to be functions of an independent variable s that represents arc length along the path. That is,
298 CHAPTER 11. NONLINEAR EQUATIONS
xs,s is the point that we arrive at by traveling a distance s along the path from the initial point x0, 0 a, 0. Because we have that
Hxs,s 0, for all s 0,
we can take the total derivative of this expression with respect to s to obtain
Hx,x Hx, 0, where x , dx, d . 11.62 x ds ds
The vector x s, s is the tangent vector to the zero path, as we illustrate in Figure 11.4. From 11.62, we see that it lies in the null space of the n n 1 matrix
Hx, Hx, . 11.63 x
When this matrix has full rank, its null space has dimension 1, so to complete the definition of x , in this case, we need to assign it a length and direction. The length is fixed by imposing the normalization condition
x s 2 s2 1, foralls, 11.64
which ensures that s is the true arc length along the path from 0, a to xs, s. We need to choose the sign to ensure that we keep moving forward along the zero path. A heuristic that works well is to choose the sign so that the tangent vector x , at the current value of s makes an angle of less than 2 with the tangent point at the previous value of s.
We can outline the complete procedure for computing x , as follows:
Procedure 11.7 Tangent Vector Calculation.
Compute a vector in the null space of 11.63 by performing a QR
factorization with column pivoting,
QT Hx, Hx, R w , x
where Q is n n orthogonal, R is n n upper triangular, is
an n 1 n 1 permutation matrix, and w IRn. Set
v R1w ; 1
11.3. CONTINUATIONHOMOTOPY METHODS 299
Set x , v v 2, where the sign is chosen to satisfy the angle criterion mentioned above.
Details of the QR factorization procedure are given in the Appendix.
Since we can obtain the tangent at any given point x , and since we know the initial
point x0,0 a,0, we can trace the zero path by calling a standard initialvalue firstorder ordinary differential equation solver, terminating the algorithm when it finds a value of s for which s 1.
A second approach for following the zero path is quite similar to the one just described, except that it takes an algebraic viewpoint instead of a differentialequations viewpoint. Given a current point x,, we compute the tangent vector x , as above, and take a small step of length , say along this direction to produce a predictor point x P , P ; that is,
xP,Px, x , .
Usually, this new point will not lie exactly on the zero path, so we apply some corrector iterations to bring it back to the path, thereby identifying a new iterate x, that satisfies Hx, 0. This process is illustrated in Figure 11.5. During the corrections, we choose a component of the predictor step x P , P one of the components that has been changing most rapidly during the past few stepsand hold this component fixed during the correction process. If the index of this component is i, and if we use a pure Newton corrector process often adequate, since x P , P is usually quite close to the target point
x,
PP x ,
x ,
x
Figure 11.5 The algebraic predictorcorrector procedure, using as the fixed variable in the correction process.
300 CHAPTER 11. NONLINEAR EQUATIONS
x, , the steps will have the form HH
x x H , ei 0
where the quantities H x , H , and H are evaluated at the latest point of the corrector process. The last row of this system serves to fix the ith component of x, at zero; the vector ei IRn1 is a vector with n 1 components containing all zeros, except for a 1 in the location i that corresponds to the fixed component. Note that in Figure 11.5 the component is chosen to be fixed on the current iteration. On the following iteration, it may be more appropriate to choose x as the fixed component, as we reach the turning point in .
The two variants on pathfollowing described above are able to follow curves like those depicted in Figure 11.4 to a solution of the nonlinear system. They rely, however, on the n n 1 matrix in 11.63 having full rank for all x , along the path, so that the tangent vector is welldefined. The following result shows that full rank is guaranteed under certain assumptions.
Theorem 11.9 Watson 305.
Suppose that r is twice continuously differentiable. Then for almost all vectors a IRn,
there is a zero path emanating from 0, a along which the n n 1 matrix 11.63 has full rank. If this path is bounded for 0, 1, then it has an accumulation point x , 1 such that rx 0. Furthermore, if the Jacobian Jx is nonsingular, the zero path between a,0 and x , 1 has finite arc length.
The theorem assures us that unless we are unfortunate in the choice of a, the algorithms described above can be applied to obtain a path that either diverges or else leads to a point x that is a solution of the original nonlinear system if Jx is nonsingular. More detailed convergence results can be found in Watson 305 and the references therein.
We conclude with an example to show that divergence of the zero paththe less desirable outcome of Theorem 11.9can happen even for innocentlooking problems.
EXAMPLE 11.3
Consider the system r x x 2 1, for which there are two nondegenerate solutions 1 and 1. Suppose we choose a 2 and attempt to apply a continuation method to the function
Hx,x2 11x2x2 1×23, 11.65
obtained by substituting into 11.61. The zero paths for this function are plotted in Figure 11.6. As can be seen from that diagram, there is no zero path that joins 2, 0
11.3. CONTINUATIONHOMOTOPY METHODS 301
2
0
2
4
6
8
10
12
0 0.2 0.4 0.6 0.8 1
lambda
Figure11.6 Zeropathsfortheexampleinwhich Hx, x211x2. There is no continuous zero path from 0 to 1.
to either 1, 1 or 1, 1, so the continuation methods fail on this example. We can find the values of for which no solution exists by using the formula for a quadratic root to obtain
x 1 12 423. 2
Now, when the term in the square root is negative, the corresponding values of x are complex, that is, there are no real roots x. It is easy to verify that such is the case when
52 3,52 3 0.118,0.651.
13 13
Note that the zero path starting from 2, 0 becomes unbounded, which is one of the
possible outcomes of Theorem 11.9.
This example indicates that continuation methods may fail to produce a solution even to a fairly simple system of nonlinear equations. However, it is generally true that they are more reliable than the meritfunction methods described earlier in the chapter. The extra robustness comes at a price, since continuation methods typically require significantly more computational effort than the meritfunction methods.
x
302 CHAPTER 11. NONLINEAR EQUATIONS
NOTES AND REFERENCES
Nonlinear differential equations and integral equations are a rich source of nonlinear equations. When formulated as finitedimensional nonlinear equations, the unknown vector x is a discrete approximation to the infinitedimensional solution. In other applications, the vector x is intrinsically finitedimensional; it may represent the quantities of materials to be transported between pairs of cities in a distribution network, for instance. In all cases, the equations ri enforce consistency, conservation, and optimality principles in the model. More 212 and Averick et al. 10 discuss a number of interesting practical applications.
For analysis of the convergence of Broydens method, including proofs of Theo rem 11.5, see Dennis and Schnabel 92, Chapter 8 and Kelley 177, Chapter 6. Details on a limitedmemory implementation of Broydens method are given by Kelley 177, Section 7.3.
Example 11.2 and the algorithm described by Powell 241 have been influential beyond the field of nonlinear equations. The example shows that a linesearch method may not be able to achieve sufficient decrease, whereas the Cauchy step in the trustregion approach is designed to guarantee that this condition holds and hence that reasonable convergence properties are guaranteed. The dogleg algorithm proposed in 241 can be viewed as one of the first modern trustregion methods.
EXERCISES
11.1 Show that for any vector s IRn , we have
assT a 1, asT sa
where denotes the Euclidean matrix norm.
11.2 Consider the function r : IR IR defined by rx xq, where q is an integer greater than 2. Note that x 0 is the sole root of this function and that it is degenerate. Show that Newtons method converges Qlinearly, and find the value of the convergence ratio r in A.34.
11.3 Show that Newtons method applied to the function rx x5 x3 4x starting from x0 1 produces the cyclic behavior described in the text. Find the roots of this function, and check that they are nondegenerate.
11.4 For the scalar function r x sin5x x , show that the sumofsquares merit function has infinitely many local minima, and find a general formula for such points.
11.5 When r : IRn IRn , show that the function aJT J I1 JT ra
be the exact minimizer of the merit function f ; that is,
k argmin fxk J1rk.
11.3. CONTINUATIONHOMOTOPY METHODS 303
is monotonically decreasing in unless JTr 0. Hint: Use the singularvalue decomposition of J .
11.6 Prove part iii of Theorem 11.3.
11.7 Consider a linesearch Newton method in which the step length k is chosen to
k
Show that if Jx is nonsingular at the solution x, then k 1 as xk x.
11.8LetJIRnmandrIRnandsupposethatJJTr0.ShowthatJTr0. Hint: This doesnt even take one line!
11.9 Suppose we replace the assumption of xk x in Theorem 11.8 by an assump
tion that the nondegenerate solution x is a limit point of x. By adding some logic to the
proof of this result, show that in fact x is the only possible limit point of the sequence.
Hint: Show that J 1 rk1 1 J 1rk for all k sufficiently large, and hence that for any k1 2k
constant 0, the sequence xk satisfies xk x for all k sufficiently large.
11.10 Consider the following modification of our example of failure of continuation
methods:
rxx2 1, a 1. 2
Show that for this example there is a zero path for Hx, x2 1 1 x a that connects 1 , 0 to 1, 1, so that continuation methods should work for this choice of
2 starting point.
CHAPTER12
Theory of Constrained Optimization
The second part of this book is about minimizing functions subject to constraints on the variables. A general formulation for these problems is
min fx subjectto cix0, i E, 12.1
x IR n
cix0, iI,
where f and the functions ci are all smooth, realvalued functions on a subset of IRn, and I and E are two finite sets of indices. As before, we call f the objective function, while ci ,
This is pa Printer: O
g
CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION 305
i E are the equality constraints and ci , i I are the inequality constraints. We define the feasible set to be the set of points x that satisfy the constraints; that is,
xcix0, i E; cix0, i I, 12.2 so that we can rewrite 12.1 more compactly as
min f x. 12.3 x
In this chapter we derive mathematical characterizations of the solutions of 12.3. As in the unconstrained case, we discuss optimality conditions of two types. Necessary condi tions are conditions that must be satisfied by any solution point under certain assumptions. Sufficient conditions are those that, if satisfied at a certain point x, guarantee that x is in fact a solution.
For the unconstrained optimization problem of Chapter 2, the optimality conditions were as follows:
Necessary conditions: Local unconstrained minimizers have f x 0 and 2 f x positive semidefinite.
Sufficient conditions: Any point x at which f x 0 and 2 f x is positive definite is a strong local minimizer of f .
In this chapter, we derive analogous conditions to characterize the solutions of constrained optimization problems.
LOCAL AND GLOBAL SOLUTIONS
We have seen already that global solutions are difficult to find even when there are no constraints. The situation may be improved when we add constraints, since the feasible set might exclude many of the local minima and it may be comparatively easy to pick the global minimum from those that remain. However, constraints can also make things more difficult. As an example, consider the problem
min x2 1002 0.01×12, subject to x2 cos x1 0, 12.4 illustrated in Figure 12.1. Without the constraint, the problem has the unique solution
0, 100T . With the constraint, there are local solutions near the points xk k,1T , for k 1,3,5,….
Definitions of the different types of local solutions are simple extensions of the corre sponding definitions for the unconstrained case, except that now we restrict consideration to the feasible points in the neighborhood of x. We have the following definition.
306 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
contours of f
x1 x2 x3 x4 x5
feasible region
Figure 12.1 Constrained problem with many isolated local solutions.
A vector x is a local solution of the problem 12.3 if x and there is a
neighborhoodN ofx suchthat fx fxforx N . Similarly, we can make the following definitions:
Avectorxisastrictlocalsolutionalsocalledastronglocalsolutionifx andthere is a neighborhood N of x such that f x f x for all x N with x a x.
A point x is an isolated local solution if x and there is a neighborhood N of x such that x is the only local solution in N .
Note that isolated local solutions are strict, but that the reverse is not true see Exercise 12.2.
SMOOTHNESS
Smoothness of objective functions and constraints is an important issue in character izing solutions, just as in the unconstrained case. It ensures that the objective function and the constraints all behave in a reasonably predictable way and therefore allows algorithms to make good choices for search directions.
We saw in Chapter 2 that graphs of nonsmooth functions contain kinks or jumps where the smoothness breaks down. If we plot the feasible region for any given constrained optimization problem, we usually observe many kinks and sharp edges. Does this mean that the constraint functions that describe these regions are nonsmooth? The answer is often no, because the nonsmooth boundaries can often be described by a collection of smooth constraint functions. Figure 12.2 shows a diamondshaped feasible region in IR2 that could be described by the single nonsmooth constraint
x 1 x1x21. 12.5 It can also be described by the following set of smooth in fact, linear constraints:
x1 x2 1, x1 x2 1, x1 x2 1, x1 x2 1. 12.6
Each of the four constraints represents one edge of the feasible polytope. In general, the con straint functions are chosen so that each one represents a smooth piece of the boundary of .
Figure 12.2
A feasible region with a nonsmooth boundary can be described by smooth constraints.
Nonsmooth, unconstrained optimization problems can sometimes be reformulated as smooth constrained problems. An example is the unconstrained minimization of a function
f x maxx2, x, 12.7 whichhaskinksatx 0andx 1,andthesolutionatx 0.Weobtainasmooth,
constrained formulation of this problem by adding an artificial variable t and writing
min t s.t. t x, t x2. 12.8
Reformulation techniques such as 12.6 and 12.8 are used often in cases where f is a maximum of a collection of functions or when f is a 1norm or norm of a vector function.
In the examples above we expressed inequality constraints in a slightly different way from the form cix 0 that appears in the definition 12.1. However, any collection of inequality constraints with and and nonzero righthandsides can be expressed in the form ci x 0 by simple rearrangement of the inequality.
12.1 EXAMPLES
To introduce the basic principles behind the characterization of solutions of constrained optimization problems, we work through three simple examples. The discussion here is informal; the ideas introduced will be made rigorous in the sections that follow.
We start by noting one important item of terminology that recurs throughout the rest of the book.
12.1. EXAMPLES 307
308 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
Definition 12.1.
The active set Ax at any feasible x consists of the equality constraint indices from E together with the indices of the inequality constraints i for which ci x 0; that is,
AxEi Icix0.
At a feasible point x, the inequality constraint i I is said to be active if cix 0
and inactive if the strict inequality ci x 0 is satisfied. A SINGLE EQUALITY CONSTRAINT
EXAMPLE 12.1
Our first example is a twovariable problem with a single equality constraint:
min x1 x2 s.t. x12 x2 20 12.9 see Figure 12.3. In the language of 12.1, we have f x x1 x2, I , E 1, and
c1x x12 x2 2. We can see by inspection that the feasible set for this problem is the
2 centered at the originjust the boundary of this circle, not its interior. The solution x is obviously 1, 1T . From any other point on the circle, it is easy to find a way to move that stays feasible that is, remains on the circle while decreasing f . Forinstance,fromthepointx2,0T anymoveintheclockwisedirectionaroundthe circle has the desired effect.
circle of radius
x2
c1
c1 f
x1 f
f
f x
c1
Figure 12.3
Problem 12.9, showing constraint and function gradients at various feasible points.
We also see from Figure 12.3 that at the solution x, the constraint normal c1x is parallel to f x. That is, there is a scalar 1 in this case 1 12 such that
f x 1c1x. 12.10
We can derive 12.10 by examining firstorder Taylor series approximations to the objective and constraint functions. To retain feasibility with respect to the function c1x 0, we require any small but nonzero step s to satisfy that c1x s 0; that is,
or, to first order,
0 fxs fxfxTs,
f xT s 0.
Existence of a small step s that satisfies both 12.12 and 12.13 strongly suggests existence of a direction d where the size of d is not small; we could have d s s to ensure that the norm of d is close to 1 with the same properties, namely
c1xTd0 and fxTd0. 12.14
If, on the other hand, there is no direction d with the properties 12.14, then is it likely that we cannot find a small step s with the properties 12.12 and 12.13. In this case, x would appear to be a local minimizer.
By drawing a picture, the reader can check that the only way that a d satisfying 12.14 does not exist is if f x and c1x are parallel, that is, if the condition f x 1c1x holds at x, for some scalar 1. If in fact f x and c1x are not parallel, we can set
c1xc1xT d
d I c1x 2 fx; d d . 12.15
12.1. EXAMPLES 309
0 c1x s c1x c1xT s c1xT s.
Hence, the step s retains feasibility with respect to c1, to first order, when it satisfies
c1xT s 0.
Similarly, if we want s to produce a decrease in f , we would have so that
12.11
12.12
12.13
It is easy to verify that this d satisfies 12.14.
310 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
By introducing the Lagrangian function
Lx,1 fx1c1x, 12.16
and noting that x Lx, 1 f x 1c1x, we can state the condition 12.10 equivalently as follows: At the solution x, there is a scalar 1 such that
xLx,1 0. 12.17
This observation suggests that we can search for solutions of the equalityconstrained problem 12.9 by seeking stationary points of the Lagrangian function. The scalar quantity 1 in 12.16 is called a Lagrange multiplier for the constraint c1x 0.
Though the condition 12.10 equivalently, 12.17 appears to be necessary for
an optimal solution of the problem 12.9, it is clearly not sufficient. For instance, in
Example 12.1, condition 12.10 is satisfied at the point x 1, 1T with 1 1 , but 2
this point is obviously not a solutionin fact, it maximizes the function f on the circle. Moreover, in the case of equalityconstrained problems, we cannot turn the condition 12.10 into a sufficient condition simply by placing some restriction on the sign of 1. To see this, consider replacing the constraint x12 x2 2 0 by its negative 2 x12 x2 0 in Example 12.1. The solution of the problem is not affected, but the value of 1 that satisfies the condition 12.10 changes from 1 to 1 .
A SINGLE INEQUALITY CONSTRAINT
EXAMPLE 12.2
This is a slight modification of Example 12.1, in which the equality constraint is
replaced by an inequality. Consider
min x1 x2 s.t. 2×12 x2 0, 12.18
for which the feasible region consists of the circle of problem 12.9 and its interior see
Figure 12.4. Note that the constraint normal c1 points toward the interior of the feasible
region at each point on the boundary of the circle. By inspection, we see that the solution
is still 1, 1T and that the condition 12.10 holds for the value 1 . However, 12
this inequalityconstrained problem differs from the equalityconstrained problem 12.9 of Example 12.1 in that the sign of the Lagrange multiplier plays a significant role, as we
1212
now argue.
As before, we conjecture that a given feasible point x is not optimal if we can find a small step s that both retains feasibility and decreases the objective function f to first order. The main difference between problems 12.9 and 12.18 comes in the handling of the feasibility condition. As in 12.13, the step s improves the objective function, to first order, if f xT s 0. Meanwhile, s retains feasibility if
0 c1x s c1x c1xT s, so, to first order, feasibility is retained if
c1x c1xT s 0. 12.19 In determining whether a step s exists that satisfies both 12.13 and 12.19, we
consider the following two cases, which are illustrated in Figure 12.4.
Case I: Consider first the case in which x lies strictly inside the circle, so that the strict inequality c1x 0 holds. In this case, any step vector s satisfies the condition 12.19, provided only that its length is sufficiently small. In fact, whenever f x a 0, we can obtain a step s that satisfies both 12.13 and 12.19 by setting
s f x,
12.1. EXAMPLES 311
f
x
s
cf 1
x s
Figure 12.4 Improvement directions s from two feasible points x for the problem 12.18 at which the constraint is active and inactive, respectively.
312 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
for any positive scalar sufficiently small. However, this definition does not give a step s with the required properties when
f x 0, 12.20 Case II: Consider now the case in which x lies on the boundary of the circle, so that
c1x 0. The conditions 12.13 and 12.19 therefore become fxTs0, c1xTs0.
The first of these conditions defines an open halfspace, while the second defines a closed halfspace, as illustrated in Figure 12.5. It is clear from this figure that the intersection of these two regions is empty only when f x and c1x point in the same direction, that is, when
f x 1c1x, for some 1 0. 12.21
Note that the sign of the multiplier is significant here. If 12.10 were satisfied with a negative value of 1, then f x and c1x would point in opposite directions, and we see from Figure 12.5 that the set of directions that satisfy both 12.13 and 12.19 would make up an entire open halfplane.
f
c1
Any d in this cone is a good search direction, to first order
Figure 12.5 A direction d that satisfies both 12.13 and 12.19 lies in the intersection of a closed halfplane and an open halfplane.
The optimality conditions for both cases I and II can again be summarized neatly with reference to the Lagrangian function L defined in 12.16. When no firstorder feasible descent direction exists at some point x, we have that
xLx,1 0, for some 1 0, 12.22 where we also require that
1c1x 0. 12.23
Condition 12.23 is known as a complementarity condition; it implies that the Lagrange multiplier 1 can be strictly positive only when the corresponding constraint c1 is active. Conditions of this type play a central role in constrained optimization, as we see in the sections that follow. In case I, we have that c1x 0, so 12.23 requires that 1 0. Hence, 12.22 reduces to f x 0, as required by 12.20. In case II, 12.23 allows 1 to take on a nonnegative value, so 12.22 becomes equivalent to 12.21.
TWO INEQUALITY CONSTRAINTS
EXAMPLE 12.3
Suppose we add an extra constraint to the problem 12.18 to obtain
min x1 x2 s.t. 2×12 x2 0, x2 0, 12.24
for which the feasible region is the halfdisk illustrated in Figure 12.6. It is easy to see that the solution lies at 2, 0T , a point at which both constraints are active. By repeating the arguments for the previous examples, we would expect a direction d of firstorder feasible descent to satisfy
ci xT d 0, i I 1, 2, f xT d 0. 12.25
However, it is clear from Figure 12.6 that no such direction can exist when x 2, 0T . The conditions ci xT d 0, i 1, 2, are both satisfied only if d lies in the quadrant defined by c1x and c2x, but it is clear by inspection that all vectors d in this quadrant satisfy f xT d 0.
Let us see how the Lagrangian and its derivatives behave for the problem 12.24 and the solution point 2, 0T . First, we include an additional term i ci x in the Lagrangian for each additional constraint, so the definition of L becomes
12.1. EXAMPLES 313
Lx, fx1c1x2c2x,
314 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
c2
f
c 1
Figure 12.6
Problem 12.24, illustrating the gradients of the active constraints and objective at the solution.
where 1 , 2 T is the vector of Lagrange multipliers. The extension of condition 12.22 to this case is
x Lx, 0, for some 0, 12.26 where the inequality 0 means that all components of are required to be nonnegative.
By applying the complementarity condition 12.23 to both inequality constraints, we obtain 1c1x 0, 2c2x 0. 12.27
When x 2, 0T , we have
f x 1 , c1x 2 2 , c2x 0 ,
101 so that it is easy to verify that x Lx, 0 when we select as follows:
122 .
1
Note that both components of are positive, so that 12.26 is satisfied.
We consider now some other feasible points that are not solutions of 12.24, and
examine the properties of the Lagrangian and its gradient at these points.
For the point x 2, 0T , we again have that both constraints are active see Figure 12.7. However, it s easy to identify vectors d that satisfies 12.25: d 1, 0T is one such vector there are many others. For this value of x it is easy to verify that the condition x Lx , 0 is satisfied only when 122, 1T . Note that the first
component 1 is negative, so that the conditions 12.26 are not satisfied at this point. Finally, we consider the point x 1, 0T , at which only the second constraint c2 is active. Since any small step s away from this point will continue to satisfy c1x s 0, we need to consider only the behavior of c2 and f in determining whether s is indeed a feasible
12.2. TANGENT CONE AND CONSTRAINT QUALIFICATIONS 315
c2
f
c1
Figure 12.7
Problem 12.24, illustrating the gradients of the active constraints and objective at a nonoptimal point.
descent step. Using the same reasoning as in the earlier examples, we find that the direction of feasible descent d must satisfy
c2xTd 0, fxTd 0. 12.28
By noting that
fx 1 , c2x 01 ,
it is easy to verify that the vector d 1 , 1 T satisfies 12.28 and is therefore a descent
24
direction.
To show that optimality conditions 12.26 and 12.27 fail, we note first from 12.27
that since c1x 0, we must have 1 0. Therefore, in trying to satisfy xLx, 0, we are left to search for a value 2 such that f x 2c2x 0. No such 2 exists, and
thus this point fails to satisfy the optimality conditions.
12.2 TANGENT CONE AND CONSTRAINT QUALIFICATIONS
In this section we define the tangent cone Tx to the closed convex set at a point x , and also the set Fx of firstorder feasible directions at x. We also discuss constraint qualifications. In the previous section, we determined whether or not it was possible to take a feasible descent step away from a given feasible point x by examining the first derivatives of f and the constraint functions ci . We used the firstorder Taylor series expansion of these functions about x to form an approximate problem in which both objective and constraints are linear. This approach makes sense, however, only when the linearized approximation captures the essential geometric features of the feasible set near the point x in question. If, near x, the linearization is fundamentally different from the
316 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
feasible set for instance, it is an entire plane, while the feasible set is a single point then we cannot expect the linear approximation to yield useful information about the original problem. Hence, we need to make assumptions about the nature of the constraints ci that are active at x to ensure that the linearized approximation is similar to the feasible set, near x. Constraint qualifications are assumptions that ensure similarity of the constraint set and its linearized approximation, in a neighborhood of x.
Given a feasible point x, we call zk a feasible sequence approaching x if zk for all k sufficiently large and zk x.
Later, we characterize a local solution of 12.1 as a point x at which all feasible sequences approaching x have the property that f zk f x for all k sufficiently large, and we will derive practical, verifiable conditions under which this property holds. We lay the groundwork in this section by characterizing the directions in which we can step away from x while remaining feasible.
A tangent is a limiting direction of a feasible sequence.
Definition 12.2.
The vector d is said to be a tangent or tangent vector to at a point x if there are a feasible sequence zk approaching x and a sequence of positive scalars tk with tk 0 such that
lim zk x d. 12.29 k tk
The set of all tangents to at x is called the tangent cone and is denoted by Tx.
It is easy to see that the tangent cone is indeed a cone, according to the definition A.36. If d is a tangent vector with corresponding sequences zk and tk , then by replacing each tk by 1tk, for any 0, we find that d Tx also. We obtain that 0 Tx by setting zk x in the definition of feasible sequence.
We turn now to the linearized feasible direction set, which we define as follows.
Definition 12.3.
Given a feasible point x and the active constraint set Ax of Definition 12.1, the set of linearized feasible directions Fx is
Fx da dTcix0, foralli E, . dTcix0, foralli AxI
As with the tangent cone, it is easy to verify that Fx is a cone, according to the definition A.36.
It is important to note that the definition of tangent cone does not rely on the algebraic specification of the set , only on its geometry. The linearized feasible direction set does, however, depend on the definition of the constraint functions ci , i E I.
12.2. TANGENT CONE AND CONSTRAINT QUALIFICATIONS 317
c1 f
tangent d
feasible sequence zk
Figure 12.8
Constraint normal, objective gradient, and feasible sequence for problem 12.9.
We illustrate the tangent cone and the linearized feasible direction set by revisiting Examples 12.1 and 12.2.
EXAMPLE 12.4 EXAMPLE 12.1, REVISITED
Figure 12.8 shows the problem 12.9, the equalityconstrained problem in which the feasible set is a circle of radius 2, near the nonoptimal point x 2, 0T . The figure also shows a feasible sequence approaching x. This sequence could be defined analytically by the formula
21k2
zk 1k . 12.30
By choosing tk zk x , we find that d 0, 1T is a tangent. Note that the objective function f x x1 x2 increases as we move along the sequence 12.30; in fact, we have fzk1 fzk for all k 2,3,…. It follows that fzk fx for k 2,3,…, so x
cannot be a solution of 12.9. Another feasible sequence is one that approaches x
2, 0
T
from the opposite
direction. Its elements are defined by
21k2
zk 1k .
It is easy to show that f decreases along this sequence and that the tangents corresponding to this sequence are d 0, T . In summary, the tangent cone at x 2, 0T is 0,d2T d2 IR.
318 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
Forthedefinition12.9ofthisset,andDefinition12.3,wehavethatdd1,d2T Fx if
2xTd 0c1xTd 1 1 2 2d1.
2×2 d2
Therefore, we obtain Fx 0, d2T d2 IR. In this case, we have Tx Fx.
Suppose that the feasible set is defined instead by the formula
xc1x0, where c1xx12 x2 22 0. 12.31
Note that is the same, but its algebraic specification has changed. The vector d belongs to the linearized feasible set if
4×2 x2 2x T d 0 T d 0c1xTd 1 2 1 1 1,
4 x 12 x 2 2 2 x 2 d 2 0 d 2
which is true for all d1, d2T . Hence, we have Fx IR2, so for this algebraic specification
of , the tangent cone and linearized feasible sets differ.
EXAMPLE 12.5 EXAMPLE 12.2, REVISITED
Wenowreconsiderproblem12.18inExample12.2.Thesolutionx1,1T is the same as in the equalityconstrained case, but there is a much more extensive collection of feasible sequences that converge to any given feasible point see Figure 12.9.
x2
x1
Figure 12.9
Feasible sequences converging to a particular feasible point for the region defined by
x 12 x 2 2 2 .
12.2. TANGENT CONE AND CONSTRAINT QUALIFICATIONS 319
From the point x 2, 0T , the various feasible sequences defined above for the equalityconstrained problem are still feasible for 12.18. There are also infinitely many feasiblesequencesthatconvergetox2,0T alongastraightlinefromtheinteriorof the circle. These sequences have the form
zk 2,0T 1kw,
where w is any vector whose first component is positive w1 0. The point zk is feasible
provided that zk 2, that is,
2 w1k2 w2k2 2,
which is true when k w12 w222w1. In addition to these straightline feasible sequences, we can also define an infinite variety of sequences that approach 2,0T along a curve from the interior of the circle. To summarize, the tangent cone to this set at 2,0T isw1,w2T w1 0.
For the definition 12.18 of this feasible set, we have from Definition 12.3 that d Fx if
2x T d 0c1xTd 1 1 2 2d1.
2×2 d2
Hence, we obtain Fx Tx for this particular algebraic specification of the feasible
set.
Constraint qualifications are conditions under which the linearized feasible set Fx is similar to the tangent cone Tx. In fact, most constraint qualifications ensure that these two sets are identical. As mentioned earlier, these conditions ensure that the Fx, which is constructed by linearizing the algebraic description of the set at x, captures the essential geometric features of the set in the vicinity of x, as represented by Tx.
Revisiting Example 12.4, we see that both Tx and Fx consist of the vertical axis, which is qualitatively similar to the set x in the neighborhood of x. As a further example, consider the constraints
c1x1x12 x2 12 0, c2xx2 0, 12.32
for which the feasible set is the single point 0, 0T see Figure 12.10. For this point x 0, 0T , it is obvious that that tangent cone is Tx 0, 0T , since all feasible sequences approaching x must have zk x 0, 0T for all k sufficiently large. Moreover, it is easy to show that linearized approximation to the feasible set Fx is
Fx d1, 0T d1 IR,
320 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
Figure 12.10 Problem 12.32, for which the feasible set is the single point of intersection between circle and line.
that is, the entire horizontal axis. In this case, the linearized feasible direction set does not capture the geometry of the feasible set, so constraint qualifications are not satisfied.
The constraint qualification most often used in the design of algorithms is the subject of the next definition.
Definition 12.4 LICQ.
Given the point x and the active set Ax defined in Definition 12.1, we say that the linear
independence constraint qualification LICQ holds if the set of active constraint gradients ci x, i Ax is linearly independent.
Note that this condition is not satisfied for the examples 12.32 and 12.31. In general, if LICQ holds, none of the active constraint gradients can be zero. We mention other constraint qualifications in Section 12.6.
12.3 FIRSTORDER OPTIMALITY CONDITIONS
In this section, we state firstorder necessary conditions for x to be a local minimizer and show how these conditions are satisfied on a small example. The proof of the result is presented in subsequent sections.
As a preliminary to stating the necessary conditions, we define the Lagrangian function for the general problem 12.1.
icix. 12.33 We had previously defined special cases of this function for the examples of Section 12.1.
Lx, fx
iEI
12.3. FIRSTORDER OPTIMALITY CONDITIONS 321
The necessary conditions defined in the following theorem are called firstorder con ditions because they are concerned with properties of the gradients firstderivative vectors of the objective and constraint functions. These conditions are the foundation for many of the algorithms described in the remaining chapters of the book.
Theorem 12.1 FirstOrder Necessary Conditions.
Suppose that x is a local solution of 12.1, that the functions f and ci in 12.1 are
continuously differentiable, and that the LICQ holds at x. Then there is a Lagrange multiplier vector , with components i, i E I, such that the following conditions are satisfied at x,
xLx, 0, cix 0, cix 0, i 0, i c i x 0 ,
for all i E,
for all i I,
for all i I,
for all i E I.
12.34a 12.34b 12.34c 12.34d 12.34e
The conditions 12.34 are often known as the KarushKuhnTucker conditions, or KKT conditions for short. The conditions 12.34e are complementarity conditions; they imply that either constraint i is active or i 0, or possibly both. In particular, the Lagrange multipliers corresponding to inactive inequality constraints are zero, we can omit the terms for indices i Ax from 12.34a and rewrite this condition as
0 x Lx, f x ici x. 12.35 iAx
A special case of complementarity is important and deserves its own definition.
Definition 12.5 Strict Complementarity.
Given a local solution x of 12.1 and a vector satisfying 12.34, we say that the
strict complementarity condition holds if exactly one of i and ci x is zero for each index i I.Inotherwords,wehavethati 0foreachi IAx.
Satisfaction of the strict complementarity property usually makes it easier for algorithms to determine the active set Ax and converge rapidly to the solution x.
For a given problem 12.1 and solution point x, there may be many vectors for which the conditions 12.34 are satisfied. When the LICQ holds, however, the optimal is unique see Exercise 12.17.
The proof of Theorem 12.1 is quite complex, but it is important to our understanding of constrained optimization, so we present it in the next section. First, we illustrate the KKT conditions with another example.
322 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
x2
x1
Figure 12.11 Inequalityconstrained problem 12.36 with solution at 1, 0T . EXAMPLE 12.6
Consider the feasible region illustrated in Figure 12.2 and described by the four constraints 12.6. By restating the constraints in the standard form of 12.1 and including an objective function, the problem becomes
1×1 x2
2 4 1xx minx3x1 s.t. 1 20. 12.36
x 12 22 1x1x2 1×1 x2
It is fairly clear from Figure 12.11 that the solution is x 1, 0T . The first and second constraints in 12.36 are active at this point. Denoting them by c1 and c2 and the inactive constraints by c3 and c4, we have
fx 1 , cx 1 , cx 1 . 1 1 1 2 1
2
Therefore, the KKT conditions 12.34a12.34e are satisfied when we set
3,1,0,0T. 44
12.4. FIRSTORDER OPTIMALITY CONDITIONS: PROOF 323
12.4 FIRSTORDER OPTIMALITY CONDITIONS: PROOF
We now develop a proof of Theorem 12.1. A number of key subsidiary results are required, so the development is quite long. However, a complete treatment is worthwhile, since these results are so fundamental to the field of optimization.
RELATING THE TANGENT CONE AND THE FIRSTORDER FEASIBLE DIRECTION SET
The following key result uses a constraint qualification LICQ to relate the tangent cone of Definition 12.2 to the set F of firstorder feasible directions of Definition 12.3. In the proof below and in later results, we use the notation Ax to represent the matrix whose rows are the active constraint gradients at the optimal point, that is,
AxT cixiAx, where the active set Ax is defined as in Definition 12.1.
Lemma 12.2.
Let x be a feasible point. The following two statements are true. i Tx Fx.
ii If the LICQ condition is satisfied at x, then Fx Tx.
12.37
PROOF. Without loss of generality, let us assume that all the constraints ci, i 1, 2, . . . , m, are active at x. We can arrive at this convenient ordering by simply dropping all inactive constraintswhich are irrelevant in some neighborhood of xand renumbering the active constraints that remain.
To prove i, let zk and tk be the sequences for which 12.29 is satisfied, that is,
zk x
lim d.
k tk
Note in particular that tk 0 for all k. From this definition, we have that
zk xtkdotk. By taking i E and using Taylors theorem, we have that
0 1cizk tk
1 cixtkcixTdotk tk
cixT d otk. tk
12.38
324 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
By taking the limit as k , the last term in this expression vanishes, and we have ci xT d 0, as required. For the active inequality constraints i Ax I, we have similarly that
0 1cizk tk
1 cixtkcixTdotk tk
cixT d otk. tk
Hence, by a similar limiting argument, we have that ci xT d 0, as required.
For ii, we use the implicit function theorem see the Appendix or Lang 187, p. 131 for a statement of this result. First, since the LICQ holds, we have from Definition 12.4 that the m n matrix Ax of active constraint gradients has full row rank m. Let Z be a matrix
whose columns are a basis for the null space of Ax; that is,
Z IRnnm, Z has full column rank, AxZ 0. 12.39
See the related discussion in Chapter 16. Choose d Fx arbitrarily, and suppose that tkk0 is any sequence of positive scalars such limk tk 0. Define the parametrized system of equations R : IRn IR IRn by
Rz, t cz t Axd 0 . 12.40 ZTzx td 0
We claim that the solutions z zk of this system for small t tk 0 give a feasible sequence that approaches x and satisfies the definition 12.29.
At t 0, z x, and the Jacobian of R at this point is
z Rx, 0 Ax , 12.41
ZT
which is nonsingular by construction of Z. Hence, according to the implicit function theorem, the system 12.40 has a unique solution zk for all values of tk sufficiently small. Moreover, we have from 12.40 and Definition 12.3 that
i E cizktkcixTd 0, 12.42a i AxI cizktkcixTd 0, 12.42b
so that zk is indeed feasible.
12.4. FIRSTORDER OPTIMALITY CONDITIONS: PROOF 325
It remains to verify that 12.29 holds for this choice of zk. Using the fact that Rzk , tk 0 for all k together with Taylors theorem, we find that
0 Rzk,tk
czk tk Axd ZTzk x tkd
Axzk xo zk x tk Axd ZTzk x tkd
Ax zkxtkdozkx . ZT
By dividing this expression by tk and using nonsingularity of the coefficient matrix in the first term, we obtain
zk x zk x tdot,
kk
from which it follows that 12.29 is satisfied for x x. Hence, d Tx for an arbitrary d Fx, so the proof of ii is complete.
A FUNDAMENTAL NECESSARY CONDITION
As mentioned above, a local solution of 12.1 is a point x at which all feasible sequences have the property that f zk f x for all k sufficiently large. The following result shows that if such a sequence exists, then its limiting directions must make a nonnegative inner product with the objective function gradient.
Theorem 12.3.
If x is a local solution of 12.1, then we have
f xT d 0, for all d Tx. 12.43
PROOF. Suppose for contradiction that there is a tangent d for which f xT d 0. Let zk and tk be the sequences satisfying Definition 12.2 for this d. We have that
fzk fxzk xTfxo zk x fxtkdTfxotk,
326 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
Figure 12.12
Problem 12.44, showing various limiting directions of feasible sequences at the point 0, 0T .
where the second line follows from 12.38. Since dT f x 0, the remainder term is eventually dominated by the firstorder term, that is,
fzk fx1tkdTfx, forallksufficientlylarge. 2
Hence, given any open neighborhood of x, we can choose k sufficiently large that zk lies within this neighborhood and has a lower value of the objective f . Therefore, x is not a local solution.
The converse of this result is not necessarily true. That is, we may have f xT d 0 for all d Tx, yet x is not a local minimizer. An example is the following problem in two unknowns, illustrated in Figure 12.12
min x2 subject to x2 x12. 12.44
This problem is actually unbounded, but let us examine its behavior at x 0, 0T . It is not difficult to show that all limiting directions d of feasible sequences must have d2 0, so that f xT d d2 0. However, x is clearly not a local minimizer; the point , 2T for 0 has a smaller function value than x, and can be brought arbitrarily close to x by setting sufficiently small.
FARKAS LEMMA
The most important step in proving Theorem 12.1 is a classical theorem of the alternative known as Farkas Lemma. This lemma considers a cone K defined as follows:
K By Cw y 0, 12.45
12.4. FIRSTORDER OPTIMALITY CONDITIONS: PROOF 327
b1 gb1
g
b2 b3 b2 b3
d
Figure 12.13 Farkas Lemma: Either g K left or there is a separating hyperplane right.
where B and C are matrices of dimension n m and n p, respectively, and y and w are vectors of appropriate dimensions. Given a vector g IRn, Farkas Lemma states that one and only one of two alternatives is true. Either g K , or else there is a vector d IRn such that
gTd 0, BTd 0, CTd 0. 12.46
The two cases are illustrated in Figure 12.13 for the case of B with three columns, C null, and n 2. Note that in the second case, the vector d defines a separating hyperplane, which is a plane in IRn that separates the vector g from the cone K .
Lemma 12.4 Farkas.
Let the cone K be defined as in 12.45. Given any vector g IRn, we have either that
g K or that there exists d IRn satisfying 12.46, but not both.
PROOF. We show first that the two alternatives cannot hold simultaneously. If g K , there exist vectors y 0 and w such that g By Cw. If there also exists a d with the property 12.46, we have by taking inner products that
0 dT g dT By dT Cw BT dT y CT dT w 0,
where the final inequality follows from CT d 0, BT d 0, and y 0. Hence, we cannot have both alternatives holding at once.
We now show that one of the alternatives holds. To be precise, we show how to construct d with the properties 12.46 in the case that g K . For this part of the proof, we need to use the property that K is a closed seta fact that is intuitively obvious but not trivial to prove see Lemma 12.15 in the Notes and References below. Let s be the vector
328 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
in K that is closest to g in the sense of the Euclidean norm. Because K is closed, s is well defined and is given by the solution of the following optimization problem:
min sg 2 subjecttosK. 12.47 S i n c e s K , w e h a v e f r o m t h e f a c t t h a t K i s a c o n e t h a t s K f o r a l l s c a l a r s 0 . S i n c e
s g 2 is minimized by 1, we have by simple calculus that da a
d sg 2a
Now, let s be any other vector in K . Since K is convex, we have by the minimizing property
of s that
and hence
0 2sTg2tsTs 1 0
sT s g 0. 12.48
sssg 2 sg 2 forall 0,1, 2ssTsg2 ss 2 0.
1
By dividing this expression by and taking the limit as 0, we have s sT s g 0. Therefore, because of 12.48,
sTsg0, foralls K. 12.49 We claim now that the vector
d s g
satisfies the conditions 12.46. Note that d a 0 because g K . We have from 12.48 that dTgdTsdsgTsdTd d 2 0,
so that d satisfies the first property in 12.46. From12.49,wehavethatdTs 0foralls K,sothat
dTByCw0 forally0andallw.
By fixing y 0 we have that CT dT w 0 for all w, which is true only if CT d 0. By fixingw 0,wehavethatBT dT y 0forall y 0,whichistrueonlyif BT d 0.Hence, d also satisfies the second and third properties in 12.46 and our proof is complete.
12.4. FIRSTORDER OPTIMALITY CONDITIONS: PROOF 329
By applying Lemma 12.4 to the cone N defined by
N icix, i 0fori AxI,
iAx
and setting g f x, we have that either
fx icix AxT, i 0fori AxI, iAx
or else there is a direction d such that dT f x 0 and d Fx. PROOF OF THEOREM 12.1
12.50
12.51
Lemmas 12.2 and 12.4 can be combined to give the KKT conditions described in Theorem 12.1. We work through the final steps of the proof here. Suppose that x IRn is a feasible point at which the LICQ holds. The theorem claims that if x is a local solution for 12.1, then there is a vector IRm that satisfies the conditions 12.34.
We show first that there are multipliers i , i Ax, such that 12.51 is satisfied. Theorem 12.3 tells us that dT f x 0 for all tangent vectors d Tx. From Lemma 12.2, since LICQ holds, we have that Tx Fx. By putting these two statementstogether,wefindthatdT f x 0foralld Fx.Hence,fromLemma12.4, there is a vector for which 12.51 holds, as claimed.
We now define the vector by
i i, i Ax, 12.52
0, i IAx,
and show that this choice of , together with our local solution x, satisfies the conditions
12.34. We check these conditions in turn.
The condition 12.34a follows immediately from 12.51 and the definitions 12.33
of the Lagrangian function and 12.52 of .
Since x is feasible, the conditions 12.34b and 12.34c are satisfied.
We have from 12.51 that i 0 for i Ax I, while from 12.52, i 0 for i IAx. Hence, i 0 for i I, so that 12.34d holds.
Wehavefori AxIthatcix0,whilefori IAx,wehavei 0. Hence ici x 0 for i I, so that 12.34e is satisfied as well.
The proof is complete.
330 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
12.5 SECONDORDER CONDITIONS
So far, we have described firstorder conditionsthe KKT conditionswhich tell us how thefirstderivativesof f andtheactiveconstraintsci arerelatedtoeachotheratasolutionx. When these conditions are satisfied, a move along any vector w from Fx either increases the firstorder approximation to the objective function that is, wT f x 0, or else keeps this value the same that is, wT f x 0.
What role do the second derivatives of f and the constraints ci play in optimality conditions? We see in this section that second derivatives play a tiebreaking role. For the directions w Fx for which wT f x 0, we cannot determine from first derivative information alone whether a move along this direction will increase or decrease the objective function f . Secondorder conditions examine the second derivative terms in the Taylor series expansions of f and ci , to see whether this extra information resolves the issue of increase or decrease in f . Essentially, the secondorder conditions concern the curvature of the Lagrangian function in the undecided directionsthe directions w Fx for which wT f x 0.
Since we are discussing second derivatives, stronger smoothness assumptions are needed here than in the previous sections. For the purpose of this section, f and ci, i E I, are all assumed to be twice continuously differentiable.
Given Fx from Definition 12.3 and some Lagrange multiplier vector satisfying the KKT conditions 12.34, we define the critical cone Cx, as follows:
Cx,wFxcixTw0,alli AxIwithi 0. Equivalently,
cixTw0, wCx, cixTw0, cixTw0,
foralliE,
foralli AxIwithi 0, foralliAxIwithi0.
12.53
The critical cone contains those directions w that would tend to adhere to the active inequality constraints even when we were to make small changes to the objective those indices i I for which the Lagrange multiplier component i is positive, as well as to the equality constraints. From the definition 12.53 and the fact that i 0 for all inactive components i IAx, it follows immediately that
wCx, icixTw0 foralli EI. 12.54 Hence, from the first KKT condition 12.34a and the definition 12.33 of the Lagrangian
function, we have that
wCx, wTfx iwTcix0. 12.55 iEI
12.5. SECONDORDER CONDITIONS 331
x2
C
F
f x1
Figure 12.14
Problem 12.56, showing Fx and Cx,.
Hence the critical cone Cx, contains directions from Fx for which it is not clear from first derivative information alone whether f will increase or decrease.
EXAMPLE 12.7 Consider the problem
min x1 subjectto x2 0, 1×1 12 x2 0, 12.56
illustrated in Figure 12.14. It is not difficult to see that the solution is x 0, 0T , with active set Ax 1, 2 and a unique optimal Lagrange multiplier 0, 0.5T . Since the gradients of the active constraints at x are 0, 1T and 2, 0T , respectively, the LICQ holds, so the optimal multiplier is unique. The linearized feasible set is then
while the critical cone is
Fxdd 0,
Cx,0,w2T w2 0.
332 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
The first theorem defines a necessary condition involving the second derivatives: If x is a local solution, then the Hessian of the Lagrangian has nonnegative curvature along critical directions that is, the directions in Cx, .
Theorem 12.5 SecondOrder Necessary Conditions.
Suppose that x is a local solution of 12.1 and that the LICQ condition is satisfied. Let
be the Lagrange multiplier vector for which the KKT conditions 12.34 are satisfied. Then wT2 Lx,w0, forallwCx,. 12.57
PROOF. Since x is a local solution, all feasible sequences zk approaching x must have fzk fx for all k sufficiently large. Our approach in this proof is to construct a feasible sequence whose limiting direction is w and show that the property f zk f x
implies that 12.57 holds.
Since w Cx, Fx, we can use the technique in the proof of Lemma 12.2
to choose a sequence tk of positive scalars and to construct a feasible sequence zk approaching x such that
xx
zk x
lim w,
k tk which we can write also as 12.58 that
zk x tkwotk.
Because of the construction technique for zk , we have from formula 12.42 that
cizktkcixTw, foralli Ax From 12.33, 12.60, and 12.54, we have
12.58
12.59
12.60
12.61
Lzk, fzk icizk iEI
fzktk
fzk,
T icix w
iAx
On the other hand, we can perform a Taylor series expansion to obtain an estimate of L z k , n e a r x . B y u s i n g Ta y l o r s t h e o r e m e x p r e s s i o n 2 . 6 a n d c o n t i n u i t y o f t h e H e s s i a n s 2 f and 2ci , i E I, we obtain
Lzk, Lx, zk xT xLx, 12.62
1z xT2 Lx,z xo z x 2. 2k xx k k
12.5. SECONDORDER CONDITIONS 333
Bythecomplementarityconditions12.34e,wehaveLx, fx.From12.34a, the second term on the righthand side is zero. Hence, using 12.59, we can rewrite 12.62 as
Lz , fx1t2wT2 Lx,ot2. k 2kxx k
By substituting into 12.63, we obtain
fz fx 1t2wT2 Lx,wot2.
IfwTx2xLx,w0,then12.64wouldimplythat fzk fxforallksufficiently large, contradicting the fact that x is a local solution. Hence, the condition 12.57 must hold, as claimed.
Sufficient conditions are conditions on f and ci , i E I , that ensure that x is a local solution of the problem 12.1. They take the opposite tack to necessary conditions, which assume that x is a local solution and deduce properties of f and ci , for the active indices i. The secondorder sufficient condition stated in the next theorem looks very much like the necessary condition just discussed, but it differs in that the constraint qualification is not required, and the inequality in 12.57 is replaced by a strict inequality.
Theorem 12.6 SecondOrder Sufficient Conditions.
Suppose that for some feasible point x IRn there is a Lagrange multiplier vector
such that the KKT conditions 12.34 are satisfied. Suppose also that
wT 2 Lx,w 0, for all w Cx,, w a 0. 12.65 xx
Then x is a strict local solution for 12.1.
PROOF. First, note that the set C d Cx, d 1 is a compact subset of Cx, , so by 12.65, the minimizer of dT x2x Lx, d over this set is a strictly positive n u m b e r , s a y . S i n c e C x , i s a c o n e , w e h a v e t h a t w w C i f a n d o n l y i f w Cx, , w a 0. Therefore, condition 12.65 by
wT 2 Lx,w w 2, for all w Cx,, 12.66 xx
for 0 defined as above. Note that this inequality holds trivially for w 0.
We prove the result by showing that every feasible sequence zk approaching x has f zk f x 4 zk x 2, for all k sufficiently large. Suppose for contradiction
that this is not the case, and that there is a sequence zk approaching x with
f zk f x 4 zk x 2, for all k sufficiently large. 12.67
k 2kxx k
12.63
12.64
334 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
By taking a subsequence if necessary, we can identify a limiting direction d such that
zk x
lim d. 12.68
k zkx
We have from Lemma 12.2i and Definition 12.3 that d Fx. From 12.33 and the
factsthati 0andcizk0fori Iandcizk0fori E,wehavethat
Lzk, fzk icizk fzk, 12.69
iAx
while the Taylor series approximation 12.63 from the proof of Theorem 12.5 continues to hold.
If d were not in Cx, , we could identify some index j Ax I such that the strict positivity condition
j c j x T d 0
is satisfied, while for the remaining indices i Ax, we have
i c i x T d 0 .
From Taylors theorem and 12.68, we have for this particular value of j that
jcjzkjcjxjcjxTzk xo zk x zkx jcjxTdozkx .
Hence, from 12.69, we have that
Lzk, fzk icizk iAx
fzkjcjzk
fzk zk x jcjxTdo zk x .
From the Taylor series estimate 12.63, we have meanwhile that Lzk, fxO zk x 2,
and by combining with 12.71, we obtain
fzk fx zk x jcjxTdo zk x .
1 2 . 7 0
12.71
12.5. SECONDORDER CONDITIONS 335
Because of 12.70, this inequality is incompatible with 12.67. We conclude that d Cx,, and hence dT x2xLx,d .
By combining the Taylor series estimate 12.63 with 12.69 and using 12.68, we obtain
fz fx1z xT2 Lx,z xo z x 2 k 2k xx k k
fx1dT2Lx,dzx 2ozx 2 2xx k k
fx2 zk x 2 o zk x 2.
This inequality yields the contradiction to 12.67. We conclude that every feasible sequence zk approaching x must satisfy f zk f x 4 zk x 2, for all k sufficiently large, so x is a strict local solution.
EXAMPLE 12.8 EXAMPLE 12.2, ONE MORE TIME
We now return to Example 12.2 to check the secondorder conditions for problem 12.18.Inthisproblemwehave fxx1x2,c1x2x12x2,E,andI1. The Lagrangian is
Lx,x1 x212x12 x2,
and it is easy to show that the KKT conditions 12.34 are satisfied by x 1, 1T , with
1 . The Lagrangian Hessian at this point is 12
2Lx,21 010. xx 0 21 0 1
This matrix is positive definite, so it certainly satisfies the conditions of Theorem 12.6. We conclude that x 1,1T is a strict local solution for 12.18. In fact, it is the global solution of this problem, since, as we note later, this problem is a convex programming problem.
EXAMPLE 12.9
For a more complex example, consider the problem
min 0.1×1 42 x2 s.t. x12 x2 10, 12.72
336 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
in which we seek to minimize a nonconvex function over the exterior of the unit circle. Obviously, the objective function is not bounded below on the feasible region, since we can take the feasible sequence
10 , 20 , 30 , 40 , 0000
and note that f x approaches along this sequence. Therefore, no global solution exists, but it may still be possible to identify a strict local solution on the boundary of the constraint. We search for such a solution by using the KKT conditions 12.34 and the secondorder conditions of Theorem 12.6.
By defining the Lagrangian for 12.72 in the usual way, it is easy to verify that
xLx, 2 Lx,
0.2×1 4 21×1 , 12.73a 2×2 21×2
0.221 0 . 12.73b 0 221
xx
The point x 1,0T satisfies the KKT conditions with 1 0.3 and the active set Ax 1. To check that the secondorder sufficient conditions are satisfied at this point, we note that
c1x 20 ,
so that the set C defined in 12.53 is simply
Cx, 0, w2T w2 IR.
Now, by substituting x and into 12.73b, we have for any w Cx, with w a 0 that w2 a 0 and thus
wT2 Lx,w 0 T 0.4 0 0 1.4w2 0. xx w01.4w2
22
Hence, the secondorder sufficient conditions are satisfied, and we conclude from
Theorem12.6that1,0T isastrictlocalsolutionfor12.72.
or, more succinctly,
uTZT2 Lx,Zu0 forallu, xx
ZT 2 Lx,Z is positive semidefinite. xx
12.5. SECONDORDER CONDITIONS 337
SECONDORDER CONDITIONS AND PROJECTED HESSIANS
The secondorder conditions are sometimes stated in a form that is slightly weaker but easier to verify than 12.57 and 12.65. This form uses a twosided projection of the Lagrangian Hessian x2x Lx, onto subspaces that are related to Cx, .
The simplest case is obtained when the multiplier that satisfies the KKT conditions 12.34 is unique as happens, for example, when the LICQ condition holds and strict complementarity holds. In this case, the definition 12.53 of Cx, reduces to
Cx, Null c xT Null Ax, i iAx
where Ax is defined as in 12.37. In other words, Cx, is the null space of the matrix whose rows are the active constraint gradients at x. As in 12.39, we can define the matrix Z with full column rank whose columns span the space Cx, ;that is,
C x , Z u u IR A x .
Hence, the condition 12.57 in Theorem 12.5 can be restated as
Similarly, the condition 12.65 in Theorem 12.6 can be restated as ZT 2 Lx,Z is positive definite.
As we show next, Z can be computed numerically, so that the positive semidefiniteness conditions can actually be checked by forming these matrices and finding their eigenvalues. One way to compute the matrix Z is to apply a QR factorization to the matrix of active constraint gradients whose null space we seek. In the simplest case above in which the multiplier is unique and strictly complementary holds, we define Ax as in 12.37
and write the QR factorization of its transpose as
AxT Q R Q Q R Q R, 12.74 01201
where R is a square upper triangular matrix and Q is n n orthogonal. If R is nonsingular, we can set Z Q2. If R is singular indicating that the active constraint gradients are linearly dependent, a slight enhancement of this procedure that makes use of column pivoting during the QR procedure can be used to identify Z .
xx
338 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
12.6 OTHER CONSTRAINT QUALIFICATIONS
We now reconsider constraint qualifications, the conditions discussed in Sections 12.2 and 12.4 that ensure that the linearized approximation to the feasible set captures the essential shape of in a neighborhood of x.
One situation in which the linearized feasible direction set Fx is obviously an adequate representation of the actual feasible set occurs when all the active constraints are already linear; that is,
ci x aiT x bi , 12.75 for some ai IRn and bi IR. It is not difficult to prove a version of Lemma 12.2 for this
situation.
Lemma 12.7.
Suppose that at some x , all active constraints ci , i Ax, are linear functions. Then Fx Tx.
PROOF. WehavefromLemma12.2ithatTxFx.ToprovethatFxTx, we choose an arbitrary w Fx and show that w Tx. By Definition 12.3 and the form 12.75 of the constraints, we have
Fx da aiTd0, foralliE, . aiTd0, foralliAxI
First, note that there is a positive scalar t such that the inactive constraint remain inactive at x tw, for all t 0, t , that is,
cix tw0, foralli IAxandallt 0,t . Now define the sequence zk by
z k x t k w , k 1 , 2 , . . . . SinceaiTw0foralli IAx,wehave
cizkcizkcixaiTzk x t aiTw0, foralli IAx, k
so that zk is feasible with respect to the active inequality constraints ci , i I Ax. By the choice of t , we find that zk is also feasible with respect to the inactive inequality constraints
12.6. OTHER CONSTRAINT QUALIFICATIONS 339
i IAx,anditiseasytoshowthatcizk 0fortheequalityconstraintsi E.Hence, zk is feasible for each k 1, 2, . . .. In addition, we have that
zk x t kw
t k t k w,
so that indeed w is the limiting direction of zk. Hence, w Tx, and the proof is complete.
We conclude from this result that the condition that all active constraints be linear is another possible constraint qualification. It is neither weaker nor stronger than the LICQ condition, that is, there are situations in which one condition is satisfied but not the other see Exercise 12.12.
Another useful generalization of the LICQ is the MangasarianFromovitz constraint qualification MFCQ.
Definition 12.6 MFCQ.
We say that the MangasarianFromovitz constraint qualification MFCQ holds if there
exists a vector w IRn such that
cixTw0, foralli AxI,
ci xT w 0, for all i E,
and the set of equality constraint gradients ci x, i E is linearly independent.
Note the strict inequality involving the active inequality constraints.
The MFCQ is a weaker condition than LICQ. If LICQ is satisfied, then the system of equalities defined by
cixTw1, foralli AxI, ci xT w 0, for all i E,
has a solution w, by full rank of the active constraint gradients. Hence, we can choose the w of Definition 12.6 to be precisely this vector. On the other hand, it is easy to construct examples in which the MFCQ is satisfied but the LICQ is not; see Exercise 12.13.
It is possible to prove a version of the firstorder necessary condition result Theo rem 12.1 in which MFCQ replaces LICQ in the assumptions. MFCQ gives rise to the nice property that it is equivalent to boundedness of the set of Lagrange multiplier vectors for which the KKT conditions 12.34 are satisfied. In the case of LICQ, this set consists of a unique vector , and so is trivially bounded.
Note that constraint qualifications are sufficient conditions for the linear approxima tiontobeadequate,notnecessaryconditions.Forinstance,considerthesetdefinedbyx2 x12 and x2 x12 and the feasible point x 0, 0T . None of the constraint qualifications
340 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
we have discussed are satisfied, but the linear approximation Fx w1, 0T w1 IR accurately reflects the geometry of the feasible set near x.
12.7 A GEOMETRIC VIEWPOINT
Finally, we mention an alternative firstorder optimality condition that depends only on the geometry of the feasible set and not on its particular algebraic description in terms of the constraint functions ci , i E I. In geometric terms, our problem 12.1 can be stated as
min f x subject to x , 12.76
where is the feasible set.
To prove a geometric firstorder condition, we need to define the normal cone to
the set at a feasible point x. Definition 12.7.
Thenormalconetothesetatthepointx isdefinedas
Nx v vT w 0 for all w Tx, 12.77
where Tx is the tangent cone of Definition 12.2. Each vector v Nx is said to be a normal vector.
Geometrically, each normal vector v makes an angle of at least 2 with every tangent vector.
The firstorder necessary condition for 12.76 is delightfully simple.
Theorem 12.8.
Suppose that x is a local minimizer of f in . Then
f x Nx. 12.78
PROOF. Given any d Tx, we have for the sequences tk and zk in Definition 12.2 that
zk , zk xtkdotk, forallk. 12.79 Since x is a local solution, we must have
fzk fx
12.8. LAGRANGE MULTIPLIERS AND SENSITIVITY 341
forallksufficientlylarge.Hence,since f iscontinuouslydifferentiable,wehavefromTaylors theorem 2.4 that
fzk fxtkfxTdotk0. By dividing by tk and taking limits as k , we have
f xT d 0.
Recall that d was an arbitrary member of Tx, so we have f xT d 0 for all
d Tx. We conclude from Definition 12.7 that f x Nx.
This result suggests a close relationship between Nx and the conic combination of active constraint gradients given by 12.50. When the linear independence constraint qualification holds, identical to within a change of sign.
Lemma 12.9.
Suppose that the LICQ assumption Definition 12.4 holds at x. Then t the normal cone Nx is simply N, where N is the set defined in 12.50.
PROOF. The proof follows from Farkas Lemma Lemma 12.4 and Definition 12.7 of Nx. From Lemma 12.4, we have that
gN gTd0 foralldFx.
Since we have Fx Tx from Lemma 12.2, it follows by switching the sign of this
expression that
gN gTd0 foralldTx.
We conclude from Definition 12.7 that Nx N, as claimed.
12.8 LAGRANGE MULTIPLIERS AND SENSITIVITY
The importance of Lagrange multipliers in optimality theory should be clear, but what of their intuitive significance? We show in this section that each Lagrange multiplier i tells us something about the sensitivity of the optimal objective value f x to the presence of the constraint ci . To put it another way, i indicates how hard f is pushing or pulling the solution x against the particular constraint ci .
We illustrate this point with some informal analysis. When we choose an inactive constraint i Ax such that ci x 0, the solution x and function value f x are
342 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
indifferent to whether this constraint is present or not. If we perturb ci by a tiny amount, it will still be inactive and x will still be a local solution of the optimization problem. Since i 0 from 12.34e, the Lagrange multiplier indicates accurately that constraint i is not significant.
Suppose instead that constraint i is active, and let us perturb the righthandside of this constraint a little, requiring, say, that ci x ci x instead of ci x 0. Suppose that is sufficiently small that the perturbed solution x still has the same set of active constraints, and that the Lagrange multipliers are not much affected by the perturbation. These conditions can be made more rigorous with the help of strict complementarity and secondorder conditions. We then find that
cix cix cixx xTcix, 0cjx cjxx xTcjx,
forall j Axwith j ai. The value of f x , meanwhile, can be estimated with the help of 12.34a. We have
fx fxx xTfx
cixi.
By taking limits, we see that the family of solutions x satisfies
d f x i ci x . d
jx xTcjx jAx
12.80
is large, then the optimal value is sensitive to the placement of the ith constraint, while if this quantity is small, the dependence is not too strong. If i is exactly zero for some active constraint, small perturbations to ci in some directions will hardly affect the optimal objective value at all;
the change is zero, to first order.
This discussion motivates the definition below, which classifies constraints according
to whether or not their corresponding Lagrange multiplier is zero.
Definition 12.8.
Let x be a solution of the problem 12.1, and suppose that the KKT conditions 12.34 are satisfied. We say that an inequality constraint ci is strongly active or binding if i Ax and i 0 for some Lagrange multiplier satisfying 12.34. We say that ci is weakly active if i Ax and i 0 for all satisfying 12.34.
Note that the analysis above is independent of scaling of the individual constraints. For instance, we might change the formulation of the problem by replacing some active
A sensitivity analysis of this problem would conclude that if i ci x
constraint ci by 10ci . The new problem will actually be equivalent that is, it has the same feasible set and same solution, but the optimal multiplier i corresponding to ci will be replaced by i10. However, since ci x is replaced by 10 ci x , the product i ci x does not change. If, on the other hand, we replace the objective function f by 10 f , the multipliers i in 12.34 all will need to be replaced by 10i. Hence in 12.80 we see that the sensitivity of f to perturbations has increased by a factor of 10, which is exactly what we would expect.
12.9 DUALITY
In this section we present some elements of the duality theory for nonlinear program ming. This theory is used to motivate and develop some important algorithms, including the augmented Lagrangian algorithms of Chapter 17. In its full generality, duality theory ranges beyond nonlinear programming to provide important insight into the fields of con vex nonsmooth optimization and even discrete optimization. Its specialization to linear programming proved central to the development of that area; see Chapter 13. We note that the discussion of linear programming duality in Section 13.1 can be read without consulting this section first.
Duality theory shows how we can construct an alternative problem from the functions and data that define the original optimization problem. This alternative dual problem is related to the original problem which is sometimes referred to in this context as the primal for purposes of contrast in fascinating ways. In some cases, the dual problem is easier to solve computationally than the original problem. In other cases, the dual can be used to obtain easily a lower bound on the optimal value of the objective for the primal problem. As remarked above, the dual has also been used to design algorithms for solving the primal problem.
Our results in this section are mostly restricted to the special case of 12.1 in which there are no equality constraints and the objective f and the negatives of the inequality constraintsci areallconvexfunctions.Forsimplicityweassumethatthereareminequality constraints labelled 1, 2, . . . , m and rewrite 12.1 as follows:
min fx subjectto cix0, i 1,2,…,m. x IR n
If we assemble the constraints into a vector function
def T
cx c1x,c2x,…,cmx , we can write the problem as
min f x subject to cx 0, 12.81 x IR n
12.9. DUALITY 343
344 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
for which the Lagrangian function 12.16 with Lagrange multiplier vector IRm is Lx, fxTcx.
We define the dual objective function q : IRn IR as follows:
def
q inf Lx, . 12.82
x
In many problems, this infimum is for some values of . We define the domain of q as
the set of values for which q is finite, that is, def
The dual problem to 12.81 is defined as follows:
D q . 12.83
Note that calculation of the infimum in 12.82 requires finding the global minimizer of the function L, for the given which, as we have noted in Chapter 2, may be extremely difficult in practice. However, when f and ci are convex functions and 0 the case in which we are most interested, the function L, is also convex. In this situation, all local minimizers are global minimizers as we verify in Exercise 12.4, so computation of q becomes a more practical proposition.
max q IRn
subject to 0.
12.84
EXAMPLE 12.10 Consider the problem
The Lagrangian is
min 0.5×12 x2 x1 ,x2
subject to x1 1 0.
12.85
Lx1, x2, 1 0.5×12 x2 1×1 1.
If we hold 1 fixed, this is a convex function of x1, x2T . Therefore, the infimum with respect to x1, x2T is achieved when the partial derivatives with respect to x1 and x2 are zero, that is,
x110, x20.
By substituting these infimal values into Lx1, x2, 1 we obtain the dual objective 12.82: q1 0.521 0 11 1 0.521 1.
12.86
In the remainder of this section, we show how the dual problem is related to 12.81. Our first result concerns concavity of q.
Theorem 12.10.
The function q defined by 12.82 is concave and its domain D is convex. PROOF. Forany0 and1 inIRm,anyx IRn,andany0,1,wehave
Lx,10 11Lx,0Lx,1.
By taking the infimum of both sides in this expression, using the definition 12.82, and using the results that the infimum of a sum is greater than or equal to the sum of infimums, we obtain
q1 0 1 1 q0 q1,
confirming concavity of q. If both 0 and 1 belong to D, this inequality implies that q1 0 1 also, and therefore 1 0 1 D, verifying convexity of D.
The optimal value of the dual problem 12.84 gives a lower bound on the optimal objective value for the primal problem 12.81. This observation is a consequence of the following weak duality result.
Theorem 12.11 Weak Duality.
For any x feasible for 12.81 and any 0, we have q f x .
PROOF.
q inf fx Tcx fx Tcx fx , x
Hence, the dual problem 12.84 is
1 0
max 0.521 1, which clearly has the solution 1 1.
12.9. DUALITY 345
where the final inequality follows from 0 and cx 0.
346 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
For the remaining results, we note that the KKT conditions 12.34 specialized to 12.81 are as follows:
f x cx 0, cx 0, 0, icix 0,
i 1,2,…,m,
12.87a 12.87b 12.87c 12.87d
where cx is the n m matrix defined by cx c1x,c2x,…,cmx. The next result shows that optimal Lagrange multipliers for 12.81 are solutions of
the dual problem 12.84 under certain conditions. It is essentially due to Wolfe 309.
Theorem 12.12.
Suppose that x is a solution of 12.81 and that f and ci, i 1,2,…,m are convex functions on IRn that are differentiable at x . Then any for which x , satisfies the KKT conditions 12.87 is a solution of 12.84.
PROOF. Suppose that x , satisfies 12.87. We have from 0 that L, is a convex and differentiable function. Hence, for any x, we have
Lx, Lx , xLx , T x x Lx , , where the last equality follows from 12.87a. Therefore, we have
q inf Lx, Lx , f x T cx f x , x
where the last equality follows from 12.87d. Since from Theorem 12.11, we have q fx for all 0 it follows immediately from q fx that is a solution of 12.84.
Note that if the functions are continuously differentiable and a constraint qualification such as LICQ holds at x , then an optimal Lagrange multiplier is guaranteed to exist, by Theorem 12.1.
In Example 12.10, we see that 1 1 is both an optimal Lagrange multiplier for the problem 12.85 and a solution of 12.86. Note too that the optimal objective for both problems is 0.5.
We prove a partial converse of Theorem 12.12, which shows that solutions to the dual problem 12.84 can sometimes be used to derive solutions to the original problem 12.81. The essential condition is strict convexity of the function L, for a certain value . We note that this condition holds if either f is strictly convex as is the case in Example 12.10 or if ci is strictly convex for some i 1,2,…,m with i 0.
Theorem 12.13.
Suppose that f and ci , i 1, 2, . . . , m are convex and continuously differentiable on IRn . Suppose that x is a solution of 12.81 at which LICQ holds. Suppose that solves 12.84 and that the infimum in inf x Lx , is attained at x . Assume further than L, is a strictly convexfunction.Thenx xthatis,xistheuniquesolutionof12.81,and fx Lx,.
PROOF. Assume for contradiction that x a x. From Theorem 12.1, because of the LICQ assumption, there exists satisfying 12.87. Hence, from Theorem 12.12, we have that also solves 12.84, so that
Lx , q q Lx,.
Because x argminx Lx,, we have from Theorem 2.2 that xLx, 0. Moreover,
by strict convexity of L, , it follows that
L x , L x , x L x , T x x 0 .
Hence, we have
so in particular we have
Lx , Lx, Lx , ,
T cx T cx 0,
where the final equality follows from 12.87d. Since 0 and cx 0, this yields the
contradiction, and we conclude that x x , as claimed.
InExample12.10,atthedualsolution1 1,theinfimumofLx1,x2,1isachieved at x1, x2 1, 0T , which is the solution of the original problem 12.85.
An slightly different form of duality that is convenient for computations, known as the Wolfe dual 309, can be stated as follows:
max Lx , x,
subject to x Lx, 0, 0.
The following results explains the relationship of the Wolfe dual to 12.81.
Theorem 12.14.
12.88a 12.88b
Suppose that f and ci , i 1, 2, . . . , m are convex and continuously differentiable on IR n . S u p p o s e t h a t x , i s a s o l u t i o n p a i r o f 1 2 . 8 1 a t w h i c h L I C Q h o l d s . T h e n x , s o l v e s the problem 12.88.
12.9. DUALITY 347
348 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
PROOF. From the KKT conditions 12.87 we have that x , satisfies 12.88b, and that Lx , f x . Therefore for any pair x, that satisfies 12.88b we have that
L x , f x
fx Tcx
L x ,
L x , x L x , T x x Lx,,
where the second inequality follows from the convexity of L, . We have therefore shown that x , maximizes L over the constraints 12.88b, and hence solves 12.88.
EXAMPLE 12.11 LINEAR PROGRAMMING
An important special case of 12.81 is the linear programming problem
mincTx subjectto Axb0, 12.89 for which the dual objective is
qinf cTxTAxb inf cATTxbT . xx
If c AT a 0, the infimum is clearly we can set x to be a large negative multiple of c AT to make q arbitrarily large and negative. When c AT 0, on the other hand, the dual objective is simply bT . In maximizing q, we can exclude for which c AT a 0 from consideration the maximum obviously cannot be attained at a point for which q . Hence, we can write the dual problem 12.84 as follows:
max bT subjectto ATc, 0. 12.90
The Wolfe dual of 12.89 can be written as maxcTxTAxb subjecttoATc,0,
and by substituting the constraint AT c 0 into the objective we obtain 12.90 again. For some matrices A, the dual problem 12.90 may be computationally easier to solve
than the original problem 12.89. We discuss the possibilities further in Chapter 13.
EXAMPLE 12.12 CONVEX QUADRATIC PROGRAMMING Consider
min 1xTGxcTx subjectto Axb0, 12.91 2
where G is a symmetric positive definite matrix. The dual objective for this problem is qinf Lx,inf 1xTGxcTxTAxb. 12.92
xx2
Since G is positive definite, since L, is a strictly convex quadratic function, the infimum
is achieved when x Lx , 0, that is,
Gx c AT 0. 12.93
Hence, we can substitute for x in the infimum expression and write the dual objective explicitly as follows:
q1ATcTG1ATcT bT. 2
Alternatively, we can write the Wolfe dual form 12.88 by retaining x as a variable and including the constraint 12.93 explicitly in the dual problem, to obtain
max 1xT Gx cT x T Ax b 12.94 ,x 2
subjectto GxcAT0, 0.
To make it clearer that the objective is concave, we can use the constraint to substitute
c AT T x xT Gx in the objective, and rewrite the dual formulation as follows: max1xTGxTb, subjectto GxcAT0, 0. 12.95
12.9. DUALITY 349
,x 2
Note that the Wolfe dual form requires only positive semidefiniteness of G. NOTES AND REFERENCES
The theory of constrained optimization is discussed in many books on numerical optimization. The discussion in Fletcher 101, Chapter 9 is similar to ours, though a little
350 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
terser, and includes additional material on duality. Bertsekas 19, Chapter 3 emphasizes the role of duality and discusses sensitivity of the solution with respect to the active constraints in some detail. The classic treatment of Mangasarian 198 is particularly notable for its thorough description of constraint qualifications. It also has an extensive discussion of theorems of the alternative 198, Chapter 2, placing Farkas Lemma firmly in the context of other related results.
The KKT conditions were described in a 1951 paper of Kuhn and Tucker 185, though they were derived earlier and independently in an unpublished 1939 masters thesis of W. Karush. Lagrange multipliers and optimality conditions for general problems including nonsmooth problems are described in the deep and wideranging article of Rockafellar 270.
Duality theory for nonlinear programming is described in the books of Rockafel lar 198 and Bertsekas 19; the latter treatment is particularly extensive and general. The material in Section 12.9 is adapted from these sources.
We return to our claim that the set N defined by
N By Ct y 0,
where B and C are matrices of dimension n m and n p, respectively, and y and t are vectors of appropriate dimensions; see 12.45 is a closed set. This fact is needed in the proof of Lemma 12.4 to ensure that the solution of the projection subproblem 12.47 is welldefined. The following technical result is well known; the proof given below is due to R. Byrd.
Lemma 12.15.
The set N is closed.
PROOF. By splitting t into positive and negative parts, it is easy to see that
yay
a N B C C t at 0.
t a t
Hence, we can assume without loss of generality that N has the form N By y 0.
Suppose that B has dimensions n m. First,weshowthatforanysN,wecanwritesBIyI withyI 0,where
I 1, 2, . . . , m, BI is the column submatrix of B indexed by I with full column rank, and I has minimum cardinality. To prove this claim, we assume for contradiction that K1,2,…,misanindexsetwithminimalcardinalitysuchthatsBKyK,yK 0,yet
the columns of BK are linearly dependent. Since K is minimal, yK has no zero components. We t h e n h a v e a n o n z e r o v e c t o r w s u c h t h a t B K w 0 . S i n c e s B K y K w f o r a n y , w e can increase or decrease from 0 until one or more components of yK w become zero, while the other components remain positive. We define K by removing the indices from K that correspond to zero components of yK w, and define y K to be the vector of strictly positivecomponentsofyK w.WethenhavethatsBK y K andy K 0,contradicting our assumption that K was the set of minimal cardinality with this property.
Nowletskbeasequencewithsk Nforallkandsk s.Weprovethelemma
by showing that s N. By the claim of the previous paragraph, for all k we can write
sk BIk yk with yk 0, Ik is minimal, and the columns of BIk are linearly independent. Ik Ik
Since there only finitely many possible choices of index set Ik , at least one index set occurs infinitely often in the sequence. By choosing such an index set I , we can take a subsequence if necessary and assume without loss of generality that Ik I for all k. We then have that sk AI yIk with yIk 0 and AI has full column rank. Because of the latter property, we have that ATI AI is invertible, so that yIk is defined uniquely as follows:
y Ik A TI A I 1 A TI s k , k 0 , 1 , 2 , . . . . By taking limits and using sk s, we have that
k defT1T yIyIAIAI AIs,
andmoreover yI 0,since yIk 0forallk.Hencewecanwrites BI yI with yI 0,and therefore s N .
EXERCISES
12.1 The following example from 268 with a single variable x IR and a single
equality constraint shows that strict local solutions are not necessarily isolated. Consider
12.9. DUALITY 351
min x2 subject to cx 0, where cx x
x6sin1x0 ifxa0 0 if x 0.
12.96
a Showthattheconstraintfunctionistwicecontinuouslydifferentiableatallxincluding at x 0 and that the feasible points are x 0 and x 1k for all nonzero integers k.
b Verify that each feasible point except x 0 is an isolated local solution by showing that there is a neighborhood N around each such point within which it is the only feasible point.
352 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
c Verify that x 0 is a global solution and a strict local solution, but not an isolated local solution
12.2 Is an isolated local solution necessarily a strict local solution? Explain.
12.3 Does problem 12.4 have a finite or infinite number of local solutions? Use the
firstorder optimality conditions 12.34 to justify your answer.
12.4 If f is convex and the feasible region is convex, show that local solutions of the problem 12.3 are also global solutions. Show that the set of global solutions is convex. Hint: See Theorem 2.5.
12.5 Let v : IRn IRm be a smooth vector function and consider the unconstrained optimization problems of minimizing f x where
fx vx , fx max vix. i 1,2,…,m
Reformulate these generally nonsmooth problems as smooth constrained optimization problems.
12.6 Can you perform a smooth reformulation as in the previous question when f is defined by
fx min fix? i 1,2,…,m
N.B. min not max. Why or why not?
12.7 Show that the vector defined by 12.15 satisfies 12.14 when the firstorder optimality condition 12.10 is not satisfied.
12.8 Verify that for the sequence zk defined by 12.30, the function f x x1 x2 satisfies fzk1 fzk for k 2,3,…. Hint: Consider the trajectory zs
2T def
21s ,1s and show that the function hs fzs has h s 0 for all
s 2.
12.9 Consider the problem 12.9. Specify two feasible sequences that approach the
maximizing point 1, 1T , and show that neither sequence is a decreasing sequence for f . 12.10 Verify that neither the LICQ nor the MFCQ holds for the constraint set defined
by12.32atx 0,0T.
12.11 Consider the feasible set in IR2 defined by x2 0, x2 x12.
a For x 0, 0T , write down Tx and Fx. b Is LICQ satisfied at x? Is MFCQ satisfied?
c d
If the objective function is f x x2, verify that that KKT conditions 12.34 are satisfied at x.
Find a feasible sequence zk approaching x with f zk f x for all k.
12.12 It is trivial to construct an example of a feasible set and a feasible point x at which the LICQ is satisfied but the constraints are nonlinear. Give an example of the reverse situation, that is, where the active constraints are linear but the LICQ is not satisfied.
12.13 Show that for the feasible region defined by
x1 12 x2 12 2, x1 12 x2 12 2, x1 0,
the MFCQ is satisfied at x 0, 0T but the LICQ is not satisfied.
12.14 ConsiderthehalfspacedefinedbyHxIRnaTx0whereaIRn and IR are given. Formulate and solve the optimization problem for finding the point x in H that has the smallest Euclidean norm.
12.15 Consider the following modification of 12.36, where t is a parameter to be fixed prior to solving the problem:
1×1 x2
2 1xx
min x 3 x t4 s.t. x 12 2
1 2 0. 12.97 1x1x2
12.9. DUALITY 353
a b
1×1 x2
For what values of t does the point x 1, 0T satisfy the KKT conditions?
Show that when t 1, only the first constraint is active at the solution, and find the solution.
12.16 Fletcher 101 Solve the problem
min x1 x2 subjecttox12 x2 1
x
by eliminating the variable x2. Show that the choice of sign for a square root operation
during the elimination process is critical; the wrong choice leads to an incorrect answer.
12.17 Prove that when the KKT conditions 12.34 and the LICQ are satisfied at a point x, the Lagrange multiplier in 12.34 is unique.
354 CHAPTER 12. THEORY OF CONSTRAINED OPTIMIZATION
12.18 Consider the problem of finding the point on the parabola y 1 x 12 that 5
is closest to x, y 1, 2, in the Euclidean norm sense. We can formulate this problem as min fx,yx12 y22 subjecttox12 5y.
a Find all the KKT points for this problem. Is the LICQ satisfied?
b Which of these points are solutions?
c By directly substituting the constraint into the objective function and eliminating the variable x, we obtain an unconstrained optimization problem. Show that the solutions of this problem cannot be solutions of the original problem.
12.19 Consider the problem
min fx2x1 x2 subjectto 1×13 x2 0
xIR2 x20.25×121 0.
The optimal solution is x 0, 1T , where both constraints are active.
a Do the LICQ hold at this point?
b Are the KKT conditions satisfied?
c WritedownthesetsFxandCx,.
d Are the secondorder necessary conditions satisfied? Are the secondorder sufficient conditions satisfied?
12.20 Find the minima of the function f x x1x2 on the unit circle x12 x2 1. Illustrate this problem geometrically.
12.21 Find the maxima of f x x1 x2 over the unit disk defined by the inequality constraint 1 x12 x2 0.
12.22 Show that for 12.1, the feasible set is convex if ci , i E are linear functions and ci , i I are convex functions.
CHAPTER13
Linear Programming: The Simplex Method
Dantzigs development of the simplex method in the late 1940s marks the start of the modern era in optimization. This method made it possible for economists to formulate large models and analyze them in a systematic and efficient way. Dantzigs discovery coincided with the development of the first electronic computers, and the simplex method became one of the earliest important applications of this new and revolutionary technology. From those days to the present, computer implementations of the simplex method have been continually improved and refined. They have benefited particularly from interactions with numerical analysis, a branch of mathematics that also came into its own with the appearance of electronic computers, and have now reached a high level of sophistication.
This is page 355 Printer: Opaque this
356 CHAPTER 13. THE SIMPLEX METHOD
Today, linear programming and the simplex method continue to hold sway as the most widely used of all optimization tools. Since 1950, generations of workers in management, economics, finance, and engineering have been trained in the techniques of formulating linear models and solving them with simplexbased software. Often, the situations they model are actually nonlinear, but linear programming is appealing because of the advanced state of the software, guaranteed convergence to a global minimum, and the fact that uncertainty in the model makes a linear model more appropriate than an overly complex nonlinear model. Nonlinear programming may replace linear programming as the method of choice in some applications as the nonlinear software improves, and a new class of methods known as interiorpoint methods see Chapter 14 has proved to be faster for some linear programming problems, but the continued importance of the simplex method is assured for the foreseeable future.
LINEAR PROGRAMMING
Linear programs have a linear objective function and linear constraints, which may include both equalities and inequalities. The feasible set is a polytope, a convex, connected set with flat, polygonal faces. The contours of the objective function are planar. Figure 13.1 depicts a linear program in twodimensional space, in which the contours of the objective function are indicated by dotted lines. The solution in this case is uniquea single vertex. A simple reorientation of the polytope or the objective gradient c could however make the solution nonunique; the optimal value cT x could take on the same value over an entire edge. In higher dimensions, the set of optimal points can be a single vertex, an edge or face, or even the entire feasible set. The problem has no solution if the feasible set is empty the infeasible case or if the objective function is unbounded below on the feasible region the unbounded case.
Linear programs are usually stated and analyzed in the following standard form:
min cTx, subjectto Ax b,x 0, 13.1
where c and x are vectors in IRn, b is a vector in IRm, and A is an m n matrix. Simple devices can be used to transform any linear program to this form. For instance, given the problem
min cT x, subject to Ax b
without any bounds on x, we can convert the inequality constraints to equalities by
introducing a vector of slack variables z and writing
mincTx, subjecttoAxzb,z0. 13.2
This form is still not quite standard, since not all the variables are constrained to be
CHAPTER 13. THE SIMPLEX METHOD 357
optimal point x
c
feasible polytope
Figure 13.1 A linear program in two dimensions with solution at x.
nonnegative. We deal with this by splitting x into its nonnegative and nonpositive parts, x x x, where x maxx,0 0 and x maxx,0 0. The problem 13.2 can now be written as
c T x x x mincx ,s.t. A A I x b,x 0,
0zzz
which clearly has the same form as 13.1.
Inequality constraints of the form x u or Ax b always can be converted to equality
constraints by adding or subtracting slack variables to make up the difference between the left and righthand sides. Hence,
x u x w u, w 0, AxbAxyb, y0.
When we subtract the variables from the left hand side, as in the second case, they are sometimes known as surplus variables. We can also convert a maximize objective max cT x into the minimize form of 13.1 by simply negating c to obtain: min cT x .
We say that the linear program 13.1 is infeasible if the feasible set is empty. We say that the problem 13.1 is unbounded if the objective function is unbounded below on the feasible region, that is, there is a sequence of points xk feasible for 13.1 such that cT xk . Of course, unbounded problems have no solution.
358 CHAPTER 13. THE SIMPLEX METHOD
Many linear programs arise from models of transshipment and distribution networks. These problems have additional structure in their constraints; specialpurpose simplex algo rithms that exploit this structure are highly efficient. We do not discuss such problems further in this book, except to note that the subject is important and complex, and that a number of fine texts on the topic are available see, for example, Ahuja, Magnanti, and Orlin 1.
For the standard formulation 13.1, we will assume throughout that m n. Other wise, the system Ax b contains redundant rows, or is infeasible, or defines a unique point. When m n, factorizations such as the QR or LU factorization see Appendix A can be used to transform the system Ax b to one with a coefficient matrix of full row rank.
13.1 OPTIMALITY AND DUALITY
OPTIMALITY CONDITIONS
Optimality conditions for the problem 13.1 can be derived from the theory of Chapter 12. Only the firstorder conditionsthe KarushKuhnTucker KKT conditions are needed. Convexity of the problem ensures that these conditions are sufficient for a global minimum. We do not need to refer to the secondorder conditions from Chapter 12, which are not informative in any case because the Hessian of the Lagrangian for 13.1 is zero.
The theory we developed in Chapter 12 make derivation of optimality and duality results for linear programming much easier than in other treatments, where this theory is developed more or less from scratch.
The KKT conditions follow from Theorem 12.1. As stated in Chapter 12, this theorem requires linear independence of the active constraint gradients LICQ. However, as we noted in Section 12.6, the result continues to hold for dependent constraints provided they are linear, as is the case here.
We partition the Lagrange multipliers for the problem 13.1 into two vectors and s, where IRm is the multiplier vector for the equality constraints Ax b, while s IRn is the multiplier vector for the bound constraints x 0. Using the definition 12.33, we can write the Lagrangian function for 13.1 as
Lx,,s cT x T Ax b sT x. 13.3 Applying Theorem 12.1, we find that the firstorder necessary conditions for x to be a
solution of 13.1 are that there exist vectors and s such that
AT s c, Ax b, x 0, s 0, xisi 0,
i 1,2,…,n.
13.4a 13.4b 13.4c 13.4d 13.4e
The complementarity condition 13.4e, which essentially says that at least one of the components xi and si must be zero for each i 1, 2, . . . , n, is often written in the alternative form xT s 0. Because of the nonnegativity conditions 13.4c, 13.4d, the two forms are identical.
Let x,,s denote a vector triple that satisfies 13.4. By combining the three equalities 13.4a, 13.4d, and 13.4e, we find that
cTx AT sTx AxT bT. 13.5
As we shall see in a moment, bT is the objective function for the dual problem to 13.1, so 13.5 indicates that the primal and dual objectives are equal for vector triples x,,s that satisfy 13.4.
It is easy to show directly that the conditions 13.4 are sufficient for x to be a global solution of 13.1. Let x be any other feasible point, so that Ax b and x 0. Then
cTx AsTx bTx Ts bT cTx. 13.6
We have used 13.4 and 13.5 here; the inequality relation follows trivially from x 0 and s 0. The inequality 13.6 tells us that no other feasible point can have a lower objective value than cT x. We can say more: The feasible point x is optimal if and only if
x T s 0 ,
since otherwise the inequality in 13.6 is strict. In other words, when si 0, then we must
have x i 0 for all solutions x of 13.1. THE DUAL PROBLEM
Given the data c, b, and A, which defines the problem 13.1, we can define another, closely related, problem as follows:
max bT , subject to AT c. 13.7
This problem is called the dual problem for 13.1. In contrast, 13.1 is often referred to as the primal. We can restate 13.7 in a slightly different form by introducing a vector of dual slack variables s, and writing
max bT, subjectto ATs c, s 0. 13.8
The variables ,s in this problem are sometimes jointly referred to collectively as dual variables.
13.1. OPTIMALITY AND DUALITY 359
360 CHAPTER 13. THE SIMPLEX METHOD
The primal and dual problems present two different viewpoints on the same data. Their close relationship becomes evident when we write down the KKT conditions for 13.7. Let us first restate 13.7 in the form
minbT subjecttocAT0,
to fit the formulation 12.1 from Chapter 12. By using x IRn to denote the Lagrange
multipliers for the constraints AT c, we see that the Lagrangian function is L , x b T x T c A T .
Using Theorem 12.1 again, we find the firstorder necessary conditions for to be optimal for 13.7 to be that there exists x such that
Ax b, AT c, x 0,
xicATi 0, i1,2,…,n.
13.9a 13.9b 13.9c 13.9d
Defining s c AT as in 13.8, we find that the conditions 13.9 and 13.4 are identical! The optimal Lagrange multipliers in the primal problem are the optimal variables in the dual problem, while the optimal Lagrange multipliers x in the dual problem are the optimal variables in the primal problem.
Analogously to 13.6, we can show that 13.9 are in fact sufficient conditions for a solution of the dual problem 13.7. Given x and satisfying these conditions so that the triple x, , s x, , c AT satisfies 13.4, we have for any other dual feasible point with AT c that
b T x T A T
x T A T c c T x
cTx becauseAT c0andx 0 bT from 13.5.
Hence achieves the maximum of the dual objective bT over the dual feasible region AT c, so it solves the dual problem 13.7.
The primaldual relationship is symmetric; by taking the dual of the dual problem 13.7, we recover the primal problem 13.1. We leave the proof of this claim as an exercise. Given a feasible vector x for the primal satisfying Ax b and x 0 and a feasible
point,sforthedualsatisfying ATs c,s 0,wehaveasin13.6that
cT x bT c AT T x sT x 0. 13.10
Therefore we have cT x bT that is, the dual objective is a lower bound on the primal objective when both the primal and dual variables are feasiblea result known as weak duality.
The following strong duality result is fundamental to the theory of linear programming. Theorem 13.1 Strong Duality.
i If either problem 13.1 or 13.7 has a finite solution, then so does the other, and the objective values are equal.
ii If either problem 13.1 or 13.7 is unbounded, then the other problem is infeasible.
PROOF. For i, suppose that 13.1 has a finite optimal solution x. It follows from The orem 12.1 that there are vectors and s such that x,,s satisfies 13.4. We noted above that 13.4 and 13.9 are equivalent, and that 13.9 are sufficient conditions for to beasolutionofthedualproblem13.7.Moreover,itfollowsfrom13.5thatcTx bT, as claimed.
A symmetric argument holds if we start by assuming that the dual problem 13.7 has a solution.
To prove ii, suppose that the primal is unbounded, that is, there is a sequence of points xk, k 1,2,3,… such that
cTxk , Axk b, xk 0.
Suppose too that the dual 13.7 is feasible, that is, there exists a vector such that AT c.
From the latter inequality together with xk 0, we have that T Axk cT xk , and therefore T b T A x k c T x k ,
yielding a contradiction. Hence, the dual must be infeasible.
A similar argument can be used to show that unboundedness of the dual implies
infeasibility of the primal.
As we showed in the discussion following Theorem 12.1, the multiplier values and s for 13.1 indicate the sensitivity of the optimal objective value to perturbations in the constraints. In fact, the process of finding ,s for a given optimal x is often called sensitivity analysis. Considering the case of perturbations to the vector b the righthand side in 13.1 and objective gradient in 13.7, we can make an informal argument to illustrate the sensitivity. Suppose that this small change produces small perturbations in the primal and dual solutions, and that the vectors as and ax have zeros in the same locations as s and x, respectively. Since x and s are complementary see 13.4e it follows that
0 xT s xT as axT s axT as.
13.1. OPTIMALITY AND DUALITY 361
362 CHAPTER 13. THE SIMPLEX METHOD
We have from Theorem 13.1 that the optimal objectives of the primal and dual problems are equal, for both the original and perturbed problems, so
cT x bT , cT x ax b abT a.
Moreover, by feasibility of the perturbed solutions in the perturbed problems, we have
Ax ax b ab, AT a as.
Hence, the change in optimal objective due to the perturbation is as follows:
In particular, if ab small that
cTax babTabT b abT a abT
x axT AT a abT x axT as abT
abT .
e j , where e j is the j th unit vector in IRm , we have for all
sufficiently
cTax j. Thatis,thechangeinoptimalobjectiveisj timesthesizeoftheperturbationtobj,ifthe
perturbation is small.
13.2 GEOMETRY OF THE FEASIBLE SET
BASES AND BASIC FEASIBLE POINTS
We assume for the remainder of the chapter that
The matrix A in 13.1 has full row rank. 13.12
In practice, a preprocessing phase is applied to the usersupplied data to remove some redundancies from the given constraints and eliminate some of the variables. Reformulation by adding slack, surplus, and artificial variables can also result in A satisfying the property 13.12 .
Each iterate generated by the simplex method is a basic feasible point of 13.1. A vector x is a basic feasible point if it is feasible and if there exists a subset B of the index set 1,2,…,n such that
13.11
13.2. GEOMETRY OF THE FEASIBLE SET 363
B contains exactly m indices;
iB xi 0thatis,theboundxi 0canbeinactiveonlyifiB;
ThemmmatrixBdefinedby
B AiiB 13.13 is nonsingular, where Ai is the ith column of A.
A set B satisfying these properties is called a basis for the problem 13.1. The corresponding matrix B is called the basis matrix.
The simplex methods strategy of examining only basic feasible points will converge to a solution of 13.1 only if
a the problem has basic feasible points; and
b atleastonesuchpointisabasicoptimalpoint,thatis,asolutionof13.1thatisalsoa
basic feasible point.
Happily, both a and b are true under reasonable assumptions, as the following result sometimes known as the fundamental theorem of linear programming shows.
Theorem 13.2.
i If 13.1 has a nonempty feasible region, then there is at least one basic feasible point; ii If 13.1 has solutions, then at least one such solution is a basic optimal point.
iii If 13.1 is feasible and bounded, then it has an optimal solution.
PROOF. Among all feasible vectors x, choose one with the minimal number of nonzero components, and denote this number by p. Without loss of generality, assume that the nonzeros are x1,x2,…,xp, so we have
p
Aixi b.
i1
Suppose first that the columns A1, A2,…, Ap are linearly dependent. Then we can
express one of them A p , say in terms of the others, and write p1
Ap
Aizi, 13.14
i1
364 CHAPTER 13. THE SIMPLEX METHOD
for some scalars z1,z2,…,zp1. It is easy to check that the vector
x x z1,z2,…,zp1,1,0,0,…,0T x z 13.15
satisfies Ax b for any scalar . In addition, since xi 0 for i 1,2,…, p, we also havexi 0forthesameindicesi 1,2,…, pandall sufficientlysmallinmagnitude. However, there is a value 0,xp such that xi 0 for some i 1,2,…, p. Hence, x is feasible and has at most p 1 nonzero components, contradicting our choice of p as the minimal number of nonzeros.
Therefore, columns A1, A2,…, Ap must be linearly independent, and so p m. If p m, we are done, since then x is a basic feasible point and B is simply 1,2,…,m. Otherwise p m and, because A has full row rank, we can choose m p columns from amongAp1,Ap2,…,An tobuildupasetofmlinearlyindependentvectors.Weconstruct
B by adding the corresponding indices to 1, 2, . . . , p. The proof of i is complete.
The proof of ii is quite similar. Let x be a solution with a minimal number of nonzero components p, and assume again that x1, x2, . . . , xp are the nonzeros. If the
columns A1, A2,…, Ap are linearly dependent, we define xx z,
where z is chosen exactly as in 13.14, 13.15. It is easy to check that x will be feasible
for all have
sufficiently small, both positive and negative. Hence, since x is optimal, we must cTx zcTx cTz0
sufficiently small positive and negative. Therefore, cT z 0 and so cT x cT x . The same logic as in the proof of i can be applied to find 0 such that x is
for all
for all
feasible and optimal, with at most p 1 nonzero components. This contradicts our choice of p as the minimal number of nonzeros, so the columns A1, A2,…, Ap must be linearly independent. We can now apply the same reasoning as above to conclude that x is already a basic feasible point and therefore a basic optimal point.
The final statement iii is a consequence of finite termination of the simplex method. We comment on the latter property in the next section.
The terminology we use here is not quite standard, as the following table shows:
our terminology
basic feasible point basic optimal point
terminology used elsewhere
basic feasible solution optimal basic feasible solution
The standard terms arose because solution and feasible solution were originally used as synonyms for feasible point. However, as the discipline of optimization developed,
13.2. GEOMETRY OF THE FEASIBLE SET 365
Figure 13.2
Vertices of a threedimensional polytope indicated by .
the word solution took on a more specific and intuitive meaning as in solution to the problem. We maintain consistency with the rest of the book by following this more modern usage.
VERTICES OF THE FEASIBLE POLYTOPE
The feasible set defined by the linear constraints is a polytope, and the vertices of this polytope are the points that do not lie on a straight line between two other points in the set. Geometrically, they are easily recognizable; see Figure 13.2. Algebraically, the vertices are exactly the basic feasible points defined above. We therefore have an important relationship between the algebraic and geometric viewpoints and a useful aid to understanding how the simplex method works.
Theorem 13.3.
All basic feasible points for 13.1 are vertices of the feasible polytope x Ax b, x 0, and vice versa.
PROOF. Let x be a basic feasible point and assume without loss of generality that B 1,2,…,m. The matrix B Aii1,2,…,m is therefore nonsingular, and
xm1 xm2 xn 0. 13.16
Suppose that x lies on a straight line between two other feasible points y and z. Then we canfind0,1suchthatx y1z.Becauseof13.16andthefactthatand 1arebothpositive,wemusthaveyi zi 0forim1,m2,…,n.Writing xB x1,x2,…,xmT and defining yB and zB likewise, we have from Ax Ay Az b
366 CHAPTER 13. THE SIMPLEX METHOD
that
BxB ByB BzB b,
and so, by nonsingularity of B, we have xB yB zB. Therefore, x y z, contradicting our assertion that y and z are two feasible points other than x. Therefore, x is a vertex.
Conversely, let x be a vertex of the feasible polytope, and suppose that the nonzero components of x are x1,x2,…,xp. If the corresponding columns A1, A2,…, Ap are linearly dependent, then we can construct the vector x x z as in 13.15. Since x is feasible for all with sufficiently small magnitude, we can define 0 such that x and x are both feasible. Since x x0 obviously lies on a straight line between thesetwopoints,itcannotbeavertex.HenceourassertionthatA1,A2,…,Ap arelinearly dependent must be incorrect, so these columns must be linearly independent and p m. If p m, and since A has full row rank, we can add m p indices to 1,2,…,p to form a basis B, for which x is the corresponding basic feasible point. This completes our proof.
We conclude this discussion of the geometry of the feasible set with a definition of degeneracy. This term has a variety of meanings in optimization, as we discuss in Chapter 16. For the purposes of this chapter, we use the following definition.
Definition 13.1 Degeneracy.
A basis B is said to be degenerate if xi 0 for some i B, where x is the basic feasible
solution corresponding to B. A linear program 13.1 is said to be degenerate if it has at least one degenerate basis.
13.3 THE SIMPLEX METHOD
OUTLINE
In this section we give a detailed description of the simplex method for 13.1. There are actually a number of variants the simplex method; the one described here is sometimes known as the revised simplex method. We will describe an alternative known as the dual simplex method in Section 13.6.
As we described above, all iterates of the simplex method are basic feasible points for 13.1 and therefore vertices of the feasible polytope. Most steps consist of a move from one vertex to an adjacent one for which the basis B differs in exactly one component. On most steps but not all, the value of the primal objective function cT x is decreased. Another type of step occurs when the problem is unbounded: The step is an edge along which the objective function is reduced, and along which we can move infinitely far without ever reaching a vertex.
The major issue at each simplex iteration is to decide which index to remove from the basis B. Unless the step is a direction of unboundedness, a single index must be removed from B and replaced by another from outside B. We can gain some insight into how this decision is made by looking again at the KKT conditions 13.4.
From B and 13.4, we can derive values for not just the primal variable x but also the dual variables ,s, as we now show. First, define the nonbasic index set N as the complement of B, that is,
N 1,2,…,nB. 13.17
Just as B is the basic matrix, whose columns are Ai for i B, we use N to denote the nonbasic matrix N Ai iN . We also partition the nelement vectors x, s, and c according to the index sets B and N , using the notation
xB xiiB, sB siiB, cB ciiB,
xN xiiN, sN siiN, cN ciiN.
From the KKT condition 13.4b, we have that AxBxB NxN b.
The primal variable x for this simplex iterate is defined as
xB B1b, xN 0. 13.18
Since we are dealing only with basic feasible points, we know that B is nonsingular and that xB 0, so this choice of x satisfies two of the KKT conditions: the equality constraints 13.4b and the nonnegativity condition 13.4c.
We choose s to satisfy the complementarity condition 13.4e by setting sB 0. The remainingcomponentsandsN canbefoundbypartitioningthisconditionintocB andcN componentsandusingsB 0toobtain
BTcB, NTsN cN.
Since B is square and nonsingular, the first equation uniquely defines as
BT cB. The second equation in 13.19 implies a value for sN :
13.19
13.20
sN cN NTcN B1NTcB.
13.21
13.3. THE SIMPLEX METHOD 367
368 CHAPTER 13. THE SIMPLEX METHOD
ComputationofthevectorsN isoftenreferredtoaspricing.ThecomponentsofsN areoften called the reduced costs of the nonbasic variables xN.
The only KKT condition that we have not enforced explicitly is the nonnegativity condition s 0. The basic components sB certainly satisfy this condition, by our choice sB 0. If the vector sN defined by 13.21 also satisfies sN 0, we have found an optimal vector triple x, , s, so the algorithm can terminate and declare success. Usually, however, oneormoreofthecomponentsofsN arenegative.ThenewindextoenterthebasisBthe enteringindexischosentobeoneoftheindicesqNforwhichsq 0.Asweshowbelow, the objective cT x will decrease when we allow xq to become positive if and only if i sq 0 and ii it is possible to increase xq away from zero while maintaining feasibility of x. Our procedure for altering B and changing x and s can be described accordingly as follows:
allow xq to increase from zero during the next step;
fix all other components of xN at zero, and figure out the effect of increasing xq on the current basic vector xB, given that we want to stay feasible with respect to the equality constraints Ax b;
keep increasing xq until one of the components of xB xp, say is driven to zero, or determining that no such component exists the unbounded case;
remove index p known as the leaving index from B and replace it with the entering index q.
This process of selecting entering and leaving indices, and performing the algebraic operations necessary to keep track of the values of the variables x, , and s, is sometimes known as pivoting.
We now formalize the pivoting procedure in algebraic terms. Since both the new iterate x and the current iterate x should satisfy Ax b, and since xN 0 and xi 0 for i Nq,wehave
A x B x B A q x q B x B A x . By multiplying this expression by B1 and rearranging, we obtain
xB xB B1Aqxq. 13.22
Geometrically speaking, 13.22 is usually a move along an edge of the feasible polytope that decreases cT x. We continue to move along this edge until a new vertex is encountered. At thisvertex,anewconstraintxp 0musthavebecomeactive,thatis,oneofthecomponents x p , p B, has decreased to zero. We then remove this index p from the basis B and replace it by q.
We now show how the step defined by 13.22 affects the value of cT x. From 13.22, we have
cTx cBTxB cqxq cBTxB cBT B1Aqxq cqxq. 13.23 From 13.20 we have cBT B1 T , while from the second equation in 13.19, since q N ,
w e h a v e A qT c q s q . T h e r e f o r e ,
c BT B 1 A q x q T A q x q c q s q x q ,
so by substituting in 13.23 we obtain
cT x cBT xB cq sqxq cqxq cT x sqxq. 13.24
Since q was chosen to have sq 0, it follows that the step 13.22 produces a decrease in the primal objective function cT x whenever xq 0.
It is possible that we can increase xq to without ever encountering a new vertex. In other words, the constraint xB xB B1 Aq xq 0 holds for all positive values of xq. When this happens, the linear program is unbounded; the simplex method has identified a ray that lies entirely within the feasible polytope along which the objective cT x decreases to .
Figure 13.3 shows a path traversed by the simplex method for a problem in IR2. In this example, the optimal vertex x is found in three steps.
If the basis B is nondegenerate see Definition 13.1, then we are guaranteed that xq 0, so we can be assured of a strict decrease in the objective function cT x at this step. If
13.3. THE SIMPLEX METHOD 369
c
simplex path
0
1
2
3
Figure 13.3
Simplex iterates for a twodimensional problem.
370 CHAPTER 13. THE SIMPLEX METHOD
the problem 13.1 is nondegenerate, we can ensure a decrease in cT x at every step, and can therefore prove the following result concerning termination of the simplex method.
Theorem 13.4.
Provided that the linear program 13.1 is nondegenerate and bounded, the simplex method terminates at a basic optimal point.
PROOF. The simplex method cannot visit the same basic feasible point x at two different iterations, because it attains a strict decrease at each iteration. Since the number of possible bases B is finite there are only a finite number of ways to choose a subset of m indices from 1, 2, . . . , n, and since each basis defines a single basic feasible point, there are only a finite number of basic feasible points. Hence, the number of iterations is finite. Moreover, since the method is always able to take a step away from a nonoptimal basic feasible point, and since the problem is not unbounded, the method must terminate at a basic optimal point.
This result gives us a proof of Theorem 13.2 iii in the case in which the linear program is nondegenerate. The proof of finite termination is considerably more complex when nondegeneracy of 13.1 is not assumed, as we discuss at the end of Section 13.5.
A SINGLE STEP OF THE METHOD
We have covered most of the mechanics of taking a single step of the simplex method. To make subsequent discussions easier to follow, we summarize our description.
Procedure 13.1 One Step of Simplex. GivenB,N,xB B1b0,xN 0; SolveBTcB for,
Compute sN cN NT ; pricing ifsN 0
stop; optimal point found
Select q N with sq 0 as the entering index; Solve Bd Aq for d;
if d 0
stop; problem is unbounded
Calculate xq mini di 0 xB i di , and use p to denote the minimizing i ;
UpdatexB xB dxq,xN 0,…,0,xq,0,…,0T;
Change B by adding q and removing the basic variable corresponding to column p of B.
We illustrate this procedure with a simple example.
EXAMPLE 13.1 Consider the problem
min 4×1 2×2 subject to x1 x2 x3 5,
2×1 12×2 x4 8, x 0.
Suppose we start with the basis B 3, 4, for which we have
xB x3 5 , 0 , sN s1 3 ,
x4 8 0 s2 2
andanobjectivevalueofcTx0.SincebothelementsofsN arenegative,wecouldchoose either 1 or 2 to be the entering variable. Suppose we choose q 1. We obtain d 1, 2T , so we cannot yet conclude that the problem is unbounded. By performing the ratio calculation, we find that p 2 corresponding to the index 4 and x1 4. We update the basic and nonbasic index sets to B 3, 1 and N 4, 2, and move to the next iteration.
At the second iteration, we have
xB x3 1 , 0 , sN s4 32 ,
x1 4 32 s2 54
withanobjectivevalueof12.WeseethatsN hasonenegativecomponent,corresponding to the index q 2, so we select this index to enter the basis. We obtain d 32, 12T , so again we do not detect unboundedness. Continuing, we find that the maximum value of x2 is 43, and that p 1, which indicates that index 3 will leave the basis B. We update the index sets to B 2, 1 and N 4, 3 and continue.
At the start of the third iteration, we have
xB x2 43 , 53 , sN s4 73 ,
x1 113 23 s3 53
with an objective value of cT x 413. We see that sN 0, so the optimality test is
satisfied, and we terminate.
13.3. THE SIMPLEX METHOD 371
We need to flesh out Procedure 13.1 with specifics of three important aspects of the implementation:
372 CHAPTER 13. THE SIMPLEX METHOD
Linear algebra issuesmaintaining an LU factorization of B that can be used to solve for and d.
Selection of the entering index q from among the negative components of sN. In general, there are many such components.
Handling of degenerate bases and degenerate steps, in which it is not possible to choose a positive value of xq without violating feasibility.
Proper handling of these issues is crucial to the efficiency of a simplex implementation. We give some details in the next three sections.
13.4 LINEAR ALGEBRA IN THE SIMPLEX METHOD
We have to solve two linear systems involving the matrix B at each step; namely,
BT cB, Bd Aq. 13.25
We never calculate the inverse basis matrix B1 explicitly just to solve these systems. Instead, we calculate or maintain some factorization of Busually an LU factorizationand use triangular substitutions with the factors to recover and d. It is less expensive to update the factorization than to calculate it afresh at each iteration because the basis matrix B changes by just a single column between iterations.
The standard factorizationupdating procedures start with an LU factorization of B at the first iteration of the simplex algorithm. Since in practical applications B is large and sparse, its rows and columns are rearranged during the factorization to maintain both numerical stability and sparsity of the L and U factors. One successful pivot strategy that trades off between these two aims was proposed by Markowitz in 1957 202; it is still used as the basis of many practical sparse LU algorithms. Other considerations may also enter into our choice of row and column reordering of B. For example, it may help to improve the efficiency of the updating procedure if as many as possible of the leading columns of U con tain just a single nonzero, on the diagonal. Many heuristics have been devised for choosing row and column permutations that produce this and other desirable structural features.
Let us assume for simplicity that row and column permutations are already incorporated in B, so that we write the initial LU factorization as
LU B, 13.26 L is unit lower triangular, U is upper triangular. The system Bd Aq can then be solved
by the following twostep procedure:
Ld Aq, Ud d . 13.27
13.4. LINEAR ALGEBRA IN THE SIMPLEX METHOD 373
column p
Figure 13.4 Left: L1 B, which is upper triangular except for the column occupied by A p . Right: After cyclic row and column permutation P1 , the nonupper triangular partofP1L1BP1T appearsinthelastrow.
Similarly, the system BT cB is solved by performing the following two triangular substitutions:
U T c B , L T .
We now discuss a procedure for updating the factors L and U after one step of the simplex method, when the index p is removed from the basis B and replaced by the index q. The corresponding change to the basis matrix B is that the column Bp is removed from B and replaced by Aq . We call the resulting matrix B and note that if we rewrite 13.26 as U L1 B, the modified matrix L1 B will be upper triangular except in column p. That is, L1 B has the form shown on the left in Figure 13.4.
We now perform a cyclic permutation that moves column p to the last column position m andmovescolumns p1, p2,…,m onepositiontothelefttomakeroomforit.Ifwe apply the same permutation to rows p through m, the net effect is to move the nonupper triangular part to the last row of the matrix, as shown in Figure 13.4. If we denote the permutation matrix by P1, the matrix illustrated at right in Figure 13.4 is P1 L1 B P1T .
Finally,weperformsparseGaussianeliminationonthematrixP1L1BP1T torestore upper triangular form. That is, we find L 1 and U1 lower and upper triangular, respectively such that
P1L1BP1T L1U1. 13.28
It is easy to show that L1 and U1 have a simple form. The lower triangular matrix L1 differs from the identity only in the last row, while U1 is identical to P1 L 1 B P1T except that the m,m element is changed and the offdiagonal elements in the last row are eliminated.
374 CHAPTER 13. THE SIMPLEX METHOD
We give details of this process for the case of m 5. Using the notation
u11 u12 u13 u14
u22 u23 u24 L1BU u33 u34
w1 w2
u15 u25
u55
u11 w1 u13 u14 u15
w2u23u24u25 L1B w3 u33 u34 u35 .
w4 u44u45 w5 u55
After the cyclic permutation P1, we have
u11 u13 u14 u15 w1
u33u34u35w3 P1L1BP1T u44 u45 w4 .
u55 w5 u23 u24 u25 w2
u44
and supposing that p 2 so that the second column is replaced by L 1 Aq , we have
The factors L1 and U1 are now as follows:
L1
u44 u45 w4 , 0 l52 l53 l54 1 w 2
1
1
u11
u13 u14 u15 w1 u33 u34 u35 w3
1 , U1
1 u55w5
13.30
for certain values of l52, l53, l54, and w2 see Exercise 13.10.
The result of this updating process is the factorization 13.28, which we can rewrite
as follows:
u35 , u45
L1Aq w3 , w4
B LU, where L LP1TL1, U U1P1. 13.31
w5
13.29
There is no need to calculate L and U explicitly. Rather, the nonzero elements in L 1 and the last column of U1, and the permutation information in P1, can be stored in compact form, so that triangular substitutions involving L and U can be performed by applying a number of permutations and sparse triangular substitutions involving these factors. The factorization updates from subsequent simplex steps are stored and applied in a similar fashion.
The procedure we have just outlined is due to Forrest and Tomlin 110. It is quite efficient, because it requires the storage of little data at each update and does not require much movement of data in memory. Its major disadvantage is possible numerical instability. Large elements in the factors of a matrix are a sure indicator of instability, and the multipliers in the L1 factor l52 in 13.30, for example may be very large. An earlier scheme of Bartels and Golub 12 allowed swapping of rows to avoid these problems. For instance, if u33 u23 in 13.29, we could swap rows 2 and 5 to ensure that the subsequent multiplier l52 in the L 1 factor does not exceed 1 in magnitude. This improved stability comes at a price: The lower right corner of the upper triangular factor may become more dense during each update.
Although the update information for each iteration the permutation matrices and the sparse triangular factors can often be stored in a highly compact form, the total amount of space may build up to unreasonable levels after many such updates have been performed. As the number of updates builds up, so does the time needed to solve for the vectors d and in Procedure 13.1. If an unstable updating procedure is used, numerical errors may also come into play, blocking further progress by the simplex algorithm. For all these reasons, most simplex implementations periodically calculate a fresh LU factorization of the current basis matrix B and discard the accumulated updates. The new factorization uses the same permutation strategies that we apply to the very first factorization, which balance the requirements of stability, sparsity, and structure.
13.5 OTHER IMPORTANT DETAILS
PRICING AND SELECTION OF THE ENTERING INDEX
There are usually many negative components of sN at each step. How do we choose one of these to become the index that enters the basis? Ideally, we would like to choose the sequence of entering indices q that gets us to the solution x in the fewest possible steps, but we rarely have the global perspective needed to implement this strategy. Instead, we use more shortsighted but practical strategies that obtain a significant decrease in cT x on just the present iteration. There is usually a tradeoff between the effort spent on finding a good entering index and the amount of decrease in cT x resulting from this choice. Different pivot strategies resolve this tradeoff in different ways.
Dantzigs original selection rule is one of the simplest. It chooses q such that sq is the most negative component of sN N T . This rule, which is motivated by 13.24, gives the maximum improvement in cT x per unit increase in the entering variable xq . A large
13.5. OTHER IMPORTANT DETAILS 375
376 CHAPTER 13. THE SIMPLEX METHOD
reduction in cT x is not guaranteed, however. It could be that we can increase xq only a tiny amount from zero or not at all before reaching the next vertex.
CalculationoftheentirevectorsN from13.21requiresamultiplicationbyNT,which can be expensive when the matrix N is very large. Partial pricing strategies calculate only a subvectorofsN andmakethechoiceofenteringvariablefromamongthenegativeentriesin this subvector. To give all the indices in N a chance to enter the basis, these strategies cycle through the nonbasic elements, periodically changing the subvector of sN they evaluate so that no nonbasic index is ignored for too long.
Neither of these strategies guarantees that we can make a substantial move along the chosen edge before reaching a new vertex. Multiple pricing strategies are more thorough: For a small subset of indices q N , they evaluate sq and, if sq 0, the maximum value of xq that maintains feasibility of x and the consequent change sq xq in the objective function see 13.24. Calculation of xq requires evaluation of d B1 Aq as in Procedure 13.1, which is not cheap. Subsequent iterations deal with this same index subset until we reach an iteration at which all sq are nonnegative for q in the subset. At this point, the full vector sN iscomputed,anewsubsetofnonbasicindicesischosen,andthecyclebeginsagain.This approach has the advantage that the columns of the matrix N outside the current subset of priced components need not be accessed at all, so memory access in the implementation is quite localized.
Naturally, it is possible to devise heuristics that combine partial and multiple pricing in various imaginative ways.
A sophisticated rule known as steepest edge chooses the most downhill direction from among all the candidatesthe one that produces the largest decrease in cT x per unit distance moved along the edge. By contrast, Dantzigs rule maximizes the decrease in cT x per unit change in xq, which is not the same thing, as a small change in xq can correspond to a large distance moved along the edge. During the pivoting step, the overall change in x is
xB xB B1Aq xxx e xqxqxq, 13.32
NNq
where eq is the unit vector with a 1 in the position corresponding to the index q N and
zeros elsewhere, and the vector q is defined as
q B1Aq d ;
eq eq
see 13.25. The change in cT x per unit step along q is given by
cT q . q
13.33
13.34
The steepestedge rule chooses q N to minimize this quantity.
If we had to compute each i by solving Bd Ai for each i N , the steepestedge strategy would be prohibitively expensive. Goldfarb and Reid 134 showed that the measure 13.34 of edge steepness for all indices i N can, in fact, be updated quite economically at each iteration. We outline their steepestedge procedure by showing how each cT i and
i can be updated at the current iteration.
First, note that we already know the numerator cT i in 13.34 without calculating
i , because by taking the inner product of 13.32 with c and using 13.24, we have that cTi si.Toinvestigatethechangeindenominator i atthisstep,wedefinei i 2, where this quantity is defined before and after the update as follows:
i i 2 B1Ai 21, 13.35a 2 B1Ai 2 1. 13.35b
ii
Assume without loss of generality that the entering column Aq replaces the first column of the basis matrix B that is, p 1, and that this column corresponds to the index t. We can then express the update to B as follows:
BBAqAte1T BAqBe1e1T, 13.36 where e1 1, 0, 0, . . . , 0T . By applying the ShermanMorrison formula A.27 to the
rankone update formula in 13.36, we obtain
1 1 B1Aq e1e1TB1 1 de1e1TB1
B B 1e1TB1Aqe1B e1Td ,
where again we have used the fact that d B1 Aq see 13.25. Therefore, we have that
13.5. OTHER IMPORTANT DETAILS 377
B1Ai B1Ai e1T B1Ai d e1. e 1T d
By substituting for B1 Ai in 13.35 and performing some simple manipulation, we obtain
i 2 1 i
AiTBTd
1
i q.
13.37
eTB1A
i
eTB1A 2
e 1T d
Once we solve the following two linear systems to obtain d and r :
BT d d, BT r e1.
13.38
e 1T d
378 CHAPTER 13. THE SIMPLEX METHOD
The formula 13.37 then becomes
i 2 i
rTA 2
i q. 13.39
Hence, the entire set of i values, for i N with i a q, can be calculated by solving the two systems 13.38 and then evaluating the inner products r T Ai and dT Ai , for each i .
The steepestedge strategy does not guarantee that we can take a long step before reaching another vertex, but it has proved to be highly effective in practice.
STARTING THE SIMPLEX METHOD
The simplex method requires a basic feasible starting point x and a corresponding initial basis B 1, 2, . . . , n with B m such that the basis matrix B defined by 13.13 isnonsingularandxB B1b0andxN 0.Theproblemoffindingthisinitialpointand basis may itself be nontrivialin fact, its difficulty is equivalent to that of actually solving a linear program. We describe here the twophase approach that is commonly used to deal with this difficulty in practical implementations.
In Phase I of this approach we set up an auxiliary linear program based on the data of 13.1, and solve it with the simplex method. The PhaseI problem is designed so that an initial basis and initial basic feasible point is trivial to find, and so that its solution gives a basic feasible initial point for the second phase. In Phase II, a second linear program similar to the original problem 13.1 is solved, with the PhaseI solution as a starting point. The solution of the original problem 13.1 can be extracted easily from the solution of the PhaseII problem.
In Phase I we introduce artificial variables z into 13.1 and redefine the objective function to be the sum of these artificial variables, as follows:
mineTz, subjecttoAxEzb,x,z0, 13.40 where z IRm, e 1,1,…,1T , and E is a diagonal matrix whose diagonal elements are
Ejj 1 ifbj 0, Ejj 1 ifbj 0. It is easy to see that the point x, z defined by
x 0, zj bj, j 1,2,…,m, 13.41
is a basic feasible point for 13.40. Obviously, this point satisfies the constraints in 13.40, while the initial basis matrix B is simply the diagonal matrix E , which is clearly nonsingular. At any feasible point for 13.40, the artificial variables z represent the amounts by which the constraints Ax b are violated by the x component. The objective function is
rTA i rT Aq
dT Ai
rT Aq
simply the sum of these violations, so by minimizing this sum we are forcing x to become feasible for the original problem 13.1. It is not difficult to see that the PhaseI problem 13.40 has an optimal objective value of zero if and only if the original problem 13.1 is feasible, by using the following argument: If there exists a vector x , z that is feasible for 13.40 such that eT z 0, we must have z 0, and therefore Ax b and x 0, so x is feasible for 13.1. Conversely, if x is feasible for 13.1, then the point x , 0 is feasible for 13.40 with an objective value of 0. Since the objective in 13.40 is obviously nonnegative at all feasible points, then x , 0 must be optimal for 13.40, verifying our claim.
In Phase I, we apply the simplex method to 13.40 from the initial point 13.41. This linear program cannot be unbounded, because its objective function is bounded below by 0, so the simplex method will terminate at an optimal point assuming that it does not cycle; see below. If the objective eT z is positive at this solution, we conclude by the argument above that the original problem 13.1 is infeasible. Otherwise, the simplex method identifies a point x , z with eT z 0, which is also a basic feasible point for the following PhaseII problem:
mincTx subjecttoAxzb, x0, 0z0. 13.42
Note that this problem differs from 13.40 in that the objective function is replaced by the original objective cT x, while upper bounds of 0 have been imposed on z. In fact, 13.42 is equivalent to 13.1, because any solution and indeed any feasible point must have z 0. We need to retain the artificial variables z in Phase II, however, since some components of z may still be present in the optimal basis from Phase I that we are using as the initial basis for 13.42, though of course the values z j of these components must be zero. In fact, we can modify 13.42 to include only those components of z that are present in the optimal basis for 13.40.
The problem 13.42 is not quite in standard form because of the twosided bounds on z. However, it is easy to modify the simplex method described above to handle upper and lower bounds on the variables we omit the details. We can customize the simplex algorithm slightly by deleting each component of z from the problem 13.42 as soon as it is swapped out of the basis. This strategy ensures that components of z do not repeatedly enter and leave the basis, thereby avoiding unnecessary simplex iterations.
If x, z is a basic solution of 13.42, it must have z 0, and so x is a solution of 13.1. In fact, x is a basic feasible point for 13.1, though this claim is not completely obvious because the final basis B for the PhaseII problem may still contain components of z, making it unsuitable as an optimal basis for 13.1. Since A has full row rank, however, we can construct an optimal basis for 13.1 in a postprocessing phase: Extract from B any components of z that are present, and replace them with nonbasic components of x in a way that maintains nonsingularity of the submatrix B defined by 13.13.
A final point to note is that in many problems we do not need to add a complete set of m artificial variables to form the PhaseI problem. This observation is particularly relevant when slack and surplus variables have already been added to the problem formulation, as
13.5. OTHER IMPORTANT DETAILS 379
380 CHAPTER 13. THE SIMPLEX METHOD
in 13.2, to obtain a linear program with inequality constraints in standard form 13.1. Some of these slacksurplus variables can play the roles of artificial variables, making it unnecessary to include such variables explicitly.
We illustrate this point with the following example.
EXAMPLE 13.2
Consider the inequalityconstrained linear program defined by
min 3×1 x2 x3 subject to 2×1 x2 x3 2,
x1 x2 x3 1, x 0.
By adding slack variables to both inequality constraints, we obtain the following equivalent problem in standard form:
min 3×1 x2 x3 subject to 2×1 x2 x3 x4 2,
x1 x2 x3 x5 1, x 0.
By inspection, it is easy to see that the vector x 0, 0, 0, 2, 0 is feasible with respect to the first linear constraint and the lower bound x 0, though it does not satisfy the second constraint. Hence, in forming the PhaseI problem, we add just a single artificial variable z2 to the second constraint and obtain
min z2 subject to 2×1 x2 x3 x4 2,
x1 x2 x3 x5 z2 1, x, z2 0.
13.43 13.44 13.45 13.46
It is easy to see that the vector x, z2 0, 0, 0, 2, 0, 1 is feasible with respect to 13.43. In fact, it is a basic feasible point, since the corresponding basis matrix B is
B10, 0 1
which is clearly nonsingular. In this example, the variable x4 plays the role of artificial variable
for the first constraint. There was no need to add an explicit artificial variable z1.
DEGENERATE STEPS AND CYCLING
As noted above, the simplex method may encounter situations in which for the entering index q, we cannot set xq any greater than zero in 13.22 without violating the nonnegativity condition x 0. By referring to Procedure 13.1, we see that these situations arise when there is i with xBi 0 and di 0, where d is defined by 13.25. Steps of this type are called degenerate steps. On such steps, the components of x do not change and, therefore, the objective function cT x does not decrease. However, the steps may still be useful because they change the basis B by replacing one index, and the updated B may be closer to the optimal basis. In other words, the degenerate step may be laying the groundwork for reductions in cT x on later steps.
Sometimes, however, a phenomenon known as cycling can occur. After a number of successive degenerate steps, we may return to an earlier basis B. If we continue to apply the algorithm from this point using the same rules for selecting entering and leaving indices, we will repeat the same cycle ad infinitum, never converging.
Cycling was once thought to be a rare phenomenon, but in recent times it has been observed frequently in the large linear programs that arise as relaxations of integer pro gramming problems. Since integer programs are an important source of linear programs, practical simplex codes usually incorporate a cycling avoidance strategy.
In the remainder of this section, we describe a perturbation strategy and its close relative, the lexicographic strategy.
Suppose that a degenerate basis is encountered at some simplex iteration, at which the basis is B and the basis matrix is B, say. We consider a modified linear program in which we add a small perturbation to the righthand side of the constraints in 13.1, as follows:
2
bbB . , .
m
where
the components of the basic solution vector; we have
2
xB xB . . .
13.5. OTHER IMPORTANT DETAILS 381
is a very small positive number. This perturbation in b induces a perturbation in
m
13.47
382 CHAPTER 13. THE SIMPLEX METHOD
Retaining the perturbation for subsequent iterations, we see that subsequent basic solutions have the form
2 m
xB xB B1B . xB B1Bk k,
13.48
where B1Bk denotes the kth column of B1B and xB represents the basic solution for the unperturbed righthand side b.
From 13.47, we have that for all sufficiently small but positive, xB i 0 for all i. Hence, the basis is nondegenerate for the perturbed problem, and we can perform a step of the simplex method that produces a nonzero but tiny decrease in the objective.
Indeed, if we retain the perturbation over all subsequent iterations, and provided that the initial choice of was small enough, we claim that all subsequent bases visited by the algorithm are nondegenerate. We prove this claim by contradiction, by assuming that there is some basis matrix B such that xB i 0 for some i and all sufficiently small. From 13.48, we see that this can happen only when xBi 0 and B1B ik 0 for k 1, 2, . . . , m . The latter relation implies that the i th row of B 1 B is zero, which cannot occur, because both B and B are nonsingular.
We conclude that, provided the initial choice of is sufficiently small to ensure nondegeneracy of all subsequent bases, no basis is visited more than once by the simplex method and therefore, by the same logic as in the proof of Theorem 13.4, the method terminates finitely at a solution of the perturbed problem. The perturbation can be removed in a postprocessing phase, by resetting xB B1b for the final basis B and the original righthand side b.
The question remains of how to choose small enough at the point at which the original degenerate basis B is encountered. The lexicographic strategy finesses this issue by not making an explicit choice of , but rather keeping track of the dependence of each basic variable on each power of . When it comes to selecting the leaving variable, it chooses the index p that minimizes xB idi over all variables in the basis, for all sufficiently small
. The choice of p is uniquely defined by this procedure, as we can show by an argument similar to the one above concerning nondenegeracy of each basis. We can extend the pivot procedure slightly to update the dependence of each basic variable on the powers of at each iteration, including the variable xq that has just entered the basis.
13.6 THE DUAL SIMPLEX METHOD
Here we describe another variant of the simplex method that is useful in a variety of situations and is often faster on many practical problems than the variant described above. This dual
. k1 m
13.6. THE DUAL SIMPLEX METHOD 383
simplex method uses many of the same concepts and methodology described above, such as the splitting of the matrix A into column submatrices B and N and the generation of iterates x,,s that satisfy the complementarity condition xT s 0. The method of Section 13.3 starts with a feasible x with xB 0 and xN 0 and a corresponding dual iterate ,s for which sB 0 but sN is not necessarily nonnegative. After making systematic column interchangesbetweenBandN,itfinallyreachesafeasibledualpoint,satwhichsN 0, thus yielding a solution of both the primal problem 13.1 and the dual 13.8. By contrast, the dual simplex method starts with a point ,s feasible for 13.8, at which sN 0 and sB 0, and a corresponding primal feasible point x for which xN 0 but xB is not necessarily nonnegative. By making systematic column interchanges between B and N, it finally reaches a feasible primal point x for which xB 0, signifying optimality. Note that although the matrix B used in this algorithm is a nonsingular column submatrix of A, it is no longer correct to refer to it as a basis matrix, since it does not satisfy the feasibility condition xB B1b 0.
We now describe a single step of this method in a similar fashion to Section 13.3, though the details are a little more complicated here. As mentioned above, we commence each step with submatrices B and N of A, and corresponding sets B and N . The primal and dual variables corresponding to these sets are defined as follows cf. 13.18, 13.20, and 13.21:
xB B1b, xN 0,
BT cB,
sB cB BT0, sN cN NT0,
13.49a 13.49b 13.49c
If xB 0, the current point x,,s satisfies the optimality conditions 13.4, and we are done. Otherwise, we select a leaving index q B such that xq 0. Our aim is to move xq to zero thereby ensuring that nonnegativity holds for this component, while allowing sq to increase away from zero. We will also identify an entering index r N , such that sr becomes zero on this step while xr increases away from zero. Hence, the index q will move from B to N, while r will move from N to B. How do we choose r, and how are x, , and s changed on this step? The description below provides the answer. We use x, , s to denote the updated values of our variables, after this step is taken.
First, let eq the vector of length m that contains all zeros except for a 1 in the position occupied by index q in the set B. Since we increase sq away from zero while fixing the remaining components of sB at zero, the updated value sB will have the form
sB sB eq 13.50 for some positive scalar to be determined. We write the corresponding update to as
v, 13.51
384 CHAPTER 13. THE SIMPLEX METHOD
for some vector v. In fact, since sB and must satisfy the first equation in 13.49c, we must have
s B c B B T
sB eq cB BTv
eq BT v, 13.52
which is a system of equations that we can solve to obtain v.
To see how the dual objective value bT changes as a result of this step, we use 13.52
and the fact that xq xBT eq to obtain
bT bTbTv
bT bT BT eq
bT xBT eq bT xq
from 13.52
from 13.49a
by definition of eq .
Sincexq 0andsinceouraimistomaximizethedualobjective,wewouldliketochoose as large as possible. The upper bound on is provided by the constraint sN 0. Similarly to 13.49c, we have
sN cN NT sN NTvsN w,
where we have defined
w NT v NT BT eq. The largest for which sN 0 is given by the formula
min sj. jN,wj0 wj
We define the entering index r to be the index at which the minimum in this expression is achieved. Note that
sr0 and wrArTv0, 13.53
where, as usual, Ar denotes the rth column of A.
Having now identified how and s are updated on this step, we need to figure out
how x changes. For the leaving index q, we need to set xq 0, while for the entering index r we can allow xr to be nonzero. We denote the direction of change for xB to be the vector
d, defined by the following linear system:
0 , xi 0, ,
13.7 PRESOLVING
f o r i q ,
fori N withi ar, forir.
iB
Aixi b, iB
iB
Bd
Aidi Ar.
13.54
Since from 13.49a, we have
we have that
Ai xi di Ar b, for any scalar . To ensure that xq 0, we set
xq , dq
which is well defined only if dq is nonzero. In fact, we have that dq 0, since dq dTeq ArTBTeq ArTvwr 0,
13.55
13.56
13.7. PRESOLVING 385
where we have used the definition of eq along with 13.54, 13.52, and 13.53 to derive these relationships. Since xq 0, it follows from 13.56 that 0. Following 13.55 we can define the updated vector x as follows:
xidi, foriBwithiaq,
Presolving also known as preprocessing is carried out in practical linear programming codes to reduce the size of the userdefined linear programming problem before passing it to the solver. A variety of techniquessome obvious, some ingeniousare used to eliminate certain variables, constraints, and bounds from the problem. Often the reduction in problem size is quite dramatic, and the linear programming algorithm takes much less time when applied to the presolved problem than when applied to the original problem. Presolving is
386 CHAPTER 13. THE SIMPLEX METHOD
beneficial regardless of what algorithm is used to solve the linear program; it is used both in simplex and interiorpoint codes. Infeasibility may also be detected by the presolver, eliminating the need to call the linear programming algorithm at all.
We mention just a few of the more straightforward preprocessing techniques here, referring the interested reader to Andersen and Andersen 4 for a more comprehensive list. For the purpose of this discussion, we assume that the linear program is formulated with both lower and upper bounds on x, that is,
mincTx, subjecttoAxb,lxu, 13.57
wheresomecomponentsli ofthelowerboundvectormaybeandsomeupperbounds ui maybe.
Consider first a row singleton, which happens when one of the equality constraints involves just one of the variables. Specifically, if constraint k involves only variable j that is, Akj a 0, but Aki 0 for all i a j, we can immediately set xj bkAkj and eliminate x j from the problem. Note that if this value of x j violates its bounds that is, x j l j or xj uj,wecandeclaretheproblemtobeinfeasible,andterminate.
Another obvious technique is the free column singleton, in which there is a variable x j that occurs in only one of the equality constraints, and is free that is, its lower bound is and its upper bound is . In this case, we have for some k that Akj a 0 while Alj 0 for all l a k. Here we can simply use constraint k to eliminate x j from the problem, setting
bk Akpxp xj paj
Akj
.
Once the values of xp for p a j have been obtained by solving a reduced linear program, we can substitute into this formula to recover x j prior to returning the result to the user. This substitution does not require us to modify any other constraints, but it will change the costvectorcingeneral,whenevercj a0.Wewillneedtomakethereplacement
cpcpcjAkpAkj, forallpaj. Inthiscase,wecanalsodeterminethedualvariableassociatedwithconstraintk.Sincexj is
a free variable, there is no dual slack associated with it, so the jth dual constraint becomes
m
Aljl cj Akjk cj,
l1
from which we deduce that k cjAkj.
Perhaps the simplest preprocessing check is for the presence of zero rows and columns
in A. If Aki 0 for all i 1,2,…,n, then provided that the righthand side is also zero bk 0, we can simply delete this row from the problem and set the corresponding
Lagrange multiplier k to an arbitrary value. For a zero columnsay, Akj 0 for all k 1, 2, . . . , mwe can determing the optimal value of x j by inspecting its cost coefficient cj anditsboundslj anduj.Ifcj 0,wesetxj uj tominimizetheproductcjxj.We are free to do so because x j is not restricted by any of the equality constraints. If c j 0 and uj , then the problem is unbounded. Similarly, if cj 0, we set xj lj, or else declareunboundednessiflj .
A somewhat more subtle presolving technique is to check for forcing or dominated constraints. Rather than give a general specification, we illustrate this case with a simple example. Suppose that one of the equality constraints is as follows:
5×1 x4 2×5 10, where the variables in question have the following bounds:
0x1 1, 1×4 5, 0x5 2.
It is not hard to see that the equality constraint can only be satisfied if x1 and x5 are at their upper bounds and x4 is at its lower bound. Any other feasible values of these variables would result in the lefthand side of the equality constraint being strictly less than 10. Hence, we can set x1 1, x4 1, x5 2 and eliminate these variables, and the equality constraint, from the problem.
We use a similar example to illustrate dominated constraints. Suppose that we have the following constraint involving three variables:
2×2 x6 3×7 8, where the variables in question have the following bounds:
10×2 10, 0x6 1, 0x7 2.
By rearranging the constraint and using the bounds on x6 and x7, we find that
x2 412×6 32×7 40322 7.
and similarly, using the opposite bounds on x6 and x7 we obtain x2 72. We conclude that the stated bounds of 10 and 10 on x2 are redundant, since x2 is implicitly confined to an even smaller interval by the combination of the equality constraint and the bounds on x6 and x7. Hence, we can drop the bounds on x2 from the formulation and treat it as a free variable.
Presolving techniques are applied recursively, because the elimination of certain vari ables or constraints may create situations that allow further eliminations. As a trivial example,
13.7. PRESOLVING 387
388 CHAPTER 13. THE SIMPLEX METHOD
suppose that the following two equality constraints are present in the problem: 3×2 6, x2 4×5 10.
The first of these constraints is a row singleton, which we can use to set x2 2 and eliminate this variable and constraint. After substitution, the second constraint becomes 4×5 10 x2 8, which is again a row singleton. We can therefore set x5 2 and eliminate this variable and constraint as well.
Relatively little information about presolving techniques has appeared in the liter ature, in part because they have commercial value as an important component of linear programming software.
13.8 WHERE DOES THE SIMPLEX METHOD FIT?
In linear programming, as in all optimization problems in which inequality constraints are present, the fundamental task of the algorithm is to determine which of these constraints are active at the solution see Definition 12.1 and which are inactive. The simplex method belongs to a general class of algorithms for constrained optimization known as active set methods, which explicitly maintain estimates of the active and inactive index sets that are updated at each step of the algorithm. At each iteration, the basis B is our current estimate of the inactive set, that is, the set of indices i for which we suspect that xi 0 at the solution of the linear program. Like most active set methods, the simplex method makes only modest changes to these index sets at each step; a single index is exchanged between B into N .
Active set algorithms for quadratic programming, boundconstrained optimization, and nonlinear programming use the same basic strategy as simplex of making an explicit estimate of the active set and taking a step toward the solution of a reduced problem in which the constraints in this estimated active set are satisfied as equalities. When nonlinearity enters the problem, many of the features that make the simplex method so effective no longer apply. For example, it is no longer true in general that at least n m of the bounds x 0 are active at the solution, and the specialized linear algebra techniques described in Section 13.5 no longer apply. Nevertheless, the simplex method is rightly viewed as the antecedent of the active set class of methods for constrained optimization.
One undesirable feature of the simplex method attracted attention from its earliest days. Though highly efficient on almost all practical problems the method generally requires at most 2m to 3m iterations, where m is the row dimension of the constraint matrix in 13.1, there are pathological problems on which the algorithm performs very poorly. Klee and Minty 182 presented an ndimensional problem whose feasible polytope has 2n vertices, for which the simplex method visits every single vertex before reaching the optimal point! This example verified that the complexity of the simplex method is exponential;
13.8. WHERE DOES THE SIMPLEX METHOD FIT? 389
roughly speaking, its running time may be an exponential function of the dimension of the problem. For many years, theoreticians searched for a linear programming algorithm that has polynomial complexity, that is, an algorithm in which the running time is bounded by a polynomial function of the amount of storage required to define the problem. In the late 1970s, Khachiyan 180 described an ellipsoid method that indeed has polynomial complexity but turned out to be impractical. In the mid1980s, Karmarkar 175 described a polynomial algorithm that approaches the solution through the interior of the feasible polytope rather than working its way around the boundary as the simplex method does. Karmarkars announcement marked the start of intense research in the field of interiorpoint methods, which are the subject of the next chapter.
NOTES AND REFERENCES
The standard reference for the simplex method is Dantzigs book 86. Later excellent texts include Chva tal 61 and Vanderbei 293.
Further information on steepestedge pivoting can be found in Goldfarb and Reid 134 and Goldfarb and Forrest 133.
An alternative procedure for performing the PhaseI calculation of an initial basis was described by Wolfe 310. This technique does not require artificial variables to be introduced in the problem formulation, but rather starts at any point x that satisfies Ax b withatmostmnonzerocomponentsinx.NotethatwedonotrequirethebasicpartxB to consist of all positive components. Phase I then consists in solving the problem
x
xi0
and terminating when an objective value of 0 is attained. This problem is not a linear programits objective is only piecewise linearbut it can be solved by the simplex method nonetheless. The key is to redefine the cost vector f at each iteration x such that fi 1 for xi 0 and fi 0 otherwise.
EXERCISES
13.1 Convert the following linear program to standard form:
maxcTxdTy subjecttoA1xb1, A2xB2yb2, lyu, x,y
where there are no explicit bounds on x.
13.2 Verify that the dual of 13.8 is the original primal problem 13.1.
min
xi subject to Ax b,
390 CHAPTER 13. THE SIMPLEX METHOD
13.3 Complete the proof of Theorem 13.1 by showing that if the dual 13.7 is unbounded above, the primal 13.1 must be infeasible.
13.4 Theorem 13.1 does not exclude the possibility that both primal and dual are infeasible. Give a simple linear program for which such is the case.
13.5 Show that the dual of the linear program
min cTx subjectto Ax b, x 0,
is
max bT subjectto ATc, 0.
13.6 Show that when m n and the rows of A are linearly dependent in 13.1, then
the matrix B in 13.13 is singular, and therefore there are no basic feasible points.
13.7 Consider the overdetermined linear system Ax b with m rows and n columns
m n. When we apply Gaussian elimination with complete pivoting to A, we obtain P AQ L U11 U12 ,
00
where P and Q are permutation matrices, L is m m lower triangular, U11 is m m upper triangular and nonsingular, U12 is m n m , and m n is the rank of A.
a
b c
Show that the system Ax b is feasible if the last m m components of L1 Pb are zero, and infeasible otherwise.
When m n, find the unique solution of Ax b.
Show that the reduced system formed from the first m rows of P A and the first m components of Pb is equivalent to Ax b i.e., a solution of one system also solves the other.
13.8 Verify formula 13.37.
13.9 Consider the following linear program:
min 5×1 x2 subject to x1 x2 5,
2×1 12×2 8, x 0.
13.8. WHERE DOES THE SIMPLEX METHOD FIT? 391
a b
13.11 By extending the procedure 13.27 appropriately, show how the factorization 13.31 can be used to solve linear systems with coefficient matrix B efficiently.
Add slack variables x3 and x4 to convert this problem to standard form.
Using Procedure 13.1, solve this problem using the simplex method, showing at each step the basis and the vectors , sN , and xB , and the value of the objective function. The initial choice of B for which xB 0 should be obvious once you have added the slacks in part a.
13.10 Calculate the values of l52, l53, l54, and w2 in 13.30, by equating the last row of L1U1 to the last row of the matrix in 13.29.
CHAPTER14
Linear Programming: InteriorPoint Methods
In the 1980s it was discovered that many large linear programs could be solved efficiently by using formulations and algorithms from nonlinear programming and nonlinear equations. One characteristic of these methods was that they required all iterates to satisfy the inequality constraints in the problem strictly, so they became known as interiorpoint methods. By the early 1990s, a subclass of interiorpoint methods known as primaldual methods had distinguished themselves as the most efficient practical approaches, and proved to be strong competitors to the simplex method on large problems. These methods are the focus of this chapter.
This is pa Printer: O
g
Interiorpoint methods arose from the search for algorithms with better theoretical properties than the simplex method. As we mentioned in Chapter 13, the simplex method can be inefficient on certain pathological problems. Roughly speaking, the time required to solve a linear program may be exponential in the size of the problem, as measured by the number of unknowns and the amount of storage needed for the problem data. For almost all practical problems, the simplex method is much more efficient than this bound would suggest, but its poor worstcase complexity motivated the development of new algorithms with better guaranteed performance. The first such method was the ellipsoid method, proposed by Khachiyan 180, which finds a solution in time that is at worst polynomial in the problem size. Unfortunately, this method approaches its worstcase bound on all problems and is not competitive with the simplex method in practice.
Karmarkars projective algorithm 175, announced in 1984, also has the polynomial complexity property, but it came with the added attraction of good practical behavior. The initial claims of excellent performance on large linear programs were never fully borne out, but the announcement prompted a great deal of research activity which gave rise to many new methods. All are related to Karmarkars original algorithm, and to the logbarrier approach described in Chapter 19, but many of the approaches can be motivated and analyzed independently of the earlier methods.
Interiorpoint methods share common features that distinguish them from the simplex method. Each interiorpoint iteration is expensive to compute and can make significant progress towards the solution, while the simplex method usually requires a larger number of inexpensive iterations. Geometrically speaking, the simplex method works its way around the boundary of the feasible polytope, testing a sequence of vertices in turn until it finds the optimal one. Interiorpoint methods approach the boundary of the feasible set only in the limit. They may approach the solution either from the interior or the exterior of the feasible region, but they never actually lie on the boundary of this region.
In this chapter, we outline some of the basic ideas behind primaldual interiorpoint methods, including the relationship to Newtons method and homotopy methods and the concept of the central path. We sketch the important methods in this class, and give a com prehensive convergence analysis of a particular interiorpoint method known as a longstep pathfollowing method. We describe in some detail a practical predictorcorrector algorithm proposed by Mehrotra, which is the basis of much of the current generation of software.
14.1 PRIMALDUAL METHODS
OUTLINE
We consider the linear programming problem in standard form; that is,
min cTx, subjectto Ax b,x 0, 14.1 where c and x are vectors in IRn , b is a vector in IRm , and A is an m n matrix with full row
14.1. PRIMALDUAL METHODS 393
394 CHAPTER 14. INTERIORPOINT METHODS
rank. As in Chapter 13, we can preprocess the problem to remove dependent rows from A if necessary. The dual problem for 14.1 is
max bT, subjectto ATs c,s 0, 14.2
where is a vector in IRm and s is a vector in IRn. As shown in Chapter 13, solutions of 14.1,14.2 are characterized by the KarushKuhnTucker conditions 13.4, which we restate here as follows:
where
AT s c, Ax b,
xisi 0, i 1,2,…,n, x, s 0.
14.3a 14.3b 14.3c 14.3d
Primaldual methods find solutions x,,s of this system by applying variants of Newtons method to the three equalities in 14.3 and modifying the search directions and step lengths so that the inequalities x,s 0 are satisfied strictly at every iteration. The equations 14.3a, 14.3b, 14.3c are linear or only mildly nonlinear and so are not difficult to solve by themselves. However, the problem becomes much more difficult when we add the nonnegativity requirement 14.3d, which gives rise to all the complications in the design and analysis of interiorpoint methods.
To derive primaldual interiorpoint methods we restate the optimality conditions 14.3 in a slightly different form by means of a mapping F from IR2nm to IR2nm :
ATsc Fx, , s Ax b 0,
XSe
X diagx1,x2,…,xn, S diags1,s2,…,sn,
14.4a 14.4b
14.5
and e 1,1,…,1T . Primaldual methods generate iterates xk,k,sk that satisfy the bounds 14.4b strictly, that is, xk 0 and sk 0. This property is the origin of the term interiorpoint. By respecting these bounds, the methods avoid spurious solutions, that is, points that satisfy Fx, , s 0 but not x, s 0. Spurious solutions abound, and do not provide useful information about solutions of 14.1 or 14.2, so it makes sense to exclude them altogether from the region of search.
Like most iterative algorithms in optimization, primaldual interiorpoint methods have two basic ingredients: a procedure for determining the step and a measure of the
x, s 0,
desirability of each point in the search space. An important component of the measure of desirability is the average value of the pairwise products xi si , i 1, 2, . . . , n, which are all positive when x 0 and s 0. This quantity is known as the duality measure and is defined as follows:
n
i1
1 n x T s
xisi n. 14.6
The procedure for determining the search direction has its origins in Newtons method for the nonlinear equations 14.4a. Newtons method forms a linear model for F around the current point and obtains the search direction ax,a,as by solving the following system of linear equations:
ax
Jx,,s a Fx,,s,
as
where J is the Jacobian of F. See Chapter 11 for a detailed discussion of Newtons method for nonlinear systems. If we use the notation rc and rb for the first two block rows in F, that is,
14.1. PRIMALDUAL METHODS 395
rb Axb, rc ATsc, we can write the Newton equations as follows:
0AT Iaxr c
A 0 0 a rb . S 0 X as XSe
14.7
14.8
Usually, a full step along this direction would violate the bound x, s 0, so we perform a line search along the Newton direction and define the new iterate as
x,,sax,a,as,
for some line search parameter 0, 1. We often can take only a small step along this direction a 1 before violating the condition x,s 0. Hence, the pure Newton direction 14.8, sometimes known as the affine scaling direction, often does not allow us to make much progress toward a solution.
Most primaldual methods use a less aggressive Newton direction, one that does not aim directly for a solution of 14.3a, 14.3b, 14.3c but rather for a point whose pairwise products xi si are reduced to a lower average valuenot all the way to zero. Specifically, we
396 CHAPTER 14. INTERIORPOINT METHODS
takeaNewtonsteptowardtheapointforwhichxisi ,whereisthecurrentduality measure and 0, 1 is the reduction factor that we wish to achieve in the duality measure on this step. The modified step equation is then
0AT Iax r c
A 0 0 a rb . 14.9 S 0 X as XSe e
We call the centering parameter, for reasons to be discussed below. When 0, it usually is possible to take a longer step along the direction defined by 14.16 before violating the bounds x,s 0.
At this point, we have specified most of the elements of a pathfollowing primaldual interiorpoint method. The general framework for such methods is as follows.
Framework 14.1 PrimalDual PathFollowing. Given x0, 0, s0 with x0, s0 0;
for k 0,1,2,…
Choosek 0,1andsolve 0ATIaxkrck
A 0 0 ak rbk , Sk 0 Xk ask Xk Ske kke
wherek xkTskn; Set
xk1,k1,sk1 xk,k,skkaxk,ak,ask, choosing k so that xk1, sk1 0.
end for.
Thechoicesofcenteringparameterk andsteplengthk arecrucialtotheperformance of the method. Techniques for controlling these parameters, directly and indirectly, give rise to a wide variety of methods with diverse properties.
Although software for implementing interiorpoint methods does not usually start from a point x0,0,s0 that is feasible with respect to the linear equations 14.3a and 14.3b, most of the historical development of theory and algorithms assumed that these conditions are satisfied. In the remainder of this section, we discuss this feasible case, showing that a comprehensive convergence analysis can be presented in just a few pages, using only basic mathematical tools and concepts. Analysis of the infeasible case follows the
14.10
14.11
same principles, but is considerably more complicated in the details, so we do not present it here. In Section 14.2, however, we describe a complete practical algorithm that does not require starting from a feasible initial point.
To begin our discussion and analysis of feasible interiorpoint methods, we introduce the concept of the central path, and then describe neighborhoods of this path.
THE CENTRAL PATH
The primaldual feasible set F and strictly feasible set Fo are defined as follows:
F x,,sAx b,ATs c,x,s0, 14.12a
Fo x,,sAx b,ATs c,x,s0. 14.12b
The central path C is an arc of strictly feasible points that plays a vital role in primaldual algorithms. It is parametrized by a scalar 0, and each point x , , s C satisfies the following equations:
AT s c, Ax b, xisi , x, s 0.
i 1,2,…,n,
14.13a 14.13b 14.13c 14.13d
These conditions differ from the KKT conditions only in the term on the righthand side of 14.13c. Instead of the complementarity condition 14.3c, we require that the pairwise products xisi have the same positive value for all indices i. From 14.13, we can define the central path as
C x , , s 0.
It can be shown that x,,s is defined uniquely for each 0 if and only if Fo is nonempty.
The conditions 14.13 are also the optimality conditions for a logarithmicbarrier formulation of the problem 14.1. By introducing logbarrier terms for the nonnegativity constraints, with barrier parameter 0, we obtain
n i1
min cT x
ln xi , subject to Ax b. 14.14
14.1. PRIMALDUAL METHODS 397
398 CHAPTER 14. INTERIORPOINT METHODS
The KKT conditions 12.34 for this problem, with Lagrange multiplier for the equality constraint, are as follows:
cAT, i1,2,…,n, Axb. i xi i
Since the objective in 14.14 is strictly convex, these conditions are sufficient as well as necessaryforoptimality.Werecover14.13bydefiningsi xi,i1,2,…,n.
Another way of defining C is to use the mapping F defined in 14.4 and write 0
Fx,,s 0 , x,s 0. 14.15 e
The equations 14.13 approximate 14.3 more and more closely as goes to zero. If C converges to anything as 0, it must converge to a primaldual solution of the linear program. The central path thus guides us to a solution along a route that maintains positivity of the x and s components and decreases the pairwise products xi si , i 1, 2, . . . , n to zero at the same rate.
Most primaldual algorithms take Newton steps toward points on C for which 0, rather than pure Newton steps for F. Since these steps are biased toward the interior of the nonnegative orthant defined by x,s 0, it usually is possible to take longer steps along them than along the pure Newton affine scaling steps, before violating the positivity condition.
Inthefeasiblecaseofx,,s F,wehaverb 0andrc 0,sothesearchdirection satisfies a special case of 14.8, that is,
0AT Iax 0
A 0 0 a 0 , 14.16
S 0 X as XSe e
where is the duality measure defined by 14.6 and 0,1 is the centering pa rameter. When 1, the equations 14.16 define a centering direction, a Newton step toward the point x, , s C, at which all the pairwise products xi si are identical to the current average value of . Centering directions are usually biased strongly toward the interior of the nonnegative orthant and make little, if any, progress in reducing the duality measure . However, by moving closer to C, they set the scene for a substantial reduction in on the next iteration. At the other extreme, the value 0 gives the standard Newton affine scaling step. Many algorithms use intermediate values of from
the open interval 0, 1 to trade off between the twin goals of reducing and improving centrality.
CENTRAL PATH NEIGHBORHOODS AND PATHFOLLOWING METHODS
Pathfollowing algorithms explicitly restrict the iterates to a neighborhood of the central path C and follow C to a solution of the linear program. By preventing the iterates from coming too close to the boundary of the nonnegative orthant, they ensure that it is possible to take a nontrivial step along each search direction. Mopreover, by forcing the duality measure k to zero as k , we ensure that the iterates xk,k,sk come closer and closer to satisfying the KKT conditions 14.3.
The two most interesting neighborhoods of C are
N2x,,sFo XSee 2 , 14.17
for some 0, 1, and
Nx,,sFoxisi alli1,2,…,n, 14.18
for some 0,1. Typical values of the parameters are 0.5 and 103. If a point lies in N, each pairwise product xisi must be at least some small multiple of their average value . This requirement is actually quite modest, and we can make N encompass most of the feasible region F by choosing close to zero. The N2 neighborhood is more restrictive, since certain points in Fo do not belong to N2 no matter how close is chosen to its upper bound of 1.
By keeping all iterates inside one or other of these neighborhoods, pathfollowing methodsreduceallthepairwiseproductsxisi tozeroatmoreorlessthesamerate.Figure14.1 shows the projection of the central path C onto the primal variables for a typical problem, along with a typical neighborhood N .
Pathfollowing methods are akin to homotopy methods for general nonlinear equa tions, which also define a path to be followed to the solution. Traditional homotopy methods stay in a tight tubular neighborhood of their path, making incremental changes to the pa rameter and chasing the homotopy path all the way to a solution. For primaldual methods, this neighborhood is hornshaped rather than tubular, and it tends to be broad and loose for larger values of the duality measure . It narrows as 0, however, because of the positivity requirement x, s 0.
The algorithm we specify below, a special case of Framework 14.1, is known as a longstep pathfollowing algorithm. This algorithm can make rapid progress because of its use of the wide neighborhood N , for close to zero. It depends on two parameters min and max , which are lower and upper bounds on the centering parameter k . The search direction is, as usual, obtained by solving 14.10, and we choose the step length k to be as large as possible, subject to the requirement that we stay inside N .
14.1. PRIMALDUAL METHODS 399
400 CHAPTER 14. INTERIORPOINT METHODS
central path neighborhood
C
Figure 14.1 Central path, projected into space of primal variables x, showing a typical neighborhood N .
14.19a 14.19b
Here and in later analysis, we use the notation kkkdefkkk kkk
x , ,s x , ,s ax ,a ,as , def k T k
k x s n.
Algorithm 14.2 LongStep PathFollowing. Given,min,max with 0,1,0min max 1,
and x0, 0, s0 N ; for k 0,1,2,…
Choosek min,max;
Solve 14.10 to obtain axk,ak,ask;
Choose k as the largest value of in 0, 1 such that
xk,k,sk N; Set xk1,k1,sk1 xkk,kk,skk;
end for.
14.20
Typical behavior of the algorithm is illustrated in Figure 14.2 for the case of n 2. The horizontal and vertical axes in this figure represent the pairwise products x1s1 and x2s2, so the central path C is the line emanating from the origin at an angle of 45. A point at the origin of this illustration is a primaldual solution if it also satisfies the feasibility conditions
14.1. PRIMALDUAL METHODS 401
x2 s2
1 iterates 0 2
3
boundary of neighborhood N
x1 s1
central path C
Figure 14.2 Iterates of Algorithm 14.2, plotted in xs space.
14.3a, 14.3b, and 14.3d. In the unusual geometry of Figure 14.2, the search directions axk,ak,ask transform to curves rather than straight lines.
As Figure 14.2 shows and the analysis confirms, the lower bound min on the centering parameter ensures that each search direction starts out by moving away from the boundary of N and into the relative interior of this neighborhood. That is, small steps along the search direction improve the centrality. Larger values of take us outside the neighborhood again, since the error in approximating the nonlinear system 14.15 by the linear step equations 14.16 becomes more pronounced as increases. Still, we are guaranteed that a certain minimum step can be taken before we reach the boundary of N , as we show in the analysis below.
The analysis of Algorithm 14.2 appears in the next few pages. With judicious choices of k , this algorithm is fairly efficient in practice. With a few more modifications, it becomes the basis of a truly competitive method, as we discuss in Section 14.2.
Our aim in the analysis below is to show that given some small tolerance 0, the algorithm requires On log iterations to reduce the duality measure by a factor of , that is, to identify a point xk,k,sk for which k 0. For small , the point xk,k,sk satisfies the primaldual optimality conditions except for perturbations of about in the righthand side of 14.3c, so it is usually very close to a primaldual solution of the original linear program. The Onlog estimate is a worstcase bound on the number of iterations required; on practical problems, the number of iterations required appears
402 CHAPTER 14. INTERIORPOINT METHODS
to increase only slightly if at all as n increases. The simplex method may require 2n iterations to solve a problem with n variables, though in practice it usually requires a modest multiple of m iterations, where m is the row dimension of the constraint matrix A in 14.1.
As is typical for interiorpoint methods, the analysis builds from a purely techni cal lemma to a powerful theorem in just a few pages. We start with the technical result Lemma 14.1 and use it to derive a bound on the vector of pairwise products axi asi , i 1, 2, . . . , n Lemma 14.2. Theorem 14.3 finds a lower bound on the step length k and a corresponding estimate of the reduction in on iteration k. Finally, Theorem 14.4 proves that Onlog iterations are required to identify a point for which k , for a given tolerance 0, 1.
Lemma 14.1.
LetuandvbeanytwovectorsinIRn withuTv0.Then UVe2232 uv2,
where
U diagu1,u2,…,un, V diagv1,v2,…,vn.
PROOF. When the subscript is omitted from , we mean 2, as is our convention throughout the book. First, note that for any two scalars and with 0, we have from the algebraicgeometric mean inequality that
Since uT v 0, we have 0uTv
1. 2
uivi uivi uivi uivi0 uivi0 iP iM
14.21
uivi, 14.22
where we partitioned the index set 1, 2, . . . , n as
P iuivi 0, Miuivi 0.
Now,
UVe uiviiP 2 uiviiM 2 12
uv 2uv 212 since
i i iP 1
2 uv 2 12
a a4iia
a i i iP 1 2a 1u v2
a
i i iM 1
2 from14.22
1
14.1. PRIMALDUAL METHODS 403
from14.21
232 232
iP1 ui vi2
iP n
ui vi2 232 uv2,
i1
completing the proof.
For the next result, we omit the iteration counter k from 14.10, and define the
diagonal matrices aX and aS similarly to 14.5, as follows:
aX diagax1,ax2,…,axn, aS diagas1,as2,…,asn..
Lemma 14.2.
If x,,s N, then
aXaSe 2321 1 n.
PROOF. It is easy to show using 14.10 that
axT as 0.
14.23 By multiplying the last block row in 14.10 by XS12 and using the definition D
X12S12, we obtain
D1ax Das XS12XSe e. 14.24 Because D1axT Das axT as 0, we can apply Lemma 14.1 with u D1ax
and v Das to obtain
aXaSe D1aXDaSe
232 D1ax Das 2 from Lemma 14.1 232 XS12XSee 2 from14.24.
Expanding the squared Euclidean norm and using such relationships as xT s n and
404 CHAPTER 14. INTERIORPOINT METHODS
eT e n, we obtain a X a S e
as claimed.
Theorem 14.3.
2 3 2 x T s 2 e T e 2 2 n 1 i1 xisi
232 xTs2eTe22 n
32 2
2 12n
2321 1 n,
sincexisi
Given the parameters , min, and max in Algorithm 14.2, there is a constant independent of n such that
k1 1 k, n
14.25
14.26
forallk 0.
PROOF. We start by proving that
xk,k,sk N forall 0,2321 k , 1 n
where xk,k,sk is defined as in 14.19. It follows that the step length k is at least as long as the upper bound of this interval, that is,
k 232k1. n 1
For any i 1,2,…,n, we have from Lemma 14.2 that axikasik aXkaSke 2 2321 1nk.
Using 14.10, we have from xik sik k and 14.28 that
xik sik xik axik sik asik
xiksik xikasik sikaxik 2axikasik
xiksik1 kk 2axikasik 1k kk 223211nk.
14.27
14.28
By summing the n components of the equation Skaxk Xkask Xk Ske kke the third block row from 14.10, and using 14.23 and the definition of k and k see 14.19, we obtain
k 1 1 kk.
From these last two formulas, we can see that the proximity condition
xik sik k
is satisfied, provided that
1k kk 223211nk 1kk.
Rearranging this expression, we obtain
k k 1 2232nk 1 1 ,
which is true if
232 1 n k1.
We have proved that xk,k,sk satisfies the proximity condition for N when lies in the range stated in 14.26. It is not difficult to show that xk,k,sk Fo for all in the given range. Hence, we have proved 14.26
and therefore 14.27.
We complete the proof of the theorem by estimating the reduction in on the kth
step. Because of 14.23, 14.27, and the last block row of 14.16, we have
k1 xkkT skkn
xkT sk k xkT ask skT axk k2axkT ask n k k xkTsknkk
1 k 1 k k
232 1
1 n 1k1k k.
14.29
14.1. PRIMALDUAL METHODS 405
Now, the function 1 is a concave quadratic function of , so on any given interval it attains its minimum value at one of the endpoints. Hence, we have
k1kminmin1min,max1max, forallk min,max.
406 CHAPTER 14. INTERIORPOINT METHODS
The proof is completed by substituting this estimate into 14.29 and setting
232 1 minmin1min,max1max.
1
We conclude with a result that a reduction of a factor of in the duality measure
canbeobtainedinOnlog1 iterations.
Theorem 14.4.
Given 0, 1 and 0, 1, suppose the starting point in Algorithm 14.2 satisfies x0,0,s0N.ThenthereisanindexKwithKOnlog1 suchthat
k 0, forallkK. PROOF. By taking logarithms of both sides in 14.25, we obtain
logk1log 1 logk. n
By applying this formula repeatedly, we have
logkklog 1 log0.
n
The following wellknown estimate for the log function,
log1 , for all 1,
implies that
logk0k . n
Therefore, the condition k 0 is satisfied if we have k log .
n This inequality holds for all k that satisfy
defn 1 n kKlog log ,
so the proof is complete.
14.2. PRACTICAL PRIMALDUAL ALGORITHMS 407
14.2 PRACTICAL PRIMALDUAL ALGORITHMS
Practical implementations of interiorpoint algorithms follow the spirit of the previous section, in that strict positivity of xk and sk is maintained throughout and each step is a Newtonlike step involving a centering component. However, most implementations work with an infeasible starting point and infeasible iterations. Several aspects of theoretical algorithms are typically ignored, while several enhancements are added that have a significant effect on practical performance. In this section, we describe the algorithmic enhancements that are found in a typical implementation of an infeasibleinteriorpoint method, and present the resulting method as Algorithm 14.3. Many of the techniques of this section are described in the paper of Mehrotra 207, which can be consulted for further details.
CORRECTOR AND CENTERING STEPS
A key feature of practical algorithms is their use of corrector steps that compensate for the linearization error made by the Newton affinescaling step in modeling the equation xisi 0,i1,2,…,nsee14.3c.Considertheaffinescalingdirectionax,a,as defined by
0AT Iaxaffr c
A 0 0 aaff rb , 14.30 S 0 X asaff XSe
where rb and rc are defined in 14.7. If we take a full step in this direction, we obtain x axaffs asaff
iiii
xs xasaff saxaff axaffasaff axaffasaff. iiiiiiiiii
That is, the updated value of xi si is axaff asaff rather than the ideal value 0. We can solve ii
the following system to obtain a step axcor, acor, ascor that attempts to correct for this deviation from the ideal:
0ATIaxcor 0
A 0 0 acor 0 . 14.31
S 0 X ascor aXaffaSaffe
Inmanycases,thecombinedstepaxaff,aaff,asaffaxcor,acor,ascordoesabetter job of reducing the duality measure than does the affinescaling step alone.
Like theoretical algorithms such as the one analysed in Section 14.1, practical algo rithms make use of centering steps, with an adaptive choice of the centering parameter k . The affinescaling step can be used as the basis of a successful heuristic for choosing k.
408 CHAPTER 14. INTERIORPOINT METHODS
Roughly speaking, if the affinescaling step multiplied by a steplength to maintain non negativity of x and s reduces the duality measure significantly, there is not much need for centering, so a smaller value of k is appropriate. Conversely, if not much progress can be made along this direction before reaching the boundary of the nonnegative orthant, a larger value of k will ensure that the next iterate is more centered, so a longer step will be possible from this next point. Specifically, this scheme calculates the maximum allowable steplengths along the affinescaling direction 14.30 as follows:
pri def xi
aff min 1, min aff ,
i:axaff0 ax ii
dual def si
aff min 1, min aff ,
i:asaff0 as ii
14.32a 14.32b
andthendefinesaff tobethevalueofthatwouldbeobtainedbyusingthesesteplengths, that is,
aff xpriaxaffTsdualasaffn. 14.33 aff aff
The centering parameter is chosen according to the following heuristic which does not have a solid analytical justification, but appears to work well in practice:
aff 3
. 14.34
To summarize, computation of the search direction requires the solution of two linear systems. First, the system 14.30 is solved to obtain the affinescaling direction, also known as the predictor step. This step is used to define the righthand side for the corrector step see 14.31 and to calculate the centering parameter from 14.33, 14.34. Second, the search direction is calculated by solving
0AT Iax r c
A 0 0 a rb . 14.35 S 0 X as X S e a X aff a S aff e e
Note that the predictor, corrector, and centering contributions have been aggregated on the righthand side of this system. The coefficient matrix in both linear systems 14.30 and 14.35 is the same. Thus, the factorization of the matrix needs to be computed only once, and the marginal cost of solving the second system is relatively small.
14.2. PRACTICAL PRIMALDUAL ALGORITHMS 409
STEP LENGTHS
Practical implementations typically do not enforce membership of the central path neighborhoods N2 and N defined in the previous section. Rather, they calculate the maximum steplengths that can be taken in the x and s variables separately without violating nonnegativity, then take a steplength of slightly less than this maximum but no greater than 1. Given an iterate xk,k,sk with xk,sk 0, and a step axk,ak,ask,
it is easy to show that the quantities pri
k ,max
xk pri def i
are the largest values of for which xk axk 0 and sk ask 0, respectively. Note that these formulae are similar to the ratio test used in the simplex method to determine the index that enters the basis. Practical algorithms then choose the steplengths to lie in the open intervals defined by these maxima, that is,
pri 0, pri , dual 0, dual , k k,max k k,max
and then obtain a new iterate by setting
xk1 xk priaxk, k1,sk1 k,skdualak,ask.
and dual defined as follows: k ,max
sk dual def i
k,max min k, k,max min k, i :axik 0 axi i :asik 0 asi
14.36
kk
If the step axk,ak,ask rectifies the infeasibility in the KKT conditions 14.3a and 14.3b, that is,
Aaxk rbk Axk b, ATak ask rck ATk sk c, it is easy to show that the infeasibilities at the new iterate satisfy
rk1 1 pri rk, rk1 1 dual rk. 14.37 bkbckc
The following formula is used to calculate steplengths in many practical implementations
pri min1, pri , dual min1, dual , 14.38 k k k,max k k k,max
where k 0.9, 1.0 is chosen so that k 1 as the iterates approach the primaldual solution, to accelerate the asymptotic convergence.
410 CHAPTER 14. INTERIORPOINT METHODS
STARTING POINT
Choice of starting point is an important practical issue with a significant effect on the robustness of the algorithm. A poor choice x0,0,s0 satisfying only the minimal conditions x0 0 and s0 0 often leads to failure of convergence. We describe here a heuristic that finds a starting point that satisfies the equality constraints in the primal and dual problems reasonably well, while maintaining positivity of the x and s components and avoiding excessively large values of these components.
First, we find a vector x of minimum norm satisfying the primal constraint Ax b, and a vector , s satisfy the dual constraint AT s c such that s has minimum norm. That is, we solve the problems
min 1 xT x subject to Ax b, 14.39a x2
min 1sTs subjecttoATsc. 14.39b ,s 2
It is not difficult to show that x and , s can be written explicitly as follows:
x ATAAT1b, AAT1Ac, s cAT . 14.40
In general, x and s will have nonpositive components, so are not suitable for use as a starting point. We define
x max32 min x i , 0, s max32 min s i , 0, ii
and adjust the x and s vectors as follows:
xx xe, ss se,
where, as usual, e 1,1,…,1T . Clearly, we have x 0 and s 0. To ensure that the components of x0 and s0 are not too close to zero and not too dissimilar, we add two more scalars defined as follows:
x1xTs, s1xTs 2 e T s 2 e T x
Note that x is the average size of the components of x, weighted by the corresponding components of s; similarly for s . Finally, we define the starting point as follows:
x0xxe, 0 , s0sse.
The computational cost of finding x0, 0, s0 by this scheme is about the same as one step of the primaldual method.
Set
end for.
xk1 xk priax, k
k1, sk1 k , sk duala, as; k
14.2. PRACTICAL PRIMALDUAL ALGORITHMS 411
In some cases, we have prior knowledge about the solution, possibly in the form of a solution to a similar linear program. The use of such warmstart information in constructing a starting point is discussed in Section 14.4.
A PRACTICAL ALGORITHM
We now give a formal specification of a practical algorithm.
Algorithm 14.3 PredictorCorrector Algorithm Mehrotra 207. Calculate x0, 0, s0 as described above;
for k 0,1,2,…
Set x,,s xk,k,sk and solve 14.30 for axaff,aaff,asaff;
Calculate pri, dual, and aff as in 14.32 and 14.33; aff aff
Set centering parameter to aff 3 ;
Solve 14.35 for ax, a, as;
Calculate pri and dual from 14.38; kk
No convergence theory is available for Mehrotras algorithm, at least in the form in which it is described above. In fact, there are examples for which the algorithm diverges. Simple safeguards could be incorporated into the method to force it into the convergence framework of existing methods or to improve its robustness, but many practical codes do not implement these safeguards, because failures are rare.
When presented with a linear program that is infeasible or unbounded, the algorithm above typically diverges, with the infeasibilities rbk and rck andor the duality measure k going to . Since the symptoms of infeasibility and unboundedness are fairly easy to recognize, interiorpoint codes contain heuristics to detect and report these conditions. More rigorous approaches for detecting infeasibility and unboundedness make use of the homogeneous selfdual formulation; see Wright 316, Chapter 9 and the references therein for a discussion. A more recent approach that applies directly to infeasibleinteriorpoint methods is described by Todd 286.
SOLVING THE LINEAR SYSTEMS
Most of the computational effort in primaldual methods is taken up in solving linear systems such as 14.9, 14.30, and 14.35. The coefficient matrix in these systems is usually large and sparse, since the constraint matrix A is itself large and sparse in most applications.
412 CHAPTER 14. INTERIORPOINT METHODS
The special structure in the step equations allows us to reformulate them as systems with more compact symmetric coefficient matrices, which are easier and cheaper to factor than the original sparse form.
We apply the reformulation procedures to the following general form of the linear system:
0AT Iaxr c
A 0 0 a rb . 14.41 S 0 X as rxs
Since x and s are strictly positive, the diagonal matrices X and S are nonsingular. We can eliminate as and add X1 times the third equation in this system to the first equation to obtain
where we have introduced the notation
D2 AT ax rc X1rxs , A 0 a rb
as X1rxs X1Sax, D S12 X12.
14.42a 14.42b
14.43
This form of the step equations usually is known as the augmented system. We can go further and eliminate ax and add AD2 times the first equation to the second equation in 14.42a to obtain
AD2 AT a rb AX S1rc AS1rxs as rc AT a,
ax S1rxs X S1as,
14.44a 14.44b 14.44c
where the expressions for as and as are obtained from the original system 14.41. The form 14.44a often is called the normalequations form, because the system 14.44a can be viewed as the normal equations 10.14 for a certain linear leastsquares problem with coefficient matrix D AT .
Most implementations of primaldual methods are based on formulations like 14.44. They use direct sparse Cholesky algorithms to factor the matrix AD2 AT , and then perform triangular solves with the resulting sparse factors to obtain the step a from 14.44a. The steps as and ax are recovered from 14.44b and 14.44c. Generalpurpose sparse Cholesky software can be applied to A D2 AT , but modifications are needed because A D2 AT may be illconditioned or singular. Ill conditioning of this system is often observed during
14.3. OTHER PRIMALDUAL ALGORITHMS AND EXTENSIONS 413
the final stages of a primaldual algorithm, when the elements of the diagonal weighting matrix D2 take on both huge and tiny values. The Cholesky technique may encounter diagonal elements that are very small, zero or because of roundoff error slightly negative. One approach for handling this eventuality is to skip a step of the factorization, setting the component of a that corresponds to the faulty diagonal element to zero. We refer to Wright 317 for details of this and other approaches.
A disadvantage of the normalequations formulation is that if A contains any dense columns, the entire matrix AD2AT is also dense. Hence, practical software identifies dense and nearlydense columns, excludes them from the matrix product A D2 AT , and performs the Cholesky factorization of the resulting sparse matrix. Then, a device such as a ShermanMorrisonWoodbury update is applied to account for the excluded columns. We refer the reader to Wright 316, Chapter 11 for further details.
The formulation 14.42 has received less attention than 14.44, mainly because algorithms and software for factoring sparse symmetric indefinite matrices are more com plicated, slower, and less prevalent than sparse Cholesky algorithms. Nevertheless, the formulation 14.42 is cleaner and more flexible than 14.44 in a number of respects. It normally avoids the fillin caused by dense columns in A in the matrix product AD2 AT . Moreover, it allows free variables components of x with no explicit lower or upper bounds to be handled directly in the formulation. The normal equations form must resort to var ious artificial devices to express such variables, otherwise it is not possible to perform the block elimination that leads to the system 14.44a.
14.3 OTHER PRIMALDUAL ALGORITHMS AND EXTENSIONS
OTHER PATHFOLLOWING METHODS
Framework 14.1 is the basis of a number of other algorithms of the pathfollowing variety. They are less important from a practical viewpoint, but we mention them here because of their elegance and their strong theoretical properties.
Some pathfollowing methods choose conservative values for the centering parameter that is, only slightly less than 1 so that unit steps that is, a steplength of 1 can be taken along the resulting direction from 14.16 without leaving the chosen neighborhood. These methods, which are known as shortstep pathfollowing methods, make only slow progress toward the solution because they require the iterates to stay inside a restrictive N2 neighborhood 14.17. From a theoretical point of view, however, they have the advantage of better complexity. A result similar to Theorem 14.4 holds with n replaced by n12 in the complexity estimate.
Better results are obtained with the predictorcorrector method, due to Mizuno, Todd, and Ye 208, which uses two N2 neighborhoods, nested one inside the other. Despite the similar terminology, this algorithm is quite distinct from Algorithm 14.3 of Section 14.2. Every second step of this method is a predictor step, which starts in the inner neighborhood
414 CHAPTER 14. INTERIORPOINT METHODS
and moves along the affinescaling direction computed by setting 0 in 14.16 to the boundary of the outer neighborhood. The gap between neighborhood boundaries is wide enough to allow this step to make significant progress in reducing . Alternating with the predictor steps are corrector steps computed with 1 and 1, which take the next iterate back inside the inner neighborhood in preparation for the next predictor step. The predictorcorrector algorithm produces a sequence of duality measures k that converge superlinearly to zero, in contrast to the linear convergence that characterizes most methods.
POTENTIALREDUCTION METHODS
Potentialreduction methods take steps of the same form as pathfollowing methods, but they do not explicitly follow the central path C and can be motivated independently of it. They use a logarithmic potential function to measure the worth of each point in Fo and aim to achieve a certain fixed reduction in this function at each iteration. The primaldual potential function, which we denote generically by , usually has two important properties:
ifxisi 0forsomei,whilexTsna0, 14.45a if and only if x,,s . 14.45b
Thefirstproperty14.45apreventsanyoneofthepairwiseproductsxisi fromapproaching zero independently of the others, and therefore keeps the iterates away from the boundary of the nonnegative orthant. The second property 14.45b relates to the solution set . If our algorithm forces to , then 14.45b ensures that the sequence approaches the solution set.
An interesting primaldual potential function is defined by
n i1
for some parameter n see Tanabe 283 and Todd and Ye 287. Like all algorithms based on Framework 14.1, potentialreduction algorithms obtain their search directions by solving 14.10, for some k 0, 1, and they take steps of length k along these directions. For instance, the step length k may be chosen to approximately minimize along the computed direction. By fixing k nn n for all k, one can guarantee constant reduction in at every iteration. Hence, will approach , forcing convergence. Adaptive and heuristic choices of k and k are also covered by the theory, provided that they at least match the reduction in obtained from the conservative theoretical values of these parameters.
x, s log xT s
log xi si , 14.46
14.3. OTHER PRIMALDUAL ALGORITHMS AND EXTENSIONS 415
EXTENSIONS
Primaldual methods for linear programming can be extended to wider classes of problems. There are simple extensions of the algorithm to the monotone linear comple mentarity problem LCP and convex quadratic programming problems for which the convergence and polynomial complexity properties of the linear programming algorithms are retained. The monotone LCP is the problem of finding vectors x and s in IRn that satisfy the following conditions:
s Mx q, x,s 0, xT s 0, 14.47
where M is a positive semidefinite n n matrix and q IRn . The similarity between 14.47 and the KKT conditions 14.3 is obvious: The last two conditions in 14.47 correspond to 14.3d and 14.3c, respectively, while the condition s M x q is similar to the equations 14.3a and 14.3b. For practical instances of the problem 14.47, see Cottle, Pang, and Stone 80. Interiorpoint methods for monotone LCP have a close correspondence to algorithms for linear programming. The duality measure 14.6 is redefined to be the complementarity measure with the same definition xT sn, and the conditions that must be satisfied by the solution can be stated similarly to 14.4 as follows:
Mx q s XSe
0, x,s0.
The general formula for a pathfollowing step is defined analogously to 14.9 as follows:
M I ax Mx q s , S X as XSe e
where 0, 1. Using these and similar adaptations, an extension of the practical method of Section 14.2 can also be derived.
Extensions to convex quadratic programs are discussed in Section 16.6. Their adaptation to nonlinear programming problems is the subject of Chapter 19.
Interiorpoint methods are highly effective in solving semidefinite programming prob lems, a class of problems involving symmetric matrix variables that are constrained to be positive semidefinite. Semidefinite programming, which has been the topic of concentrated research since the early 1990s, has applications in many areas, including control theory and combinatorial optimization. Further information on this increasingly important topic can be found in the survey papers of Todd 285 and Vandenberghe and Boyd 292 and the books of Nesterov and Nemirovskii 226, Boyd et al. 37, and Boyd and Vandenberghe 38.
416 CHAPTER 14. INTERIORPOINT METHODS
14.4 PERSPECTIVES AND SOFTWARE
The appearance of interiorpoint methods in the 1980s presented the first serious challenge to the dominance of the simplex method as a practical means of solving linear programming problems. By about 1990, interiorpoint codes had emerged that incorporated the techniques described in Section 14.2 and that were superior on many large problems to the simplex codes available at that time. The years that followed saw significant improvements in simplex software, evidenced by the appearance of packages such as CPLEX and XPRESSMP. These improvements were due to algorthmic advances such as steepestedge pivoting see Goldfarb and Forrest 133 and improved pricing heuristics, and also to close attention to the nuts and bolts of efficient implementation. The efficiency of interiorpoint codes also continued to improve, through improvements in the linear algebra for solving the step equations and through the use of higherorder correctors in the step calculation see Gondzio 138. During this period, a number of good interiorpoint codes became freely available such as PCx 84, HOPDM 137, BPMPD, and LIPSOL 321 and found their way into many applications.
In general, simplex codes are faster on problems of smallmedium dimensions, while interiorpoint codes are competitive and often faster on large problems. However, this rule is certainly not hardandfast; it depends strongly on the structure of the particular application. Interiorpoint methods are generally not able to take full advantage of prior knowledge about the solution, such as an estimate of the solution itself or an estimate of the optimal basis. Hence, interiorpoint methods are less useful than simplex approaches in situations in which warmstart information is readily available. One situation of this type involves branchandbound algorithms for solving integer programs, where each node in the branchandbound tree requires the solution of a linear program that differs only slightly from one already solved in the parent node. In other situations, we may wish to solve a sequence of linear programs in which the data is perturbed slightly to investigate sensitivity of the solutions to various perturbations, or in which we approximate a non linear optimization problem by a sequence of linear programs. Yldrm and Wright 319 describe how a given point such as an approximate solution can be modified to obtain a starting point that is theoretically valid, in that it allows complexity results to be proved that depend on the quality of the given point. In practice, however, these techniques can be expected to provide only a modest improvement in algorithmic performance perhaps a factor of between 2 and 5 over a cold starting point such as the one described in Section 14.2.
Interiorpoint software has the advantage that it is easy to program, relative to the simplex method. The most complex operation is the solution of the large linear systems at each iteration to compute the step; software to perform this linear algebra operation is readily available. The interiorpoint code LIPSOL 321 is written entirely in the Matlab language, apart from a small amount of FORTRAN code that interfaces to the linear algebra software. The code PCx 84 is written in C, but also is easy for the interested user to comprehend and modify. It is even possible for a nonexpert in optimization to write an
14.4. PERSPECTIVES AND SOFTWARE 417
efficient interiorpoint implementation from scratch that is customized to their particular application.
NOTES AND REFERENCES
For more details on the material of this chapter, see the book by Wright 316.
As noted in the text, Karmarkars method arose from a search for linear programming algorithms with better worstcase behavior than the simplex method. The first algorithm with polynomial complexity, Khachiyans ellipsoid algorithm 180, was a computational disappointment. In contrast, the execution times required by Karmarkars method were not too much greater than simplex codes at the time of its introduction, particularly for large linear programs. Karmarkars is a primal algorithm; that is, it is described, moti vated, and implemented purely in terms of the primal problem 14.1 without reference to the dual. At each iteration, Karmarkars algorithm performs a projective transforma tion on the primal feasible set that maps the current iterate xk to the center of the set and takes a step in the feasible steepest descent direction for the transformed space. Progress toward optimality is measured by a logarithmic potential function. Descrip tions of the algorithm can be found in Karmarkars original paper 175 and in Fletcher 101, Section 8.7.
Karmarkars method falls outside the scope of this chapter, and in any case, its practi cal performance does not appear to match the most efficient primaldual methods. The algorithms we discussed in this chapter have polynomial complexity, like Karmarkars method.
Many of the algorithmic ideas that have been examined since 1984 actually had their genesis in three works that preceded Karmarkars paper. The first of these is the book of Fiacco and McCormick 98 on logarithmic barrier functions originally proposed by Frisch 115, which proves existence of the central path, among many other results. Further analysis of the central path was carried out by McLinden 205, in the context of nonlinear complementarity problems. Finally, there is Dikins paper 94, in which an interiorpoint method known as primal affinescaling was originally proposed. The outburst of research on primaldual methods, which culminated in the efficient software packages available today, dates to the seminal paper of Megiddo 206.
Todd gives an excellent survey of potential reduction methods in 284. He relates the primaldual potential reduction method mentioned above to pure primal potential reduction methods, including Karmarkars original algorithm, and discusses extensions to special classes of nonlinear problems.
For an introduction to complexity theory and its relationship to optimization, see the book by Vavasis 297.
Andersen et al. 6 cover many of the practical issues relating to implementation of interiorpoint methods. In particular, they describe an alternative scheme for choosing the initial point, for the case in which upper bounds are also present on the variables.
418 CHAPTER 14. INTERIORPOINT METHODS
EXERCISES
14.1 This exercise illustrates the fact that the bounds x,s 0 are essential in relating solutions of the system 14.4a to solutions of the linear program 14.1 and its dual. Consider the following linear program in IR2:
min x1, subjectto x1 x2 1, x1,x20. Show that the primaldual solution is
x 0 , 0, s 1 . 10
Also verify that the system Fx, , s 0 has the spurious solution x 1 , 1, s 0 ,
0 1 which has no relation to the solution of the linear program.
14.2
i Show that N21 N22 when 0 1 2 1 and that N1 N2
for02 1 1.
ii ShowthatN2Nif 1.
14.3 Given an arbitrary point x,,s Fo, find the range of values for which x, , s N . The range depends on x and s.
14.4 For n 2, find a point x,s 0 for which the condition XSee 2
is not satisfied for any 0, 1.
14.5 Prove that the neighborhoods N1 see 14.18 and N20 see 14.17
coincide with the central path C.
14.6 In the longstep pathfollowing method Algorithm 14.2, give a procedure for
calculating the maximum value of such that 14.20 is satisfied.
14.7 Show that defined by 14.46 has the property 14.45a.
14.8 Prove that the coefficient matrix in 14.16 is nonsingular if and only if A has full row rank.
14.4. PERSPECTIVES AND SOFTWARE 419
14.9 Given ax, a, as satisfying 14.10, prove 14.23.
14.10 Given an iterate xk,k,sk with xk,sk 0, show that the quantities pri
and dual defined by 14.36 are the largest values of such that xk axk 0 and max
sk ask 0, respectively.
14.11 Verify 14.37.
14.12 Given that X and S are diagonal with positive diagonal elements, show that the coefficient matrix in 14.44a is symmetric and positive definite if and only if A has full row rank. Does this result continue to hold if we replace D by a diagonal matrix in which exactly m of the diagonal elements are positive and the remainder are zero? Here m is the number of rows of A.
14.13 Given a point x, , s with x, s 0, consider the trajectory H defined by
1ATsc
F x , , s 1 A x b , x , s 0 , 1XSe
for 0,1, and note that x0,0,s0 x,,s, while the limit of
x,,s as 1 will lie in the primaldual solution set of the linear pro gram. Find equations for the first, second, and third derivatives of H with respect to at 0. Hence, write down a Taylor series approximation to H near the point x,,s.
14.14 Consider the following linear program, which contains free variables denoted by y:
mincTxdTy, subjecttoA1xA2yb,x0.
By introducing Lagrange multipliers for the equality constraints and s for the bounds x 0, write down optimality conditions for this problem in an analogous fashion to 14.3. Following 14.4 and 14.16, use these conditions to derive the general step equations for a primaldual interiorpoint method. Express these equations in augmented system form analogously to 14.42 and explain why it is not possible to reduce further to a formulation like 14.44 in which the coefficient matrix is symmetric positive definite.
14.15 Program Algorithm 14.3 in Matlab. Choose 0.99 uniformly in 14.38. Test your code on a linear programming problem 14.1 generated by choosing A randomly,
max
420 CHAPTER 14. INTERIORPOINT METHODS
and then setting x, s, b, and c as follows:
random positive number 0
random positive number
i 1,2,…,m,
i m 1, m 2, . . . , n, i m 1,m 2,…,n i 1,2,…,m,
xi si
0
random vector,
c AT s, b Ax.
Choose the starting point x0, 0, s0 with the components of x0 and s0 set to large positive values.
14.17 Show that the solutions of the problems 14.39 are given explicitly by 14.40.
CHAPTER15
Fundamentals
of Algorithms
for Nonlinear
Constrained
Optimization
In this chapter, we begin our discussion of algorithms for solving the general constrained optimization problem
min fx subjectto cix0, i E, 15.1
x IR n
cix0, iI,
where the objective function f and the constraint functions ci are all smooth, realvalued functions on a subset of IRn, and I and E are finite index sets of inequality and equality constraints, respectively. In Chapter 12, we used this general statement of the problem
This is page 421 Printer: Opaque this
422 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
to derive optimality conditions that characterize its solutions. This theory is useful for motivating the various algorithms discussed in the remainder of the book, which differ from each other in fundamental ways but are all iterative in nature. They generate a sequence of estimates of the solution x that, we hope, tend toward a solution. In some cases, they also generate a sequence of guesses for the Lagrange multipliers associated with the constraints. As in the chapters on unconstrained optimization, we study only algorithms for finding local solutions of 15.1; the problem of finding a global solution is outside the scope of this book.
We note that this chapter is not concerned with individual algorithms themselves, but rather with fundamental concepts and building blocks that are common to more than one algorithm. After reading Sections 15.1 and 15.2, the reader may wish to glance at the material in Sections 15.3, 15.4, 15.5, and 15.6, and return to these sections as needed during study of subsequent chapters.
15.1 CATEGORIZING OPTIMIZATION ALGORITHMS
We now catalog the algorithmic approaches presented in the rest of the book. No standard taxonomy exists for nonlinear optimization algorithms; in the remaining chapters we have grouped the various approaches as follows.
I. In Chapter 16 we study algorithms for solving quadratic programming problems. We consider this category separately because of its intrinsic importance, because its particular characteristics can be exploited by efficient algorithms, and because quadratic programming subproblems need to be solved by sequential quadratic programming methods and certain interiorpoint methods for nonlinear programming. We discuss active set, interiorpoint, and gradient projection methods.
II. In Chapter 17 we discuss penalty and augmented Lagrangian methods. By combining the objective function and constraints into a penalty function, we can attack problem 15.1 by solving a sequence of unconstrained problems. For example, if only equality constraints are present in 15.1, we can define the quadratic penalty function as
f x 2
iE
2
ci x, 15.2
where 0 is referred to as a penalty parameter. We minimize this unconstrained function, for a series of increasing values of , until the solution of the constrained optimization problem is identified to sufficient accuracy.
If we use an exact penalty function, it may be possible to find a local solution of 15.1 by solving a single unconstrained optimization problem. For the equalityconstrained problem,
the function defined by
15.1. CATEGORIZING OPTIMIZATION ALGORITHMS 423
fx cix, iE
is usually an exact penalty function, for a sufficiently large value of 0. Although they often are nondifferentiable, exact penalty functions can be minimized by solving a sequence of smooth subproblems.
In augmented Lagrangian methods, we define a function that combines the properties of the Lagrangian function 12.33 and the quadratic penalty function 15.2. This socalled augmented Lagrangian function has the following form for equalityconstrained problems:
2 LAx,; fx icix 2 cix.
iE iE
Methods based on this function fix to some estimate of the optimal Lagrange multiplier vector and fix to some positive value, then find a value of x that approximately minimizes LA,;. At this new xiterate, and may be updated; then the process is repeated. This approach avoids certain drawbacks associated with the minimization of the quadratic penalty function 15.2.
III. In Chapter 18 we describe sequential quadratic programming SQP methods, which model 15.1 by a quadratic programming subproblem at each iterate and define the search direction to be the solution of this subproblem. In the basic SQP method, we define the search direction pk at the iterate xk , k to be the solution of
min 1pT2 Lx,pfxTp p2xxkk k
subjectto cixkT pcixk0, i E, cixkTpcixk0, iI,
15.3a
15.3b 15.3c
where L is the Lagrangian function defined in 12.33. The objective in this subproblem is an approximation to the change in the Lagrangian function in moving from xk to xk p, while the constraints are linearizations of the constraints in 15.1. A trustregion constraint may be added to 15.3 to control the length and quality of the step, and quasiNewton approximate Hessians can be used in place of x2x Lxk , k . In a variant called sequential linearquadratic programming, the step pk is computed in two stages. First, we solve a linear program that is defined by omitting the first quadratic term from the objective 15.3a and adding a trustregion constraint to 15.3. Next, we obtain the step pk by solving an equalityconstrained subproblem in which the constraints active at the solution of the linear program are imposed as equalities, while all other constraints are ignored.
IV. In Chapter 19 we study interiorpoint methods for nonlinear programming. These meth ods can be viewed as extensions of the primaldual interiorpoint methods for linear
424 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
programming discussed in Chapter 14. We can also view them as barrier methods that generate steps by solving the problem
min
x,s subject to
m
f x log si
i1
ci x 0, i E,
cixsi 0, iI,
15.4a
15.4b 15.4c
for some positive value of the barrier parameter , where the variables si
Interiorpoint methods constitute the newest class of methods for nonlinear programming and have already proved to be formidable competitors of sequential quadratic programming methods.
The algorithms in categories I, III, and IV make use of elimination techniques, in which the constraints are used to eliminate some of the degrees of freedom in the problem. As a background to those algorithms, we discuss elimination in Section 15.3. In later sections we discuss merit functions and filters, which are important mechanisms for promoting convergence of nonlinear programming algorithms from remote starting points.
15.2 THE COMBINATORIAL DIFFICULTY OF INEQUALITYCONSTRAINED PROBLEMS
One of the main challenges in solving nonlinear programming problems lies in dealing with inequality constraintsin particular, in deciding which of these constraints are active at the solution and which are not. One approach, which is the essence of activeset methods, starts by making a guess of the optimal active set A, that is, the set of constraints that are satisfied as equalities at a solution. We call our guess the working set and denote it by W. We then solve a problem in which the constraints in the working set are imposed as equalities and the constraints not in W are ignored. We then check to see if there is a choice of Lagrange multipliers such that the solution x obtained for this W satisfies the KKT conditions 12.34. If so, we accept x as a local solution of 15.1. Otherwise, we make a different choice of W and repeat the process. This approach is based on the observation that, in general, it is much simpler to solve equalityconstrained problems than to solve nonlinear programs.
The number of choices for working set W may be very largeup to 2I, where I is the number of inequality constraints. We arrive at this estimate by observing that we can make one of two choices for each i I : to include it in W or leave it out. Since the number of possible working sets grows exponentially with the number of inequalitiesa phenomenon which we refer to as the combinatorial difficulty of nonlinear programmingwe cannot hope to design a practical algorithm by considering all possible choices for W.
0 are slacks.
15.2. THE COMBINATORIAL DIFFICULTY OF INEQUALITYCONSTRAINED PROBLEMS 425
The following example suggests that even for a small number of inequality constraints, determination of the optimal active set is not a simple task.
EXAMPLE 15.1 Consider the problem
def1 21 12 min fx,y2x2 2y2
x,y
x11y1 0, 4
subject to x 0, y 0.
15.5
We label the constraints, in order, with the indices 1 through 3. Figure 15.1 illustrates the contours of the objective function dashed circles. The feasible region is the region enclosed by the curve and the two axes. We see that only the first constraint is active at the solution, whichisx,yT 1.953,0.089T.
Let us now apply the workingset approach described above to 15.5, considering all 23 8 possible choices of W.
We consider first the possibility that no constraints are active at the solution, that is, W.Sincef x2,y12T,weseethattheunconstrainedminimumof f lies outside the feasible region. Hence, the optimal active set cannot be empty.
There are seven further possibilities. First, all three constraints could be active that is, W 1, 2, 3. A glance at Figure 15.1 shows that this does not happen for our problem; the three constraints do not share a common point of intersection. Three further possibilities are obtained by making a single constraint active that is, W 1, W 2, and W 3,
y
2,0.5
x,y
x
Figure 15.1 Graphical illustration of problem 15.5.
426 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
while the final three possibilities are obtained by making exactly two constraints active that is, W 1, 2, W 1, 3, and W 2, 3. We consider three of these cases in detail.
W 2; that is, only the constraint x 0 is active. If we minimize f enforcing only this constraint, we obtain the point 0, 12T . A check of the KKT conditions 12.34 shows that no matter how we choose the Lagrange multipliers, we cannot satisfy all these conditions at 0, 12T . We must have 1 3 0 to satisfy 12.34e, which implies that we must set 2 2 to satisfy 12.34a; but this value of 2 violates the condition 12.34d.
W 1,3, which yields the single feasible point 3,0T. Since constraint 2 is inactive at this point, we have 2 0, so by solving 12.34a for the other Lagrange multipliers, we obtain 1 16 and 3 16.5. These values are negative, so they violate12.34d,andx3,0T cannotbeasolutionof15.1.
W 1. Solving the equalityconstrained problem in which the first constraint is active, we obtain x, yT 1.953, 0.089T with Lagrange multiplier 1 0.411. It is easy to see that by setting 2 3 0, the remaining KKT conditions 12.34 are satisfied, so we conclude that this is a KKT point. Furthermore, it is easy to show that the secondorder sufficient conditions are satisfied, as the Hessian of the Lagrangian is positive definite.
Even for this small example, we see that it is exhausting to consider all possible choices for W. Figure 15.1 suggests, however, that some choices of W can be eliminated from consideration if we make use of knowledge of the functions that define the problem, and their derivatives. In fact, the active set methods described in Chapter 16 use this kind of information to make a series of educated guesses for the working set, avoiding choices of W that obviously will not lead to a solution of 15.1.
A different approach is followed by interiorpoint or barrier methods discussed in Chapter 19. These methods generate iterates that stay away from the boundary of the feasible region defined by the inequality constraints. As the solution of the nonlinear program is approached, the barrier effects are weakened to permit an increasingly accurate estimate of the solution. In this manner, interiorpoint methods avoid the combinatorial difficulty of nonlinear programming.
15.3 ELIMINATION OF VARIABLES
When dealing with constrained optimization problems, it is natural to try to use the con straints to eliminate some of the variables from the problem, to obtain a simpler problem with fewer degrees of freedom. Elimination techniques must be used with care, however, as they may alter the problem or introduce ill conditioning.
min fx fx1,x2,x3,x4 there is no risk in setting
x1 x4x3 x32, to obtain a function of two variables
subjectto x1 x32 x4x3 0, x2x4x32 0,
15.3. ELIMINATION OF VARIABLES 427
y
y2x1 3
x
x2y24 x 2y 21
1,0
Figure 15.2
The danger of nonlinear elimination.
We begin with an example in which it is safe and convenient to eliminate variables. In the problem
x2 x4 x32, hx3, x4 f x4x3 x32, x4 x32, x3, x4,
which we can minimize using the unconstrained optimization techniques described in earlier chapters.
The dangers of nonlinear elimination are illustrated in the following example.
EXAMPLE 15.2 FLETCHER 101
Consider the problem
min x2 y2 subject to x 13 y2.
The contours of the objective function and the constraints are illustrated in Figure 15.2, which shows that the solution is x, y 1, 0.
428 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
We attempt to solve this problem by eliminating y. By doing so, we obtain hx x2 x 13.
Clearly, hx as x . By blindly applying this transformation we may conclude that the problem is unbounded, but this view ignores the fact that the con straint x 13 y2 implicitly imposes the bound x 1 that is active at the solution. Hence, if we wish to eliminate y, we should explicitly introduce the bound x 1 into the problem.
This example shows that the use of nonlinear equations to eliminate variables may result in errors that can be difficult to trace. For this reason, nonlinear elimination is not used by most optimization algorithms. Instead, many algorithms linearize the constraints and apply elimination techniques to the simplified problem. We now describe systematic procedures for performing variable elimination using linear constraints.
SIMPLE ELIMINATION USING LINEAR CONSTRAINTS
We consider the minimization of a nonlinear function subject to a set of linear equality constraints,
min f x subject to Ax b, 15.6
where A is an m n matrix with m n. Suppose for simplicity that A has full row rank. If such is not the case, we find either that the problem is inconsistent or that some of the constraints are redundant and can be deleted without affecting the solution of the problem. Under this assumption, we can find a subset of m columns of A that is linearly independent. If we gather these columns into an m m matrix B and define an n n permutation matrix P that swaps these columns to the first m column positions in A, we can write
AP B N, 15.7
where N denotes the n m remaining columns of A. The notation here is consistent with that of Chapter 13, where we discussed similar concepts in the context of linear programming. We define the subvectors xB IRm and xN IRnm as follows:
xB xN
PT x, 15.8
15.3. ELIMINATION OF VARIABLES 429
andcallxB thebasicvariablesandBthebasismatrix.NotingthatPPT I,wecanrewrite the constraint Ax b as
bAxAPPTxBxB NxN.
By rearranging this formula, we deduce that the basic variables can be expressed as follows:
xB B1bB1NxN. 15.9
We can therefore compute a feasible point for the constraints Ax b by choosing any value of xN and then setting xB according to the formula 15.9. The problem 15.6 is therefore equivalent to the unconstrained problem
def B1b B1 N xN
minhxNf P . 15.10
xN xN
We refer to the substitution in 15.9 as simple elimination of variables.
This discussion shows that a nonlinear optimization problem with linear equality
constraints is, from a mathematical point of view, the same as an unconstrained problem.
EXAMPLE 15.3 Consider the problem
min sinx1 x2 x32 1×4 x54 x62 3
subjectto 8×1 6×2 x3 9×4 4×5 6 3×1 2×2 x4 6×5 4×6 4.
15.11a 15.11b
By defining the permutation matrix P so as to reorder the components of x as xT x3, x6, x1, x2, x4, x5T , we find that the coefficient matrix AP is
AP 1 0 8 6 9 4 . 0 4 3 2 1 6
The basis matrix B is diagonal and therefore easy to invert. We obtain from 15.9 that
86 9 4×1
x3 3 1 1 3×2 6 .
15.12
x6 x41
4242
x5
430 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
By substituting for x3 and x6 in 15.11a, the problem becomes
min sinx1 x28x1 6×2 9×4 4×5 62 15.13
x1 ,x2 ,x4 ,x5
1×4 x54 12 38×1 14×2 18×4 34×5.
3
We could have chosen two other columns of the coefficient matrix A that is, two variables other than x3 and x6 as the basis for elimination in the system 15.11b, but the
matrix B1 N would not have been so simple.
A set of m independent columns can be selected, in general, by means of Gaussian elimination. In the parlance of linear algebra, we can compute the row echelon form of the matrix and choose the pivot columns as the columns of the basis B. Ideally, we would like B to be easy to factor and well conditioned. A technique that suits these purposes is a sparse Gaussian elimination approach that attempts to preserve sparsity while keeping rounding errors under control. A wellknown implementation of this algorithm is MA48 from the HSL library 96. As we discuss below, however, there is no guarantee that the Gaussian elimination process will identify the best choice of basis matrix.
There is an interesting interpretation of the simple eliminationofvariables approach that we have just described. To simplify the notation, we will assume from now on that the coefficient matrix is already given to us so that the basic columns appear in the first m positions, that is, P I .
From 15.8 and 15.9 we see that any feasible point x for the linear constraints in 15.6 can be written as
where
xB xN
x Yb ZxN,
15.14
B1 B1N
Y ,Z . 15.15
0I
Note that Z has n m linearly independent columns because of the presence of the identity matrix in the lower block and that it satisfies AZ 0. Therefore, Z is a basis for the null space of A. In addition, the columns of Y and the columns of Z form a linearly independent set. We note also from 15.15, 15.7 that Y b is a particular solution of the linear constraints Ax b.
In other words, the simple elimination technique expresses feasible points as the sum of a particular solution of Ax b the first term in 15.14 plus a displacement along the
15.3. ELIMINATION OF VARIABLES 431
x2
Ax b
coordinate relaxation step
x1
Figure 15.3 Simple elimination, showing the coordinate relaxation step obtained by choosing the basis to be the first column of A.
null space of the constraints the second term in 15.14. The relations 15.14, 15.15 indicate that the particular Y b solution is obtained by holding n m components of x at zero while relaxing the other m components the ones in xB until they reach the constraints. The particular solution Y b is sometimes known as the coordinate relaxation step. In Figure 15.3, we see the coordinate relaxation step Y b obtained by choosing the basis matrix B to be the first column of A. If we were to choose B to be the second column of A, the coordinate relaxation step would lie along the x2 axis.
Simple elimination is inexpensive but can give rise to numerical instabilities. If the feasible set in Figure 15.3 consisted of a line that was almost parallel to the x1 axis, the coordinate relaxation along this axis would be very large in magnitude. We would then be computing x as the difference of very large vectors, giving rise to numerical cancellation. In that situation it would be preferable to choose a particular solution along the x2 axis, that is, to select a different basis. Selection of the best basis is, therefore, not a straightfor ward task in general. To overcome the dangers of an excessively large coordinate relaxation step, we could define the particular solution Yb as the minimumnorm step to the con straints. This approach is a special case of more general elimination strategies, which we now describe.
GENERAL REDUCTION STRATEGIES FOR LINEAR CONSTRAINTS
To generalize 15.14 and 15.15, we choose matrices Y IRnm and Z IRnnm with the following properties:
Y Z IRnn is nonsingular, AZ 0. 15.16
432 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
These properties indicate that, as in 15.15, the columns of Z are a basis for the null space of A.Since Ahasfullrowrank,sodoes AYZAY0,soitfollowsthatthemm matrix AY is nonsingular. We now express any solution of the linear constraints Ax b as
xYxY ZxZ, 15.17 for some vectors xY IRm and xZ IRnm. By substituting 15.17 into the constraints
Ax b, we obtain
AxAYxY b; hencebynonsingularityofAY,xY canbewrittenexplicitlyas
xY AY1b. 15.18 By substituting this expression into 15.17, we conclude that any vector x of the form
x YAY1b ZxZ 15.19 satisfiestheconstraintsAxbforanychoiceofxZ IRnm.Therefore,theproblem15.6
can be restated equivalently as the following unconstrained problem
min fYAY1bZxZ. 15.20
xZ
Ideally, we would like to choose Y in such a way that the matrix AY is as well conditioned as possible, since it needs to be factorized to give the particular solution YAY1b. We can do this by computing Y and Z by means of a QR factorization of AT , which has the form
AT Q Q R , 15.21 120
where Q1 Q2 is orthogonal. The submatrices Q1 and Q2 have orthonormal columns and are of dimension n m and n n m, while R is m m upper triangular and nonsingular and is an m m permutation matrix. See the discussion following A.24 in the Appendix for further details. We now define
Y Q1, Z Q2, 15.22 so that the columns of Y and Z form an orthonormal basis of IRn . If we expand 15.21 and
do a little rearrangement, we obtain
AY RT , AZ 0.
15.3. ELIMINATION OF VARIABLES 433
Therefore, Y and Z have the desired properties, and the condition number of AY is the same as that of R, which in turn is the same as that of A itself. From 15.19 we see that any solution of Ax b can be expressed as
x Q1 RT T b Q2xZ,
for some vector xZ. The computation RT T b can be carried out inexpensively, at the cost of a single triangular substitution.
A simple computation shows that the particular solution Q1 RT T b can also be written as AT AAT 1b. This vector is the solution of the following problem:
min x 2 subjectto Axb;
that is, it is the minimumnorm solution of Ax b. See Figure 15.5 for an illustration of this step.
Elimination via the orthogonal basis 15.22 is ideal from the point of view of numer ical stability. The main cost associated with this reduction strategy is in computing the QR factorization 15.21. Unfortunately, for problems in which A is large and sparse, a sparse QR factorization can be much more costly to compute than the sparse Gaussian elimina tion strategy used in simple elimination. Therefore, other elimination strategies have been developed that seek a compromise between these two techniques; see Exercise 15.7.
x2
YxY
x1
ZxZ
Ax b
x3
Figure 15.4 General elimination: Case in which A IR13, showing the particular solution and a step in the null space of A.
434 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
x2
A T A A T 1 b
Ax b
x1
Figure 15.5 The minimumnorm step.
EFFECT OF INEQUALITY CONSTRAINTS
Elimination of variables is not always beneficial if inequality constraints are present alongside the equalities. For instance, if problem 15.11 had the additional constraint x 0, then after eliminating the variables x3 and x6, we would be left with the problem of minimizing the function in 15.13 subject to the constraints
x1, x2, x4, x5 0, 8×1 6×2 9×4 4×5 6,
34×1 12×2 14×4 32×5 1.
Hence, the cost of eliminating the equality constraints 15.11b is to make the inequalities more complicated than the simple bounds x 0. For many algorithms, this transformation will not yield any benefit.
If,however,problem15.11includedthegeneralinequalityconstraint3x12x3 1, the elimination 15.12 would transform the problem into one of minimizing the function in 15.13 subject to the inequality constraint
13×1 12×2 18×4 8×5 11. 15.23
In this case, the inequality constraint would not become much more complicated af ter elimination of the equality constraints, so it is probably worthwhile to perform the elimination.
15.4. MERIT FUNCTIONS AND FILTERS 435
15.4 MERIT FUNCTIONS AND FILTERS
Suppose that an algorithm for solving the nonlinear programming problem 15.1 generates a step that reduces the objective function but increases the violation of the constraints. Should we accept this step?
This question is not easy to answer. We must look for a way to balance the twin often competing goals of reducing the objective function and satisfying the constraints. Merit functions and filters are two approaches for achieving this balance. In a typical constrained optimization algorithm, a step p will be accepted only if it leads to a sufficient reduction in the merit function or if it is acceptable to the filter. These concepts are explained in the rest of the section.
MERIT FUNCTIONS
In unconstrained optimization, the objective function f is the natural choice for the merit function. All the unconstrained optimization methods described in this book require that f be decreased at each step or at least within a certain number of iterations. In feasible methods for constrained optimization in which the starting point and all subsequent iterates satisfy all the constraints in the problem, the objective function is still an appropriate merit function. On the other hand, algorithms that allow iterates to violate the constraints require some means to assess the quality of the steps and iterates. The merit function in this case combines the objective with measures of constraint violation.
A popular choice of merit function for the nonlinear programming problem 15.1 is the 1 penalty function defined by
1x; fx cix cix , 15.24
iE iI
whereweusethenotationz max0,z.Thepositivescalaristhepenaltyparameter, which determines the weight that we assign to constraint satisfaction relative to minimization of the objective. The 1 merit function 1 is not differentiable because of the presence of the absolute value and functions, but it has the important property of being exact.
Definition 15.1 Exact Merit Function.
A merit function x; is exact if there is a positive scalar such that for any ,
any local solution of the nonlinear programming problem 15.1 is a local minimizer of x; . We show in Theorem 17.3 that, under certain assumptions, the 1 merit function
1x; is exact and that the threshold value is given by maxi, i E I,
436 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
where the i denote the Lagrange multipliers associated with an optimal solution x. Since the optimal Lagrange multipliers are, however, not known in advance, algorithms based on the 1 merit function contain rules for adjusting the penalty parameter whenever there is reason to believe that it is not large enough or is excessively large. These rules depend on the choice of optimization algorithm and are discussed in the next chapters.
Another useful merit function is the exact 2 function, which for equalityconstrained problems takes the form
2x; fx cx 2. 15.25
This function is nondifferentiable because the 2norm term is not squared; its derivative is not defined at x for which cx 0.
Some merit functions are both smooth and exact. To ensure that both properties hold, we must include additional terms in the merit function. For equalityconstrained problems, Fletchers augmented Lagrangian is given by
iE
where 0 is the penalty parameter and
x AxAxT 1 Ax f x.
Fx; fxxTcx1 2
cix2,
15.26
15.27
Here Ax denotes the Jacobian of cx . Although this merit function has some interesting theoretical properties, it has practical limitations, including the expense of solving for x in 15.27.
A quite different merit function is the standard augmented Lagrangian in x and , which for equalityconstrained problems has the form
LAx,; fxTcx1 cx 2. 15.28 2
We assess the acceptability of a trial point x, by comparing the value of LAx, ; with the value at the current iterate, x, . Strictly speaking, LA is not a merit function in the sense that a solution x, of the nonlinear programming problem is not in general a minimizer of LAx,; but only a stationary point. Although some sequential quadratic programming methods use LA successfully as a merit function by adaptively modifying and , we will not consider its use as a merit function further. Instead, we will focus primarily on the nonsmooth exact penalty functions 1 and 2.
A trial step x x p generated by a line search algorithm will be accepted if it produces a sufficient decrease in the merit function x; . One way to define this concept is analogous to the condition 3.4 used in unconstrained optimization, where the amount
hx we can write these two goals as
cix cix ,
iE iI
15.31
15.32
15.4. MERIT FUNCTIONS AND FILTERS 437
of decrease is not too small relative to the predicted change in the function over the step. The 1 and 2 merit functions are not differentiable, but they have a directional derivative. See A.51 for background on directional derivatives. We write the directional derivative of x; in the direction p as
Dx;; p.
In a line search method, the sufficient decrease condition requires the steplength parameter
0 to be small enough that the inequality
x p; x; Dx;; p, 15.29
is satisfied for some 0, 1.
Trustregion methods typically use a quadratic model qp to estimate the value of
the merit function after a step p; see Section 18.5. The sufficient decrease condition can be stated in terms of a decrease in this model, as follows
x p; x; q0 qp, 15.30 for some 0, 1. The final term in 15.30 is positive, because the step p is computed
to decrease the model q. FILTERS
Filter techniques are step acceptance mechanisms based on ideas from multiobjective optimization. Our derivation starts with the observation that nonlinear programming has two goals: minimization of the objective function and the satisfaction of the constraints. If we define a measure of infeasibility as
min f x and min hx. xx
Unlike merit functions, which combine both problems into a single minimization prob lem, filter methods keep the two goals in 15.32 separate. Filter methods accept a trial step x as a new iterate if the pair f x, hx is not dominated by a previous pair fl , hl f xl , hxl generated by the algorithm. These concepts are defined as follows.
438 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
hx
fk ,hk
fi ,hi
fx
Figure 15.6 Graphical illustration of a filter with four pairs. Definition 15.2.
a A pair fk,hk is said to dominate another pair fl,hl if both fk fl and hk hl. b Afilterisalistofpairsfl,hlsuchthatnopairdominatesanyother.
c An iterate xk is said to be acceptable to the filter if fk , hk is not dominated by any pair in the filter.
When an iterate xk is acceptable to the filter, we normally add fk , hk to the filter and remove any pairs that are dominated by fk , hk . Figure 15.6 shows a filter where each pair fl , hl in the filter is represented as a black dot. Every point in the filter creates an infinite rectangular region, and their union defines the set of pairs not acceptable to the filter. More specifically, a trial point x is acceptable to the filter if f , h lies below or to the left of the solid line in Figure 15.6.
To compare the filter and merit function approaches, we plot in Figure 15.7 the contourlineofthesetofpairsf,hsuchthat f h fk hk,wherexk isthecurrent iterate. The region to the left of this line corresponds to the set of pairs that reduce the merit function x; f x hx; clearly this set is quite different from the set of points acceptable to the filter.
Ifatrialstepx xk k pk generatedbyalinesearchmethodgivesapairf,h that is acceptable to the filter, we set xk1 x; otherwise, a backtracking line search is performed. In a trustregion method, if the step is not acceptable to the filter, the trust region is reduced, and a new step is computed.
Several enhancements to this filter technique are needed to obtain global convergence and good practical performance. We need to ensure, first of all, that we do not accept a point whose f, h pair is very close to the current pair fk , hk or to another pair in the filter. We
15.4. MERIT FUNCTIONS AND FILTERS 439
hx
fk ,hk
isovalue of
merit function fi ,h i
fx
Figure 15.7 Comparing the filter and merit function techniques.
do so by modifying the acceptability criterion and imposing a sufficient decrease condition.
A trial iterate x is acceptable to the filter if, for all pairs f j , h j in the filter, we have that
fx fj hj or hx hj hj, 15.33
for 0,1. Although this condition is effective in practice using, say 105, for purposes of analysis it may be advantageous to replace the first inequality by
fx fj h.
A second enhancement addresses some problematic aspects of the filter mechanism. Under certain circumstances, the search directions generated by line search methods may require arbitrarily small steplengths k to be acceptable to the filter. This phenomenon can cause the algorithm to stall and fail. To guard against this situation, if the backtracking line search generates a steplength that is smaller than a given threshold min, the algorithm switches to a feasibility restoration phase, which we describe below. Similarly, in a trustregion method, if a sequence of trial steps is rejected by the filter, the trustregion radius may be decreased so much that the trustregion subproblem becomes infeasible see Section 18.5. In this case, too, the feasibility restoration phase is invoked. Other mechanisms could be employed to handle this situation, but as we discuss below, the feasibility restoration phase can help the algorithm achieve other useful goals.
The feasibility restoration phase aims exclusively to reduce the constraint violation, that is, to find an approximate solution to the problem
min hx. x
440 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
Although hx defined by 15.31 is not smooth, we show in Chapter 17 how to minimize it using a smooth constrained optimization subproblem. This phase terminates at an iterate that has a sufficiently small value of h and is compatible with the filter.
We now present a framework for filter methods that assumes that iterates are generated by a trustregion method; see Section 18.5 for a discussion of trustregion methods for constrained optimization.
Algorithm 15.1 General Filter Method.
Choose a starting point x0 and an initial trustregion radius a0; Set k 0;
repeat until a convergence test is satisfied
if the stepgeneration subproblem is infeasible
Compute xk1 using the feasibility restoration phase;
else
else
Chooseak1 ak; end if
end if
k k 1; end repeat
Other enhancements of this simple filter framework are used in practice; they depend on the choice of algorithm and will be discussed in subsequent chapters.
15.5 THE MARATOS EFFECT
Some algorithms based on merit functions or filters may fail to converge rapidly because they reject steps that make good progress toward a solution. This undesirable phenomenon is often called the Maratos effect, because it was first observed by Maratos 199. It is illustrated by the following example, in which steps pk, which would yield quadratic convergence if accepted, cause an increase both in the objective function value and the constraint violation.
Compute a trial iterate x xk pk; if f , h is acceptable to the filter
Set xk1 x and add fk1, hk1 to the filter; Choose ak1 such that ak1 ak ;
Remove all pairs from the filter that are dominated
by fk1, hk1;
Reject the step, set xk1 xk ;
x12 x2 10. 15.34 correspondingLagrangemultiplieris 3,andthat2 Lx, I.
15.5. THE MARATOS EFFECT 441
contours of f
xk
x p x k k
constraint
x12 x2 1
Figure 15.8
Maratos Effect: Example 15.4. Note that the constraint is no longer satisfied after the step from xk to xk pk, and the objective value has increased.
EXAMPLE 15.4 Consider the problem
POWELL 255
min fx1,x22x12 x2 1×1, subjectto
One can verify see Figure 15.8 that the optimal solution is x 1, 0T , that the
2 xx
Letusconsideraniteratexk oftheformxk cos,sinT,whichisfeasibleforany
value of . Suppose that our algorithm computes the following step: pk sin2 ,
15.35
which yields a trial point
sin cos
xk pk cos sin2 sin1cos
By using elementary trigonometric identities, we have that
xk pk x 2 2sin22, xk x 2 2sin2,
.
442 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
and therefore
xkpkx 2 1. xkx 2 2
Hence, this step approaches the solution at a rate consistent with Qquadratic convergence. However, we have that
fxk pksin2coscos fxk, cxk pksin2cxk0,
so that, as can be seen in Figure 15.8, both the objective function value and the constraint violation increase over this step. This behavior occurs for any nonzero value of , even if the
initial point is arbitrarily close to the solution.
On the example above, any algorithm that requires reduction of a merit function of the form
x; fxhcx,
where h is a nonnegative function satisfying h0 0, will reject the good step 15.35. Examples of such merit functions include the 1 and 2 penalty functions. The step 15.35 will also be rejected by the filter mechanism described above because the pair f xk pk , hxk pk is dominated by fk , hk . Therefore, all these approaches will suffer from the Maratos effect.
If no remedial measures are taken, the Maratos effect can slow optimization meth ods by interfering with good steps away from the solution and by preventing superlinear convergence. Strategies for avoiding the Maratos effect include the following.
1. We can use a merit function that does not suffer from the Maratos effect. An example is Fletchers augmented Lagrangian function 15.26.
2. We can use a secondorder correction in which we add to pk a step pk , which is computed at cxk pk and which decreases the constraint violation.
3. We can allow the merit function to increase on certain iterations; that is, we can use a nonmonotone strategy.
We discuss the last two approaches in the next section.
15.6. SECONDORDER CORRECTION AND NONMONOTONE TECHNIQUES 443
15.6 SECONDORDER CORRECTION AND NONMONOTONE TECHNIQUES
By adding a correction term that decreases the constraint violation, various algorithms are able to overcome the difficulties associated with the Maratos effect. We describe this technique with respect to the equalityconstrained problem, in which the constraint is c x 0 , w h e r e c : IR n IR E .
Given a step pk , the secondorder correction step pk is defined to be
pk AkTAkAkT1cxk pk, 15.36
where Ak Axk is the Jacobian of c at xk . Note that pk has the property that it satisfies a linearization of the constraint c at the point xk pk , that is,
A k p k c x k p k 0 .
In fact, pk is the minimumnorm solution of this equation. A different interpretation of the secondorder correction is given in Section 18.3.
The effect of the correction step pk is to decrease the quantity cx to the order of xk x 3, provided the primary step pk satisfies Ak pk cxk 0. This estimate indicates that the step from from xk to xk pk pk will decrease the merit function, at least near the solution. The cost of this enhancement includes the additional evaluation of the constraint function c at xk pk and the linear algebra required to calculate the step pk from 15.36.
We now describe an algorithm that uses a merit function together with a linesearch strategy and a secondorder correction step. We assume that the search direction pk and the penalty parameter k are computed so that pk is a descent direction for the merit function, that is, Dxk;; pk 0. In Chapters 18 and 19, we discuss how to accomplish these goals. The key feature of the algorithm is that, if the full step k 1 does not produce satisfactory descent in the merit function, we try the secondorder correction step before backtracking along the original direction pk .
Algorithm 15.2 Generic Algorithm with SecondOrder Correction. Choose parameters 0,0.5 and 1, 2 with 0 1 2 1; Choose initial point x0; set k 0;
repeat until a convergence test is satisfied:
Compute a search direction pk ; Setk 1,newpointfalse; while newpoint false
if xk k pk; xk; k Dxk;; pk Setxk1 xk kpk;
Set newpoint true;
444 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
elseifk 1
Compute pk from 15.36;
ifxk pk pk;xk;Dxk;;pk
S e t x k 1 x k p k p k ;
Set newpoint true; else
Choosenewk in1k,2k; end
else
Choosenewk in1k,2k; end
end while end repeat
In this algorithm, the full secondorder correction step pk is discarded if does not produce a reduction in the merit function. We do not backtrack along the direction pk pk because it is not guaranteed to be a descent direction for the merit func tion. A variation of this algorithm applies the secondorder correction step only if the sufficient decrease condition 15.29 is violated as a result of an increase in the norm of the
constraints.
The secondorder correction strategy is effective in practice. The cost of performing
the extra constraint function evaluation and an additional backsolve in 15.36 is outweighed by added robustness and efficiency.
NONMONOTONE WATCHDOG STRATEGY
The inefficiencies caused by the Maratos effect can also be avoided by occasionally accepting steps that increase the merit function; such steps are called relaxed steps. There is a limit to our tolerance, however. If a sufficient reduction of the merit function has not been obtained within a certain number of iterates of the relaxed step t iterates, say, then we return to the iterate before the relaxed step and perform a normal iteration, using a line search or some other technique to force a reduction in the merit function.
In contrast with the secondorder correction, which aims only to improve satisfaction of the constraints, this nonmonotone strategy always takes regular steps pk of the algorithm that aim both for improved feasibility and optimality. The hope is that any increase in the merit function over a single step will be temporary, and that subsequent steps will more than compensate for it.
We now describe a particular instance of the nonmonotone approach called the watchdog strategy. We set t 1, so that we allow the merit function to increase on just a single step before insisting on a sufficient decrease in the merit function. As above, we focus our discussion on a line search algorithm that uses a nonsmooth merit function . We assume that the penalty parameter is not changed until a successful cycle has been
15.6. SECONDORDER CORRECTION AND NONMONOTONE TECHNIQUES 445
completed. To simplify the notation, we omit the dependence of on and write the merit function as x and the directional derivative as Dx; pk.
Algorithm 15.3 Watchdog.
Choose a constant 0, 0.5 and an initial point x0; Setk 0,S 0;
repeat until a termination test is satisfied
Compute a step pk ;
Set xk1 xk pk;
if xk1 xk Dxk; pk
k k 1, S S k; else
Compute a search direction pk1 from xk1; Find k1 such that
xk2 xk1 k1 Dxk1; pk1;
Set xk2 xk1 k1 pk1;
if xk1 xk or xk2 xk Dxk; pk
k k 2, S S k; elseif xk2xk
return to xk and search along pk
Findk suchthatxk3xkkDxk;pk; Computexk3 xk kpk;
k k 3, S S k;
else
k k 3, S S k; end
end end repeat
The set S is not required by the algorithm and is introduced only to identify the iterates for which a sufficient merit function reduction was obtained. Note that at least a third of the iterates have their indices in S. By using this fact, one can show that various constrained optimization methods that use the watchdog technique are globally convergent. One can also show that for all sufficiently large k, the step length is k 1 and the convergence rate is superlinear.
In practice, it may be advantageous to allow increases in the merit function for more than one iteration. Values of t such as 5 or 8 are typical. As this discussion indicates, care ful implementations of the watchdog technique have a certain degree of complexity, but
Compute a direction pk2 from xk2; Find k2 such that
xk3 xk2 k2 Dxk2; pk2;
Set xk3 xk2 k2 pk2;
446 CHAPTER 15. FUNDAMENTALS OF CONSTRAINED ALGORITHMS
the added complexity is worthwhile because the approach has good practical performance. A potential advantage of the watchdog technique over the secondorder correction strat egy is that it may require fewer evaluations of the constraint functions. In the best case, most of the steps will be full steps, and there will rarely be a need to return to an earlier point.
NOTES AND REFERENCES
Techniques for eliminating linear constraints are described, for example, in Fletcher 101 and Gill, Murray, and Wright 131. For a thorough discussion of merit functions see Boggs and Tolle 33 and Conn, Gould, and Toint 74. Some of the earliest references on nonmonotone methods include Grippo, Lampariello and Lucidi 158, and Chamberlain et al 57; see 74 for a review of nonmonotone techniques and an extensive list of references. The concept of a filter was introduced by Fletcher and Leyffer 105; our discussion of filters is based on that paper. Secondorder correction steps are motivated and discussed in Fletcher 101.
EXERCISES
15.1 In Example 15.1, consider these three choices of the working set: W 3, W 1, 2, W 2, 3. Show that none of these working sets are the optimal active set for 15.5.
15.2 For the problem in Example 15.3, perform simple elimination of the variables x2 and x5 to obtain an unconstrained problem in the remaining variables x1, x3, x4, and x6. Similarly to 15.12, express the eliminated variables explicitly in terms of the retained variables.
15.3 Do the following problems have solutions? Explain.
min x1 x2 min x1 x2 min x1x2
subjecttox12 x2 2, 0x1 1, 0x2 1; subjecttox12 x2 1, x1 x2 3;
subject to x1 x2 2.
15.4 Show that if in Example 15.2 we eliminate x in terms of y, then the correct solution of the problem is obtained by performing unconstrained minimization.
15.5 Show that the basis matrices 15.15 are linearly independent.
15.6 Show that the particular solution Q1 RT T b of Ax b is identical to AT AAT 1b.
15.6. SECONDORDER CORRECTION AND NONMONOTONE TECHNIQUES 447
15.7 In this exercise we compute basis matrices that attempt to compromise between the orthonormal basis 15.22 and simple elimination 15.15. We assume that the basis matrix is given by the first m columns of A, so that P I in 15.7, and define
a
b
Y I , Z B1N . 15.37 B1NT I
Show that the columns of Y and Z are no longer of norm 1 and that the relations AZ 0 and YT Z 0 hold. Therefore, the columns of Y and Z form a linearly independent set, showing that 15.37 is a valid choice of the basis matrices.
Show that the particular solution YAY1b defined by this choice of Y is, as in the orthogonal factorization approach, the minimumnorm solution of Ax b. More specifically, show that
YAY1 ATAAT1.
It follows that the matrix YAY1 is independent of the choice of basis matrix B in 15.7, and its conditioning is determined by that of A alone. Note, however, that the matrix Z still depends explicitly on B, so a careful choice of B is needed to ensure well conditioning in this part of the computation.
15.8 Verify that by adding the inequality constraint 3×1 2×3 1 to the problem 15.11, the elimination 15.12 transforms the problem into one of minimizing the function 15.13 subject to the inequality constraint 15.23.
CHAPTER16 Quadratic
Programming
An optimization problem with a quadratic objective function and linear constraints is called a quadratic program. Problems of this type are important in their own right, and they also arise as subproblems in methods for general constrained optimization, such as sequential quadratic programming Chapter 18, augmented Lagrangian methods Chapter 17, and interiorpoint methods Chapter 19.
This is pa Printer: O
g
The general quadratic program QP can be stated as min qx1xTGxxTc
16.1a
16.1b 16.1c
CHAPTER 16. QUADRATIC PROGRAMMING 449
x2
subject to aiT x bi , i E,
aiT x bi , i I,
where G is a symmetric n n matrix, E and I are finite sets of indices, and c, x , and ai,i E I, are vectors in IRn. Quadratic programs can always be solved or shown to be infeasible in a finite amount of computation, but the effort required to find a solution depends strongly on the characteristics of the objective function and the number of inequality constraints. If the Hessian matrix G is positive semidefinite, we say that 16.1 is a convex QP, and in this case the problem is often similar in difficulty to a linear program. Strictly convex QPs are those in which G is positive definite. Nonconvex QPs, in which G is an indefinite matrix, can be more challenging because they can have several stationary points and local minima.
In this chapter we focus primarily on convex quadratic programs. We start by considering an interesting application of quadratic programming.
EXAMPLE 16.1 PORTFOLIO OPTIMIZATION
Every investor knows that there is a tradeoff between risk and return: To increase the expected return on investment, an investor must be willing to tolerate greater risks. Portfolio theory studies how to model this tradeoff given a collection of n possible investments with returns ri , i 1, 2, . . . , n. The returns ri are usually not known in advance and are often assumed to be random variables that follow a normal distribution. We can characterize these variables by their expected value i E ri and their variance i2 E ri i 2 . The variance measures the fluctuations of the variable ri about its mean, so that larger values of i indicate riskier investments. The returns are not in general independent, and we can define correlations between pairs of returns as follows:
ij Eri irj j, fori,j1,2,…,n. ij
The correlation measures the tendency of the return on investments i and j to move in the same direction. Two investments whose returns tend to rise and fall together have a positive correlation; the nearer i j is to 1, the more closely the two investments track each other. Investments whose returns tend to move in opposite directions have a negative correlation.
An investor constructs a portfolio by putting a fraction xi of the available funds into
investment i , for i 1, 2, . . . , n. Assuming that all available funds are invested and that
shortselling is not allowed, the constraints are n xi 1 and x 0. The return on the i1
450 CHAPTER 16. QUADRATIC PROGRAMMING
portfolio is given by
VarRERER2
where the n n symmetric positive semidefinite matrix G defined by
n
R xiri. 16.2
i1
To measure the desirability of the portfolio, we need to obtain measures of its expected return and variance. The expected return is simply
ER E while the variance is given by
n i1
n
xiri xi Eri xT ,
i1
Gij ijij
is called the covariance matrix.
Ideally, we would like to find a portfolio for which the expected return x T is large
while the variance xT Gx is small. In the model proposed by Markowitz 201, we combine these two aims into a single objective function with the aid of a risk tolerance parameter denoted by , and we solve the following problem to find the optimal portfolio:
n i1
The value chosen for the nonnegative parameter depends on the preferences of the individual investor. Conservative investors, who place more emphasis on minimizing risk in their portfolio, would choose a large value of to increase the weight of the variance measure in the objective function. More daring investors, who are prepared to take on more risk in the hope of a higher expected return, would choose a smaller value of .
The difficulty in applying this portfolio optimization technique to reallife investing lies in defining the expected returns, variances, and correlations for the investments in question. Financial professionals often combine historical data with their own insights and expectations to produce values of these quantities.
max xTxTGx, subjectto
xi 1, x 0.
n n i1 j1
xixjijij xTGx,
16.1. EQUALITYCONSTRAINED QUADRATIC PROGRAMS 451
16.1 EQUALITYCONSTRAINED QUADRATIC PROGRAMS
We begin our discussion of algorithms for quadratic programming by considering the case in which only equality constraints are present. Techniques for this special case are applicable also to problems with inequality constraints since, as we see later in this chapter, some algorithms for general QP require the solution of an equalityconstrained QP at each iteration.
PROPERTIES OF EQUALITYCONSTRAINED QPs
For simplicity, we write the equality constraints in matrix form and state the equality constrained QP as follows:
def 1 T T
min qx2xGxxc 16.3a
x
subject to Ax b, 16.3b
where A is the m n Jacobian of constraints with m n whose rows are aiT , i E and b is the vector in IRm whose components are bi , i E . For the present, we assume that A has full row rank rank m so that the constraints 16.3b are consistent. In Section 16.8 we discuss the case in which A is rank deficient.
The firstorder necessary conditions for x to be a solution of 16.3 state that there is a vector such that the following system of equations is satisfied:
GAT x c
A 0 b . 16.4
These conditions are a consequence of the general result for firstorder optimality conditions, Theorem 12.1. As in Chapter 12, we call the vector of Lagrange multipliers. The system 16.4 can be rewritten in a form that is useful for computation by expressing x as x x p, where x is some estimate of the solution and p is the desired step. By introducing this notation and rearranging the equations, we obtain
where
G AT p g , 16.5 A0 h
h Ax b, g c Gx, p x x. 16.6
The matrix in 16.5 is called the KarushKuhnTucker KKT matrix, and the fol lowing result gives conditions under which it is nonsingular. As in Chapter 15, we use Z to
452 CHAPTER 16. QUADRATIC PROGRAMMING
denote the n n m matrix whose columns are a basis for the null space of A. That is, Z has full rank and satisfies A Z 0.
Lemma 16.1.
Let A have full row rank, and assume that the reducedHessian matrix Z T G Z is positive definite. Then the KKT matrix
G AT A0
is nonsingular, and hence there is a unique vector pair x, satisfying 16.4. PROOF. Suppose there are vectors w and v such that
K
16.7
16.8
GAT w A0v
0.
Since Aw 0, we have from 16.8 that
0 w T G AT w wTGw.
vA0v
Since w lies in the null space of A, it can be written as w Zu for some vector u IRnm.
Therefore, we have
0 wT Gw uT ZT GZu,
which by positive definiteness of Z T G Z implies that u 0. Therefore, w 0, and by 16.8, AT v 0. Full row rank of A then implies that v 0. We conclude that equation 16.8 is satisfied only if w 0 and v 0, so the matrix is nonsingular, as claimed.
EXAMPLE 16.2
Consider the quadratic programming problem
min qx3x12 2x1x2 x1x3 2.5×2 2x2x3 2×32 8×1 3×2 3×3, subjectto x1 x3 3, x2 x3 0. 16.9
16.1. EQUALITYCONSTRAINED QUADRATIC PROGRAMS 453
We can write this problem in the form 16.3 by defining 6 2 1 8
G 2 5 2 , c 3 , A 1 0 1 , b 3 . 124 3 011 0
The solution x and optimal Lagrange multiplier vector are given by x 2,1,1T , 3,2T .
In this example, the matrix G is positive definite, and the nullspace basis matrix can be defined as in 15.15, giving
Z 1, 1, 1T . 16.10
We have seen that when the conditions of Lemma 16.1 are satisfied, there is a unique vector pair x, that satisfies the firstorder necessary conditions for 16.3. In fact, the secondorder sufficient conditions see Theorem 12.6 are also satisfied at x, , so x is a strict local minimizer of 16.3. In fact, we can use a direct argument to show that x is a global solution of 16.3.
Theorem 16.2.
Let A have full row rank and assume that the reducedHessian matrix Z T G Z is positive definite. Then the vector x satisfying 16.4 is the unique global solution of 16.3.
PROOF. Let x be any other feasible point satisfying Ax b, and as before, let p denote the difference x x. Since Ax Ax b, we have that Ap 0. By substituting into the objective function 16.3a, we obtain
qx 1 x pT Gx p cT x p 2
1 pT Gp pT Gx cT p qx. 2
From16.4wehavethatGx cAT,sofromAp0wehavethat pT Gx pT c AT pT c.
By substituting this relation into 16.11, we obtain
qx 1 pT Gp qx. 2
16.11
454 CHAPTER 16. QUADRATIC PROGRAMMING
Since p lies in the null space of A, we can write p Zu for some vector u IRnm, so that qx 1uT ZTGZuqx.
2
By positive definiteness of ZT GZ, we conclude that qx qx except when u 0, that
is, when x x. Therefore, x is the unique global solution of 16.3.
When the reduced Hessian matrix Z T G Z is positive semidefinite with zero eigenval ues, the vector x satisfying 16.4 is a local minimizer but not a strict local minimizer. If the reduced Hessian has negative eigenvalues, then x is only a stationary point, not a local minimizer.
16.2 DIRECT SOLUTION OF THE KKT SYSTEM
In this section we discuss efficient methods for solving the KKT system 16.5. The first important observation is that if m 1, the KKT matrix is always indefinite. We define the inertia of a symmetric matrix K to be the scalar triple that indicates the numbers n, n, and n0 of positive, negative, and zero eigenvalues, respectively, that is,
inertiaK n, n, n0.
The following result characterizes the inertia of the KKT matrix.
Theorem 16.3.
Let K be defined by 16.7, and suppose that A has rank m. Then inertiaK inertiaZT GZ m,m,0.
Therefore, if Z T G Z is positive definite, inertiaK n, m, 0.
The proof of this result is given in 111, for example. Note that the assumptions of this theorem are satisfied by Example 16.2. Hence, if we construct the 5 5 matrix K using the data of this example, we obtain inertiaK 3, 2, 0.
Knowing that the KKT system is indefinite, we now describe the main direct techniques used to solve 16.5.
FACTORING THE FULL KKT SYSTEM
One option for solving 16.5 is to perform a triangular factorization on the full KKT matrix and then perform backward and forward substitution with the triangular factors. Because of indefiniteness, we cannot use the Cholesky factorization. We could use Gaussian
s e t
16.2. DIRECT SOLUTION OF THE KKT SYSTEM 455
elimination with partial pivoting or a sparse variant thereof to obtain the L and U factors, but this approach has the disadvantage that it ignores the symmetry.
The most effective strategy in this case is to use a symmetric indefinite factorization, which we have discussed in Chapter 3 and the Appendix. For a general symmetric matrix K , this factorization has the form
PTKPLBLT, 16.12
where P is a permutation matrix, L is unit lower triangular, and B is blockdiagonal with either 1 1 or 2 2 blocks. The symmetric permutations defined by the matrix P are introduced for numerical stability of the computation and, in the case of large sparse K , for maintaining sparsity. The computational cost of the symmetric indefinite factorization 16.12 is typically about half the cost of sparse Gaussian elimination.
To solve 16.5, we first compute the factorization 16.12 of the coefficient matrix. We then perform the following sequence of operations to arrive at the solution:
solve Lz PT g h
solve Bz z solve LT z z
p P z .
to obtain z; to obtain y;
to obtain z ;
SincemultiplicationswiththepermutationmatricesPandPT canbeperformedbysimply rearranging vector components, they are inexpensive. Solution of the system Bz z entails solving a number of small 1 1 and 2 2 systems, so the number of operations is a small multiple of the system dimension m n, again inexpensive. Triangular substitutions with L and LT are more costly. Their precise cost depends on the amount of sparsity, but is usually significantly less than the cost of performing the factorization 16.12.
This approach of factoring the full n m n m KKT matrix 16.7 is quite effective on many problems. It may be expensive, however, when the heuristics for choosing the permutation matrix P are not able to maintain sparsity in the L factor, so that L becomes much more dense than the original coefficient matrix.
SCHURCOMPLEMENT METHOD
Assuming that G is positive definite, we can multiply the first equation in 16.5 by AG1 and then subtract the second equation to obtain a linear system in the vector alone:
AG1 AT AG1g h. 16.13
456 CHAPTER 16. QUADRATIC PROGRAMMING
We solve this symmetric positive definite system for and then recover p from the first equation in 16.5 by solving
Gp AT g. 16.14 This approach requires us to perform operations with G1, as well as to compute the
factorization of the m m matrix AG1 AT . Therefore, it is most useful when:
G is well conditioned and easy to invert for instance, when G is diagonal or block
diagonal; or
G1 is known explicitly through a quasiNewton updating formula; or
thenumberofequalityconstraintsmissmall,sothatthenumberofbacksolvesneeded toformthematrixAG1AT isnottoolarge.
The name SchurComplement method derives from the fact that, by applying block Gaussian elimination to 16.7 using G as the pivot, we obtain the block upper triangular system
G AT
0 AG1 AT
. 16.15
In linear algebra terminology, the matrix AG1 AT is the Schur complement of G in the matrix K of 16.7. By applying this block elimination technique to the system 16.5, and performing a block backsolve, we obtain 16.13, 16.14.
We can use an approach like the Schurcomplement method to derive an explicit inverse formula for the KKT matrix in 16.5. This formula is
with
GAT1 CE
A 0 ET F ,
C G1 G1 AT AG1 AT 1 AG1, E G1 AT AG1 AT 1,
F AG1 AT 1.
16.16
The solution of 16.5 can be obtained by multiplying its righthand side by this inverse matrix. If we take advantage of common expressions, and group the terms appropriately, we recover the approach 16.13, 16.14.
16.2. DIRECT SOLUTION OF THE KKT SYSTEM 457
NULLSPACE METHOD
The nullspace method does not require nonsingularity of G and therefore has wider applicability than the Schurcomplement method. It assumes only that the conditions of Lemma 16.1 hold, namely, that A has full row rank and that Z T G Z is positive definite. However, it requires knowledge of the nullspace basis matrix Z . Like the Schurcomplement method, it exploits the block structure in the KKT system to decouple 16.5 into two smaller systems.
Suppose that we partition the vector p in 16.5 into two components, as follows:
pYpY ZpZ, 16.17
where Z is the n n m nullspace matrix, Y is any n m matrix such that Y Z is nonsingular, pY is an mvector, and pZ is an n mvector. The matrices Y and Z werediscussedinSection15.3,whereFigure15.4showsthatYxY isaparticularsolutionof Axb,whileZxZ isadisplacementalongtheseconstraints.
By substituting p into the second equation of 16.5 and recalling that AZ 0, we obtain
AYpY h. 16.18
Since A has rank m and Y Z is n n nonsingular, the product AY Z AY 0 has rank m. Therefore, AY is a nonsingular m m matrix, and pY is well determined by the equations 16.18. Meanwhile, we can substitute 16.17 into the first equation of 16.5 to obtain
GYpY GZpZ AT g
andmultiplybyZT toobtain
ZTGZpZ ZTGYpY ZTg. 16.19
This system can be solved by performing a Cholesky factorization of the reducedHessian matrix ZT GZ to determine pZ. We therefore can compute the total step p Y pY ZpZ. ToobtaintheLagrangemultiplier,wemultiplythefirstblockrowin16.5byYT toobtain the linear system
AYT YTgGp, 16.20
which can be solved for .
458 CHAPTER 16. QUADRATIC PROGRAMMING
EXAMPLE 16.3
Consider the problem 16.9 given in Example 16.2. We can choose
23 13 Y 1 3 2 3
13 13 and set Z as in 16.10. Note that AY I.
Supposewehavex0,0,0T in16.6.Then hAxbb, gcGxc 3 .
Simple calculation shows that
so that
pY3, pZ0, 0
2 p x x Y p Y Z p Z 1 .
After recovering from 16.20, we conclude that 2
x 1 , 3 .
1
2
1
8 3
The nullspace approach can be very effective when the number of degrees of freedom n m is small. Its main limitation lies in the need for the nullspace matrix Z which, as we have seen in Chapter 15, can be expensive to compute in some large problems. The matrix Z is not uniquely defined and, if it is poorly chosen, the reduced system 16.19 may become ill conditioned. If we choose Z to have orthonormal columns, as is normally done in software for small and mediumsized problems, then the conditioning of Z T G Z is at least as good as that of G itself. When A is large and sparse, however, an orthonormal Z is expensive to
16.3. ITERATIVE SOLUTION OF THE KKT SYSTEM 459
compute, so for practical reasons we are often forced to use one of the less reliable choices of Z described in Chapter 15.
It is difficult to give hard and fast rules about the relative effectiveness of nullspace and Schurcomplement methods, because factors such as fillin during computation of Z vary significantly even among problems of the same dimension. In general, we can recommend the Schurcomplement method if G is positive definite and AG1 AT can be computed relatively cheaply because G is easy to invert or because m is small relative to n. Otherwise, the nullspace method is often preferable, in particular when it is much more expensive to compute factors of G than to compute the nullspace matrix Z and the factors of Z T G Z .
16.3 ITERATIVE SOLUTION OF THE KKT SYSTEM
An alternative to the direct factorization techniques discussed in the previous section is to use an iterative method to solve the KKT system 16.5. Iterative methods are suitable for solving very large systems and often lend themselves well to parallelization. The conjugate gradient CG method is not recommended for solving the full system 16.5 as written, because it can be unstable on systems that are not positive definite. Better options are Krylov methods for general linear or symmetric indefinite systems. Candidates include the GMRES, QMR, and LSQR methods; see the Notes and References at the end of the chapter. Other iterative methods can be derived from the nullspace approach by applying the conjugate gradient method to the reduced system 16.19. Methods of this type are key to the algorithms of Chapters 18 and 19, and are discussed in the remainder of this section. We assume throughout that Z T G Z is positive definite.
CG APPLIED TO THE REDUCED SYSTEM
We begin our discussion of iterative nullspace methods by deriving the underlying equations in the notation of the equalityconstrained QP 16.3. Expressing the solution of the quadratic program 16.3 as
x YxY ZxZ, 16.21 forsomevectorsxZ IRnm,xY IRm,theconstraintsAxbyield
AYxY b, 16.22
which determines the vector xY. In Chapter 15, various practical choices of Y are described, some of which allow 16.22 to be solved economically. Substituting 16.21 into 16.3, we seethatxZ solvestheunconstrainedreducedproblem
min 1 x T Z T G Z x x T c , xZ2Z ZZZ
460 CHAPTER 16. QUADRATIC PROGRAMMING
where
cZ ZTGYxY ZTc. 16.23 ThesolutionxZ satisfiesthelinearsystem
ZT GZxZ cZ. 16.24
Since Z T G Z is positive definite, we can apply the CG method to this linear system and substitutexZ into16.21toobtainasolutionof16.3.
As discussed in Chapter 5, preconditioning can improve the rate of convergence of theCGiteration,soweassumethatapreconditionerWZZ isgiven.ThepreconditionedCG method Algorithm 5.3 applied to the n mdimensional reduced system 16.24 is as follows. We denote the steps produced by the CG iteration by dZ.
Algorithm 16.1 Preconditioned CG for Reduced Systems.
Choose an initial point xZ;
Computer ZTGZx c, g W 1r,andd g; Z ZZZZZZZZ
repeat
r T g d T ZT GZd ; ZZZ Z
xZ xZ dZ;
r r ZTGZd;
16.25a 16.25b 16.25c 16.25d 16.25e 16.25f 16.25g
ZZZ
g W 1r ; Z ZZZ
r T g r T g ; ZZZZ
d g d; ZZZ
g g; r r; ZZZZ
until a termination test is satisfied. Thisiterationmaybeterminatedwhen,forexample,rTW 1r issufficientlysmall.
Z ZZ Z
In this approach, it is not necessary to form the reduced Hessian Z T G Z explicitly because the CG method requires only that we compute matrixvector products involving this matrix. In fact, it is not even necessary to form Z explicitly as long as we are able to computeproductsofZandZT witharbitraryvectors.ForsomechoicesofZ,theseproducts
are much cheaper to compute than Z itself, as we have seen in Chapter 15.
ThepreconditionerWZZ isasymmetric,positivedefinitematrixofdimensionnm,
which might be chosen to cluster the eigenvalues of W 12ZT GZW 12 and to reduce ZZ ZZ
the span between the smallest and largest eigenvalues. An ideal choice of preconditioner is
one for which W 12ZT GZW 12 I, that is, W ZT GZ. Motivated by this ideal, ZZ ZZ ZZ
we consider preconditioners of the form
WZZ ZT HZ, 16.26
16.3. ITERATIVE SOLUTION OF THE KKT SYSTEM 461
where H is a symmetric matrix such that Z T H Z is positive definite. Some choices of H are discussed below. Preconditioners of the form 16.26 allow us to apply the CG method in ndimensional space, as we discuss next.
THE PROJECTED CG METHOD
It is possible to design a modification of the Algorithm 16.1 that avoids operating with the nullspace basis Z , provided we use a preconditioner of the form 16.26 and a particular solution of the equation Ax b. This approach works implicitly with an orthogonal matrix Z and is not affected by ill conditioning in A or by a poor choice of Z .
After the solution xZ of 16.24 has been computed by using Algorithm 16.1, it must be multiplied by Z and substituted in 16.21 to give the solution of the quadratic program 16.3. Alternatively, we may rewrite Algorithm 16.1 to work directly with the vector x ZxZ YxY, where the YxY term is fixed at the start and the xZ term is updated implicitly within each iteration. To specify this form of the CG algorithm, we introduce thenvectorsx,r,g,andd,whichsatisfyx ZxZ YxY,ZTr rZ,g ZgZ,andd ZdZ, respectively. We also define the scaled n n projection matrix P as follows:
P ZZT H Z1 ZT , 16.27 where H is the preconditioning matrix from 16.26. The CG iteration in ndimensional
space can be specified as follows.
Algorithm 16.2 Projected CG Method. Choose an initial point x satisfying Ax b; Compute r Gx c, g Pr, and d g; repeat
rT gdT Gd;
x x d; r rGd; g Pr;
rT grT g;
d g d;
g g; r r;
until a convergence test is satisfied.
16.28a 16.28b 16.28c 16.28d 16.28e 16.28f 16.28g
462 CHAPTER 16. QUADRATIC PROGRAMMING
A practical stop test is to terminate when rT g rT Pr is smaller than a prescribed tolerance.
Note that the vector g, which we call the preconditioned residual, has been defined to be in the null space of A. As a result, in exact arithmetic, all the search directions d generated by Algorithm 16.2 also lie in the null space of A, and thus the iterates x all satisfy Ax b. It is not difficult to verify see Exercise 16.14 that the iteration is well defined if Z T G Z and Z T H Z are positive definite. The reader can also verify that the iterates x generated by Algorithm16.2arerelatedtotheiteratesxZ ofAlgorithm16.1via16.21.
Twosimplechoicesofthepreconditioningmatrix H are H diagGiiand H I. In some applications, it is effective to define H as a block diagonal submatrix of G.
Algorithm 16.2 makes use of the nullspace basis Z only through the operator 16.27. It is possible, however, to compute Pr without knowing a representation of the nullspace basis Z . For simplicity, we first consider the case in which H I , so that P is the orthogonal projection operator onto the null space of A. We use PI to denote this special case of P, that is,
PI ZZTZ1ZT. 16.29 The computation of the preconditioned residual g PIr in 16.28d can be performed
intwoways.ThefirstistoexpressPI bytheequivalentformula
PI IATAAT1A 16.30
and thus compute g PIr. We can then write g r AT v, where v is the solution of the system
AATv Ar. 16.31
This approach for computing the projection g PIr is called the normal equa tions approach; the system 16.31 can be solved by using a Cholesky factorization of AAT.
The second approach is to express the projection 16.28d as the solution of the augmented system
IAT g r
A 0 v 0 , 16.32
which can be solved by means of a symmetric indefinite factorization, as discussed earlier. We call this approach the augmented system approach.
16.4. INEQUALITYCONSTRAINED PROBLEMS 463
We suppose now that the preconditioning has the general form of 16.27 and 16.28d. When H is nonsingular, we can compute g as follows:
g Pr, where P H1 I ATAH1AT1AH1 . 16.33 Otherwise, when zT Hz a 0 for all nonzero z with Az 0, we can find g as the solution
of the system
HAT g r
A 0 v 0 . 16.34
While 16.33 is unappealing when H1 does not have a simple form, 16.34 is a useful generalization of 16.32. A perfect preconditioner is obtained by taking H G, but other choices for H are also possible, provided that Z T H Z is positive definite. The matrix in 16.34 is often called a constraint preconditioner.
None of these procedures for computing the projection makes use of a nullspace basis Z; only the factorization of matrices involving A is required. Significantly, all these forms allow us to compute an initial point satisfying Ax b. The operator g PIr relies on a factorization of AAT from which we can compute x AT AAT 1b, while factorizations of the system matrices in 16.32 and 16.34 allow us to find a suitable x by solving
I AT x 0 or H AT x 0 . A0yb A0yb
Therefore we can compute an initial point for Algorithm 16.2 at the cost of one backsolve, using the factorization of the system needed to perform the projection operators.
We point out that these approaches for computing g can give rise to signifi cant roundoff errors, so the use of iterative refinement is recommended to improve accuracy.
16.4 INEQUALITYCONSTRAINED PROBLEMS
In the remainder of the chapter we discuss several classes of algorithms for solving convex quadratic programs that contain both inequality and equality constraints. Activeset methods have been widely used since the 1970s and are effective for small and mediumsized problems. They allow for efficient detection of unboundedness and infeasibility and typically return an accurate estimate of the optimal active set. Interiorpoint methods are more recent, having become popular in the 1990s. They are well suited for large problems but may not be the most effective when a series of related QPs must be solved. We also study a special
464 CHAPTER 16. QUADRATIC PROGRAMMING
type of activeset methods called a gradient projection method, which is most effective when the only constraints in the problem are bounds on the variables.
OPTIMALITY CONDITIONS FOR INEQUALITYCONSTRAINED PROBLEMS
We begin our discussion with a brief review of the optimality conditions for inequality constrained quadratic programming, then discuss some of the less obvious properties of the solutions.
Theorem 12.1 can be applied to 16.1 by noting that the Lagrangian for this problem is
iaiTxbi. 16.35 As in Definition 12.1, the active set Ax consists of the indices of the constraints for which
equality holds at x:
Ax iEIaiTxbi . 16.36
By specializing the KKT conditions 12.34 to this problem, we find that any solution x of 16.1 satisfies the following firstorder conditions, for some Lagrange multipliers i, i Ax:
Lx, 1xTGxxTc 2
Gxc iai 0, iAx
aiT x bi , aiT x bi ,
i 0,
for all i Ax,
for all i IAx, for all i I Ax.
16.37a
16.37b 16.37c 16.37d
A technical point: In Theorem 12.1 we assumed that the linear independence con straint qualification LICQ was satisfied. As mentioned in Section 12.6, this theorem still holds if we replace LICQ by other constraint qualifications, such as linearity of the con straints, which is certainly satisfied for quadratic programming. Hence, in the optimality conditions for quadratic programming given above, we need not assume that the active constraints are linearly independent at the solution.
For convex QP, when G is positive semidefinite, the conditions 16.37 are in fact sufficient for x to be a global solution, as we now prove.
Theorem 16.4.
If x satisfies the conditions 16.37 for some i, i Ax, and G is positive semidefinite, then x is a global solution of 16.1.
iIE
16.4. INEQUALITYCONSTRAINED PROBLEMS 465
PROOF. If x is any other feasible point for 16.1, we have that aiT x bi for all i E and aiTxbi foralliAxI.Hence,aiTxx0foralliEandaiTxx0 for all i Ax I. Using these relationships, together with 16.37a and 16.37d, we have that
x xT Gx c iaiT x x iaiT x x 0. 16.38 iE iAxI
By elementary manipulation, we find that
qx qx x xT Gx c 1 x xT Gx x
2
where the first inequality follows from 16.38 and the second inequality follows from positive semidefiniteness of G. We have shown that qx qx for any feasible x, so x is a global solution.
By a trivial modification of this proof, we see that x is actually the unique global solution when G is positive definite.
We can also apply the theory from Section 12.5 to derive secondorder optimality conditions for 16.1. Secondorder sufficient conditions for x to be a local minimizer are satisfied if Z T G Z is positive definite, where Z is defined to be a nullspace basis matrix for the active constraint Jacobian matrix, which is the matrix whose rows are aiT for all i Ax. In this case, x is a strict local solution, according to Theorem 12.6.
When G is not positive definite, the general problem 16.1 may have more than one strict local solution. As mentioned above, such problems are called nonconvex QPs or indefinite QPs, and they cause some complications for algorithms. Examples of indefinite QPs are illustrated in Figure 16.1. On the left we have plotted the feasible region and the contours of a quadratic objective qx in which G has one positive and one negative eigenvalue. We have indicated by or that the function tends toward plus or minus infinity in that direction. Note that x is a local maximizer, x a local minimizer, and the center of the box is a stationary point. The picture on the right in Figure 16.1, in which both eigenvalues of G are negative, shows a global maximizer at x and local minimizers at x and x.
DEGENERACY
A second property that causes difficulties for some algorithms is degeneracy. Con fusingly, this term has been given a variety of meanings. It refers to situations in which
qx 1 x xT Gx x 2
qx,
466 CHAPTER 16. QUADRATIC PROGRAMMING
feasible region
x
x
x x
x
x
Figure 16.1 Nonconvex quadratic programs.
x
x
Figure 16.2 Degenerate solutions of quadratic programs.
a the active constraint gradients ai , i Ax, are linearly dependent at the solution x,
andor
b the strict complementarity condition of Definition 12.5 fails to hold, that is, there is some index i Ax such that all Lagrange multipliers satisfying 16.37 have i 0. Such constraints are weakly active according to Definition 12.8.
Two examples of degeneracy are shown in Figure 16.2. In the lefthand picture, there
is a single active constraint at the solution x, which is also an unconstrained minimizer of the objective function. In the notation of 16.37a, we have that Gx c 0, so that
16.5. ACTIVESET METHODS FOR CONVEX QPS 467
the lone Lagrange multiplier must be zero. In the righthand picture, three constraints are active at the solution x. Since each of the three constraint gradients is a vector in IR2, they must be linearly dependent.
Lack of strict complementarity is also illustrated by the problem min x12 x2 12 subjecttox 0,
which has a solution at x 0 at which both constraints are active. Strict complementarity does not hold at x because the Lagrange multiplier associated with the active constraint x1 0 is zero.
Degeneracy can cause problems for algorithms for two main reasons. First, linear dependence of the active constraint gradients can cause numerical difficulties in the step computation because certain matrices that we need to factor become rank deficient. Second, when the problem contains weakly active constraints, it is difficult for the algorithm to determine whether these constraints are active at the solution. In the case of activeset methods and gradient projection methods described below, this indecisiveness can cause the algorithm to zigzag as the iterates move on and off the weakly active constraints on successive iterations. Safeguards must be used to prevent such behavior.
16.5 ACTIVESET METHODS FOR CONVEX QPs
We now describe activeset methods for solving quadratic programs of the form 16.1 containing equality and inequality constraints. We consider only the convex case, in which the matrix G in 16.1a is positive semidefinite. The case in which G is an indefinite matrix raises complications in the algorithms and is outside the scope of this book. We refer to Gould 147 for a discussion of nonconvex QPs.
If the contents of the optimal active set 16.36 were known in advance, we could find the solution x by applying one of the techniques for equalityconstrained QP of Sections 16.2 and 16.3 to the problem
minqx1xTGxxTc subjectto aTxb, iAx. x2ii
Of course, we usually do not have prior knowledge of Ax and, as we now see, de termination of this set is the main challenge facing algorithms for inequalityconstrained QP.
We have already encountered an activeset approach for linear programming in Chap ter 13, namely, the simplex method. In essence, the simplex method starts by making a guess of the optimal active set, then repeatedly uses gradient and Lagrange multiplier information to drop one index from the current estimate of Ax and add a new index, until optimality
468 CHAPTER 16. QUADRATIC PROGRAMMING
is detected. Activeset methods for QP differ from the simplex method in that the iterates and the solution x are not necessarily vertices of the feasible region.
Activeset methods for QP come in three varieties: primal, dual, and primaldual. We restrict our discussion to primal methods, which generate iterates that remain feasible with respect to the primal problem 16.1 while steadily decreasing the objective function qx.
Primal activeset methods find a step from one iterate to the next by solving a quadratic subproblem in which some of the inequality constraints 16.1c, and all the equality con straints 16.1b, are imposed as equalities. This subset is referred to as the working set and is denoted at the kth iterate xk by Wk . An important requirement we impose on Wk is that the gradients ai of the constraints in the working set be linearly independent, even when the full set of active constraints at that point has linearly dependent gradients.
Given an iterate xk and the working set Wk, we first check whether xk minimizes the quadratic q in the subspace defined by the working set. If not, we compute a step p by solving an equalityconstrained QP subproblem in which the constraints corresponding to the working set Wk are regarded as equalities and all other constraints are temporarily disregarded. To express this subproblem in terms of the step p, we define
pxxk, gk Gxk c.
By substituting for x into the objective function 16.1a, we find that
q x q x k p 1 p T G p g kT p k , 2
where k 1 xT Gxk cT xk is independent of p. Since we can drop k from the objective 2k
without changing the solution of the problem, we can write the QP subproblem to be solved at the kth iteration as follows:
min 1 pT Gp gT p 16.39a p2k
subject to aiT p 0, i Wk. 16.39b
We denote the solution of this subproblem by pk . Note that for each i Wk , the value of aiTxdoesnotchangeaswemovealongpk,sincewehaveaiTxkpkaiTxk bi for all . Since the constraints in Wk were satisfied at xk , they are also satisfied at xk pk , for any value of . Since G is positive definite, the solution of 16.39 can be computed by any of the techniques described in Section 16.2.
Supposing for the moment that the optimal pk from 16.39 is nonzero, we need to decide how far to move along this direction. If xk pk is feasible with respect to all the constraints, we set xk1 xk pk . Otherwise, we set
xk1 xk kpk, 16.40
16.5. ACTIVESET METHODS FOR CONVEX QPS 469
where the steplength parameter k is chosen to be the largest value in the range 0, 1 for which all constraints are satisfied. We can derive an explicit definition of k by considering what happens to the constraints i Wk, since the constraints i Wk will certainly be satisfied regardless of the choice of k. If aiT pk 0 for some i Wk, then for all k 0 we have aiT xk k pk aiT xk bi . Hence, constraint i will be satisfied for all nonnegative choices of the steplength parameter. Whenever aiT pk 0 for some i Wk , however, we havethataiTxk kpkbi onlyif
k bi aiT xk . aiT pk
To maximize the decrease in q, we want k to be as large as possible in 0,1 subject to retaining feasibility, so we obtain the following definition:
def bi aiTxk
kmin1, min T . 16.41
iWk,aiT pk0 ai pk
We call the constraints i for which the minimum in 16.41 is achieved the blocking con straints.Ifk 1andnonewconstraintsareactiveatxkkpk,thentherearenoblocking constraints on this iteration. Note that it is quite possible for k to be zero, because we could have aiT pk 0 for some constraint i that is active at xk but not a member of the current working set Wk .
If k 1, that is, the step along pk was blocked by some constraint not in Wk , a new working set Wk1 is constructed by adding one of the blocking constraints to Wk .
We continue to iterate in this manner, adding constraints to the working set until we reach a point x that minimizes the quadratic objective function over its current working set W . It is easy to recognize such a point because the subproblem 16.39 has solution p 0. Since p 0 satisfies the optimality conditions 16.5 for 16.39, we have that
aii gGxc, 16.42
i W
for some Lagrange multipliers i, i W . It follows that x and satisfy the first KKT condition 16.37a, if we define the multipliers corresponding to the inequality constraints that are not in the working set to be zero. Because of the control imposed on the step length, x is also feasible with respect to all the constraints, so the second and third KKT conditions 16.37b and 16.37c are satisfied at this point.
We now examine the signs of the multipliers corresponding to the inequality con straints in the working set, that is, the indices i W I. If these multipliers are all nonnegative, the fourth KKT condition 16.37d is also satisfied, so we conclude that x is a KKT point for the original problem 16.1. In fact, since G is positive semidefinite, we have
470 CHAPTER 16. QUADRATIC PROGRAMMING
from Theorem 16.4 that x is a global solution of 16.1. As noted after Theorem 16.4, x is a strict local minimizer and the unique global solution if G is positive definite.
If, on the other hand, one or more of the multipliers j , j W I, is negative, the condition 16.37d is not satisfied and the objective function q may be decreased by dropping one of these constraints, as shown in Section 12.3. Thus, we remove an index
j corresponding to one of the negative multipliers from the working set and solve a new subproblem 16.39 for the new step. We show in the following theorem that this strategy produces a direction p at the next iteration that is feasible with respect to the dropped constraint. We continue to assume that the constraint gradients ai for i in the working set are linearly independent. After the algorithm has been fully stated, we discuss how this property can be maintained.
Theorem 16.5.
Suppose that the point x satisfies firstorder conditions for the equalityconstrained subproblem with working set W ; that is, equation 16.42 is satisfied along with aiT x bi for all i W . Suppose, too, that the constraint gradients ai , i W , are linearly independent and that there is an index j W such that j 0. Let p be the solution obtained by dropping the constraint j and solving the following subproblem:
min 1 pT Gp Gx cT p, 16.43a p2
subjecttoaiT p0,foralli W withi a j. 16.43b
Thenpisafeasibledirectionforconstraintj,thatis,aTj p0.Moreover,ifpsatisfiessecond order sufficient conditions for 16.43, then we have that aTj p 0, and that p is a descent direction for q.
PROOF. Since p solves 16.43, we have from the results of Section 16.1 that there are multipliers i , for all i W with i a j, such that
iai GpGxc. 16.44 i W , i a j
In addition, we have by secondorder necessary conditions that if Z is a nullspace basis vector for the matrix
aT , i iW,iaj
then Z T G Z is positive semidefinite. Clearly, p has the form p Z pZ for some vector pZ , so it follows that pT Gp 0.
16.5. ACTIVESET METHODS FOR CONVEX QPS 471
WehavemadetheassumptionthatxandW satisfytherelation16.42.Bysubtracting 16.42 from 16.44, we obtain
i iai jaj Gp. 16.45
i W , i a j
By taking inner products of both sides with p and using the fact that aiT p 0 for all i W
with i a j, we have that
jaTj ppTGp. 16.46
SincepTGp0andj 0byassumption,itfollowsthataTj p0.
If the secondorder sufficient conditions of Section 12.5 are satisfied, we have that
ZT GZ defined above is positive definite. From 16.46, we can have aTj p 0 only if pTGppZTZTGZpZ 0,whichhappensonlyifpZ 0andp0.Butifp0,then by substituting into 16.45 and using linear independence of ai for i W , we must have thatj 0,whichcontradictsourchoiceofj.WeconcludethatpTGp0in16.46,and
thereforeaTj p0wheneverpsatisfiesthesecondordersufficientconditionsfor16.43. The claim that p is a descent direction for q is proved in Theorem 16.6 below.
While any index j for which j 0 usually will yield a direction p along which the algorithm can make progress, the most negative multiplier is often chosen in practice and in the algorithm specified below. This choice is motivated by the sensitivity analysis given in Chapter 12, which shows that the rate of decrease in the objective function when one constraint is removed is proportional to the magnitude of the Lagrange multiplier for that constraint. As in linear programming, however, the step along the resulting direction may be short as when it is blocked by a new constraint, so the amount of decrease in q is not guaranteed to be greater than for other possible choices of j.
We conclude with a result that shows that whenever pk obtained from 16.39 is nonzero and satisfies secondorder sufficient optimality conditions for the current working set, it is a direction of strict descent for q.
Theorem 16.6.
Suppose that the solution pk of 16.39 is nonzero and satisfies the secondorder sufficient conditions for optimality for that problem. Then the function q is strictly decreasing along the direction pk .
PROOF. Since pk satisfies the secondorder conditions, that is, Z T G Z is positive definite for the matrix Z whose columns are a basis of the null space of the constraints 16.39b, we have by applying Theorem 16.2 to 16.39 that pk is the unique global solution of 16.39. Since p 0 is also a feasible point for 16.39, its objective value in 16.39a must be larger
472 CHAPTER 16. QUADRATIC PROGRAMMING
than that of pk , so we have
for all 0 sufficiently small.
1pTGp gTp 0. 2kkkk
Since pkT Gpk 0 by convexity, this inequality implies that gkT pk 0. Therefore, we have q x k k p k q x k g kT p k 1 2 p kT G p k q x k ,
When G is positive definitethe strictly convex casethe secondorder sufficient conditions are satisfied for all feasible subproblems of the form 16.39. Hence, it follows from the result above that we obtain a strict decrease in q whenever pk a 0. This fact is significant when we discuss finite termination of the algorithm.
SPECIFICATION OF THE ACTIVESET METHOD FOR CONVEX QP
Having described the activeset algorithm for convex QP, we now present the following formal specification. We assume that the objective function q is bounded in the feasible set 16.1b, 16.1c.
Algorithm 16.3 ActiveSet Method for Convex QP. Compute a feasible starting point x0;
Set W0 to be a subset of the active constraints at x0; for k 0,1,2,…
Solve 16.39 to find pk ; if pk 0
ComputeLagrangemultipliersi thatsatisfy16.42, w i t h W W k ;
i f i 0 f o r a l l i W k I
stop with solution x xk;
else
else pk a0
Compute k from 16.41;
xk1 xk kpk;
if there are blocking constraints
Obtain Wk1 by adding one of the blocking constraints to Wk ;
else
end for
jargminjWkI j;
xk1 xk; Wk1 Wkj;
Wk1 Wk;
2
16.5. ACTIVESET METHODS FOR CONVEX QPS 473
Various techniques can be used to determine an initial feasible point. One such is to use the Phase I approach for linear programming described in Chapter 13. Though no significant modifications are needed to generalize this method from linear program ming to quadratic programming, we describe a variant here that allows the user to supply an initial estimate x of the vector x. This estimate need not be feasible, but a good choice based on knowledge of the QP may reduce the work needed in the Phase I step.
Given x , we define the following feasibility linear program: min eT z
x,z
subjectto aiTxizibi, iE,
aiTxizi bi, iI, z 0,
wheree1,1,…,1T,i signaiTx bifori E,andi 1fori I.Afeasible initial point for this problem is then
x x , zi aiTx bi i E, zi maxbi aiTx ,0 i I.
It is easy to verify that if x is feasible for the original problem 16.1, then x , 0 is optimal for the feasibility subproblem. In general, if the original problem has feasible points, then the optimal objective value in the subproblem is zero, and any solution of the subproblem yields a feasible point for the original problem. The initial working set W0 for Algorithm 16.3 can be found by taking a linearly independent subset of the active constraints at the solution of the feasibility problem.
An alternative approach is a penalty or big M method, which does away with the Phase I and instead includes a measure of infeasibility in the objective that is guaranteed to be zero at the solution. That is, we introduce a scalar artificial variable into 16.1 to measure the constraint violation, and we solve the problem
min 1xTGxxTcM, x, 2
s u b j e c t t o a iT x b i , aiT x bi , b i a iT x , 0 ,
i E , i E, i I ,
16.47
for some large positive value of M. It can be shown by applying the theory of exact penalty functions see Chapter 17 that whenever there exist feasible points for the original problem 16.1, then for all M sufficiently large, the solution of 16.47 will have 0, with an x component that is a solution for 16.1.
474 CHAPTER 16. QUADRATIC PROGRAMMING
Our strategy is to use some heuristic to choose a value of M and solve 16.47 by the usual means. If the solution we obtain has a positive value of , we increase M and try again. Note that a feasible point is easy to obtain for the subproblem 16.47: We set x x where, as before, x is the usersupplied initial guess and choose large enough that all the constraints in 16.47 are satisfied. This approach is, in fact, an exact penalty method using the norm; see Chapter 17.
A variant of 16.47 that penalizes the 1 norm of the constraint violation rather than the norm is as follows:
min 1xTGxxTcMeTstMeTv x,s,t,v 2 E I
s u b j e c t t o a iT x b i s i t i 0 , i E , aiT x bi vi 0, i I,
s 0, t 0, v 0.
16.48
Here,eE isthevector1,1,…,1T oflengthE;similarlyforeI.Theslackvariablessi,ti, and vi soak up any infeasibility in the constraints.
In the following example we use subscripts on the vectors x and p to denote their components, and we use superscripts to indicate the iteration index. For example, x1 denotes the first component, while x4 denotes the fourth iterate of the vector x.
x2
0,1
4,1
2,2 x4 x5
23x x , x 2,0
01 x ,x
1
Figure 16.3 Iterates of the activeset method.
in Figure 16.3.
16.5. ACTIVESET METHODS FOR CONVEX QPS 475
EXAMPLE 16.4
We apply Algorithm 16.3 to the following simple 2dimensional problem illustrated
min qxx1 12 x2 2.52 x
16.49a
16.49b 16.49c 16.49d 16.49e 16.49f
subject to
x1 2×2 2 0, x1 2×2 6 0, x1 2×2 2 0,
x1 0, x2 0.
We refer the constraints, in order, by indices 1 through 5. For this problem it is easy to determine a feasible initial point; say x 0 2, 0T . Constraints 3 and 5 are active at this point, and we set W0 3, 5. Note that we could just as validly have chosen W0 5 or W0 3 or even W ; each choice would lead the algorithm to perform somewhat differently.
Since x0 lies on a vertex of the feasible region, it is obviously a minimizer of the objective function q with respect to the working set W0; that is, the solution of 16.39 with k 0 is p 0. We can then use 16.42 to find the multipliers 3 and 5 associated with the active constraints. Substitution of the data from our problem into 16.42 yields
1 3 0 5 2 , 2 1 5
which has the solution 3 , 5 2, 1.
We now remove constraint 3 from the working set, because it has the most negative
multiplier, and set W1 5. We begin iteration 1 by finding the solution of 16.39 for k 1, which is p1 1, 0T . The steplength formula 16.41 yields 1 1, and the new iterate is x2 1, 0T .
There are no blocking constraints, so that W2 W1 5, and we find at the start of iteration 2 that the solution of 16.39 is p2 0. From 16.42 we deduce that the Lagrange multiplier for the lone working constraint is 5 5, so we drop 5 from the working set to obtain W3 .
Iteration 3 starts by solving the unconstrained problem, to obtain the solution p3 0, 2.5T . The formula 16.41 yields a step length of 3 0.6 and a new iterate x 4 1, 1.5T . There is a single blocking constraint constraint 1, so we obtain W4 1. The solution of 16.39 for k 4 is then p4 0.4, 0.2T , and the new step length is 1. There are no blocking constraints on this step, so the next working set is unchanged: W5 1. The new iterate is x 5 1.4, 1.7T .
476 CHAPTER 16. QUADRATIC PROGRAMMING
Finally, we solve 16.39 for k 5 to obtain a solution p5 0. The formula 16.42 yields a multiplier 1 0.8, so we have found the solution. We set x 1.4,1.7T and
terminate.
FURTHER REMARKS ON THE ACTIVESET METHOD
We noted above that there is flexibility in the choice of the initial working set and that
each initial choice leads to a different iteration sequence. When the initial active constraints
have independent gradients, as above, we can include them all in W0. Alternatively, we can
select a subset. For instance, if in the example above we have chosen W0 3, the first
iterate would have yielded p0 0.2, 0.1T and a new iterate of x 1 2.2, 0.1T . If we
had chosen W0 5, we would have moved immediately to the new iterate x 1 1, 0T ,
without first performing the operation of dropping the index 3, as is done in the example. If
we had selected W0 , we would have obtained p1 1, 2.5T , 1 2 , a new iterate of 3
x1 4,5T,andanewworkingsetofW1 1.Thesolutionx wouldhavebeenfound 33
on the next iteration.
Even if the initial working set W0 coincides with the initial active set, the sets Wk
and Axk may differ at later iterations. For instance, when a particular step encounters more than one blocking constraint, just one of them is added to the working set, so the identification between Wk and Axk is broken. Moreover, subsequent iterates differ in general according to what choice is made.
We require the constraint gradients in W0 to be linearly independent, and our strategy for modifying the working set ensures that this same property holds for all subsequent working sets Wk . When we encounter a blocking constraint on a particular step, its constraint normal cannot be a linear combination of the normals ai in the current working set see Exercise 16.18. Hence, linear independence is maintained after the blocking constraint is added to the working set. On the other hand, deletion of an index from the working set cannot introduce linear dependence.
The strategy of removing the constraint corresponding to the most negative Lagrange multiplier often works well in practice but has the disadvantage that it is susceptible to the scaling of the constraints. By multiplying constraint i by some factor 0 we do not change the geometry of the optimization problem, but we introduce a scaling of 1 to the corresponding multiplier i . Choice of the most negative multiplier is analogous to Dantzigs original pivot rule for the simplex method in linear programming see Chapter 13 and, as we noted there, strategies that are less sensitive to scaling often give better results. We do not discuss this advanced topic further.
We note that the strategy of adding or deleting at most one constraint at each iteration of the Algorithm 16.3 places a natural lower bound on the number of iterations needed to reach optimality. Suppose, for instance, that we have a problem in which m inequality constraints are active at the solution x but that we start from a point x0 that is strictly
16.5. ACTIVESET METHODS FOR CONVEX QPS 477
feasible with respect to all the inequality constraints. In this case, the algorithm will need at least m iterations to move from x0 to x. Even more iterations will be required if the algorithm adds some constraint j to the working set at some iteration, only to remove it at a later step.
FINITE TERMINATION OF ACTIVESET ALGORITHM ON STRICTLY CONVEX QPs
It is not difficult to show that, under certain assumptions, Algorithm 16.3 converges for strictly convex QPs, that is, it identifies the solution x in a finite number of iterations. This claim is certainly true if we assume that the method always takes a nonzero step length k whenever the direction pk computed from 16.39 is nonzero. Our argument proceeds as follows:
If the solution of 16.39 is pk 0, the current point xk is the unique global minimizer of q for the working set Wk; see Theorem 16.6. If it is not the solution of the original problem 16.1 that is, at least one of the Lagrange multipliers is negative, Theorems 16.5 and 16.6 together show that the step pk1 computed after a constraint is dropped will be a strict decrease direction for q. Therefore, because of our assumption k 0, we have that the value of q is lower than qxk at all subsequent iterations. It follows that the algorithm can never return to the working set Wk, because subsequent iterates have values of q that are lower than the global minimizer for this working set.
The algorithm encounters an iterate k for which pk 0 solves 16.39 at least on everynthiteration.Todemonstratethisclaim,wenotethatforanykatwhichpk a0, either we have k 1 in which case we reach the minimizer of q on the current working set Wk, so that the next iteration will yield pk1 0, or else a constraint is added to the working set Wk . If the latter situation occurs repeatedly, then after at most n iterations the working set will contain n indices, which correspond to n linearly independent vectors. The solution of 16.39 will then be pk 0, since only the zero vector will satisfy the constraints 16.39b.
Taken together, the two statements above indicate that the algorithm finds the global minimum of q on its current working set periodically at least once every n iterations and that, having done so, it never visits this particular working set again. It follows that, since there are only a finite number of possible working sets, the algorithm cannot iterate forever. Eventually, it encounters a minimizer for a current working set that satisfies optimality conditions for 16.1, and it terminates with a solution.
The assumption that we can always take a nonzero step along a nonzero descent direction pk calculated from 16.39 guarantees that the algorithm does not undergo cycling. This term refers to the situation in which a sequence of consecutive iterations results in no movementiniteratex,whiletheworkingsetWk undergoesdeletionsandadditionsofindices
478 CHAPTER 16. QUADRATIC PROGRAMMING
and eventually repeats itself. That is, for some integers k and l 1, we have that xk xkl andWk Wkl.Ateachiterateinthecycle,aconstraintisdroppedasinTheorem16.5, but a new constraint i Wk is encountered immediately without any movement along the computed direction p. Procedures for handling degeneracy and cycling in quadratic programming are similar to those for linear programming discussed in Chapter 13; we do not discuss them here. Most QP implementations simply ignore the possibility of cycling.
UPDATING FACTORIZATIONS
We have seen that the step computation in the activeset method given in Al gorithm 16.3 requires the solution of the equalityconstrained subproblem 16.39. As mentioned at the beginning of this chapter, this computation amounts to solving the KKT system 16.5. Since the working set can change by just one index at every iteration, the KKT matrix differs in at most one row and one column from the previous iterations KKT matrix. Indeed, G remains fixed, whereas the matrix A of constraint gradients corresponding to the current working set may change through addition andor deletion of a single row.
It follows from this observation that we can compute the matrix factors needed to solve 16.39 at the current iteration by updating the factors computed at the previous iteration, rather than recomputing them from scratch. These updating techniques are crucial to the efficiency of activeset methods.
We limit our discussion to the case in which the step is computed with the nullspace method 16.1716.20. Suppose that A has m linearly independent rows and assume that the bases Y and Z are defined by means of a QR factorization of A see Section 15.3 for details. Thus
AT Q R Q Q R 16.50 0120
see 15.21, where is a permutation matrix; R is square, upper triangular and nonsingu lar; Q Q1 Q2 is n n orthogonal; and Q1 and R both have m columns while Q2 has n m columns. As noted in Chapter 15, we can choose Z to be simply the orthonormal matrix Q2.
Suppose that one constraint is added to the working set at the next iteration, so that the new constraint matrix is A T AT a , where a is a column vector of length n such that A T retains full column rank. As we now show, there is an economical way to update the Q and R factors in 16.50 to obtain new factors and hence a new nullspace basis matrix Z , with n m 1 columns for the expanded matrix A . Note first that, since Q 1 Q 1T Q 2 Q 2T I , w e h a v e
A T 0 AT a Q R Q1Ta . 16.51 0 1 0 Q2Ta
This factorization has the form
000
T R AQ ,
0
16.5. ACTIVESET METHODS FOR CONVEX QPS 479
We can now define an orthogonal matrix Q that transforms the vector Q2T a to a vector in which all elements except the first are zero. That is, we have
Q Q 2T a 0 where is a scalar. Since Q is orthogonal, we have
have
R Q1Ta
A T 0 Q Q I 0 0 . 0 1 0 QT 0 QT
where
0 , Q Q I 0 Q 1 Q 2 Q T , R R Q 1T a .
Wecanthereforechoose Z tobethelastnm1columnsof Q2QT .Ifweknow Z explicitly and need an explicit representation of Z , we need to account for the cost of obtaining Q and thecostofformingtheproductQ2QT ZQT.BecauseofthespecialstructureofQ,this cost is of order nn m, compared to the cost of computing 16.50 from scratch, which is of order n2m. The updating strategy is less expensive, especially when the null space is small thatis,whennm an.
An updating technique can also be designed for the case in which a row is removed from A. This operation has the effect of deleting a column from R in 16.50, thus disturbing the upper triangular property of this matrix by introducing a number of nonzeros on the diagonal immediately below the main diagonal of the matrix. Upper triangularity can be restored by applying a sequence of plane rotations. These rotations introduce a number of inexpensive transformations into the first m columns of Q, and the updated nullspace matrix is obtained by selecting the last n m 1 columns from this matrix after the transformations are complete. The new nullspace basis in this case has the form
Z z Z , 16.52
that is, the current matrix Z is augmented by a single column. The total cost of this operation varies with the location of the removed column in A but is in general cheaper
0 1 0 QT 0
,
Q2T a
. From 16.51 we now RQ1Ta
480 CHAPTER 16. QUADRATIC PROGRAMMING
than recomputing a QR factorization from scratch. For details of these procedures, see Gill et al. 124, Section 5.
We now consider the reduced Hessian. Because of the special form of 16.39, we have h 0 in 16.5, and the step pY given in 16.18 is zero. Thus from 16.19, the nullspace component pZ is the solution of
ZT GZpZ ZT g. 16.53
We can sometimes find ways of updating the factorization of the reduced Hessian Z T G Z after Z has changed. Suppose that we have the Cholesky factorization of the current reduced Hessian, written as
ZT GZ LLT ,
and that at the next step Z changes as in 16.52, gaining a column after deletion of a constraint. A series of inexpensive, elementary operations can be used to transform the Cholesky factor L into the new factor L for the new reduced Hessian Z T G Z .
A variety of other simplifications are possible. For example, as discussed in Sec tion 16.7, we can update the reduced gradient Z T g at the same time as we update Z to Z .
16.6 INTERIORPOINT METHODS
The interiorpoint approach can be applied to convex quadratic programs through a simple extension of the linearprogramming algorithms described in Chapter 14. The resulting primaldual algorithms are easy to describe and are quite efficient on many types of problems. Extensions of interiorpoint methods to nonconvex problems are discussed in Chapter 19.
For simplicity, we restrict our attention to convex quadratic programs with inequality constraints, which we write as follows:
min qx 1xTGxxTc 16.54a x2
subject to Ax b, 16.54b where G is symmetric and positive semidefinite and where the m n matrix A and righthand
side b are defined by
A aiiI, b biiI, I 1,2,…,m.
If equality constraints are also present, they can be accommodated with simple extensions to the approaches described below. Rewriting the KKT conditions 16.37 in this notation,
we obtain
Gx AT c 0, Ax b 0, Axbii 0, 0.
i1,2,…,m,
By introducing the slack vector y 0, we can rewrite these conditions as
Gx AT c 0, Ax y b 0, yii 0, y, 0.
i 1,2,…,m,
16.55a 16.55b 16.55c 16.55d
Since we assume that G is positive semidefinite, these KKT conditions are not only necessary but also sufficient see Theorem 16.4, so we can solve the convex quadratic program 16.54 by finding solutions of the system 16.55.
Given a current iterate x, y, that satisfies y, complementarity measure by
GxATc
Fx,y,; Axyb 0, 16.57
Y ee
where
Y diagy1, y2,…, ym, diag1,2,…,m, e 1,1,…,1T ,
and 0, 1. The solutions of 16.57 for all positive values of and define the central path, which is a trajectory that leads to the solution of the quadratic program as tends to zero.
yT. m
16.56 As in Chapter 14, we derive pathfollowing, primaldual methods by considering the
perturbed KKT conditions given by
16.6. INTERIORPOINT METHODS 481
0, we can
define a
By fixing and applying Newtons method to 16.57, we obtain the linear system G0ATax r
d
A I 0 ay rp , 16.58
0 Y a Yee
482 CHAPTER 16. QUADRATIC PROGRAMMING
where
rd GxATc, rp Axyb. 16.59 We obtain the next iterate by setting
x, y, x, y, ax, ay, a, 16.60
where is chosen to retain the inequality y, 0 and possibly to satisfy various other conditions.
In the rest of the chapter we discuss several enhancements of this primaldual iteration that make it effective in practice.
SOLVING THE PRIMALDUAL SYSTEM
The major computational operation in the interiorpoint method is the solution of the system 16.58. The coefficient matrix in this system can be much more costly to factor than the matrix 14.9 arising in linear programming because of the presence of the Hessian matrix G. It is therefore important to exploit the structure of 16.58 by choosing a suitable direct factorization algorithm, or by choosing an appropriate preconditioner for an iterative solver.
As in Chapter 14, the system 16.58 may be restated in more compact forms. The augmented system form is
G AT ax rd . 16.61 A 1Y a rpy 1e
After a simple transformation to symmetric form, a symmetric indefinite factorization scheme can be applied to the coefficient matrix in this system. The normal equations form 14.44a is
GATY1 AaxrdATY1 rpy 1e, 16.62
which can be solved by means of a modified Cholesky algorithm. This approach is effective if the term AT Y1 A is not too dense compared with G, and it has the advantage of being much smaller than 16.61 if there are many inequality constraints.
The projected CG method of Algorithm 16.2 can also be effective for solving the primaldual system. We can rewrite 16.58 in the form
G 0 ATax r d
0 Y1 I ay eY1e , 16.63 AI0a rp
and observe that these are the optimality conditions for an equalityconstrained convex quadratic program of the form 16.3, in which the variable is ax,ay. Hence, we can make appropriate substitutions and solve this system using Algorithm 16.2. This approach may be useful for problems in which the direct factorization cannot be performed due to excessive memory demands. The projected CG method does not require that the matrix G be formed or factored; it requires only matrixvector products.
STEP LENGTH SELECTION
We mentioned in Chapter 14 that interiorpoint methods for linear programming are more efficient if different step lengths pri, dual are used for the primal and dual variables. Equation 14.37 indicates that the greatest reduction in the residuals rb and rc is obtained by choosing the largest admissible primal and dual step lengths. The situation is different in quadratic programming. Suppose that we define the new iterate as
x, y x, y priax, ay, duala, 16.64 where pri and dual are step lengths that ensure the positivity of y, . By using 16.58
and 16.59, we see that the new residuals satisfy the following relations:
rp 1 prirp, 16.65a
rd 1 dualrd pri dualGax. 16.65b
If pri dual then both residuals decrease linearly for all 0, 1. For different step lengths, however, the dual residual rd may increase for certain choices of pri, dual, possibly causing divergence of the interiorpoint iteration.
One option is to use equal step lengths, as in 16.60, and to set minpri, dual,
where
of the step lengths through 16.64.
pri max0,1:yay1y,
dual max0,1:a1;
16.66a 16.66b
the parameter 0, 1 controls how far we back off from the maximum step for which the conditions y ay 0 and a 0 are satisfied. Numerical experience has shown, however, that using different step lengths in the primal and dual variables often leads to faster convergence. One way to choose unequal step lengths is to select pri, dual so as to approximately minimize the optimality measure
GxATc 2 Axyb 2yTz,
subject to 0 pri pri and 0 dual dual, where x, y, are defined as a function
16.6. INTERIORPOINT METHODS 483
484 CHAPTER 16. QUADRATIC PROGRAMMING
A PRACTICAL PRIMALDUAL METHOD
The most popular interiorpoint method for convex QP is based on Mehrotras predictorcorrector, originally developed for linear programming see Section 14.2. The extension to quadratic programming is straightforward, as we now show.
First, we compute an affine scaling step axaff,ayaff,aaff by setting 0 in 16.58. We improve upon this step by computing a corrector step, which is defined following the same reasoning that leads to 14.31. Next, we compute the centering parameter using 14.34. The total step is obtained by solving the following system cf. 14.35:
G 0 AT ax r d
A I 0 ay rp . 16.67 0 Y a Yea affaYaff ee
We now specify the algorithm. For simplicity, we will assume in our description that equal step lengths are used in the primal and dual variables though, as noted above, unequal step lengths can give slightly faster convergence.
Algorithm 16.4 PredictorCorrector Algorithm for QP. Compute x0, y0, 0 with y0, 0 0;
for k 0,1,2,…
Set x, y, xk, yk,k and solve 16.58 with 0 for axaff , ayaff , aaff ;
Calculate yT m;
Calculateaff max0,1y,ayaff,aaff0;
Calculateaff yaffayaffTaffaaffm;
Set centering parameter to aff 3 ;
Solve 16.67 for ax, ay, a;
Choose 0,1andsetminpri,dualsee16.66; k k k
Set xk1, yk1,k1 xk, yk,k ax,ay,a; end for
We can choose k to approach 1 as the iterates approach the solution, to accelerate the convergence.
As for linear programming, efficiency and robustness of this approach is greatly enhanced if we choose a good starting point. This selection can be done in several ways. The following simple heuristic accepts an initial point x , y , from the user and moves it far enough away from the boundary of the region y, 0 to permit the algorithm to take long steps on early iterations. First, we compute the affine scaling step axaff , ayaff , aaff
16.7. THE GRADIENT PROJECTION METHOD 485
from the usersupplied initial point x , y , , then set
y0 max1,y ayaff, 0 max1, aaff, x0 x ,
where the max and absolute values are applied componentwise.
We conclude this section by contrasting some of the properties of activeset and
interiorpoint methods for convex quadratic programming. Activeset methods generally require a large number of steps in which each search direction is relatively inexpensive to compute, while interiorpoint methods take a smaller number of more expensive steps. Activeset methods are more complicated to implement, particularly if the procedures for updating matrix factorizations try to take advantage of sparsity or structure in G and A. By contrast, the nonzero structure of the matrix to be factored at each interiorpoint iteration remains the same at all iterations though the numerical values change, so standard sparse factorization software can be used to obtain the steps. For particular sparsity structures for example, bandedness in the matrices A and G, efficient customized solvers for the linear system arising at each interiorpoint iteration can be devised.
For very large problems, interiorpoint methods are often more efficient. However, when an estimate of the solution is available a warm start, the activeset approach may converge rapidly in just a few iterations, particularly if the initial value of x is feasible. Interiorpoint methods are less able to exploit a warm start, though research efforts to improve their performance in this regard are ongoing.
16.7 THE GRADIENT PROJECTION METHOD
In the activeset method described in Section 16.5, the active set and working set change slowly, usually by a single index at each iteration. This method may thus require many iterations to converge on largescale problems. For instance, if the starting point x0 has no active constraints, while 200 constraints are active at the nondegenerate solution, then at least 200 iterations of the activeset method will be required to reach the solution.
The gradient projection method allows the active set to change rapidly from iteration to iteration. It is most efficient when the constraints are simple in formin particular, when there are only bounds on the variables. Accordingly, we restrict our attention to the following boundconstrained problem:
min qx 1xTGxxTc 16.68a x2
subject to l x u, 16.68b
where G is symmetric and l and u are vectors of lower and upper bounds on the components of x. We do not make any positive definiteness assumptions on G in this section, because the gradient projection approach can be applied to both convex and nonconvex problems. The
486 CHAPTER 16. QUADRATIC PROGRAMMING
feasible region defined by 16.68b is sometimes called a box because of its rectangular shape. Some components of x may lack an upper or a lower bound; we handle these cases formally by setting the appropriate components of l and u to and , respectively.
Each iteration of the gradient projection algorithm consists of two stages. In the first stage, we search along the steepest descent direction from the current point x, that is, the direction g, where g Gx c; see 16.6. Whenever a bound is encountered, the search direction is bent so that it stays feasible. We search along the resulting piecewiselinear path and locate the first local minimizer of q, which we denote by xc and refer to as the Cauchy point, by analogy with our terminology of Chapter 4. The working set is now defined to be the set of bound constraints that are active at the Cauchy point, denoted by Axc. In the second stage of each gradient projection iteration, we explore the face of the feasible box on which the Cauchy point lies by solving a subproblem in which the active components xi for i Axc are fixed at the values xic.
We describe the gradient projection method in detail in the rest of this section. Our convention in this section is to denote the iteration number by a superscript that is, xk and use subscripts to denote the elements of a vector.
CAUCHY POINT COMPUTATION
We now derive an explicit expression for the piecewiselinear path obtained by pro jecting the steepest descent direction onto the feasible box, and outline the search procedure for identifying the first local minimum of q along this path.
The projection of an arbitrary point x onto the feasible region 16.68b is defined as follows. The ith component is given by
l i
Px,l,ui xi u i
i f if i f
x i l i ,
xi li,ui, 16.69
x i u i .
Weassume,withoutlossofgenerality,thatli ui foralli.Thepiecewiselinearpathxt starting at the reference point x and obtained by projecting the steepest descent direction at x onto the feasible region 16.68b is thus given by
xt Px tg,l,u, 16.70
where g Gx c; see Figure 16.4.
The Cauchy point x c , is defined as the first local minimizer of the univariate, piecewise
quadratic function qxt, for t 0. This minimizer is obtained by examining each of the line segments that make up xt. To perform this search, we need to determine the values of t at which the kinks in xt, or breakpoints, occur. We first identify the values of t for which
each component reaches its bound along the chosen direction g. These values t are given i
where
16.7. THE GRADIENT PROJECTION METHOD 487
xtg
xt3 xt1 xt2
x
Figure 16.4 The piecewiselinear path xt, for an example in IR3. by the following explicit formulae:
xi uigi ifgi 0andui ,
t x lg ifg 0andl , 16.71
i i i i i i otherwise.
The components of xt for any t are therefore
xit
iii
x t g i f t t , iii
x t g otherwise. Tosearchforthefirstlocalminimizeralong Pxtg,l,u,weeliminatetheduplicate
values and zero values of t from the set t , t , . . . , t , to obtain a sorted, reduced set i12n
of breakpoints t1,t2,…,tl with 0 t1 t2 . We now examine the intervals 0,t1,t1,t2,t2,t3,… in turn. Suppose we have examined up to tj1 and have not yet found a local minimizer. For the interval tj1,tj, we have that
xt xtj1 atpj1,
atttj1 0,tj tj1,
488 CHAPTER 16. QUADRATIC PROGRAMMING
and
g pj1 i
i f t t ,
j1 i 16.72
i 0
We can then write the quadratic 16.68a on the line segment xtj1,xtj as
otherwise.
follows:
qxtcTxtj1atpj1 1xtj1atpj1TGxtj1atpj1.
2
Expanding and grouping the coefficients of 1, at, and at2, we find that
qxtf f at1f at2, at0,t t , j1 j1 2j1 j j1
where the coefficients f j1, f , and f are defined by j1 j1
defT 1 T
fj1 c xtj1 2 xtj1 Gxtj1,
16.73
def T j1 fj1 c p
T j1 xtj1 Gp
j1
f f . The following cases can occur. i If f 0 there is a local minimizer j1 j1 j1
ofqxtatt tj1;elseiiat 0,tj tj1thereisaminimizeratt tj1 at; iii in all other cases we move on to the next interval tj,tj1 and continue the search.
For the next search interval, we need to calculate the new direction p j from 16.72, andweusethisnewvaluetocalculate fj, f,and f.Since pj differsfrom pj1 typically
in just one component, computational savings can be made by updating these coefficients rather than computing them from scratch.
SUBSPACE MINIMIZATION
After the Cauchy point xc has been computed, the components of xc that are at their lower or upper bounds define the active set
Axcixic li or xic ui.
In the second stage of the gradient projection iteration, we approximately solve the QP obtained by fixing the components xi for i Ax c at the values xic . The remaining
def j1 T fj1 p Gp
.
Differentiating 16.73 with respect to at and equating to zero, we obtain at
jj
,
components are determined from the subproblem min qx 1 xT Gx xT c
16.74a
16.74b 16.74c
16.7. THE GRADIENT PROJECTION METHOD 489
x2
subject to xi xic, i Axc,
li xi ui,iAxc.
It is not necessary to solve this problem exactly. Nor is it desirable in the largedimensional case, because the subproblem may be almost as difficult as the original problem 16.68. In fact, to obtain global convergence of the gradient projection procedure, we require only that the approximate solution x of 16.74 is feasible with respect to 16.68b and has an objective function value no worse than that of xc, that is, qx qxc. A strategy that is intermediate between choosing x x c as the approximate solution on the one hand and solving 16.74 exactly on the other hand is to compute an approximate solution of 16.74 by using the conjugate gradient iteration described in Algorithm 16.1 or Algorithm 16.2. Note that for the equality constraints 16.74b, the Jacobian A and the nullspace basis matrix Z have particularly simple forms. We could therefore apply conjugate gradient to the problem 16.74a, 16.74b and terminate as soon as a bound l x u is encountered. Alternatively, we could continue to iterate, temporarily ignoring the bounds and projecting the solution back onto the box constraints. The negativecurvature case can be handled as in Algorithm 7.2, the method for approximately solving possibly indefinite trustregion subproblems in unconstrained optimization.
We summarize the gradient projection algorithm for quadratic programming as follows.
Algorithm 16.5 Gradient Projection Method for QP. Compute a feasible starting point x0;
for k 0,1,2,…
if xk satisfies the KKT conditions for 16.68 stop with solution x xk;
Set x xk and find the Cauchy point xc;
Find an approximate solution x of 16.74 such that qx qxc
and x is feasible; xk1 x;
end for
If the algorithm approaches a solution x at which the Lagrange multipliers associated with all the active bounds are nonzero that is, strict complementarity holds, the active sets Axc generated by the gradient projection algorithm are equal to the optimal active set for all k sufficiently large. That is, constraint indices do not repeatedly enter and leave the active set on successive iterations. When the problem is degenerate, the active set may not settle down at its optimal value. Various devices have been proposed to prevent this undesirable behavior from taking place.
490 CHAPTER 16. QUADRATIC PROGRAMMING
While gradient projection methods can be applied in principle to problems with gen eral linear constraints, significant computation may be required to perform the projection onto the feasible set in such cases. For example, if the constraint set is defined as aiT x bi , i I, we must solve the following convex quadratic program to compute the projection of a given point x onto this set:
max x x 2 subject to aiT x bi for all i I. x
The expense of solving this projection subproblem may approach the cost of solving the original quadratic program, so it is usually not economical to apply gradient projection to this case.
When we use duality to replace a strictly convex quadratic program with its dual see Example 12.12, the gradient projection method may be useful in solving the bound constrained dual problem, which is formulated in terms of the Lagrange multipliers as follows:
max q 1ATcTG1ATcT bT, subjectto0. 2
Note that the dual is conventionally written as a maximization problem; we can equivalently minimize q and note that this transformed problem is convex. This approach is most useful when G has a simple form, for example, a diagonal or blockdiagonal matrix.
16.8 PERSPECTIVES AND SOFTWARE
Activeset methods for convex quadratic programming are implemented in QPOPT 126, VE09 142, BQPD 103, and QPA 148. Several commercial interiorpoint solvers for QP are available, including CPLEX 172, XPRESSMP 159 and MOSEK 5. The code QPB 146 uses a twophase interiorpoint method that can handle convex and nonconvex problems. OOPS 139 and OOQP 121 are objectoriented interiorpoint codes that allow the user to customize the linear algebra techniques to the particular structure of the data for an application. Some nonlinear programming interiorpoint packages, such as LOQO 294 and KNITRO 46, are also effective for convex and nonconvex quadratic programming.
The numerical comparison of activeset and interiorpoint methods for convex quadratic programming reported in 149 indicates that interiorpoint methods are gener ally much faster on large problems. If a warm start is required, however, activeset methods may be generally preferable. Although considerable research has been focused on improving the warmstart capabilities of interiorpoint methods, the full potential of such techniques is now yet known.
We have assumed in this chapter that all equalityconstrained quadratic programs have linearly independent constraints, that is, the m n constraint Jacobian matrix A has rank m .
16.8. PERSPECTIVES AND SOFTWARE 491
If redundant constraints are present, they can be detected by forming a SVD or rankrevealing QR factorization of AT , and then removed from the formulation. When A is larger, sparse GaussianeliminationtechniquescanbeappliedtoAT instead,buttheyarelessreliable.
The KNITRO and OOPS software packages provide the option of solving the primaldual equations 16.63 by means of the projected CG iteration of Algorithm 16.2.
We have not considered activeset methods for the case in which the Hessian matrix G is indefinite because these methods can be quite complicated to describe and it is not well understood how to adapt them to the large dimensional case. We make some comments here on the principal techniques.
Algorithm 16.3, the activeset method for convex QP, can be adapted to this indefinite case by modifying the computation of the search direction and step length in certain situations. To explain the need for the modification, we consider the computation of a stepbyanullspacemethod,thatis,pZpZ,wherepZ isgivenby16.53.Ifthereduced Hessian Z T G Z is positive definite, then this step p points to the minimizer of the subproblem 16.39, and the logic of the iteration need not be changed. If Z T G Z has negative eigenvalues, however, p points only to a saddle point of 16.39 and is therefore not always a suitable step.Instead,weseekanalternativedirectionsZ thatisadirectionofnegativecurvaturefor Z T G Z . We then have that
qxZsZ as. 16.75
Additionally,wechangethesignofsZ ifnecessarytoensurethatZsZ isanonascentdirection forqatthecurrentpointx,thatis,qxTZsZ 0.BymovingalongthedirectionZsZ,we will encounter a constraint that can be added to the working set for the next iteration. If we dont find such a constraint, the problem is unbounded. If the reduced Hessian for the new working set is not positive definite, we repeat this process until enough constraints have been added to make the reduced Hessian positive definite. A difficulty with this general approach, however, is that if we allow the reduced Hessian to have several negative eigenvalues, it is difficult to make these methods efficient when the reduced Hessian changes from one working set to the next.
Inertia controlling methods are a practical class of algorithms for indefinite QP that never allow the reduced Hessian to have more than one negative eigenvalue. As in the convex case, there is a preliminary phase in which a feasible starting point x0 is found. We place the additional demand on x0 that it be either a vertex in which case the reduced Hessian is the null matrix or a constrained stationary point at which the reduced Hessian is positive definite. At each iteration, the algorithm will either add or remove a constraint from the working set. If a constraint is added, the reduced Hessian is of smaller dimension and must remain positive definite or be the null matrix. Therefore, an indefinite reduced Hessian can arise only when one of the constraints is removed from the working set, which happens only when the current point is a minimizer with respect to the current working set. In this case, we will choose the new search direction to be a direction of negative curvature for the reduced Hessian.
492 CHAPTER 16. QUADRATIC PROGRAMMING
Various algorithms for indefinite QP differ in the way that indefiniteness is detected, in the computation of the negative curvature direction, and in the handling of the working set; see Fletcher 99 and Gill and Murray 126.
NOTES AND REFERENCES
The problem of determining whether a feasible point for a nonconvex QP 16.1 is a global minimizer is NPhard Murty and Kabadi 219; so is the problem of determining whether a given point is a local minimizer Vavasis 296, Theorem 5.1. Various algorithms for convex QP with polynomial convexity are discussed in Nesterov and Nemirovskii 226.
The portfolio optimization problem was formulated by Markowitz 201.
For a discussion on the QMR, LSQR, and GMRES methods see, for example, 136, 272, 290. The idea of using the projection 16.30 in the CG method dates back to at least Polyak 238. The alternative 16.34, and its special case 16.32, are proposed in Coleman 64. Although it can give rise to substantial rounding errors, they can be corrected by iterative refinement; see Gould et al. 143. More recent studies on preconditioning of the projected CG method include Keller et al. 176 and Luksan and Vlcek 196.
For further discussion on the gradient projection method see, for example, Conn, Gould, and Toint 70 and Burke and More 44.
In some areas of application, the KKT matrix 16.7 not only is sparse but also contains special structure. For instance, the quadratic programs that arise in many control problems have banded matrices G and A see Wright 315, which can be exploited by interiorpoint methods via a suitable symmetric reordering of K . When activeset methods are applied to this problem, however, the advantages of bandedness and sparsity are lost after just a few updates of the factorization.
Further details of interiorpoint methods for convex quadratic programming can be found in Wright 316 and Vanderbei 293. The first inertiacontrolling method for indefinite quadratic programming was proposed by Fletcher 99. See also Gill et al. 129 and Gould 142 for a discussion of methods for general quadratic programming.
EXERCISES 16.1
a Solve the following quadratic program and illustrate it geometrically. minfx2x1 3×2 4×12 2x1x2 x2,
subjecttox1 x2 0, x1 x2 4, x1 3.
b If the objective function is redefined as qx f x, does the problem have a finite minimum? Are there local minimizers?
16.8. PERSPECTIVES AND SOFTWARE 493
16.2 The problem of finding the shortest distance from a point x0 to the hyperplane x Ax b, where A has full row rank, can be formulated as the quadratic program
min 1xx0Txx0 subjectto Axb. 2
Show that the optimal multiplier is
AAT1bAx0
and that the solution is
x x0 AT AAT 1b Ax0.
Show that in the special case in which A is a row vector, the shortest distance from x0 to the
solutionsetofAxbisbAx0 A 2.
16.3 Use Theorem 12.1 to verify that the firstorder necessary conditions for 16.3
are given by 16.4.
16.4 Suppose that G is positive semidefinite in 16.1 and that x satisfies the KKT conditions 16.37 for some i, i Ax. Suppose in addition that secondorder sufficient conditions are satisfied, that is, Z T G Z is positive definite where the columns of Z span the null space of the active constraint Jacobian matrix. Show that x is in fact the unique global solution for 16.1, that is, qx qx for all feasible x with x a x.
16.5 Verify that the inverse of the KKT matrix is given by 16.16.
16.6 Use Theorem 12.6 to show that if the conditions of Lemma 16.1 hold, then the secondorder sufficient conditions for 16.3 are satisfied by the vector pair x, that satisfies 16.4.
16.7 Consider 16.3 and suppose that the projected Hessian matrix Z T G Z has a negative eigenvalue; that is, uT ZT GZu 0 for some vector u. Show that if there exists any vector pair x, that satisfies 16.4, then the point x is only a stationary point of 16.3 and not a local minimizer. Hint: Consider the function qx Zu for a 0, and use an expansion like that in the proof of Theorem 16.2.
16.8 By using the QR factorization and a permutation matrix, show that for a full rank m n matrix A with m n one can find an orthogonal matrix Q and an m m upper triangular matrix U such that A Q 0 U . Hint: Start by applying the standard QR factorization to AT .
16.9 Verify that the firstorder conditions for optimality of 16.1 are equivalent to 16.37 when we make use of the activeset definition 16.36.
494 CHAPTER 16. QUADRATIC PROGRAMMING
16.10 For each of the alternative choices of initial working set W0 in the example 16.49 that is, W0 3, W0 5, and W0 work through the first two iterations of Algorithm 16.3.
16.11 Program Algorithm 16.3, and use it to solve the problem min x12 2×2 2×1 6×2 2x1x2
subjectto 1×1 1×2 1, x1 2×2 2, x1,x2 0. 22
Choose three initial starting points: one in the interior of the feasible region, one at a vertex, and one at a nonvertex point on the boundary of the feasible region.
16.12 Show that the operator P defined by 16.27 is independent of the choice of nullspace basis Z . Hint: First show that any nullspace basis Z can be written as Z Q B where Q is an orthogonal basis and B is a nonsingular matrix.
16.13
a Show that the the computation of the preconditioned residual g in 16.28d can be
performed with 16.29 or 16.30.
b Show that we can also perform this computation by solving the system 16.32.
c Verify 16.33.
16.14
Show that if Z T G Z is positive definite, then the denominator in 16.28a is nonzero. Show that if ZT r rZ a 0 and ZT H Z is positive definite, then the denominator in
16.28e is nonzero.
a b
16.16
a Assume that A a 0. Show that the KKT matrix 16.7 is indefinite.
b Prove that if the KKT matrix 16.7 is nonsingular, then A must have full rank.
16.17 Consider the quadratic program
max 6×1 4×2 13×12 x2,
subjectto x1 x2 3, x1 0, x2 0. 16.76
16.15 Consider problem 16.3, and assume that A has full row rank and that Z is a basis for the null space of A. Prove that there are no finite solutions if Z T G Z has negative eigenvalues.
16.8. PERSPECTIVES AND SOFTWARE 495
First solve it graphically, and then use your program implementing the activeset method given in Algorithm 16.3.
16.18 Using 16.39 and 16.41, explain briefly why the gradient of each blocking constraint cannot be a linear combination of the constraint gradients in the current working set Wk .
16.19 Let W be an n n symmetric matrix, and suppose that Z is of dimension n t . Suppose that Z T W Z is positive definite and that Z is obtained by removing a column from Z . Show that Z T W Z is positive definite.
16.20 Find a nullspace basis matrix Z for the equalityconstrained problem defined by 16.74a, 16.74b.
16.21 Write down KKT conditions for the following convex quadratic program with mixed equality and inequality constraints:
minqx1xTGxxTc subjectto Axb, A xb , 2
where G is symmetric and positive semidefinite. Use these conditions to derive an analogue of the generic primaldual step 16.58 for this problem.
16.22 Explain why for a boundconstrained problems the number of possible active sets is at most 3n .
16.23
a Show that the primaldual system 16.58 can be solved using the augmented system 16.61 or the normal equations 16.62. Describe in detail how all the components ax, ay, a are computed.
b Verify 16.65.
16.24 Program Algorithm 16.4 and use it to solve problem 16.76. Set all initial
variables to be the vector e 1,1,…,1T .
16.25 Let x Rn be given, and let x be the solution of the projection problem
min xx 2 subjectto lxu. 16.77
For simplicity, assume that li ui for all i 1,2,…,n. Show that the solution of this problem coincides with the projection formula given by 16.69 that is, show that x Px , l, u. Hint: Note that the problem is separable.
496 CHAPTER 16. QUADRATIC PROGRAMMING
16.26 Consider the boundconstrained quadratic problem 16.68 with
G 4 1 ,c 1 ,l 0 ,andu 5 . 16.78
12103
Suppose x0 0,2T. Find t ,t , t ,t , p1,p2 and xt ,xt . Find the minimizer of 1212 12
qxt.
16.27 Consider the search for the one dimensional minimizer of the function qxt defined by 16.73. There are 9 possible cases since f, f , f can each be positive, negative, or zero. For each case, determine the location of the minimizer. Verify that the rules described in Section 16.7 hold.
CHAPTER17
Penalty and Augmented Lagrangian Methods
Some important methods for constrained optimization replace the original problem by a sequence of subproblems in which the constraints are represented by terms added to the objective. In this chapter we describe three approaches of this type. The quadratic penalty method adds a multiple of the square of the violation of each constraint to the objective. Because of its simplicity and intuitive appeal, this approach is used often in practice, al though it has some important disadvantages. In nonsmooth exact penalty methods, a single unconstrained problem rather than a sequence takes the place of the original constrained problem. Using these penalty functions, we can often find a solution by performing a single
This is page 497 Printer: Opaque this
498 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
unconstrained minimization, but the nonsmoothness may create complications. A popular function of this type is the 1 penalty function. A different kind of exact penalty approach is the method of multipliers or augmented Lagrangian method, in which explicit Lagrange multiplier estimates are used to avoid the illconditioning that is inherent in the quadratic penalty function.
A somewhat related approach is used in the logbarrier method, in which logarithmic terms prevent feasible iterates from moving too close to the boundary of the feasible re gion. This approach forms part of the foundation for interiorpoint methods for nonlinear programming and we discuss it further in Chapter 19.
17.1 THE QUADRATIC PENALTY METHOD
MOTIVATION
Let us consider replacing a constrained optimization problem by a single function consisting of
the original objective of the constrained optimization problem, plus
one additional term for each constraint, which is positive when the current point x
violates that constraint and zero otherwise.
Most approaches define a sequence of such penalty functions, in which the penalty terms for the constraint violations are multiplied by a positive coefficient. By making this coefficient larger, we penalize constraint violations more severely, thereby forcing the minimizer of the penalty function closer to the feasible region for the constrained problem.
The simplest penalty function of this type is the quadratic penalty function, in which the penalty terms are the squares of the constraint violations. We describe this approach first in the context of the equalityconstrained problem
min fx subjecttocix0, i E, 17.1 x
which is a special case of 12.1. The quadratic penalty function Qx; for this formulation is
def 2
cix, 17.2
where 0 is the penalty parameter. By driving to , we penalize the constraint violations with increasing severity. It makes good intuitive sense to consider a sequence of valueskwithk ask ,andtoseektheapproximateminimizerxk ofQx;k for each k. Because the penalty terms in 17.2 are smooth, we can use techniques from
Qx; fx2
iE
17.1. THE QUADRATIC PENALTY METHOD 499
unconstrained optimization to search for xk . In searching for xk , we can use the minimizers xk1, xk2, etc., of Q; for smaller values of to construct an initial guess. For suitable choices of the sequence k and the initial guesses, just a few steps of unconstrained minimization may be needed for each k .
EXAMPLE 17.1
Consider the problem 12.9 from Chapter 12, that is,
min x1 x2 subjecttox12 x2 20, forwhichthesolutionis1,1T andthequadraticpenaltyfunctionis
Qx;x1x2 x12x222. 2
17.3
17.4
We plot the contours of this function in Figures 17.1 and 17.2. In Figure 17.1 we have 1, and we observe a minimizer of Q near the point 1.1, 1.1T . There is also a local maximizer near x 0.3, 0.3T . In Figure 17.2 we have 10, so points that do not lie on the feasible circle defined by x12 x2 2 suffer a much greater penalty than in the first figurethe trough of low values of Q is clearly evident. The minimizer in this figure ismuchclosertothesolution1,1T oftheproblem17.3.Alocalmaximumliesnear 0, 0T , and Q goes rapidly to outside the circle x12 x2 2.
1.5
1
0.5
0
0.5
1
1.5
1.5 1 0.5 0 0.5 1 1.5
Figure 17.1 Contours of Qx; from 17.4 for 1, contour spacing 0.5.
500 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
1.5
1
0.5
0
0.5
1
1.5
1.5 1 0.5 0 0.5 1 1.5
Figure 17.2 Contours of Qx; from 17.4 for 10, contour spacing 2.
The situation is not always so benign as in Example 17.1. For a given value of the penalty parameter , the penalty function may be unbounded below even if the original constrained problem has a unique solution. Consider for example
min 5×12 x2 subject to x1 1, 17.5
whose solution is 1, 0T . The penalty function is unbounded for any 10. For such values of , the iterates generated by an unconstrained minimization method would usually diverge. This deficiency is, unfortunately, common to all the penalty functions discussed in this chapter.
For the general constrained optimization problem
min fx subjecttocix0, i E, cix0, i I, 17.6
x
which contains inequality constraints as well as equality constraints, we can define the quadratic penalty function as
def 2 2 Qx; fx 2 ci x 2 cix
iE iI
, 17.7
where y denotes maxy, 0. In this case, Q may be less smooth than the objective and constraint functions. For instance, if one of the inequality constraints is x1 0, then the function min0, x12 has a discontinuous second derivative, so that Q is no longer twice continuously differentiable.
17.1. THE QUADRATIC PENALTY METHOD 501
ALGORITHMIC FRAMEWORK
A general framework for algorithms based on the quadratic penalty function 17.2 can be specified as follows.
Framework 17.1 Quadratic Penalty Method.
Given 0 0, a nonnegative sequence k with k 0, and a starting point x0s ; for k 0,1,2,…
Find an approximate minimizer xk of Q; k , starting at xks , and terminating when x Qx;k k;
if final convergence test satisfied
stop with approximate solution xk ;
end if
Choose new penalty parameter k1 k ;
Choose new starting point xks1; end for
The parameter sequence k can be chosen adaptively, based on the difficulty of minimizing the penalty function at each iteration. When minimization of Qx;k proves to be expensive for some k , we choose k 1 to be only modestly larger than k ; for instance k1 1.5k. If we find the approximate minimizer of Qx;k cheaply, we could try a more ambitious increase, for instance k1 10k. The convergence theory for Frame work 17.1 allows wide latitude in the choice of nonnegative tolerances k ; it requires only thatk 0,toensurethattheminimizationiscarriedoutmoreaccuratelyastheiterations progress.
There is no guarantee that the stop test x Qx;k k will be satisfied be cause, as discussed above, the iterates may move away from the feasible region when the penalty parameter is not large enough. A practical implementation must include safe guards that increase the penalty parameter and possibly restore the initial point when the constraint violation is not decreasing rapidly enough, or when the iterates appear to be diverging.
Whenonlyequalityconstraintsarepresent, Qx;kissmooth,sothealgorithmsfor unconstrained minimization described in the first chapters of the book can be used to identify the approximate solution xk . However, the minimization of Qx; k becomes more difficult to perform as k becomes large, unless we use special techniques to calculate the search directions.Foronething,theHessianx2x Qx;kbecomesarbitrarilyillconditionednear the minimizer. This property alone is enough to make many unconstrained minimization algorithms such as quasiNewton and conjugate gradient perform poorly. Newtons method, on the other hand, is not sensitive to ill conditioning of the Hessian, but it, too, may encounter difficulties for large k for two other reasons. First, ill conditioning of x2x Qx;k might be expected to cause numerical problems when we solve the linear equations to calculate the Newton step. We discuss this issue below, and show that these effects are not severe and
502 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
that a reformulation of the Newton equations is possible. Second, even when x is close to the minimizer of Q;k, the quadratic Taylor series approximation to Qx;k about x is a reasonable approximation of the true function only in a small neighborhood of x. This property can be seen in Figure 17.2, where the contours of Q near the minimizer have a banana shape, rather than the elliptical shape that characterizes quadratic functions. Since Newtons method is based on the quadratic model, the steps that it generates may not make rapid progress toward the minimizer of Qx;k. This difficulty can be lessened by a judicious choice of the starting point xks1, or by setting xks1 xk and choosing k1 to be only modestly larger than k .
CONVERGENCE OF THE QUADRATIC PENALTY METHOD
We describe some convergence properties of the quadratic penalty method in the following two theorems. We restrict our attention to the equalityconstrained problem 17.1, for which the quadratic penalty function is defined by 17.2.
For the first result we assume that the penalty function Qx;k has a finite minimizer for each value of k .
Theorem 17.1.
Suppose that each xk is the exact global minimizer of Qx;k defined by 17.2 in Framework 17.1 above, and that k . Then every limit point x of the sequence xk is a global solution of the problem 17.1.
PROOF. Let x be a global solution of 17.1, that is,
fx fx forallxwithcix0, iE.
Since xk minimizes Q;k for each k, we have that Qxk;k Qx ;k, which leads to the inequality
17.8
17.9 Suppose that x is a limit point of xk, so that there is an infinite subsequence K such that
lim xk x. kK
fxk 2
By rearranging this expression, we obtain
k 2 k 2
cixk fx 2 cix fx . iE
iE
22
ci xk f x f xk.
iE k
By taking the limit as k , k K, on both sides of 17.9, we obtain 222
17.1. THE QUADRATIC PENALTY METHOD 503
cix lim cixklim fx fxk0, iE kK iE kK k
where the last equality follows from k . Therefore, we have that cix 0 for all i E,sothatx isfeasible.Moreover,bytakingthelimitask fork Kin17.8,we have by nonnegativity of k and of each ci xk 2 that
k2
fx fx lim cixk fx .
kK 2 iE
Since x is a feasible point whose objective value is no larger than that of the global solution
x , we conclude that x, too, is a global solution, as claimed.
Since this result requires us to find the global minimizer for each subproblem, this desirable property of convergence to the global solution of 17.1 cannot be attained in general. The next result concerns convergence properties of the sequence xk when we allow inexact but increasingly accurate minimizations of Q; k . In contrast to Theorem 17.1, it shows that the sequence may be attracted to infeasible points, or to any KKT point that is, a point satisfying firstorder necessary conditions; see 12.34, rather than to a minimizer. It also shows that the quantities k ci xk may be used as estimates of the Lagrange multipliers i in certain circumstances. This observation is important for the analysis of augmented Lagrangian methods in Section 17.3.
To establish the result we will make the optimistic assumption that the stop test x Qx;k k is satisfied for all k.
Theorem 17.2.
Suppose that the tolerances and penalty parameters in Framework 17.1 satisfy k 0 and k . Then if a limit point x of the sequence xk is infeasible, it is a stationary point of the function cx 2. On the other hand, if a limit point x is feasible and the constraint gradients ci x are linearly independent, then x is a KKT point for the problem 17.1. For such points, we have for any infinite subsequence K such that limkK xk x that
limkcixk i, for all i E, 17.10 kK
where is the multiplier vector that satisfies the KKT conditions 12.34 for the equality constrained problem 17.1.
PROOF. By differentiating Qx;k in 17.2, we obtain
x Qxk ; k f xk
k ci xk ci xk , 17.11
iE
504 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
so from the termination criterion for Framework 17.1, we have that
a a
a f xk k ci xk ci xk a k .
iE
By rearranging this expression and in particular using the inequality
a
b
17.12 a b ,
17.13
we obtain
a
a
iE
a 1
cixkcixka k fxk .
k
Let x be a limit point of the sequence of iterates. Then there is a subsequence K such that limkK xk x. When we take limits as k for k K, the bracketed term on the righthandside approaches f x , so because k , the righthandside approaches zero. From the corresponding limit on the lefthandside, we obtain
cix cix 0. 17.14
iE
We can have ci x a 0 if the constraint gradients ci x are dependent, but in this case 17.14 implies that x is a stationary point of the function cx 2.
If, on the other hand, the constraint gradients ci x are linearly independent at a limit point x, we have from 17.14 that ci x 0 for all i E, so x is feasible. Hence, the second KKT condition 12.34b is satisfied. We need to check the first KKT condition 12.34a as well, and to show that the limit 17.10 holds.
By using Ax to denote the matrix of constraint gradients also known as the Jacobian, that is,
AxT cixiE, 17.15 and k to denote the vector k cxk , we have as in 17.12 that
AxkT k f xk x Qxk;k, x Qxk;k k. 17.16 ForallkKsufficientlylarge,thematrixAxkhasfullrowrank,sothatAxkAxkT is
nonsingular. By multiplying 17.16 by Axk and rearranging, we have that k AxkAxkT 1 Axk fxkx Qxk;k.
Hence by taking the limit as k K goes to , we find that
limk AxAxT 1 Axfx. kK
The Hessian is given by the formula
iE
2 Qx; 2 fx xx k
cx2cx AxT Ax, ki i k
17.18
17.1. THE QUADRATIC PENALTY METHOD 505
By taking limits in 17.12, we conclude that
f x AxT 0, 17.17
so that satisfies the first KKT condition 12.34a for 17.1. Hence, x is a KKT point for 17.1, with unique Lagrange multiplier vector .
It is reassuring that, if a limit point x is not feasible, it is at least a stationary point for the function cx 2. Newtontype algorithms can always be attracted to infeasible points of this type. We see the same effect in Chapter 11, in our discussion of methods for nonlinear equations that use the sumofsquares merit function rx 2. Such methods cannot be guaranteed to find a root, and can be attracted to a stationary point or minimizer of the merit function. In the case in which the nonlinear program 17.1 is infeasible, we often observe convergence of the quadraticpenalty method to stationary points or minimizers of
cx 2.
ILL CONDITIONING AND REFORMULATIONS
We now examine the nature of the ill conditioning in the Hessian x2x Qx;k. An understanding of the properties of this matrix, and the similar Hessians that arise in other penalty and barrier methods, is essential in choosing effective algorithms for the minimization problem and for the linear algebra calculations at each iteration.
where we have used the definition 17.15 of Ax. When x is close to the minimizer of Q;k and the conditions of Theorem 17.2 are satisfied, we have from 17.10 that the sum of the first two terms on the righthandside of 17.18 is approximately equal to the Hessian of the Lagrangian function defined in 12.33. To be specific, we have
2 Qx; 2 Lx, AxT Ax, 17.19 xx k xx k
w h e n x i s c l o s e t o t h e m i n i m i z e r o f Q ; k . W e s e e f r o m t h i s e x p r e s s i o n t h a t x2 x Q x ; k is approximately equal to the sum of
a matrix whose elements are independent of k the Lagrangian term, and
a matrix of rank E whose nonzero eigenvalues are of order k the second term on
the righthand side of 17.19.
The number of constraints E is usually smaller than n. In this case, the last term in 17.19 is singular. The overall matrix has some of its eigenvalues approaching a constant, while others
506 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
are of order k. Since k is approaching , the increasing ill conditioning of x2x Qx;k is apparent.
One consequence of the ill conditioning is possible inaccuracy in the calculation of the Newton step for Qx;k, which is obtained by solving the following system:
2 Qx; p Qx; . 17.20 xx k x k
In general, the poor conditioning of this system will lead to significant errors in the computed value of p, regardless of the computational technique used to solve 17.20. For the same reason, iterative methods can be expected to perform poorly unless accompanied by a preconditioning strategy that removes the systematic ill conditioning.
There is an alternative formulation of the equations 17.20 that avoids the ill condi tioning due to the final term in 17.18. By introducing a new variable vector defined by Ax p, we see that the vector p that solves 17.20 also satisfies the following system:
2fxcx2cx AxT
ki i p xQx;k
iE .
Ax 1kI
0
When x is not too far from the solution x , the coefficient matrix in this system does not have large singular values of order k , so the system 17.21 can be viewed as a well conditioned reformulation of 17.20. We note, however, that neither system may yield a good search direction p because the coefficients k ci x in the summation term of the upper left block of 17.21 may be poor approximations to the Lagrange multipliers i, even when x is quite close to the minimizer xk of Qx;k. This fact may cause the quadratic model on which p is based to be an inadequate model of Q; k , so the Newton step may be intrinsically an unsuitable search direction. We discussed possible remedies for this difficulty above, in our comments following Framework 17.1.
To compute the step via 17.21 involves the solution of a linear system of dimension n E rather than the system of dimension n given by 17.19. A similar system must be solved to calculate the sequential quadratic programming SQP step 18.6, which is derived in Chapter 18. In fact, when k is large, 17.21 can be viewed as a regularization of the SQP step 18.6 in which the term 1k I helps to ensure that the iteration matrix is nonsingular even when the Jacobian Ax is rank deficient. On the other hand, when k is small, 17.21 shows that the step computed by the quadratic penalty method does not closely satisfy the linearization of the constraints. This situation is undesirable because the steps may not make significant progress toward the feasible region, resulting in inefficient global behavior. Moreover, if k does not approach rapidly enough, we lose the possibility of a superlinear rate that occurs when the linearization is exact; see Chapter 18.
17.21
17.2. NONSMOOTH PENALTY FUNCTIONS 507
To conclude, the formulation 17.21 allows us to view the quadratic penalty method either as the application of unconstrained minimization to the penalty function Q; k or as a variation on the SQP methods discussed in Chapter 18.
17.2 NONSMOOTH PENALTY FUNCTIONS
Some penalty functions are exact, which means that, for certain choices of their penalty parameters, a single minimization with respect to x can yield the exact solution of the non linear programming problem. This property is desirable because it makes the performance of penalty methods less dependent on the strategy for updating the penalty parameter. The quadratic penalty function of Section 17.1 is not exact because its minimizer is generally not the same as the solution of the nonlinear program for any positive value of . In this section we discuss nonsmooth exact penalty functions, which have proved to be useful in a number of practical contexts.
A popular nonsmooth penalty function for the general nonlinear programming problem 17.6 is the 1 penalty function defined by
1x; fx cix cix , 17.22
iE iI
whereweuseagainthenotationy max0,y.Itsnamederivesfromthefactthatthe penalty term is times the 1 norm of the constraint violation. Note that 1x; is not differentiable at some x, because of the presence of the absolute value and functions.
The following result establishes the exactness of the 1 penalty function. For a proof see 165, Theorem 4.4.
Theorem 17.3.
Suppose that x is a strict local solution of the nonlinear programming problem 17.6 at which the firstorder necessary conditions of Theorem 12.1 are satisfied, with Lagrange multipliers i, i E I. Then x is a local minimizer of 1x; for all , where
max i. 17.23 iEI
If, in addition, the secondorder sufficient conditions of Theorem 12.6 hold and , then x is a strict local minimizer of 1x; .
Loosely speaking, at a solution of the nonlinear program x, any move into the infeasible region is penalized sharply enough that it produces an increase in the penalty function to a value greater than 1x; fx, thereby forcing the minimizer of 1; to lie at x.
508 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
EXAMPLE 17.2
Consider the following problem in one variable:
min x subject to x 1, whose solution is x 1. We have that
1x;xx1 1x ifx1, x if x 1.
17.24
17.25
As can be seen in Figure 17.3, the penalty function has a minimizer at x 1 when 1,
but is a monotone increasing function when 1.
Since penalty methods work by minimizing the penalty function directly, we need to characterize stationary points of 1. Even though 1 is not differentiable, it has a directional derivative D1x;; p along any direction; see A.51 and the example following this definition.
Definition 17.1.
A point x Rn is a stationary point for the penalty function 1x; if
D1x; ; p 0, 17.26
x; x; 11
x 1 x x 1 x
Figure 17.3 Penalty function for problem 17.24 with 1 left and 1 right.
D1x;; p
p if p 0 1p ifp0;
17.2. NONSMOOTH PENALTY FUNCTIONS 509
for all p Rn. Similarly, x is a stationary point of the measure of infeasibility
hx ci x ci x 17.27 iE iI
if Dhx; p 0 for all p Rn. If a point is infeasible for 17.6 but stationary with respect to the infeasibility measure h, we say that it is an infeasible stationary point.
For the function in Example 17.2, we have for x 1 that
it follows that when 1, we have D1x;; p 0 for all p IR.
The following result complements Theorem 17.3 by showing that stationary points of 1x; correspond to KKT points of the constrained optimization problem 17.6 under
certain assumptions.
Theorem 17.4.
Suppose that x is a stationary point of the penalty function 1x; for all greater than a certain threshold 0. Then, if x is feasible for the nonlinear program 17.6, it satisfies the KKT conditions 12.34 for 17.6. Is x is not feasible for 17.6, it is an infeasible stationary point.
PROOF. Suppose first that x is feasible. We have from A.51 and the definition 17.22 of 1 that
D1x;;pfxT pacixT pa cixT p , 17.28 iE iIAx
where the active set Ax is defined in Definition 12.1. We leave verification of 17.28 as an exercise. Consider any direction p in the linearized feasible direction set Fx of Definition 12.3. By the properties of Fx, we have
acixT pa cixT p 0, i I A x
so that by the stationarity assumption on 1x; , we have
0 D 1 x ; ; p f x T p , f o r a l l p F x .
510 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
1.5
1
0.5
0
0.5
1
1.5
1.5 1 0.5 0 0.5 1 1.5
Figure 17.4 Contours of 1x; from 17.3 for 2, contour spacing 0.5. We can now apply Farkas Lemma Lemma 12.4 to deduce that
f x i c i x , i A x
for some coefficients i with i 0 for all i I Ax. As we noted earlier see Theorem 12.1 and 12.35, this expression implies that the KKT conditions 12.34 hold, as claimed.
We leave the second part of the proof concerning infeasible x as an exercise.
EXAMPLE 17.3
Consider again problem 17.3, for which the 1 penalty function is
1x;x1 x2 ax12 x2 2a. 17.29
Figure 17.4 plots the function 1x; 2, whose minimizer is the solution x 1, 1T of 17.3. In fact, following Theorem 17.3, we find that for all 0.5, the minimizer of 1x; coincides with x. The sharp corners on the contours indicate nonsmoothness along the boundary of the circle defined by x12 x2 2.
17.2. NONSMOOTH PENALTY FUNCTIONS 511
These results provide the motivation for an algorithmic framework based on the 1 penalty function, which we now present.
Framework 17.2 Classical 1 Penalty Method. Given 0 0, tolerance 0, starting point x0s ; for k 0,1,2,…
Find an approximate minimizer xk of 1x;k, starting at xks; if hxk
stop with approximate solution xk ; end if
Choose new penalty parameter k1 k ;
Choose new starting point xks1; end for
Theminimizationof1x;kismadedifficultbythenonsmoothnessofthefunction. Nevertheless, as we discuss below, it is well understood how to compute minimization steps using a smooth model of 1x;k, in a way that resembles SQP methods.
The simplest scheme for updating the penalty parameter k is to increase it by a constant multiple say 5 or 10, if the current value produces a minimizer that is not feasible to within the tolerance . This scheme sometimes works well in practice, but can also be inefficient. If the initial penalty parameter 0 is too small, many cycles of Framework 17.2 may be needed to determine an appropriate value. In addition, the iterates may move away from the solution x in these initial cycles, in which case the minimization of 1x;k should be terminated early and xks should possibly be reset to a previous iterate. If, on the other hand, k is excessively large, the penalty function will be difficult to minimize, possibly requiring a large number of iterations. We return to the issue of selecting the penalty parameter below.
A PRACTICAL 1 PENALTY METHOD
As noted already, 1x; is nonsmoothits gradient is not defined at any x for which ci x 0 for some i E I. Rather than using techniques for nondifferen tiable optimization, such as bundle methods 170, we prefer techniques that take account of the special nature of the nondifferentiabilities in this function. As in the algorithms for unconstrained optimization discussed in the first part of this book, we obtain a step toward the minimizer of 1x; by forming a simplified model of this function and seeking the minimizer of this model. Here, the model can be defined by linearizing the constraints ci and replacing the nonlinear programming objective f by a quadratic
512 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
function, as follows:
qp; fxfxTp1pTWp
iE
cixcixTp
17.30
iI
2
ci x ci xT p,
where W is a symmetric matrix which usually contains second derivative information about f andci,i EI.Themodelqp;isnotsmooth,butwecanformulatetheproblem of minimizing q as a smooth quadratic programming problem by introducing artificial
variables ri , si , and ti , as follows:
min f x 1 pT W p f xT p
iE subjectto cixTpcixrisi, iE
cixT pcixti, i I r, s, t 0.
p,r,s,t 2
ri si
ti
This subproblem can be solved with a standard quadratic programming solver. Even after addition of a boxshaped trust region constraint of the form p a, it remains a quadratic program. This approach to minimizing 1 is closely related to sequential quadratic programming SQP and will be discussed further in Chapter 18.
The strategy for choosing and updating the penalty parameter k is crucial to the practical success of the iteration. We mentioned that a simple but not always effective approach is to choose an initial value and increase it repeatedly until feasibility is attained. In some variants of the approach, the penalty parameter is chosen at every iteration so that k k , where k is an estimate of the Lagrange multipliers computed at xk. We base this strategy on Theorem 17.2, which suggests that in a neighborhood of a solution x, a good choice would be to set k modestly larger than . This strategy is not always successful, as the multiplier estimates may be inaccurate and may in any case not provide a good appropriate value of k far from the solution.
The difficulties of choosing appropriate values of k caused nonsmooth penalty methods to fall out of favor during the 1990s and stimulated the development of filter methods, which do not require the choice of a penalty parameter see Section 15.4. In recent years, however, there has been a resurgence of interest in penalty methods, in part because of their ability to handle degenerate problems. New approaches for updating the penalty parameter appear to have largely overcome the difficulties associated with choosing k , at least for some particular implementations see Algorithm 18.5.
Careful consideration should also be given to the choice of starting point xks1 for the minimizationof1x;k1.Ifthepenaltyparameterk forthepresentcycleisappropriate, in the sense that the algorithm made progress toward feasibility, then we can set xks1 to be
iI
17.31
17.2. NONSMOOTH PENALTY FUNCTIONS 513
theminimizerxk of1x;kobtainedonthiscycle.Otherwise,wemaywanttorestorethe initial point from an earlier cycle.
A GENERAL CLASS OF NONSMOOTH PENALTY METHODS
Exact nonsmooth penalty functions can be defined in terms of norms other than the 1 norm. We can write
x; fx cEx cIx , 17.32
where is any vector norm, and all the equality and inequality constraints have been grouped in the vector functions cE and cI , respectively. Framework 17.2 applies to any of these penalty functions; we simply redefine the measure of infeasibility as hx cE x
cIx . The most common norms used in practice are the 1, and 2 not squared. It is easy to find a reformulation similar to 17.31 for the norm.
The theoretical properties described for the 1 function extend to the general class 17.32. In Theorem 17.3, we replace the inequality 17.23 by
D, 17.33
where D is the dual norm of , defined in A.6. Theorem 17.4 applies without modification.
We show now that penalty functions of the type considered so far in this chapter must be nonsmooth to be exact. For simplicity, we restrict our attention to the case when there is a single equality constraint c1x 0, and consider a penalty function of the form
x; fxhc1x, 17.34
where h : IR IR is a function satisfying the properties hy 0 for all y IR and h0 0. Suppose for contradiction that h is continuously differentiable. Since h has a minimizer at zero, we have from Theorem 2.2 that h0 0. If x is a local solution of the problem 17.6, we have c1x 0 and therefore hc1x 0. If x is a local minimizer of x; , we therefore have
0 x; f x c1xhc1x f x.
However,itisnotgenerallytruethatthegradientof f vanishesatthesolutionofaconstrained optimization problem, so our original assumption that h is continuously differentiable must be incorrect, and ; cannot be smooth.
Nonsmooth penalty functions are also used as merit functions in methods that compute steps by some other mechanism. For further details see the general discussion of Section 15.4 and the concrete implementations given in Chapters 18 and 19.
514 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
17.3 AUGMENTED LAGRANGIAN METHOD: EQUALITY CONSTRAINTS
We now discuss an approach known as the method of multipliers or the augmented Lagrangian method. This algorithm is related to the quadratic penalty algorithm of Section 17.1, but it reduces the possibility of ill conditioning by introducing explicit Lagrange multiplier estimates into the function to be minimized, which is known as the augmented Lagrangian function. In contrast to the penalty functions discussed in Section 17.2, the augmented Lagrangian function largely preserves smoothness, and implementations can be constructed from standard software for unconstrained or boundconstrained optimization.
In this section we use superscripts usually k and k 1 on the Lagrange multiplier estimates to denote iteration index, and subscripts usually i to denote the component indices of the vector . For all other variables we use subscripts for the iteration index, as usual.
MOTIVATION AND ALGORITHMIC FRAMEWORK
We consider first the equalityconstrained problem 17.1. The quadratic penalty function Qx; defined by 17.2 penalizes constraint violations by squaring the infeasi bilities and scaling them by 2. As we see from Theorem 17.2, however, the approximate minimizers xk of Qx;k do not quite satisfy the feasibility conditions cix 0, i E. Instead, they are perturbed see 17.10 so that
cixkik, foralli E. 17.35
To be sure, we have cixk 0 as k , but one may ask whether we can alter the function Qx;k to avoid this systematic perturbationthat is, to make the approximate minimizers more nearly satisfy the equality constraints ci x 0, even for moderate values of k .
The augmented Lagrangian function LAx,; achieves this goal by including an explicit estimate of the Lagrange multipliers , based on the estimate 17.35, in the objective. From the definition
def 2
LAx,; fx icix 2 ci x, 17.36
iE iE
we see that the augmented Lagrangian differs from the standard Lagrangian 12.33 for 17.1 by the presence of the squared terms, while it differs from the quadratic penalty function 17.2 in the presence of the summation term involving . In this sense, it is a combination of the Lagrangian function and the quadratic penalty function.
Wenowdesignanalgorithmthatfixesthepenaltyparametertosomevaluek 0 at its kth iteration as in Frameworks 17.1 and 17.2, fixes at the current estimate k , and
17.3. AUGMENTED LAGRANGIAN METHOD: EQUALITY CONSTRAINTS 515
performs minimization with respect to x. Using xk to denote the approximate minimizer of LAx,k;k, we have by the optimality conditions for unconstrained minimization Theorem 2.2 that
0xLAxk,k;kfxk
ik kcixkcixk.
17.37
17.38
iE
By comparing with the optimality condition 17.17 for 17.1, we can deduce that
i ik kcixk, foralli E. By rearranging this expression, we have that
c i x k 1 i ik , f o r a l l i E , k
soweconcludethatifk isclosetotheoptimalmultipliervector,theinfeasibilityinxk will be much smaller than 1k , rather than being proportional to 1k as in 17.35. The relation17.38immediatelysuggestsaformulaforimprovingourcurrentestimatek ofthe Lagrange multiplier vector, using the approximate minimizer xk just calculated: We can set
k1 k c x , for all i E. 17.39 iikik
This discussion motivates the following algorithmic framework.
Framework 17.3 Augmented Lagrangian MethodEquality Constraints. Given 0 0, tolerance 0 0, starting points x0s and 0;
for k 0,1,2,…
Find an approximate minimizer xk of LA,k;k, starting at xks, and terminating when xLAxk,k;k k;
if a convergence test for 17.1 is satisfied stop with approximate solution xk ;
end if
Update Lagrange multipliers using 17.39 to obtain k1; Choose new penalty parameter k1 k ;
Set starting point for the next iteration to xks1 xk ; Select tolerance k1;
end for
We show below that convergence of this method can be assured without increasing indefinitely. Ill conditioning is therefore less of a problem than in Framework 17.1, so the choice of starting point xks1 in Framework 17.3 is less critical. In Framework 17.3 we
516 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
1.5
1
0.5
0
0.5
1
1.5
Figure17.5 ContoursofLAx,;from17.40for 0.4and 1,contour spacing 0.5.
1.5 1 0.5 0 0.5 1 1.5
simply start the search at iteration k 1 from the previous approximate minimizer xk . The tolerance k could be chosen to depend on the infeasibility cxk, and the penalty
iE
parameter may be increased if the reduction in this infeasibility measure is insufficient at
the present iteration.
EXAMPLE 17.4
Consider again problem 17.3, for which the augmented Lagrangian is
LAx,;x1 x2 x12 x2 2×12 x2 22. 17.40 2
Thesolutionof17.3isx 1,1T andtheoptimalLagrangemultiplieris 0.5. Suppose that at iterate k we have k 1 as in Figure 17.1, while the current multiplier estimate is k 0.4. Figure 17.5 plots the function LAx,0.4;1. Note that the spacing of the contours indicates that the conditioning of this problem is similar to that of the quadratic penalty function Qx; 1 illustrated in Figure 17.1. However, the minimizing value of xk 1.02, 1.02T is much closer to the solution x 1, 1T than is the minimizing value of Qx; 1, which is approximately 1.1, 1.1T . This example shows that the inclusion of the Lagrange multiplier term in the function LAx,; can result in a significant improvement over the quadratic penalty method, as a way to reformulate the
constrained optimization problem 17.1.
17.3. AUGMENTED LAGRANGIAN METHOD: EQUALITY CONSTRAINTS 517
PROPERTIES OF THE AUGMENTED LAGRANGIAN
We now prove two results that justify the use of the augmented Lagrangian function and the method of multipliers for equalityconstrained problems.
The first result validates the approach of Framework 17.3 by showing that when we have knowledge of the exact Lagrange multiplier vector , the solution x of 17.1 is a strict minimizer of LAx,; for all sufficiently large. Although we do not know exactly in practice, the result and its proof suggest that we can obtain a good estimate of x by minimizing LAx,; even when is not particularly large, provided that is a reasonably good estimate of .
Theorem 17.5.
Let x be a local solution of 17.1 at which the LICQ is satisfied that is, the gradients ci x, i E, are linearly independent vectors, and the secondorder sufficient conditions specified in Theorem 12.6 are satisfied for . Then there is a threshold value such that for all , x is a strict local minimizer of LAx, ; .
PROOF. We prove the result by showing that x satisfies the secondorder sufficient condi tions to be a strict local minimizer of LAx,; see Theorem 2.4 for all sufficiently large; that is,
L x, ; 0, 2 L x, ; positive definite. 17.41 x A xx A
Because x is a local solution for 17.1 at which LICQ is satisfied, we can apply Theorem 12.1 to deduce that xLx, 0 and cix 0 for all i E, so that
iE
xLAx , ;fx
f x i c i x x L x , 0 ,
iE
verifying the first part of 17.41, independently of .
For the second part of 17.41, we define A to be the constraint gradient matrix in
17.15 evaluated at x, and write
2 L x,;2 Lx,AT A.
xx A xx
If the claim in 17.41 were not true, then for each integer k 1, we could choose a vector
wk with wk 1 such that
0 wT 2 L x,;kw wT 2 Lx,w k Aw 2, 17.42 kxxA kkxx k k2
i cix cix
518 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
and therefore
a For all k and k satisfying
the problem
Aw 2 1kwT 2 Lx, w 0, as k . 17.43 k2 kxx k
Since the vectors wk lie in a compact set the surface of the unit sphere, they have an accumulation point w. The limit 17.43 implies that Aw 0. Moreover, by rearranging 17.42, we have that
wT2Lx,wkAw 20, kxx k k2
so by taking limits we have wT x2x Lx, w 0. However, this inequality contradicts the secondorder conditions in Theorem 12.6 which, when applied to 17.1, state that we must have wT x2x Lx, w 0 for all nonzero vectors w with Aw 0. Hence, the second part of 17.41 holds for all sufficiently large.
The second result, given by Bertsekas 19, Proposition 4.2.3, describes the more realistic situation of a . It gives conditions under which there is a minimizer of LAx,; that lies close to x and gives error bounds on both xk and the updated multiplier estimate k1 obtained from solving the subproblem at iteration k.
Theorem 17.6.
Suppose that the assumptions of Theorem 17.5 are satisfied at x and and let be chosen as in that theorem. Then there exist positive scalars , , and M such that the following claims hold:
k k, k ,
minLAx,k;k subjectto xx x
17.44
has a unique solution xk . Moreover, we have
xkx Mkk.
b For all k and k that satisfy 17.44, we have
k1 Mkk,
17.45
17.46
where k1 is given by the formula 17.39.
17.4. PRACTICAL AUGMENTED LAGRANGIAN METHODS 519
c For all k and k that satisfy 17.44, the matrix x2xLAxk,k;k is positive definite andtheconstraintgradientscixk,i E,arelinearlyindependent.
This theorem illustrates some salient properties of the augmented Lagrangian ap proach. The bound 17.45 shows that xk will be close to x if k is accurate or if the penalty parameter k is large. Hence, this approach gives us two ways of improving the accuracy of xk , whereas the quadratic penalty approach gives us only one option: increasing k . The bound 17.46 states that, locally, we can ensure an improvement in the accuracy of the multipliers by choosing a sufficiently large value of k . The final observation of the theorem shows that secondorder sufficient conditions for unconstrained minimization see Theo rem 2.4 are also satisfied for the kth subproblem under the given conditions, so one can expect good performance by applying standard unconstrained minimization techniques.
17.4 PRACTICAL AUGMENTED LAGRANGIAN METHODS
In this section we discuss practical augmented Lagrangian procedures, in particular, proce dures for handling inequality constraints. We discuss three approaches based, respectively, on boundconstrained, linearly constrained, and unconstrained formulations. The first two are the basis of the successful nonlinear programming codes LANCELOT 72 and MINOS 218.
BOUNDCONSTRAINED FORMULATION
Given the general nonlinear program 17.6, we can convert it to a problem with equality constraints and bound constraints by introducing slack variables si and replacing the general inequalities ci x 0, i I, by
cixsi 0, si 0, foralli I. 17.47 Bound constraints, l x u, need not be transformed. By reformulating in this way, we
can write the nonlinear program as follows:
minfx subjectto cix0,i1,2,…,m, lxu. 17.48
x IR n
The slacks si have been incorporated into the vector x and the constraint functions ci have been redefined accordingly. We have numbered the constraints consecutively with i 1,2,…,m and in the discussion below we gather them into the vector function c : IRn IRm . Some of the components of the lower bound vector l may be set to , signifying that there is no lower bound on the components of x in question; similarly for u.
520 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
The boundconstrained Lagrangian BCL approach incorporates only the equality constraints from 17.48 into the augmented Lagrangian, that is,
defined as follows
li Pg,l,ui gi
ui
ifgi li,
ifgi li,ui, foralli 1,2,…,n. 17.52 ifgi ui,
m LAx,; fx
m
icix 2 ci2x. 17.49
i1
The bound constraints are enforced explicitly in the subproblem, which has the form
minLAx,; subjecttolxu. 17.50 x
After this problem has been solved approximately, the multipliers and the penalty parameter are updated and the process is repeated.
An efficient technique for solving the nonlinear program with bound constraints 17.50 for fixed and is the nonlinear gradient projection method discussed in Section 18.6. By specializing the KKT conditions 12.34 to the problem 17.50, we find that the firstorder necessary condition for x to be a solution of 17.50 is that
x P x xLAx,;,l,u 0, 17.51 where Pg, l, u is the projection of the vector g IRn onto the rectangular box l, u
We are now ready to describe the algorithm implemented in the LANCELOT software package.
Algorithm 17.4 BoundConstrained Lagrangian Method.
Choose an initial point x0 and initial multipliers 0;
Choose convergence tolerances and ;
Set 10, 1 ,and 10.1; 00000
for k 0,1,2,…
Find an approximate solution xk of the subproblem 17.50 such that
axk P xk xLAxk,k;k,l,u ak;
if cxk k
test for convergence
if cxk andaxk P xk xLAxk,k;k,l,u a stop with approximate solution xk ;
i1
17.4. PRACTICAL AUGMENTED LAGRANGIAN METHODS 521
end if
update multipliers, tighten tolerances k1 k kcxk;
else
k1 k;
k1 k0.9 ;
k1 k1 kk1;
increase penalty parameter, tighten tolerances k1 k;
k1 100k;
k1 10.1 ;
k1 k1 1k1;
end if end for
The main branch in the algorithm occurs after problem 17.50 has been solved approximately, when the algorithm tests to see if the constraints have decreased sufficiently, as measured by the condition
cxk k . 17.53
If this condition holds, the penalty parameter is not changed for the next iteration because thecurrentvalueofk isproducinganacceptablelevelofconstraintviolation.TheLagrange multiplier estimates are updated according to the formula 17.39 and the tolerances k and k are tightened in advance of the next iteration. If, on the other hand, 17.53 does not hold, then we increase the penalty parameter to ensure that the next subproblem will place more emphasis on decreasing the constraint violations. The Lagrange multiplier estimates are not updated in this case; the focus is on improving feasibility.
The constants 0.1, 0.9, and 100 appearing in Algorithm 17.4 are to some extent arbi trary; other values can be used without compromising theoretical convergence properties. LANCELOT uses the gradient projection method with trust regions see 18.61 to solve the boundconstrained nonlinear subproblem 17.50. In this context, the gradient projection method constructs a quadratic model of the augmented Lagrangian LA and computes a step d by approximately solving the trust region problem
min1dT 2Lx,kATA dLx,k;Td 17.54 d2 xxk kkk xAk k
subject to l xk d u, d a,
where Ak Axk and a is a trust region radius. We can formulate the trustregion constraint by means of the bounds ae d ae, where e 1,1,…,1T. Each iteration of the algorithm for solving this subproblem proceeds in two stages. First, a
522 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
projected gradient line search is performed to determine which components of d should be set at one of their bounds. Second, a conjugate gradient iteration minimizes 17.54 with respect to the free components of dthose not at one of their bounds. Importantly, this algorithm does not require the factorizations of a KKT matrix or of the constraint Jacobian Ak. The conjugate gradient iteration only requires matrixvector products, a feature that makes LANCELOT suitable for large problems.
The Hessian of the Lagrangian x2x Lxk , k in 17.54 can be replaced by a quasi Newton approximation based on the BFGS or SR1 updating formulas. LANCELOT is designed to take advantage of partially separable structure in the objective function and constraints, either in the evaluation of the Hessian of the Lagrangian or in the quasiNewton updates see Section 7.4.
LINEARLY CONSTRAINED FORMULATION
The principal idea behind linearly constrained Lagrangian LCL methods is to generate a step by minimizing the Lagrangian or augmented Lagrangian subject to linearizations of the constraints. If we use the formulation 17.48 of the nonlinear programming problem, the subproblem used in the LCL approach takes the form
min Fkx x
subjectto cxkAkxxk0, lxu.
There are several possible choices for Fk x . Early LCL methods defined
m i1
ci x and its linearization at xk , that is,
c ik x ci x ci xk ci xk T x xk . 17.57
One can show that as xk converges to a solution x, the Lagrange multiplier associated with the equality constraint in 17.55b converges to the optimal multiplier. Therefore, one can set k in 17.56 to be the Lagrange multiplier for the equality constraint in 17.55b from the previous iteration.
Current LCL methods define Fk to be the augmented Lagrangian function
Fkx fx
where k is the current Lagrange multiplier estimate and c ikx is the difference between
m Fkx fx
i1
m
ikc ikx 2 c ikx2. 17.58
ikc ikx,
17.55a 17.55b
17.56
i1
where
min Fx, x IR n
F x max f x i ci x 0 i I
17.59
if x is feasible, 17.60 otherwise.
17.4. PRACTICAL AUGMENTED LAGRANGIAN METHODS 523
This definition of Fk appears to yield more reliable convergence from remote starting points than does 17.56, in practice.
There is a notable similarity between 17.58 and the augmented Lagrangian 17.36, the difference being that the original constraints ci x have been replaced by the functions c ik x , which capture only the secondorder and above terms of ci . The subproblem 17.55 differs from the augmented Lagrangian subproblem in that the new x is required to satisfy exactly a linearization of the equality constraints, while the linear part of each constraint is factored out of the objective via the use of c ik in place of ci . A procedure similar to the one in Algorithm 17.4 can be used for updating the penalty parameter and for adjusting the tolerances that govern the accuracy of the solution of the subproblem.
Since c ikx has zero gradient at x xk, we have that Fkxk f xk, where Fk is defined by either 17.56 or 17.58. We can also show that the Hessian of Fk is closely related to the Hessians of the Lagrangian or augmented Lagrangian functions for 17.1. Because of these properties, the subproblem 17.55 is similar to the SQP subproblems described in Chapter 18, with the quadratic objective in SQP being replaced by a nonlinear objective in LCL.
The well known code MINOS 218 uses the nonlinear model function 17.58 and solves the subproblem via a reduced gradient method that employs quasiNewton approximations to the reduced Hessian of Fk . A fairly accurate solution of the subproblem is computed in MINOS to try to ensure that the Lagrange multiplier estimates for the equality constraint in 17.55b subsequently used in 17.58 are of good quality. As a result, MINOS typically requiresmoreevaluationsoftheobjective f andconstraintfunctionsci andtheirgradients in total than SQP methods or interiorpoint methods. The total number of subproblems 17.55 that are solved in the course of the algorithm is, however, sometimes smaller than in other approaches.
UNCONSTRAINED FORMULATION
We can obtain an unconstrained form of the augmented Lagrangian subproblem for inequalityconstrained problems by using a derivation based on the proximal point approach. Supposing for simplicity that the problem has no equality constraints E , we can write the problem 17.6 equivalently as an unconstrained optimization problem:
To verify these expressions for F, consider first the case of x infeasible, that is, cix 0 for some i. We can then choose i arbitrarily large and positive while setting j 0 for all
f x
524 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
j a i, to verify that Fx is infinite in this case. If x is feasible, we have cix 0 for all i I, so the maximum is attained at 0, and Fx f x in this case. By combining 17.59 with 17.60, we have
min Fx min f x, 17.61
x IR n
x feasible
which is simply the original inequalityconstrained problem. It is not practical to minimize F directly, however, since this function is not smoothit jumps from a finite value to an infinite value as x crosses the boundary of the feasible set.
We can make this approach more practical by replacing F by a smooth approximation Fx;k,kwhichdependsonthepenaltyparameterk andLagrangemultiplierestimate k . This approximation is defined as follows:
Fx;k,kmax fxicix 1 i ik 2 . 17.62 0 iI 2k iI
The final term in this expression applies a penalty for any move of away from the previous estimate k ; it encourages the new maximizer to stay proximal to the previous estimate k . Since 17.62 represents a boundconstrained quadratic problem in , separable in the individual components i , we can perform the maximization explicitly, to obtain
i 0 ifcixikk 0; 17.63 ik kcix otherwise.
By substituting these values in 17.62, we find that
Fx;k,k fxcix,ik;k, 17.64
iI
where the function of three scalar arguments is defined as follows:
ift0, otherwise,
tt2 def 2
t,; 1 2 2
17.65
Hence, we can obtain the new iterate xk by minimizing F x ; k , k with respect to x , and use the formula 17.63 to obtain the updated Lagrange multiplier estimates k1. By comparing with Framework 17.3, we see that F plays the role of LA and that the scheme just described extends the augmented Lagrangian methods for equality constraints neatly to the inequalityconstrained case. Unlike the boundconstrained and linearly constrained formulations, however, this unconstrained formulation is not the basis of any widely used software packages, so its practical properties have not been tested.
17.5. PERSPECTIVES AND SOFTWARE 525
17.5 PERSPECTIVES AND SOFTWARE
The quadratic penalty approach is often used by practitioners when the number of con straints is small. In fact, minimization of Qx; is sometimes performed for just one large value of . Unless is chosen wisely with the benefit of experience with the underlying application, the resulting solution may not be very accurate. Since the main software pack ages for constrained optimization do not implement a quadratic penalty approach, little attention has been paid to techniques for updating the penalty parameter, adjusting the tolerances k , and choosing the starting points xks for each iteration. See Gould 141 for a discussion of these issues.
Despite the intuitive appeal and simplicity of the quadratic penalty method of Frame work 17.1, the augmented Lagrangian method of Sections 17.3 and 17.4 is generally preferred. The subproblems are in general no more difficult to solve, and the introduc tion of multiplier estimates reduces the likelihood that large values of will be needed to obtain good feasibility and accuracy, thereby avoiding ill conditioning of the subproblem. The quadratic penalty approach remains, however, an important mechanism for regularizing other algorithms such as sequential quadratic programming SQP methods, as we mention at the end of Section 17.1.
A generalpurpose 1 penalty method was developed by Fletcher in the 1980s. It is known as the S 1QP method because it has features in common with SQP methods. More recently, an 1 penalty method that uses linear programming subproblems has been implemented as part of the KNITRO 46 software package. These two methods are discussed in Section 18.5.
The 1 penalty function has received significant attention in recent years. It has been successfully used to treat difficult problems, such as mathematical programs with complementarity constraints MPCCs, in which the constraints do not satisfy standard constraint qualifications 274. By including these problematic constraints as a penalty term, rather than linearizing them exactly, and treating the remaining constraints using other techniques such as SQP or interiorpoint, it is possible to extend the range of applicability of these other approaches. See 8 for an activeset method and 16, 191 for interiorpoint methods for MPCCs. The SNOPT software package uses an 1 penalty approach within an SQP method as a safeguard strategy in case the quadratic model appears to be infeasible or unbounded or to have unbounded multipliers.
Augmented Lagrangian methods have been popular for many years because, in part, of their simplicity. The MINOS and LANCELOT packages rank among the best implemen tations of augmented Lagrangian methods. Both are suitable for largescale nonlinear programming problems. At a general level, the linearly constrained Lagrangian LCL of MINOS and the boundconstrained Lagrangian BCL method of LANCELOT have im portant features in common. They differ significantly, however, in the formulation of the stepcomputation subproblems and in the techniques used to solve these subproblems. MINOS follows a reducedspace approach to handle linearized constraints and employs a dense quasiNewton approximation to the Hessian of the Lagrangian. As a result, MINOS
526 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
is most successful for problems with relatively few degrees of freedom. LANCELOT, on the other hand, is more effective when there are relatively few constraints. As indicated in Sec tion 17.4, LANCELOT does not require a factorization of the constraint Jacobian matrix A, again enhancing its suitability for very large problems, and provides a variety of Hessian ap proximation options and preconditioners. The PENNON software package 184 is based on an augmented Lagrangian approach and has the advantage of permitting semidefinite matrix constraints.
A weakness of both the boundconstrained and unconstrained Lagrangian methods is that they complicate constraints by squaring them in 17.49; progress in feasibility is only achieved through the minimization of the augmented Lagrangian. In contrast, the LCL formulation 17.55 promotes steady progress toward feasibility by performing a Newton like step on the constraints. Not surprisingly, numerical experience has shown an advantage of MINOS over LANCELOT for problems with linear constraints.
Smooth exact penalty functions have been constructed from the augmented La grangian functions of Section 17.3, but these are considerably more complicated. As an example, we mention the function of Fletcher for equalityconstrained problems, defined as follows:
iE
squares estimate, defined as
x AxAxT 1 Ax f x. 17.67
The function F is differentiable and exact, though the threshold value defining the exactness property is not as easy to specify as for the nonsmooth 1 penalty function. DrawbacksofthepenaltyfunctionF includethecostofevaluatingxvia17.67,thefact that x is not uniquely defined when Ax does not have full rank, and the observation that estimates of may be poor when Ax is nearly singular.
NOTES AND REFERENCES
The quadratic penalty function was first proposed by Courant 81. Gould 140 addresses the issue of stable determination of the Newton step for Qx;k. His formula 2.2 differs from our formula 17.20 in the righthandside, but both systems give rise to the same p component.
The augmented Lagrangian method was proposed by Hestenes 167 and Powell 240. In the early days it was known as the method of multipliers. A key reference in this area is Bertsekas 18. Chapters 13 of that book contain a thorough motivation of the method that outlines its connections to other approaches. Other introductory discussions
Fx; fxx cx 2
The Lagrange multiplier estimates x are defined explicitly in terms of x via the least
T2
cix . 17.66
17.5. PERSPECTIVES AND SOFTWARE 527
are given by Fletcher 101, Section 12.2, and Polak 236, Section 2.8. The extension to inequality constraints in the unconstrained formulation was described by Rockafellar 269 and Powell 243.
Linearly constrained Lagrangian methods were proposed by Robinson 266 and Rosen and Kreuser 271. The MINOS implementation is due to Murtagh and Saunders 218 and the LANCELOT implementation due to Conn, Gould and Toint 72. We have followed Friedlander and Saunders 114 in our use of the terms linearly constrained Lagrangian and boundconstrained Lagrangian.
EXERCISES 17.1
a Write an equalityconstrained problem which has a local solution and for which the quadratic penalty function Q is unbounded for any value of the penalty parameter.
b Write a problem with a single inequality constraint that has the same unboundedness property.
17.2 Draw the contour lines of the quadratic penalty function Q for problem 17.5 corresponding to 1. Find the stationary points of Q.
17.3 Minimize the quadratic penalty function for problem 17.3 for k 1, 10, 100, 1000 using an unconstrained minimization algorithm. Set k 1k in Frame work 17.1, and choose the starting point xks1 for each minimization to be the solution for the previous value of the penalty parameter. Report the approximate solution of each penalty function.
17.4 For z IR, show that the function min0, z2 has a discontinuous second deriva tive at z 0. It follows that quadratic penalty function 17.7 may not have continuous second derivatives even when f and ci , i E I, in 17.6 are all twice continuously differentiable.
17.5 Write a quadratic program similar to 17.31 for the case when the norm in 17.32 is the infinity norm.
17.6 Suppose that a nonlinear program has a minimizer x with Lagrange multiplier vector . One can show Fletcher 101, Theorem 14.3.2 that the function 1x; does not have a local minimizer at x unless . Verify that this observation holds for Example 17.1.
17.7 Verify 17.28.
17.8 Prove the second part of Theorem 17.4. That is, if x is a stationary point of 1x; for all sufficiently large, but x is infeasible for problem 17.6, then x is
528 CHAPTER 17. PENALTY AND AUGMENTED LAGRANGIAN METHODS
an infeasible stationary point. Hint: Use the fact that D1x; ; p f xT p Dhx; p, where h is defined in 17.27.
17.9 Verify that the KKT conditions for the boundconstrained problem minx subjectto lxu
x IR n
are equivalent to the compactly stated condition
x Px x,l,u 0,
where the projection operator P onto the rectangular box l, u is defined in 17.52.
17.10 Calculate the gradient and Hessian of the LCL objective functions Fk x defined by 17.56 and 17.58. Evaluate these quantities at x xk .
17.11 Show that the function t,; defined in 17.65 has a discontinuity in its second derivative with respect to t when t . Assuming that ci : IRn IR is twice continuously differentiable, write down the second partial derivative matrix of ci x, i ; with respect to x for the two cases ci x i and ci x ai .
17.12 Verify that the multipliers i , i I defined in 17.63 are indeed those that attain the maximum in 17.62, and that the equality 17.64 holds. Hint: Use the fact that KKT conditions for the problem
max x subject to x 0
indicatethatatastationarypoint,weeitherhavexi 0andxi 0,orxi 0and xi 0.
CHAPTER18
Sequential Quadratic Programming
One of the most effective methods for nonlinearly constrained optimization generates steps by solving quadratic subproblems. This sequential quadratic programming SQP approach can be used both in line search and trustregion frameworks, and is appropriate for small or large problems. Unlike linearly constrained Lagrangian methods Chapter 17, which are effective when most of the constraints are linear, SQP methods show their strength when solving problems with significant nonlinearities in the constraints.
All the methods considered in this chapter are activeset methods; a more descriptive title for this chapter would perhaps be ActiveSet Methods for Nonlinear Programming.
This is page 529 Printer: Opaque this
530 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
In Chapter 14 we study interiorpoint methods for nonlinear programming, a competing approach for handling inequalityconstrained problems.
There are two types of activeset SQP methods. In the IQP approach, a general inequalityconstrained quadratic program is solved at each iteration, with the twin goals of computing a step and generating an estimate of the optimal active set. EQP methods decouple these computations. They first compute an estimate of the optimal active set, then solve an equalityconstrained quadratic program to find the step. In this chapter we study both IQP and EQP methods.
Our development of SQP methods proceeds in two stages. First, we consider local methods that motivate the SQP approach and allow us to introduce the step computation techniques in a simple setting. Second, we consider practical line search and trustregion methods that achieve convergence from remote starting points. Throughout the chapter we give consideration to the algorithmic demands of solving large problems.
18.1 LOCAL SQP METHOD
We begin by considering the equalityconstrained problem
min f x 18.1a
subject to cx 0, 18.1b
where f : IRn IR and c : IRn IRm are smooth functions. The idea behind the SQP approach is to model 18.1 at the current iterate xk by a quadratic programming subproblem, then use the minimizer of this subproblem to define a new iterate xk1. The challenge is to design the quadratic subproblem so that it yields a good step for the nonlinear optimization problem. Perhaps the simplest derivation of SQP methods, which we present now, views them as an application of Newtons method to the KKT optimality conditions for 18.1.
From 12.33, we know that the Lagrangian function for this problem is Lx, f x T cx. We use Ax to denote the Jacobian matrix of the constraints, that is,
AxT c1x,c2x,…,cmx, 18.2
where cix is the ith component of the vector cx. The firstorder KKT conditions 12.34 of the equalityconstrained problem 18.1 can be written as a system of n m equations in the n m unknowns x and :
Fx, f x AxT 0. 18.3 cx
Any solution x, of the equalityconstrained problem 18.1 for which Ax has full
rank satisfies 18.3. One approach that suggests itself is to solve the nonlinear equations 18.3 by using Newtons method, as described in Chapter 11.
The Jacobian of 18.3 with respect to x and is given by
2 Lx, AxT Fx, xx
Ax 0 The Newton step from the iterate xk , k is thus given by
.
18.4
18.5
xk1 xk k1 k
where pk and p solve the NewtonKKT system
pk p
,
2 L AT p Ak0p ck
This Newton iteration is well defined when the KKT matrix in 18.6 is nonsingular. We saw in Chapter 16 that this matrix is nonsingular if the following assumption holds at x, xk,k.
Assumptions 18.1.
a The constraint Jacobian Ax has full row rank;
b Thematrixx2xLx,ispositivedefiniteonthetangentspaceoftheconstraints,thatis,
dT x2xLx,d 0 for all d a 0 such that Axd 0.
The first assumption is the linear independence constraint qualification discussed in Chapter 12 see Definition 12.4, which we assume throughout this chapter. The second condition holds whenever x, is close to the optimum x, and the secondorder suf ficient condition is satisfied at the solution see Theorem 12.6. The Newton iteration 18.5, 18.6 can be shown to be quadratically convergent under these assumptions see Theo rem 18.4 and constitutes an excellent algorithm for solving equalityconstrained problems, provided that the starting point is close enough to x.
SQP FRAMEWORK
There is an alternative way to view the iteration 18.5, 18.6. Suppose that at the iterate xk , k we model problem 18.1 using the quadratic program
min f f T p 1 pT 2 L p 18.7a p k k 2 xxk
18.1. LOCAL SQP METHOD 531
f AT
xx k k k k k k . 18.6
subject to Ak p ck 0. 18.7b
532 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
If Assumptions 18.1 hold, this problem has a unique solution pk , lk that satisfies
2 L p f ATl 0, 18.8a
xxkk k kk
Ak pk ck 0. 18.8b
The vectors pk and lk can be identified with the solution of the Newton equations 18.6. If we subtract AkT k from both sides of the first equation in 18.6, we obtain
2L AT p f
xxkk kk. 18.9
Ak 0 k1 ck
Hence, by nonsingularity of the coefficient matrix, we have that k1 lk and that pk solves 18.7 and 18.6.
The new iterate xk1,k1 can therefore be defined either as the solution of the quadratic program 18.7 or as the iterate generated by Newtons method 18.5, 18.6 applied to the optimality conditions of the problem. Both viewpoints are useful. The Newton point of view facilitates the analysis, whereas the SQP framework enables us to derive practical algorithms and to extend the technique to the inequalityconstrained case.
We now state the SQP method in its simplest form.
Algorithm 18.1 Local SQP Algorithm for solving 18.1. Choose an initial pair x0, 0; set k 0;
repeat until a convergence test is satisfied
Evaluate fk, fk, x2xLk, ck, and Ak; Solve 18.7 to obtain pk and lk ; Setxk1 xk pk andk1 lk;
end repeat
We note in passing that, in the objective 18.7a of the quadratic program, we could replace the linear term fkT p by x Lxk , k T p, since the constraint 18.7b makes the two choices equivalent. In this case, 18.7a is a quadratic approximation of the Lagrangian function. This fact provides a motivation for our choice of the quadratic model 18.7: We first replace the nonlinear program 18.1 by the problem of minimizing the Lagrangian subject to the equality constraints 18.1b, then make a quadratic approximation to the Lagrangian and a linear approximation to the constraints to obtain 18.7.
INEQUALITY CONSTRAINTS
The SQP framework can be extended easily to the general nonlinear programming problem
min f x 18.10a
18.2. PREVIEW OF PRACTICAL SQP METHODS 533
subject to ci x 0, i E, 18.10b cix0, i I. 18.10c
To model this problem we now linearize both the inequality and equality constraints to obtain
min f f T p 1 pT 2 L p p k k 2 xxk
subjectto cixkT pcixk0, cixkT pcixk0,
i E, i I.
18.11a
18.11b 18.11c
We can use one of the algorithms for quadratic programming described in Chapter 16 to solve this problem. The new iterate is given by xk pk,k1 where pk and k1 are the solution and the corresponding Lagrange multiplier of 18.11. A local SQP method for 18.10 is thus given by Algorithm 18.1 with the modification that the step is computed from 18.11.
In this IQP approach the set of active constraints Ak at the solution of 18.11 constitutes our guess of the active set at the solution of the nonlinear program. If the SQP method is able to correctly identify this optimal active set and not change its guess at a subsequent iteration then it will act like a Newton method for equalityconstrained optimization and will converge rapidly. The following result gives conditions under which this desirable behavior takes place. Recall that strict complementarity is said to hold at a solution pair x, if there is no index i I such that i ci x 0.
Theorem 18.1 Robinson 267.
Suppose that x is a local solution of 18.10 at which the KKT conditions are satis
fied for some . Suppose, too, that the linear independence constraint qualification LICQ Definition 12.4, the strict complementarity condition Definition 12.5, and the secondorder sufficient conditions Theorem 12.6 hold at x,. Then if xk,k is sufficiently close to x, , there is a local solution of the subproblem 18.11 whose active set Ak is the same as the active set Ax of the nonlinear program 18.10 at x.
It is also remarkable that, far from the solution, the SQP approach is usually able to improve the estimate of the active set and guide the iterates toward a solution; see Section 18.7.
18.2 PREVIEW OF PRACTICAL SQP METHODS
IQP AND EQP
There are two ways of designing SQP methods for solving the general nonlinear programming problem 18.10. The first is the approach just described, which solves at
534 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
every iteration the quadratic subprogram 18.11, taking the active set at the solution of this subproblem as a guess of the optimal active set. This approach is referred to as the IQP inequalityconstrained QP approach; it has proved to be quite successful in practice. Its main drawback is the expense of solving the general quadratic program 18.11, which can be high when the problem is large. As the iterates of the SQP method converge to the solution, however, solving the quadratic subproblem becomes economical if we use information from the previous iteration to make a good guess of the optimal solution of the current subproblem. This warmstart strategy is described below.
The second approach selects a subset of constraints at each iteration to be the socalled working set, and solves only equalityconstrained subproblems of the form 18.7, where the constraints in the working sets are imposed as equalities and all other constraints are ignored. The working set is updated at every iteration by rules based on Lagrange multiplier estimates, or by solving an auxiliary subproblem. This EQP equalityconstrained QP approach has the advantage that the equalityconstrained quadratic subproblems are less expensive to solve than 18.11 in the largescale case.
An example of an EQP method is the sequential linearquadratic programming SLQP method discussed in Section 18.5. This approach constructs a linear program by omitting the quadratic term pT x2x Lk p from 18.11a and adding a trustregion constraint
p ak to the subproblem. The active set of the resulting linear programming sub problem is taken to be the working set for the current iteration. The method then fixes the constraints in the working set and solves an equalityconstrained quadratic program with the term pT x2x Lk p reinserted to obtain the SQP step. Another successful EQP method is the gradient projection method described in Section 16.7 in the context of bound con strained quadratic programs. In this method, the working set is determined by minimizing a quadratic model along the path obtained by projecting the steepest descent direction onto the feasible region.
ENFORCING CONVERGENCE
To be practical, an SQP method must be able to converge from remote starting points and on nonconvex problems. We now outline how the local SQP strategy can be adapted to meet these goals.
We begin by drawing an analogy with unconstrained optimization. In its simplest form, the Newton iteration for minimizing a function f takes a step to the minimizer of the quadratic model
m k p f k f kT p 1 p T 2 f k p . 2
This framework is useful near the solution, where the Hessian 2 f xk is normally positive definite and the quadratic model has a well defined minimizer. When xk is not close to the solution, however, the model function mk may not be convex. Trustregion methods ensure that the new iterate is always well defined and useful by restricting the candidate step pk
18.3. ALGORITHMIC DEVELOPMENT 535
to some neighborhood of the origin. Line search methods modify the Hessian in mkp to make it positive definite possibly replacing it by a quasiNewton approximation Bk, to ensure that pk is a descent direction for the objective function f .
Similar strategies are used to globalize SQP methods. If x2x Lk is positive definite on the tangent space of the active constraints, the quadratic subproblem 18.7 has a unique solution. When x2x Lk does not have this property, line search methods either replace it by a positive definite approximation Bk or modify x2x Lk directly during the process of matrix factorization. In all these cases, the subproblem 18.7 becomes well defined, but the modifications may introduce unwanted distortions in the model.
Trustregion SQP methods add a constraint to the subproblem, limiting the step to a region within which the model 18.7 is considered reliable. These methods are able to handle indefinite Hessians x2x Lk . The inclusion of the trust region may, however, cause the subproblem to become infeasible, and the procedures for handling this situation complicate the algorithms and increase their computational cost. Due to these tradeoffs, neither of the two SQP approachesline search or trustregionis currently regarded as clearly superior to the other.
The technique used to accept or reject steps also impacts the efficiency of SQP methods. In unconstrained optimization, the merit function is simply the objective f , and it remains fixed throughout the minimization procedure. For constrained problems, we use devices such as a merit function or a filter see Section 15.4. The parameters or entries used in these devices must be updated in a way that is compatible with the step produced by the SQP method.
18.3 ALGORITHMIC DEVELOPMENT
In this section we expand on the ideas of the previous section and describe various ingredients needed to produce practical SQP algorithms. We focus on techniques for ensuring that the subproblems are always feasible, on alternative choices for the Hessian of the quadratic model, and on stepacceptance mechanisms.
HANDLING INCONSISTENT LINEARIZATIONS
A possible difficulty with SQP methods is that the linearizations 18.11b, 18.11c of the nonlinear constraints may give rise to an infeasible subproblem. Consider, for example, the case in which n 1 and the constraints are x 1 and x2 4. When we linearize these constraints at xk 1, we obtain the inequalities
p 0 and 2 p 3 0,
which are inconsistent.
536 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
To overcome this difficulty, we can reformulate the nonlinear program 18.10 as the 1 penalty problem
iI
min fx x,v,w,t
subjectto
cix ti, i I, v, w, t 0,
for some positive choice of the penalty parameter . The quadratic subproblem 18.11 associated with 18.12 is always feasible. As discussed in Chapter 17, if the nonlinear problem 18.10 has a solution x that satisfies certain regularity assumptions, and if the penalty parameter is sufficiently large, then x along with vi wi 0, i E and ti 0, i I is a solution of the penalty problem 18.12. If, on the other hand, there is no feasible solution to the nonlinear problem and is large enough, then the penalty problem 18.12 usually determines a stationary point of the infeasibility measure. The choice of has been discussed in Chapter 17 and is considered again in Section 18.5. The SNOPT software package 127 uses the formulation 18.12, which is sometimes called the elastic mode, to deal with inconsistencies of the linearized constraints.
Other procedures for relaxing the constraints are presented in Section 18.5 in the context of trustregion methods.
FULL QUASINEWTON APPROXIMATIONS
The Hessian of the Lagrangian x2xLxk,k is made up of second derivatives of the objective function and constraints. In some applications, this information is not easy to compute, so it is useful to consider replacing the Hessian x2x Lxk , k in 18.11a by a quasi Newton approximation. Since the BFGS and SR1 formulae have proved to be successful in the context of unconstrained optimization, we can employ them here as well.
The update for Bk that results from the step from iterate k to iterate k 1 makes use of the vectors sk and yk defined as follows:
sk xk1 xk, yk xLxk1,k1xLxk,k1. 18.13
We compute the new approximation Bk1 using the BFGS or SR1 formulae given, respec tively, by 6.19 and 6.24. We can view this process as the application of quasiNewton updating to the case in which the objective function is given by the Lagrangian Lx , with fixed. This viewpoint immediately reveals the strengths and weaknesses of this approach.
If x2x L is positive definite in the region where the minimization takes place, then BFGS quasiNewton approximations Bk will reflect some of the curvature information of the problem, and the iteration will converge robustly and rapidly, just as in the unconstrained BFGS method. If, however, x2x L contains negative eigenvalues, then the BFGS approach
vi wi cixvi wi, i E,
ti 18.12a 18.12b
18.12c 18.12d
iE
Procedure 18.2 Damped BFGS Updating. Given: symmetric and positive definite matrix Bk ; Define sk and yk as in 18.13 and set
rk kyk 1kBksk, where the scalar k is defined as
k 1 if skT yk 0.2skT Bksk, 0.8skT BkskskT Bksk skT yk if skT yk 0.2skT Bksk;
18.15
18.16
Update Bk as follows:
18.3. ALGORITHMIC DEVELOPMENT 537
of approximating it with a positive definite matrix may be problematic. BFGS updating requires that sk and yk satisfy the curvature condition skT yk 0, which may not hold when sk and yk are defined by 18.13, even when the iterates are close to the solution.
To overcome this difficulty, we could skip the BFGS update if the condition
skT yk skT Bksk 18.14
is not satisfied, where is a positive parameter 102, say. This strategy may, on occasion, yield poor performance or even failure, so it cannot be regarded as adequate for general purpose algorithms.
A more effective modification ensures that the update is always well defined by modifying the definition of yk .
BkskskT Bk rkrkT Bk1Bk sTBs sTr.
kkkkk
The formula 18.16 is simply the standard BFGS update formula, with yk replaced by rk. It guarantees that Bk1 is positive definite, since it is easy to show that when k a 1 we have
skT rk 0.2skT Bk sk 0. 18.17
To gain more insight into this strategy, note that the choice k 0 gives Bk1 Bk , while k 1 gives the possibly indefinite matrix produced by the unmodified BFGS update. A value k 0, 1 thus produces a matrix that interpolates the current approximation Bk and the one produced by the unmodified BFGS formula. The choice of k ensures that the new approximation stays close enough to the current approximation Bk to ensure positive definiteness.
538 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
Damped BFGS updating often works well but it, too, can behave poorly on difficult problems. It still fails to address the underlying problem that the Lagrangian Hessian may not be positive definite. For this reason, SR1 updating may be more appropriate, and is indeed a good choice for trustregion SQP methods. An SR1 approximation to the Hessian of the Lagrangian is obtained by applying formula 6.24 with sk and yk defined by 18.13, using the safeguards described in Chapter 6. Line search methods cannot, however, accept indefinite Hessian approximations and would therefore need to modify the SR1 formula, possibly by adding a sufficiently large multiple of the identity matrix; see the discussion around 19.25.
All quasiNewton approximations Bk discussed above are dense n n matrices that can be expensive to store and manipulate in the largescale case. Limitedmemory updating is useful in this context and is often implemented in software packages. See 19.29 for an implementation of limitedmemory BFGS in a constrained optimization algorithm.
REDUCEDHESSIAN QUASINEWTON APPROXIMATIONS
When we examine the KKT system 18.9 for the equalityconstrained problem 18.1, we see that the part of the step pk in the range space of AkT is completely determined by the second block row Ak pk ck . The Lagrangian Hessian x2x Lk affects only the part of pk in the orthogonal subspace, namely, the null space of Ak . It is reasonable, therefore, to consider quasiNewton methods that find approximations to only that part of x2x Lk that affects the component of pk in the null space of Ak. In this section, we consider quasiNewton methods based on these reducedHessian approximations. Our focus is on equalityconstrained problems in this section, as existing SQP methods for the full problem 18.10 use reducedHessian approaches only after an equalityconstrained subproblem has been generated.
To derive reducedHessian methods, we consider solution of the step equations 18.9 by means of the null space approach of Section 16.2. In that section, we defined matrices Yk and Zk whose columns span the range space of AkT and the null space of Ak , respectively. By writing
pk Yk pY Zk pZ, 18.18 andsubstitutinginto18.9,weobtainthefollowingsystemtobesolvedforpY andpZ:
AkYkpY ck, 18.19a ZT2LZ pZT2LYpZTf. 18.19b
kxxkkZ kxxkkYkk
From the first block of equations in 18.9 we see that the Lagrange multipliers k1, which
are sometimes called QP multipliers, can be obtained by solving
A Y T YTf 2 L p . 18.20 kk k1 k k xxkk
k xx k1 k k Z
18.3. ALGORITHMIC DEVELOPMENT 539
We can avoid computation of the Hessian x2xLk by introducing several approx imations in the nullspace approach. First, we delete the term involving pk from the righthandside of 18.20, thereby decoupling the computations of pk and k1 and elimi nating the need for x2x Lk in this term. This simplification can be justified by observing that
pk converges to zero as we approach the solution, whereas fk normally does not. There fore, the multipliers computed in this manner will be good estimates of the QP multipliers near the solution. More specifically, if we choose Yk AkT which is a valid choice for Yk when Ak has full row rank; see 15.16, we obtain
k1 Ak AkT 1 Ak fk. 18.21 These are called the leastsquares multipliers because they can also be derived by solving the
problem
min xLxk, 2 a fk AkT a2 . 18.22
This observation shows that the leastsquares multipliers are useful even when the current iterate is far from the solution, because they seek to satisfy the firstorder optimality condition in 18.3 as closely as possible. Conceptually, the use of leastsquares multipliers transforms the SQP method from a primaldual iteration in x and to a purely primal iteration in the x variable alone.
Our second simplification of the nullspace approach is to remove the cross term ZkTx2xLkYkpY in18.19b,therebyyieldingthesimplersystem
ZT2 L Z p ZTf . 18.23 kxxkkZ kk
T h i s a p p r o a c h h a s t h e a d v a n t a g e t h a t i t n e e d s t o a p p r o x i m a t e o n l y t h e m a t r i x Z kT x2 x L k Z k , not the n m m crossterm matrix ZkT x2x Lk Yk , which is a relatively large matrix when m n m. Dropping the cross term is justified when ZkT x2x Lk Zk is replaced by a quasiNewtonapproximationbecausethenormalcomponentpY usuallyconvergestozero faster than the tangential component pZ, thereby making 18.23 a good approximation of 18.19b.
Having dispensed with the partial Hessian ZkT x2x Lk Yk , we discuss how to approximate the remaining part ZkT x2x Lk Zk . Suppose we have just taken a step k pk xk1 xk kZkpZ kYkpY.ByTaylorstheorem,writingx2xLk1 x2xLxk1,k1,wehave
2L pLxp, Lx, . xx k1 k k x k k k k1 x k k1
By premultiplying by ZkT , we have
ZT2 L Z p 18.24
ZT2L YpZTLxp, Lx, . k xx k1 k k Y k x k k k k1 x k k1
540 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
IfwedropthecrosstermZkTx2xLk1YkkpY usingtherationalediscussedearlier,wesee that the secant equation for Mk can be defined by
Mk1sk yk, 18.25
where sk and yk are given by
sk kpZ, yk ZkT xLxk kpk,k1xLxk,k1. 18.26
We then apply the BFGS or SR1 formulae, using these definitions for the correction vectors sk and yk, to define the new approximation Mk1. An advantage of this reducedHessian approach, compared to fullHessian quasiNewton approximations, is that the reduced Hessian is much more likely to be positive definite, even when the current iterate is some distance from the solution. When using the BFGS formula, the safeguarding mechanism discussed above will be required less often in line search implementations.
MERIT FUNCTIONS
SQP methods often use a merit function to decide whether a trial step should be accepted. In line search methods, the merit function controls the size of the step; in trust region methods it determines whether the step is accepted or rejected and whether the trustregion radius should be adjusted. A variety of merit functions have been used in SQP methods, including nonsmooth penalty functions and augmented Lagrangians. We limit our discussion to exact, nonsmooth merit functions typified by the 1 merit function discussed in Chapters 15 and 17.
For the purpose of step computation and evaluation of a merit function, inequality constraints cx 0 are often converted to the form
c x , s c x s 0 ,
where s 0 is a vector of slacks. The condition s 0 is typically not monitored by the merit function. Therefore, in the discussion that follows we assume that all constraints are in the form of equalities, and we focus our attention on problem 18.1.
The 1 merit function for 18.1 takes the form
1x; fx cx 1. 18.27
In a line search method, a step k pk will be accepted if the following sufficient decrease condition holds:
1xk kpk;k1xk,kkD1xk;;pk, 0,1, 18.28
Moreover, we have that
18.3. ALGORITHMIC DEVELOPMENT 541
where D1xk ; ; pk denotes the directional derivative of 1 in the direction pk . This requirement is analogous to the Armijo condition 3.4 for unconstrained optimization provided that pk is a descent direction, that is, D1xk ; ; pk 0. This descent condition holds if the penalty parameter is chosen sufficiently large, as we show in the following result.
Theorem 18.2.
Let pk and k1 be generated by the SQP iteration 18.9. Then the directional derivative of 1 in the direction pk satisfies
D1xk;;pkfkT pk ck 1. 18.29 D x ;;p pT2 L p c . 18.30
1 k k k xx k k k1 k 1
PROOF. By applying Taylors theorem see 2.5 to f and ci , i 1, 2, . . . , m, we obtain
1xk p;1xk; fxk p fk cxk p 1 ck 1 fkTp2 p2ckAkp1ck 1,
where the positive constant bounds the secondderivative terms in f and c. If p pk is givenby18.9,wehavethatAkpk ck,sofor1wehavethat
1xk pk;1xk;fkT pk ck 12 pk 2. By arguing similarly, we also obtain the following lower bound:
1xk pk;1xk;fkT pk ck 12 pk 2.
Taking limits, we conclude that the directional derivative of 1 in the direction pk is given by D1xk;;pkfkT pk ck 1, 18.31
which proves 18.29. The fact that pk satisfies the first equation in 18.9 implies that
ckT k1. By making this substitution in the expression above and invoking the inequality
Dx;;ppT2 Lp pTAT c 1k k kxxkk kkk1 k1
.
From the second equation in 18.9, we can replace the term pkT AkT k1 in this expression by
we obtain 18.30.
ckT k1 ck 1 k1
,
542 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
It follows from 18.30 that pk will be a descent direction for 1 if pk a 0, x2x Lk is positive definite and
k1 . 18.32
A more detailed analysis shows that this assumption on x2x Lk can be relaxed; we need only the reduced Hessian ZkT x2x Lk Zk to be positive definite.
One strategy for choosing the new value of the penalty parameter in 1x; at every iteration is to increase the previous value, if necessary, so as to satisfy 18.32, with some margin. It has been observed, however, that this strategy may select inappropriate values of and often interferes with the progress of the iteration.
An alternative approach, based on 18.29, is to require that the directional derivative be sufficiently negative in the sense that
D1xk;;pkfkT pk ck 1 ck 1, for some 0, 1. This inequality holds if
fkT pk . 18.33 1ck 1
This choice is not dependent on the Lagrange multipliers and performs adequately in practice.
A more effective strategy for choosing , which is appropriate both in the line search and trustregion contexts, considers the effect of the step on a model of the merit function. We define a piecewise quadratic model of 1 by
q p f f T p pT 2 L p mp, 18.34 k k 2xxk
where
mp ckAkp1,
and is a parameter to be defined below. After computing a step pk , we choose the penalty
parameter large enough that
q0 qpk m0 mpk, 18.35
for some parameter 0, 1. It follows from 18.34 and 18.7b that inequality 18.35 is satisfied for
fkT pk 2pkTx2xLkpk. 18.36 1ck 1
If the value of from the previous iteration of the SQP method satisfies 18.36, it is left unchanged. Otherwise, is increased so that it satisfies this inequality with some margin.
18.3. ALGORITHMIC DEVELOPMENT 543
The constant is used to handle the case in which the Hessian x2x Lk is not positive definite. We define as
1 ifpT2Lp0,
k xx k k 18.37
0 otherwise.
It is easy to verify that, if satisfies 18.36, this choice of ensures that D1xk ; ; pk ck 1, so that pk is a descent direction for the merit function 1. This conclusion is not always valid if 1 and pkT x2x Lk pk 0. By comparing 18.33 and 18.36 we see that, when 0, the strategy based on 18.35 selects a larger penalty parameter, thus placing more weight on the reduction of the constraints. This property is advantageous if the step pk decreases the constraints but increases the objective, for in this case the step has a better chance of being accepted by the merit function.
SECONDORDER CORRECTION
In Chapter 15, we showed by means of Example 15.4 that many merit functions can impede progress of an optimization algorithm, a phenomenon known as the Maratos effect. We now show that the step analyzed in that example is, in fact, produced by an SQP method.
EXAMPLE 18.1 EXAMPLE 15.4, REVISITED
Considerproblem15.34.Attheiteratexk cos,sinT,letuscomputeasearch direction pk by solving the SQP subproblem 18.7 with x2x Lk replaced by x2x Lx, I . Since
fkcos, fk 4cos1 , AkT 2cos , 4sin 2sin
the quadratic subproblem 18.7 takes the form
min 4cos1p14sinp21p121p2
p22 subject to p2 cot p1 0.
By solving this subproblem, we obtain the direction
which coincides with 15.35.
pk sin2 , 18.38 sin cos
544 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
We mentioned in Section 15.4 that the difficulties associated with the Maratos effect can be overcome by means of a secondorder correction. There are various ways of applying this technique; we describe one possible implementation next.
Suppose that the SQP method has computed a step pk from 18.11. If this step yields an increase in the merit function 1, a possible cause is that our linear approximations to the constraints are not sufficiently accurate. To overcome this deficiency, we could resolve 18.11 with the linear terms ci xk ci xk T p replaced by quadratic approximations,
cixkcixkT p 1 pT2cixkp. 18.39 2
However, even if the Hessians of the constraints are individually available, the resulting quadratically constrained subproblem may be too difficult to solve. Instead, we evaluate the constraint values at the new point xk pk and make use of the following approximations. By Taylors theorem, we have
cixk pkcixkcixkT pk 1pkT2cixkpk. 18.40 2
Assuming that the still unknown secondorder step p will not be too different from pk , we can approximate the last term in 18.39 as follows:
pT2cixkp pkT2cixkpk. 18.41 By making this substitution in 18.39 and using 18.40, we obtain the secondorder
correction subproblem
where
min fTp1pT2Lp p k 2 xxk
subjectto cixkTpdi 0, iE, cixkTpdi 0, iI,
di cixkpkcixkTpk, iEI.
The secondorder correction step requires evaluation of the constraints ci xk pk for i E I, and therefore it is preferable not to apply it every time the merit function increases. One strategy is to use it only if the increase in the merit function is accompanied by an increase in the constraint norm.
It can be shown that when the step pk is generated by the SQP method 18.11 then, near a solution satisfying secondorder sufficient conditions, the algorithm above takes either the full step pk or the corrected step pk pk . The merit function does not interfere with the iteration, so superlinear convergence is attained, as in the local algorithm.
18.4. A PRACTICAL LINE SEARCH SQP METHOD 545
18.4 A PRACTICAL LINE SEARCH SQP METHOD
From the discussion in the previous section, we can see that there is a wide variety of line search SQP methods that differ in the way the Hessian approximation is computed, in the step acceptance mechanism, and in other algorithmic features. We now incorporate some of these ideas into a concrete, practical SQP algorithm for solving the nonlinear programming problem 18.10. To keep the description simple, we will not include a mechanism such as 18.12 to ensure the feasibility of the subproblem, or a secondorder correction step. Rather, the search direction is obtained simply by solving the subproblem 18.11. We also assume that the quadratic program 18.11 is convex, so that we can solve it by means of the activeset method for quadratic programming Algorithm 16.3 described in Chapter 16.
Algorithm 18.3 Line Search SQP Algorithm.
Choose parameters 0, 0.5, 0, 1, and an initial pair x0, 0; Evaluate f0, f0, c0, A0;
If a quasiNewton approximation is used, choose an initial n n symmetric positive definite Hessian approximation B0 , otherwise compute x2x L0 ; repeat until a convergence test is satisfied
Compute pk by solving 18.11; let be the corresponding multiplier; S e t p k ;
Choose k to satisfy 18.36 with 1;
Setk 1;
while 1xk kpk;k1xk;kkD1xk;kpk Reset k k for some 0,;
end while
Setxk1 xk kpk andk1 k kp;
Evaluate fk1, fk1, ck1, Ak1, and possibly x2x Lk1; If a quasiNewton approximation is used, set
sk kpk andyk xLxk1,k1xLxk,k1, and obtain Bk1 by updating Bk using a quasiNewton formula;
end repeat
We can achieve significant savings in the solution of the quadratic subproblem by warmstart procedures. For example, we can initialize the working set for each QP subproblem to be the final active set from the previous SQP iteration.
We have not given particulars of the quasiNewton approximation in Algorithm 18.3. We could use, for example, a limitedmemory BFGS approach that is suitable for largescale problems. If we use an exact Hessian x2x Lk , we assume that it is modified as necessary to be positive definite on the null space of the equality constraints.
Instead of a merit function, we could employ a filter see Section 15.4 in the inner while loop to determine the steplength k. As discussed in Section 15.4, a feasibility restoration phase is invoked if a trial steplength generated by the backtracking line search is
546 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
smaller than a given threshold. Regardless of whether a merit function or filter are used, a mechanism such as secondorder correction can be incorporated to overcome the Maratos effect.
18.5 TRUSTREGION SQP METHODS
Trustregion SQP methods have several attractive properties. Among them are the facts that they do not require the Hessian matrix x2x Lk in 18.11 to be positive definite, they control the quality of the steps even in the presence of Hessian and Jacobian singularities, and they provide a mechanism for enforcing global convergence. Some implementations follow an IQP approach and solve an inequalityconstrained subproblem, while others follow an EQP approach.
The simplest way to formulate a trustregion SQP method is to add a trustregion constraint to subproblem 18.11, as follows:
min f f T p 1 pT 2 L p p k k 2 xx k
subjectto cixkT pcixk0, cixkT pcixk0,
p ak.
i E, i I,
18.43a
18.43b 18.43c 18.43d
Even if the constraints 18.43b, 18.43c are compatible, this problem may not always have a solution because of the trustregion constraint 18.43d. We illustrate this fact in Figure 18.1 for a problem that contains only one equality constraint whose linearization is represented by the solid line. In this example, any step p that satisfies the linearized constraint must lie outside the trust region, which is indicated by the circle of radius ak . As we see from this example, a consistent system of equalities and inequalities may not have a solution if we restrict the norm of the solution.
To resolve the possible conflict between the linear constraints 18.43b, 18.43c and the trustregion constraint 18.43d, it is not appropriate simply to increase ak until the set of steps p satisfying the linear constraints intersects the trust region. This approach would defeat the purpose of using the trust region in the first place as a way to define a region within which we trust the model 18.43a18.43c to accurately reflect the behavior of the objective and constraint functions. Analytically, it would harm the convergence properties of the algorithm.
A more appropriate viewpoint is that there is no reason to satisfy the linearized constraints exactly at every step; rather, we should aim to improve the feasibility of these constraints at each step and to satisfy them exactly only if the trustregion constraint permits it. This point of view is the basis of the three classes of methods discussed in this section: relaxation methods, penalty methods, and filter methods.
18.5. TRUSTREGION SQP METHODS 547
p2
k
Ak p ck0
p 1
Figure 18.1 Inconsistent constraints in trustregion model.
A RELAXATION METHOD FOR EQUALITYCONSTRAINED OPTIMIZATION
We describe this method in the context of the equalityconstrained optimization problem 18.1; its extension to general nonlinear programs is deferred to Chapter 19 because it makes use of interiorpoint techniques. Activeset extensions of the relaxation approach have been proposed, but have not been fully explored.
18.44a
18.44b 18.44c
At the iterate xk , we compute the SQP step by solving the subproblem min f f T p 1 pT 2 L p
p k k 2xxk subjectto Akpck rk,
p 2 ak.
The choice of the relaxation vector rk requires careful consideration, as it impacts the efficiency of the method. Our goal is to choose rk as the smallest vector such that 18.44b, 18.44c are consistent for some reduced value of trustregion radius ak . To do so, we first solve the subproblem
min Ak v ck 2 v
subject to v 2 0.8ak . Denoting the solution of this subproblem by vk , we define
18.45a 18.45b
rk Akvk ck.
18.46
548 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
We now compute the step pk by solving 18.44, define the new iterate xk1 xk pk, and obtain new multiplier estimates k1 using the least squares formula 18.21. Note that the constraints 18.44b, 18.44c are consistent because they are satisfied by the vector
p vk .
At first glance, this approach appears to be impractical because problems 18.44
and 18.45 are not particularly easy to solve, especially when x2x Lk is indefinite. Fortu nately, we can design efficient procedures for computing useful inexact solutions of these problems.
We solve the auxiliary subproblem 18.45 by the dogleg method described in Chap ter 4. This method requires a Cauchy step pU, which is the minimizer of the objective 18.45a along the direction AkT ck , and a Newton step pB , which is the unconstrained minimizer of 18.45a. Since the Hessian in 18.45a is singular, there are infinitely many possible choices of pB, all of which satisfy Ak pB ck 0. We choose the one with smallest Euclidean norm by setting
pB AkT Ak AkT 1ck.
We now take vk to be the minimizer of 18.45a along the path defined by pU, pB, and the formula 4.16.
The preferred technique for computing an approximate solution pk of 18.44 is the projected conjugate gradient method of Algorithm 16.2. We apply this algorithm to the equalityconstrained quadratic program 18.44a18.44b, monitoring satisfaction of the trustregion constraint 18.44c and stopping if the boundary of this region is reached or if negative curvature is detected; see Section 7.1. Algorithm 16.2 requires a feasible starting point, which may be chosen as vk .
A merit function that fits well with this approach is the nonsmooth 2 function 2x; f x cx 2. We model it by means of the function
q p f f T p 1 pT 2 L p mp, 18.47 k k 2xxk
where
mp ckAkp2,
see 18.34. We choose the penalty parameter large enough that inequality 18.35 is
satisfied. To judge the acceptability of a step pk , we monitor the ratio
k aredk 2xk,2xk pk,. 18.48
predk q0 qpk
We can now give a description of this trustregion SQP method for the equality constrained optimization problem 18.1.
18.5. TRUSTREGION SQP METHODS 549
Algorithm 18.4 ByrdOmojokun TrustRegion SQP Method. Choose constants 0 and , 0, 1;
Choose starting point x0, initial trust region a0 0;
for k 0,1,2,…
Compute fk, ck, fk, Ak; Computemultiplierestimatesk by18.21; i f f k A kT k a n d c k
stop with approximate solution xk ;
Solve normal subproblem 18.45 for vk and compute rk from 18.46; Compute x2x Lk or a quasiNewton approximation;
Compute pk by applying the projected CG method to 18.44;
Choose k to satisfy 18.35;
Computek aredkpredk;
ifk
Set xk1 xk pk;
Choose ak1 to satisfy ak1 ak ; else
Setxk1 xk;
Choose ak1 to satisfy ak1 pk ;
end for.
A secondorder correction can be added to avoid the Maratos effect. Beyond the cost of evaluating the objective function f and constraints c, the main costs of this algorithm lie in the projected CG iteration, which requires products of the Hessian x2xLk with vectors, and in the factorization and backsolves with the projection matrix 16.32; see Section 16.3.
S 1 QP SEQUENTIAL 1 QUADRATIC PROGRAMMING
In this approach we move the linearized constraints 18.43b, 18.43c into the ob jective of the quadratic program, in the form of an 1 penalty term, to obtain the following subproblem:
def T1T2 T min qp fk fk p p xxLkp cixkcixk p
p 2iE
subjectto p ak,
T
cixkcixk p 18.49
iI
550 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
for some penalty parameter , where we use the notation y max0, y. Introducing slack variables v, w, t , we can reformulate this problem as follows:
T1T2
min fk fk p p xxLkp vi wi ti 18.50a
p,v,w,t 2 iE
s.t. cixkTpcixkviwi, iE,
iI
18.50b
18.50c 18.50d 18.50e
cixkT pcixkti, v, w, t 0,
p ak.
i I,
This formulation is simply a linearization of the elasticmode formulation 18.12 with the addition of a trustregion constraint.
The constraints of this problem are always consistent. Since the trust region has been defined using the norm, 18.50 is a smooth quadratic program that can be solved by means of a quadratic programming algorithm. Warmstart strategies can significantly reduce the solution time of 18.50 and are invariably used in practical implementations.
It is natural to use the 1 merit function
1x; fx cix cix 18.51
iE iI
to determine step acceptance. In fact, the function q defined in 18.49 can be viewed as a model of 1x, at xk in which we approximate each constraint function ci by its linearization, and replace f by a quadratic function whose curvature term includes information from both objective and constraints.
After computing the step pk from 18.50, we determine the ratio k via 18.48, using the merit function 1 and defining q by 18.49. The step is accepted or rejected according to standard trustregion rules, as implemented in Algorithm 18.4. A secondorder correction step can be added to prevent the occurence of the Maratos effect.
The S 1QP approach has several attractive properties. Not only does the formula tion 18.49 overcome the possible inconsistency among the linearized constraints, but it also ensures that the trustregion constraint can always be satisfied. Further, the ma trix x2xLk can be used without modification in subproblem 18.50 or else can be replaced by a quasiNewton approximation. There is no requirement for it to be positive definite.
This choice of the penalty parameter plays an important role in the efficiency of this method. Unlike the SQP methods described above, which use a penalty function only to determine the acceptability of a trial point, the step pk of the S 1QP algorithm depends on . Values of that are too small can lead the algorithm away from the solution Section 17.2, while excessively large values can result in slow progress. To obtain good
min
p
subject to
def T T lpfkfk p cixkcixk p
p
iI k
aLP.
iE cixkcixkT p
18.5. TRUSTREGION SQP METHODS 551
practical performance over a range of applications, the value of must be chosen carefully at each iteration; see Algorithm 18.5 below.
SEQUENTIAL LINEARQUADRATIC PROGRAMMING SLQP
The SQP methods discussed above require the solution of a general inequality constrained quadratic problem at each iteration. The cost of solving this subproblem imposes a limit on the size of problems that can be solved in practice. In addition, the incorporation of indefinite second derivative information in SQP methods has proved to be difficult 147.
The sequential linearquadratic programming SLQP method attempts to overcome these concerns by computing the step in two stages, each of which scales well with the number of variables. First, a linear program LP is solved to identify a working set W. Second, there is an equalityconstrained quadratic programming EQP phase in which the constraints in the working set W are imposed as equalities. The total step of the algorithm is a combination of the steps obtained in the linear programming and equalityconstrained phases, as we now discuss.
In the LP phase, we would like to solve the problem min fk fkT p
p
subjectto cixkcixkTp0, iE,
cixkcixkTp0, iI, p aLP,
which differs from the standard SQP subproblem 18.43 only in that the secondorder term in the objective has been omitted and that an norm is used to define the trust
k
18.52a
18.52b 18.52c 18.52d
region. Since the constraints of 18.52 may be inconsistent, we solve instead the reformulation of 18.52 defined by
1 penalty
18.53a 18.53b
Byintroducingslackvariablesasin18.50,wecanreformulate18.53asanLP.Thesolution of 18.53, which we denote by pLP, is computed by the simplex method Chapter 13. From this solution we obtain the following explicit estimate of the optimal active set:
AkpLPiEcixkcixkTpLP 0iIcixkcixkTpLP 0.
552 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
Likewise, we define the set Vk of violated constraints as
VkpLPi E cixkcixkT pLP a0i I cixkcixkT pLP 0.
WedefinetheworkingsetWk assomelinearlyindependentsubsetoftheactivesetAkpLP. To ensure that the algorithm makes progress on the penalty function 1, we define the Cauchy step,
pC LP pLP, 18.54
where LP 0, 1 is a steplength that provides sufficient decrease in the model q defined in 18.49.
Given the working set Wk , we now solve an equalityconstrained quadratic program EQP treating the constraints in Wk as equalities and ignoring all others. We thus obtain the subproblem
18.55a
18.55d is spherical, and that ak is distinct from the trustregion radius aLP used in k
18.53b.Problem18.55issolvedforthevectorpQ byapplyingtheprojectedconjugated gradient procedure of Algorithm 16.2, handling the trustregion constraint by Steihaugs strategy Algorithm 7.2. The total step pk of the SLQP method is given by
pk pC QpQ pC,
where Q 0, 1 is a steplength that approximately minimizes the model q defined in 18.49.
T min f 1pT2 L pf cx p
pk2xxk kkiik i Vk
subjectto cixkcixkTp0, iEWk, cixkcixkTp0, iIWk,
p 2 ak,
where i is the algebraic sign of the ith violated constraint. Note that the trust region
The trustregion radius ak for the EQP phase is updated using standard trustregion update strategies. The choice of radius aLP for the LP phase is more delicate, since it
k1
influences our guess of the optimal active set. The value of aLP should be set to be a little
k1
larger than the total step pk , subject to some other restrictions 49. The multiplier estimates
k used in the Hessian x2x Lk are least squares estimates 18.21 using the working set Wk , andmodifiedsothati 0fori I.
An appealing feature of the SLQP algorithm is that established techniques for solving largescale versions of the LP and EQP subproblems are readily available. High quality LP
18.55b 18.55c 18.55d
mkp cixkcixk p cixkcixk p , iE iI
so that the objective of the SQP subproblem 18.49 can be written as q p f f T p 1 pT 2 L p m p.
18.5. TRUSTREGION SQP METHODS 553
software is capable of solving problems with very large numbers of variables and constraints, while the solution of the EQP subproblem can be performed efficiently using the projected conjugate gradient method.
A TECHNIQUE FOR UPDATING THE PENALTY PARAMETER
WehavementionedthatpenaltymethodssuchasS 1QPandSLQPcanbesensitiveto the choice of the penalty parameter . We now discuss a procedure for choosing that has proved to be effective in practice and is supported by global convergence guarantees. The goal is to choose small enough to avoid an unnecessary imbalance in the merit function, but large enough to cause the step to make sufficient progress in linearized feasibility at each iteration. We present this procedure in the context of the S 1QP method and then describe its extension to the SLQP approach.
18.56
18.57
We begin by solving the QP subproblem 18.49 or equivalently, 18.50 using the previous value k1 of the penalty parameter. If the constraints 18.50b, 18.50c are satisfied with the slack variables vi, wi, ti all equal to zero that is, mkpk 0, then the current value of is adequate, and we set k k1. This is the felicitous case in which we can achieve linearized feasibility with a step pk that is no longer in norm than the trustregion radius.
If mkp 0, on the other hand, it may be appropriate to increase the penalty parameter. The question is: by how much? To obtain a reference value, we resolve the QP 18.49 using an infinite value of , by which we mean that the objective function in 18.49 is replaced by mkp. After computing the new step, which we denote by p, two outcomes are possible. If mkp 0, meaning that the linearized constraints are feasible within the trust region, we choose k k1 such that mk pk 0. Otherwise, if mkp 0, we choose k k1 such that the reduction in mk caused by the step pk is at least a fraction of the optimal reduction given by p.
The selection of k k1 is achieved in all cases by successively increasing the current trial value of by a factor of 10, say and resolving the quadratic program 18.49. To describe this strategy more precisely, we write the solution of the QP problem 18.49 as
p to stress its dependence on the penalty parameter. Likewise, p denotes the minimizer of mkp subject to the trustregion constraint 18.50e. The following algorithm describes the selection of the penalty parameter k and the computation of the S 1QP step pk.
We define a piecewise linear model of constraint violation at a point xk by TT
k k 2xxk k
554 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
Algorithm 18.5 Penalty Update and Step Computation.
Initial data: xk , k1 0, ak 0, and parameters 1, 2 0, 1.
Solve the subproblem 18.50 with k1 to obtain pk1; if mkpk1 0
Set k1; else
Compute p; if mkp 0
Find k1 such that mkp 0; else
Find k1 such that mk0 mkp
endif endif
1mk0 mkp;
Increase if necessary to satisfy
q0qp 2mk0mkp;
Set k and pk p.
Note that the inequality in the penultimate line is the same as condition 18.35. Although Algorithm 18.5 requires the solution of some additional quadratic programs, we hope to reduce the total number of iterations and the total number of QP solves by identifying an appropriate penalty parameter value more quickly than rules based on feasibility monitoring see Framework 17.2.
Numerical experience indicates that these savings occur when an adaptation of Al gorithm 18.5 is used in the SLQP method. This adaptation is obtained simply by setting x2x Lk 0 in the definition 18.49 of q and applying Algorithm 18.5 to determine and to compute the LP step pLP. The extra LP solves required by Algorithm 18.5 in this case are typically inexpensive, requiring relatively few simplex iterations, because we can use warmstart information from LPs solved earlier, with different values of the penalty parameter.
18.6 NONLINEAR GRADIENT PROJECTION
In Section 16.7, we discussed the gradient projection method for bound constrained quadratic programming. It is not difficult to extend this method to the problem
min f x subject to l x u, 18.58
where f is a nonlinear function and l and u are vectors of lower and upper bounds, respectively.
18.6. NONLINEAR GRADIENT PROJECTION 555
We begin by describing a line search approach. At the current iterate xk , we form the quadratic model
qkx fk fkTxxk1xxkTBkxxk, 18.59 2
where Bk is a positive definite approximation to 2 f xk . We then use the gradient projec tion method for quadratic programming Algorithm 16.5 to find an approximate solution x of the subproblem
minqkx subject to l x u. 18.60 The search direction is defined as pk x xk and the new iterate is given by xk1
xk k pk , where the steplength k is chosen to satisfy
f x k k p k f x k k f kT p k
for some parameter 0, 1.
To see that the search direction pk is indeed a descent direction for the objective
function, we use the properties of Algorithm 16.5, as discussed in Section 16.7. Recall that this method searches along a piecewise linear paththe projected steepest descent pathfor the Cauchy point xc, which minimizes qk along this path. It then identifies the components of x that are at their bounds and holds these components constant while performing an unconstrained minimization of qk over the remaining components to obtain the approximate solution x of the subproblem 18.60.
The Cauchy point xc satisfies qkxc qkxk if the projected gradient is nonzero. Since Algorithm 16.5 produces a subproblem solution x with qk x qk x c , we have
f k q k x k q k x c q k x f k f kT p k 1 p kT B k p k . 2
This inequality implies that fkT pk 0, since Bk is assumed to be positive definite.
We now consider a trustregion gradient projection method for solving 18.58. We begin by forming the quadratic model 18.59, but since there is no requirement for qk to be convex, we can define Bk to be the Hessian 2 f xk or a quasiNewton approximation obtained from the BFGS or SR1 formulas. The step pk is obtained by solving the subproblem
minqkx subjectto lxu, xxk ak, 18.61 forsomeak 0.Thisproblemcanbeposedasaboundconstrainedquadraticprogramas
follows:
minqkx subjectto maxl,xk akexminu,xk ake,
556 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
where e 1, 1, . . . , 1T . Algorithm 16.5 can be used to solve this subproblem. The step pk is accepted or rejected following standard trustregion strategies, and the radius ak is updated according to the agreement between the change in f and the change in qk produced
by the step pk ; see Chapter 4.
The two gradient projection methods just outlined require solution of an inequality
constrained quadratic subproblem at each iteration, and so are formally IQP methods. They can, however, be viewed also as EQP methods because of their use of Algorithm 16.5 in solving the subproblem. This algorithm first identifies a working set by finding the Cauchy point and then solves an equalityconstrained subproblem by fixing the workingset constraints at their bounds. For large problems, it is efficient to perform the subpace minimization 16.74 by using the conjugate gradient method. A preconditioner is sometimes needed to make this approach practical; the most popular choice is the incomplete and modified Cholesky factorization outlined in Algorithm 7.3.
The gradient projection approach can be extended in principle to more general linear or convex constraints. Practical implementations are however limited to the bound con strained problem 18.58 because of the high cost of computing projections onto general constraint sets.
18.7 CONVERGENCE ANALYSIS
Numerical experience has shown that the SQP and SLQP methods discussed in this chapter often converge to a solution from remote starting points. Hence, there has been considerable interest in understanding what drives the iterates toward a solution and what can cause the algorithms to fail. These global convergence studies have been valuable in improving the design and implementation of algorithms.
Some early results make strong assumptions, such as boundedness of multipliers, well posedness of the subproblem 18.11, and regularity of constraint Jacobians. More recent studies relax many of these assumptions with the goal of understanding both the successful and unsuccessful outcomes of the iteration. We now state a classical global convergence result that gives conditions under which a standard SQP algorithm always identifies a KKT point of the nonlinear program.
Consider an SQP method that computes a search direction pk by solving the quadratic program 18.11. We assume that the Hessian x2xLk is replaced in 18.11a by some symmetric and positive definite approximation Bk . The new iterate is defined as xk1 k pk , where k is computed by a backtracking line search, starting from the unit steplength, and terminating when
1xk kpk;1xk;kq0qpk,
where 0, 1, with 1 defined as in 18.51 and q defined as in 18.49. To establish the convergence result, we assume that each quadratic program 18.11 is feasible and
determines a bounded solution pk . We also assume that the penalty parameter is fixed for all k and sufficiently large.
Theorem 18.3.
Suppose that the SQP algorithm just described is applied to the nonlinear program 18.10. Suppose that the sequences xk and xk pk are contained in a closed, bounded, convex region of IRn in which f and ci have continuous first derivatives. Suppose that the matrices Bk and multipliers are bounded and that satisfies k for all k, where is a positive constant. Then all limit points of the sequence xk are KKT points of the nonlinear program 18.10.
The conclusions of the theorem are quite satisfactory, but the assumptions are some what restrictive. For example, the condition that the sequence xk pk stays within in a bounded set rules out the case in which the Hessians Bk or constraint Jacobians become ill conditioned. Global convergence results that are established under more realistic condi tions are surveyed by Conn, Gould, and Toint 74. An example of a result of this type is Theorem 19.2. Although this theorem is established for a nonlinear interiorpoint method, similar results can be established for trustregion SQP methods.
RATE OF CONVERGENCE
We now derive conditions that guarantee the local convergence of SQP methods, as well as conditions that ensure a superlinear rate of convergence. For simplicity, we limit our discussion to Algorithm 18.1 for equalityconstrained optimization, and consider both exact Hessian and quasiNewton versions. The results presented here can be applied to algorithms for inequalityconstrained problems once the active set has settled at its final optimal value see Theorem 18.1.
We begin by listing a set of assumptions on the problem that will be useful in this section.
Assumptions 18.2.
The point x is a local solution of problem 18.1 at which the following conditions hold.
a The functions f and c are twice differentiable in a neighborhood of x with Lipschitz
continuous second derivatives.
b The linear independence constraint qualification Definition 12.4 holds at x. This con dition implies that the KKT conditions 12.34 are satisfied for some vector of multipliers .
c ThesecondordersufficientconditionsTheorem12.6holdatx,. We consider first an SQP method that uses exact second derivatives.
18.7. CONVERGENCE ANALYSIS 557
558 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
Theorem 18.4.
Suppose that Assumptions 18.2 hold. Then, if x0, 0 is sufficiently close to x, , the pairs xk,k generated by Algorithm 18.1 converge quadratically to x,.
The proof follows directly from Theorem 11.2, since we know that Algorithm 18.1 is equivalent to Newtons method applied to the nonlinear system Fx, 0, where F is defined by 18.3.
We turn now to quasiNewton variants of Algorithm 18.1, in which the Lagrangian Hessian x2x Lxk , k is replaced by a quasiNewton approximation Bk . We discussed in Sec tion 18.3 algorithms that used approximations to the full Hessian, and also reducedHessian methods that maintained approximations to the projected Hessian ZkT x2x Lxk , k Zk . As in the earlier discussion, we take Zk to be the n n m matrix whose columns span the null space of Ak , assuming in addition that the columns of Zk are orthornormal; see 15.22.
If we multiply the first block row of the KKT system 18.9 by Zk, we obtain
ZT2 L p ZTf . 18.62
kxxkk kk
This equation, together with the second block row Ak pk ck of 18.9, is sufficient to determine fully the value of pk when xk and k are not too far from their optimal values. In other words, only the projection of the Hessian ZkT x2x Lk is significant; the remainder of x2x Lk its projection onto the range space of AkT does not play a role in determinining pk .
By multiplying 18.62 by Zk , and defining the following matrix Pk , which projects onto the null space of Ak :
PkIAkT AkAkT1AkZkZkT, we can rewrite 18.62 equivalently as follows:
P 2 L p P f . kxxkk kk
The discussion above, together with Theorem 18.4, suggests that a quasiNewton method will be locally convergent if the quasiNewton matrix Bk is chosen so that Pk Bk is a reasonable approximation of Pk x2x Lk , and that it will be superlinearly convergent if Pk Bk approximates Pk x2x Lk well. To make the second statement more precise, we present a result that can be viewed as an extension of characterization of superlinear convergence Theorem 3.6 to the equalityconstrained case. In the following discussion, x2x L denotes x2x Lx, .
Theorem 18.5.
Suppose that Assumptions 18.2 hold and that the iterates xk generated by Algorithm 18.1 with quasiNewton approximate Hessians Bk converge to x. Then xk converges superlinearly if and only if the Hessian approximation Bk satisfies
lim PkBkx2xLxk1xk 0. 18.63 k xk1 xk
We can apply this result to the quasiNewton updating schemes discussed earlier in this chapter, beginning with the full BFGS approximation based on 18.13. To guarantee that the BFGS approximation is always well defined, we make the strong assumption that the Hessian of the Lagrangian is positive definite at the solution.
Theorem 18.6.
Suppose that Assumptions 18.2 hold. Assume also that x2x L and B0 are symmetric and positivedefinite.If x0x and B0x2xL aresufficientlysmall,theiteratesxk generated by Algorithm 18.1 with BFGS Hessian approximations Bk defined by 18.13 and 18.16 with rk sk satisfy the limit 18.63. Therefore, the iterates xk converge superlinearly to x.
For the damped BFGS updating strategy given in Procedure 18.2, we can show that the rate of convergence is Rsuperlinear not the usual Qsuperlinear rate; see the Appendix.
We now consider reducedHessian SQP methods that update an approximation Mk to ZkT x2x Lk Zk . From the definition of Pk , we see that Zk Mk ZkT can be considered as an approximation to the twosided projection Pk x2x Lk Pk . Since reducedHessian methods do not approximate the onesided projection Pk x2x Lk , we cannot expect 18.63 to hold. For these methods, we can state a condition for superlinear convergence by writing 18.63 as
P k B k x2 x L P k x k 1 x k k xk1 xk
PkBkx2xLIPkxk1xk 0, 18.64 xk1 xk
and defining Bk Zk Mk ZkT . The following result shows that it is necessary only for the first term in 18.64 to go to zero to obtain a weaker form of superlinear convergence, namely, twostep superlinear convergence.
Theorem 18.7.
Suppose that Assumption 18.2a holds and that the matrices Bk are bounded. Assume also that the iterates xk generated by Algorithm 18.1 with approximate Hessians Bk converge to x, and that
lim
lim PkBk x2xLPkxk1 xk 0. k xk1 xk
Then the sequence xk converges to x twostep superlinearly, that is,
18.65
In a reducedHessian method that uses BFGS updating, the iteration is xk1 xk Yk pY Zk pZ, where pY and pZ are given by 18.19a, 18.23 with ZkT x2xLk Zk replaced by Mk. The reducedHessian approximation Mk is updated by the BFGS formula using
18.7. CONVERGENCE ANALYSIS 559
xk2 x
lim 0.
k xk x
560 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
the correction vectors 18.26, and the initial approximation M0 is symmetric and positive definite. If we make the assumption that the null space bases Zk used to define the correction vectors 18.26 vary smoothly, then we can apply Theorem 18.7 to show that xk converges twostep superlinearly.
18.8 PERSPECTIVES AND SOFTWARE
SQP methods are most efficient if the number of active constraints is nearly as large as the number of variables, that is, if the number of free variables is relatively small. They require few evaluations of the functions, in comparison with augmented Lagrangian methods, and can be more robust on badly scaled problems than the nonlinear interiorpoint methods described in the next chapter. It is not known at present whether the IQP or EQP approach will prove to be more effective for large problems. Current reasearch focuses on widening the class of problems that can be solved with SQP and SLQP approaches.
Two established SQP software packages are SNOPT 128 and FILTERSQP 105. The former code follows a line search approach, while the latter implements a trustregion strategy using a filter for step acceptance. The SLQP approach of Section 18.5 is implemented in KNITROACTIVE 49. All three packages include mechanisms to ensure that the subproblems are always feasible and to guard against rankdeficient constraint Jacobians. SNOPT uses the penalty or elastic mode 18.12, which is invoked if the SQP subproblem is infeasible or if the Lagrange multiplier estimates become very large in norm. FILTERSQP includes a feasibility restoration phase that, in addition to promoting convergence, provides rapid identification of convergence to infeasible points. KNITROACTIVE implements a penalty method using the update strategy of Algorithm 18.5.
ThereisnoestablishedimplementationoftheS 1QPapproach,butprototypeimple mentations have shown promise. The CONOPT 9 package implements a generalized reduced gradient method as well as an SQP method.
QuasiNewton approximations to the Hessian of the Lagrangian x2xLk are often used in practice. BFGS updating is generally less effective for constrained problems than in the unconstrained case because of the requirement of maintaining a positive definite approximation to an underlying matrix that often does not have this property. Nevertheless, the BFGS and limitedmemory BFGS approximations implemented in SNOPT and KNITRO perform adequately in practice. KNITRO also offers an SR1 option that may be more effective than the BFGS option, but the question of how best to implement full quasiNewton approx imations for constrained optimization requires further investigation. The RSQP package 13 implements an SQP method that maintains a quasiNewton approximation to the reduced Hessian.
The Maratos effect, if left unattended, can significantly slow optimization algorithms that use nonsmooth merit functions or filters. However, selective application of secondorder correction steps adequately resolves the difficulties in practice.
18.8. PERSPECTIVES AND SOFTWARE 561
Trustregion implementations of the gradient projection method include TRON 192 and LANCELOT 72. Both codes use a conjugate gradient iteration to perform the subspace minimization and apply an incomplete Cholesky preconditioner. Gradient projection meth ods in which the Hessian approximation is defined by limitedmemory BFGS updating are implemented in LBFGSB 322 and BLMVM 17. The properties of limitedmemory BFGS matrices can be exploited to perform the projected gradient search and subpace minimiza tion efficiently. SPG 23 implements the gradient projection method using a nonmonotone line search.
NOTES AND REFERENCES
SQP methods were first proposed in 1963 by Wilson 306 and were developed in the 1970s by GarciaPalomares and Mangasarian 117, Han 163, 164, and Powell 247, 250, 249, among others. Trustregion variants are studied by Vardi 295, Celis, Dennis, and Tapia 56, and Byrd, Schnabel, and Shultz 55. See Boggs and Tolle 33 and Gould, Orban, and Toint 147 for literature surveys.
The SLQP approach was proposed by Fletcher and Sainz de la Maza 108 and was further developed by Chin and Fletcher 59 and Byrd et al. 49. The latter paper discusses how to update the LP trust region and many other details of implementation. The technique for updating the penalty parameter implemented in Algorithm 18.5 is discussed in 49, 47. The S 1QP method was proposed by Fletcher; see 101 for a complete discussion of this method.
Some analysis shows that severalbut not allof the good properties of BFGS updat ing are preserved by damped BFGS updating. Numerical experiments exposing the weakness of the approach are reported by Powell 254. Secondorder correction strategies were pro posed by Coleman and Conn 65, Fletcher 100, Gabay 116, and Mayne and Polak 204. The watchdog technique was proposed by Chamberlain et al. 57 and other nonmonotone strategies are described by Bonnans et al. 36. For a comprehensive discussion of second order correction and nonmonotone techniques, see the book by Conn, Gould, and Toint 74.
Two filter SQP algorithms are described by Fletcher and Leyffer 105 and Fletcher, Leyffer, and Toint 106. It is not yet known whether the filter strategy has advantages over merit functions. Both approaches are undergoing development and improved imple mentations can be expected in the future. Theorem 18.3 is proved by Powell 252 and Theorem 18.5 by Boggs, Tolle, and Wang 34.
EXERCISES
18.1 Show that in the quadratic program 18.7 we can replace the linear term fkT p by x Lxk , k T p without changing the solution.
562 CHAPTER 18. SEQUENTIAL QUADRATIC PROGRAMMING
18.2 Prove Theorem 18.4.
18.3
Write a program that implements Algorithm 18.1. Use it to solve the problem
minex1x2x3x4x5 1×3 x3 12 212
subjectto x12 x2 x32 x42 x52 100, x2x3 5x4x5 0, x13 x23 10.
18.66 18.67 18.68 18.69
Use the starting point x0 1.71, 1.59, 1.82, 0.763, 0.763T . The solution is x 1.8, 1.7, 1.9, 0.8, 0.8T .
18.4 Show that the damped BFGS updating satisfies 18.17.
18.5 Consider the constraint x12 x2 1. Write the linearized constraints 18.7b at
the following points: 0, 0T , 0, 1T , 0.1, 0.02T , 0.1, 0.02T .
18.6 Prove Theorem 18.2 for the case in which the merit function is given by x; fx cx q,whereq 0.Usethislemmatoshowthattheconditionthat ensuresdescentisgivenby k1 r,wherer 0satisfiesr1 q1 1.
18.7 Write a program that implements the reducedHessian method given by 18.18, 18.19a, 18.21, 18.23. Use your program to solve the problem given in Exercise 18.3.
18.8 Show that the constraints 18.50b18.50e are always consistent.
18.9 Show that the feasibility problem 18.45a18.45b always has a solution vk lying in the range space of AkT . Hint: First show that if the trustregion constraint 18.45b is active, vk lies in the range space of AkT . Next, show that if the trust region is inactive, the minimumnorm solution of 18.45a lies in the range space of AkT .
CHAPTER19
InteriorPoint Methods for Nonlinear Programming
Interiorpoint or barrier methods have proved to be as successful for nonlinear optimiza tion as for linear programming, and together with activeset SQP methods, they are currently considered the most powerful algorithms for largescale nonlinear programming. Some of the key ideas, such as primaldual steps, carry over directly from the linear programming case, but several important new challenges arise. These include the treatment of noncon vexity, the strategy for updating the barrier parameter in the presence of nonlinearities, and the need to ensure progress toward the solution. In this chapter we describe two classes of interiorpoint methods that have proved effective in practice.
This is page 563 Printer: Opaque this
564 CHAPTER 19. NONLINEAR INTERIOR METHODS
The methods in the first class can be viewed as direct extensions of interiorpoint methods for linear and quadratic programming. They use line searches to enforce conver gence and employ direct linear algebra that is, matrix factorizations to compute steps. The methods in the second class use a quadratic model to define the step and incorporate a trust region constraint to provide stability. These two approaches, which coincide asymptotically, have similarities with line search and trustregion SQP methods.
Barrier methods for nonlinear optimization were developed in the 1960s but fell out of favor for almost two decades. The success of interiorpoint methods for linear programming stimulated renewed interest in them for the nonlinear case. By the late 1990s, a new genera tion of methods and software for nonlinear programming had emerged. Numerical experi ence indicates that interiorpoint methods are often faster than activeset SQP methods on large problems, particularly when the number of free variables is large. They may not yet be as robust, but significant advances are still being made in their design and implementation. The terms interiorpoint methods and barrier methods are now used interchangeably.
In Chapters 14 and 16 we discussed interiorpoint methods for linear and quadratic programming. It is not essential that the reader study those chapters before reading this one, although doing so will give a better perspective. The first part of this chapter assumes famil iarity primarily with the KKT conditions and Newtons method, and the second part of the chapter relies on concepts from sequential quadratic programming presented in Chapter 18.
The problem under consideration in this chapter is written as follows: min fx
x,s
subject to cEx 0,
cIx s 0, s 0.
19.1a
19.1b 19.1c 19.1d
The vector cIx is formed from the scalar functions cix, i I, and similarly for cEx. Note that we have transformed the inequalities cIx 0 into equalities by the introduction of a vector s of slack variables. We use l to denote the number of equality constraints that is, the dimension of the vector cE and m to denote the number of inequality constraints the dimension of cI.
19.1 TWO INTERPRETATIONS
Interiorpoint methods can be seen as continuation methods or as barrier methods. We discuss both derivations, starting with the continuation approach.
The KKT conditions 12.1 for the nonlinear program 19.1 can be written as
f x A T xy A T xz 0, 19.2a EI
Sz e 0, 19.2b
with 0, together with
cEx 0, 19.2c cIx s 0, 19.2d
s 0, z 0. 19.3
Here AEx and AIx are the Jacobian matrices of the functions cE and cI, respectively, and y and z are their Lagrange multipliers. We define S and Z to be the diagonal matrices whose diagonal entries are given by the vectors s and z, respectively, and let e 1, 1, . . . , 1T .
Equation 19.2b, with 0, and the bounds 19.3 introduce into the problem the combinatorial aspect of determining the optimal active set, illustrated in Example 15.1. We circumvent this difficulty by letting be strictly positive, thus forcing the variables s and z to take positive values. The homotopy or continuation approach consists of approximately solving the perturbed KKT conditions 19.2 for a sequence of positive parameters k that converges to zero, while maintaining s, z 0. The hope is that, in the limit, we will obtain a point that satisfies the KKT conditions for the nonlinear program 19.1. Furthermore, by requiring the iterates to decrease a merit function or to be acceptable to a filter, the iteration is likely to converge to a minimizer, not simply a KKT point.
The homotopy approach is justified locally. In a neighborhood of a solution x, s, y, z that satisfies the linear independence constraint qualification LICQ Defi nition 12.4, the strict complementarity condition Definition 12.5, and the secondorder sufficient conditions Theorem 12.6, we have that for all sufficiently small positive values of , the system 19.2 has a locally unique solution, which we denote by x, s, y, z. The trajectory described by these points is called the primaldual central path, and it converges to x, s, y, z as 0.
The second derivation of interiorpoint methods associates with 19.1 the barrier problem
min fx x,s
subject to
m i1
logsi
cEx 0,
19.4a
19.4b 19.4c
where is a positive parameter and log denotes the natural logarithm function. One
need not include the inequality s 0 in 19.4 because minimization of the barrier term
m log si in 19.4a prevents the components of s from becoming too close to zero. i1
Recall that log t as t 0. Problem 19.4 also avoids the combinatorial aspect of nonlinear programs, but its solution does not coincide with that of 19.1 for 0. The barrier approach consists of finding approximate solutions of the barrier problem 19.4 for a sequence of positive barrier parameters k that converges to zero.
19.1. TWO INTERPRETATIONS 565
cIx s 0,
566 CHAPTER 19. NONLINEAR INTERIOR METHODS
To compare the homotopy and barrier approaches, we write the KKT conditions for 19.4 as follows:
2 L xx
0
Z
A T x E
0 0 0
A T xp f x A T xy A T xz
0 A E x
S
0 0
ps p y
pz
Sz e c E x cIxs
0 AIx I
f x A T xy A T xz 0, EI
S1e z 0, cEx 0, cIx s 0.
19.5a 19.5b 19.5c 19.5d
Note that they differ from 19.2 only in the second equation, which becomes quite nonlinear near the solution as s 0. It is advantageous for Newtons method to transform the rational equation 19.5b into a quadratic equation. We do so by multiplying this equation by S, a procedure that does not change the solution of 19.5 because the diagonal elements of S are positive. After this transformation, the KKT conditions for the barrier problem coincide with the perturbed KKT system 19.2.
The term interior point derives from the fact that early barrier methods 98 did not use slacks and assumed that the initial point x0 is feasible with respect to the inequality constraints ci x 0, i I. These methods used the barrier function
fx logcix iI
to prevent the iterates from leaving the feasible region defined by the inequalities. We discuss this barrier function further in Section 19.6. Most modern interiorpoint methods are infeasible they can start from any initial point x0 and remain interior only with respect to the constraints s 0, z 0. However, they can be designed so that once they generate a feasible iterate, all subsequent iterates remain feasible with respect to the inequalities.
In the next sections we will see that the homotopy and barrier interpretations are both useful. The homotopy view gives rise to the definition of the primaldual direction, whereas the barrier view is vital in the design of globally convergent iterations.
19.2 A BASIC INTERIORPOINT ALGORITHM
Applying Newtons method to the nonlinear system 19.2, in the variables x, s, y, z, we obtain
I x E
I ,
19.6
where
x xmaxp , s smaxp, sx ss
y ymaxp , z zmaxp, zy zz
max max0,1:sp 1s, ss
max max0,1:zp 1z, zz
19.8a 19.8b
19.9a 19.9b
19.2. A BASIC INTERIORPOINT ALGORITHM 567
where L denotes the Lagrangian for 19.1a19.1c:
Lx,s,y,z fxyTcExzTcIxs. 19.7
The system 19.6 is called the primaldual system in contrast with the primal system discussed in Section 19.3. After the step p px , ps , py , pz has been determined, we compute the new iterate x, s, y, z as
with 0, 1. A typical value of is 0.995. The condition 19.9, called the fraction to the boundary rule, prevents the variables s and z from approaching their lower bounds of 0 too quickly.
This simple iteration provides the basis of modern interiorpoint methods, though various modifications are needed to cope with nonconvexities and nonlinearities. The other major ingredient is the procedure for choosing the sequence of parameters k, which from now on we will call the barrier parameters. In the approach studied by Fiacco and McCormick 98, the barrier parameter is held fixed for a series of iterations until the KKT conditions 19.2 are satisfied to some accuracy. An alternative approach is to update the barrier parameter at each iteration. Both approaches have their merits and are discussed in Section 19.3.
The primaldual matrix in 19.6 remains nonsingular as the iteration converges to a solution that satisfies the secondorder sufficiency conditions and strict complementarity. More specifically, if x is a solution point for which strict complementarity holds, then for every index i either si or zi remains bounded away from zero as the iterates approach x, ensuring that the second block row of the primaldual matrix 19.6 has full row rank. Therefore, the interiorpoint approach does not, in itself, give rise to ill conditioning or singularity. This fact allows us to establish a fast superlinear rate of convergence; see Section 19.8.
We summarize the discussion by describing a concrete implementation of this basic interiorpoint method. We use the following error function, which is based on the perturbed KKT system 19.2:
Ex,s,y,z;max fxAExTyAIxTz , Sze ,
cEx , cIxs , 19.10
for some vector norm .
568 CHAPTER 19. NONLINEAR INTERIOR METHODS
Algorithm 19.1 Basic InteriorPoint Algorithm.
Choose x0 and s0 0, and compute initial values for the multipliers y0 and z0 0.
Select an initial barrier parameter 0 0 and parameters , 0, 1. Set k 0.
repeat until a stopping test for the nonlinear program 19.1 is satisfied repeat until E xk , sk , yk , zk ; k k
Solve 19.6 to obtain the search direction p px , ps , py , pz ; Computemax,max using19.9;
sz
Compute xk1, sk1, yk1, zk1 using 19.8;
Setk1 k andkk1; end
Choosek 0,k; end
Analgorithmthatupdatesthebarrierparameterk ateveryiterationiseasilyobtained from Algorithm 19.1 by removing the requirement that the KKT conditions be satisfied for each k the inner repeat loop and by using a dynamic rule for updating k in the penultimate line.
The following theorem provides a theoretical foundation for interiorpoint methods that compute only approximate solutions of the barrier problem.
Theorem 19.1.
Suppose that Algorithm 19.1 generates an infinite sequence of iterates xk and that k 0 that is, that the algorithm does not loop infinitely in the inner repeat statement. Suppose that f and c are continuously differentiable functions. Then all limit points x of xk are feasible. Furthermore, if any limit point x of xk satisfies the linear independence constraint qualification LICQ, then the firstorder optimality conditions of the problem 19.1 hold at x.
PROOF. For simplicity, we prove the result for the case in which the nonlinear program
19.1 contains only inequality constraints, leaving the extension of the result as an exercise.
Foreaseofnotation,wedenotetheinequalityconstraintscI byc.Letxbealimitpointofthe
sequencexk,andletxklbeaconvergentsubsequence,namely,xklx.Sincek 0,
the error E given by 19.10 converges to zero, so we have ckl skl 0. By continuity of def
consider the set of active indices
c, this fact implies that c cx 0 that is, x is feasible and skl s c.
Now suppose that the linear independence constraint qualification holds at x, and
Ai:ci 0.
0 p y 0 pz
S1Z.
19.3. ALGORITHMIC DEVELOPMENT 569
For i a A, we have ci 0 and si 0, and thus by the complementarity condition 19.2b, wehavethatzkli 0.Fromthisfactandfkl ATzkl 0,wededucethat
kl
zkl icixkl 0. 19.11
iA
By the constraint qualification hypothesis, the vectors ci : i A are linearly indepen dent. Hence, by 19.11 and continuity of f and ci, i A, the positive sequence zkl converges to some value z 0. Taking the limit in 19.11, we have that
z i c i x .
We also have that cT z 0, completing the proof.
Practical interiorpoint algorithms fall into two categories. The first builds on Algo rithm 19.1, adding a line search and features to control the rate of decrease in the slacks s and multipliers z, and introducing modifications in the primaldual sytem when negative curva ture is encountered. The second category of algorithms, presented in Section 19.5, computes steps by minimizing a quadratic model of 19.4, subject to a trustregion constraint. The two approaches share many features described in the next section.
19.3 ALGORITHMIC DEVELOPMENT
We now discuss a series of modifications and extensions of Algorithm 19.1 that enable it to solve nonconvex nonlinear problems, starting from any initial estimate.
Often, the primaldual system 19.6 is rewritten in the symmetric form
2 L 0 ATx ATx p fxATxyATxz xx E IxE I
fkl
f x
iA
0 0 A E x 0 0 AIx I 0
where
I ps
zS1e , c E x
This formulation permits the use of a symmetric linear equations solver, which reduces the computational work of each iteration.
cIx s
19.12
19.13
570 CHAPTER 19. NONLINEAR INTERIOR METHODS
PRIMAL VS. PRIMALDUAL SYSTEM
If we apply Newtons method directly to the optimality conditions 19.5 of the barrier problem instead of transforming to 19.5b first and then symmetrize the iteration matrix, we obtain the system 19.12 but with given by
S2. 19.14
This is often called the primal system, in contrast with the primaldual system arising from 19.13. This nomenclature owes more to the historical development of interiorpoint methods than to the concept of primaldual iterations. Whereas in the primaldual choice 19.13 the vector z can be seen as a general multiplier estimate, the primal term 19.14 is obtained by making the specific selection Z S1; we return to this choice of multipliers in Section 19.6.
Even though the systems 19.2 and 19.5 are equivalent, Newtons method applied to them will generally produce different iterates, and there are reasons for preferring the primaldual system. Note that 19.2b has the advantage that its derivatives are bounded as any slack variables approach zero; such is not the case with 19.5b. Moreover, analysis of the primal step as well as computational experience has shown that, under some circumstances, the primal step 19.12, 19.14 tends to produce poor steps that violate the bounds s 0 and z 0 significantly, resulting in slow progress; see Section 19.6.
SOLVING THE PRIMALDUAL SYSTEM
Apart from the cost of evaluating the problem functions and their derivatives, the work of the interiorpoint iteration is dominated by the solution of the primaldual system 19.12, 19.13. An efficient linear solver, using either sparse factorization or iterative techniques, is therefore essential for fast solution of large problems.
The symmetric matrix in 19.12 has the familiar form of a KKT matrix cf. 16.7, 18.6, and the linear system can be solved by the approaches described in Chapter 16. We can first reduce the system by eliminating ps using the second equation in 19.6, giving
2 L ATx ATx p fxATxyATxz xx E I x E I
AEx 0 0 py cEx . AIx 0 1 pz cIx Z1e
19.15 This system can be factored by using a symmetric indefinite factorization; see 16.12. If we denote the coefficient matrix in 19.15 by K , this factorization computes P T K P L B L T , where L is lower triangular and B is block diagonal, with blocks of size 1 1 or 2 2. P is a matrix of row and column permutations that seeks a compromise between the goals of preserving sparsity and ensuring numerical stability; see 3.51 and the discussion that
follows.
19.3. ALGORITHMIC DEVELOPMENT 571
The system 19.15 can be reduced further by eliminating pz using the last equation, to obtain the condensed coefficient matrix
2 LATA ATx xx I I E
, 19.16
AEx 0
which is much smaller than 19.12 when the number of inequality constraints is large.
Although significant fillin can arise from the term A T A , it is tolerable in many applica II
tions. A particularly favorable case, in which A T A is diagonal, arises when the inequality II
constraints are simple bounds.
The primaldual system in any of the symmetric forms 19.12, 19.15, 19.16
is ill conditioned because, by 19.13, some of the elements of diverge to , while others converge to zero as 0. Nevertheless, because of the special form in which this ill conditioning arises, the direction computed by a stable direct factorization method is usually accurate. Damaging errors result only when the slacks s or multipliers z become very close to zero or when the Hessian x2x L or the Jacobian matrix AE is almost rank deficient. For this reason, direct factorization techniques are considered the most reliable techniques for computing steps in interiorpoint methods.
Iterative linear algebra techniques can also be used for the step computation. Ill con ditioning is a grave concern in this context, and preconditioners that cluster the eigenvalues of must be used. Fortunately, such preconditioners are easy to construct. For example, let us introduce the change of variables p s S1 ps in the system 19.12, and multiply the second equation in 19.12 by S, transforming the term into SS. As 0 and assuming that SZ I we have from 19.13 that all the elements of SS cluster around I. Other scalings can be used as well. The change of variables p s 12 ps provides the perfect preconditioner, while p s S1 ps transforms to SS, which converges to I as 0.
We can apply an iterative method to one of the symmetric indefinite systems 19.12, 19.15, or 19.16. The conjugate gradient method is not appropriate except as explained below because it is designed for positive definite systems, but we can use GMRES, QMR, or LSQR see 136. In addition to employing preconditioning that re moves the ill conditioning caused by the barrier approach, as discussed above, we need to deal with possible ill conditioning caused by the Hessian x2x L or the Jacobian matri ces AE and AI. Generalpurpose preconditioners are difficult to find in this context, and the success of an iterative method hinges on the use of problemspecific or structured preconditioners.
An effective alternative is to use a nullspace approach to solve the primaldual system and apply the CG method in the positive definite reduced space. As explained in Sec tion 16.3, we can do this by applying the projected CG iteration of Algorithm 16.2 using a socalled constraint preconditioner. In the context of the system 19.12 the preconditioner
572 CHAPTER 19. NONLINEAR INTERIOR METHODS
has the form
G 0 ATx ATx EI
0 T 0 I , AEx0 0 0
AIxI 0 0
19.17
where G is a sparse matrix that is positive definite on the null space of the constraints and T is a diagonal matrix that equals or approximates . This preconditioner keeps the Jacobian information of AE and AI intact and thereby removes any ill conditioning present in these matrices.
UPDATING THE BARRIER PARAMETER
The sequence of barrier parameters k must converge to zero so that, in the limit, we recover the solution of the nonlinear programming problem 19.1. If k is decreased too slowly, a large number of iterations will be required for convergence; but if it is decreased too quickly, some of the slacks s or multipliers z may approach zero prematurely, slowing progress of the iteration. We now describe several techniques for updating k that have proved to be effective in practice.
The strategy implemented in Algorithm 19.1, which we call the FiaccoMcCormick approach, fixes the barrier parameter until the perturbed KKT conditions 19.2 are satisfied to some accuracy. Then the barrier parameter is decreased by the rule
k1 k k , with k 0, 1. 19.18
Some early implementations of interiorpoint methods chose k to be a constant for exam ple, k 0.2. It is, however, preferable to let k take on two or more values for example, 0.2 and 0.1, choosing smaller values when the most recent iterations make significant progress toward the solution. Furthermore, by letting k 0 near the solution, and letting the parameter in 19.9 converge to 1, a superlinear rate of convergence can be obtained.
The FiaccoMcCormick approach works well on many problems, but it can be sensitive to the choice of the initial point, the initial barrier parameter value, and the scaling of the problem.
Adaptive strategies for updating the barrier parameter are more robust in difficult situations. These strategies, unlike the FiaccoMcCormick approach, vary at every it eration depending on the progress of the algorithm. Most such strategies are based on complementarity, as in the linear programming case see Framework 14.1, and have the form
skT zk , 19.19 k1 km
and define k as follows:
s aff asaff T z aff k s
aff azaff m, k z
3
19.21
19.22
19.3. ALGORITHMIC DEVELOPMENT 573
which allows k to reflect the scale of the problem. One choice of k , implemented in the LOQO package 294, is based on the deviation of the smallest complementarity product skizki fromtheaverage:
1k 3 miniskizki
k 0.1min 0.05 ,2 , where k skTzkm . 19.20
k
Hereski denotestheithcomponentoftheiteratesk,andsimilarlyforzki.Whenk 1 all the individual products are near to their average, the barrier parameter is decreased aggressively.
Predictor or probing strategies see Section 14.2 can also be used to determine the parameter k in 19.19. We calculate a predictor affine scaling direction
axaff , asaff , ayaff , azaff
by setting 0 in 19.12. We probe this direction by finding aff and aff to be the
longest step lengths that can be taken along the affine scaling direction before violating the nonnegativity conditions s, z 0. Explicit formulas for these step lengths are given by 19.9 with 1. We then define aff to be the value of complementarity along the shortened affine scaling step, that is,
aff
k skTzkm .
pd
This heuristic choice of k was proposed for linear programming problems see 14.34 and also works well for nonlinear programs.
HANDLING NONCONVEXITY AND SINGULARITY
The direction defined by the primaldual system 19.12 is not always productive because it seeks to locate only KKT points; it can move toward a maximizer or other stationary points. In Chapter 18 we have seen that the Newton step 18.9 for the equality constrained problem 18.1 can be guaranteed to be a descent direction for a large class of merit functionsand to be a productive direction for a filterif the Hessian W is positive definite on the tangent space of the constraints. The reason is that, in this case, the step can be interpreted as the minimization of a convex model in the reduced space obtained by eliminating the linearized constraints.
574 CHAPTER 19. NONLINEAR INTERIOR METHODS
For the primaldual system 19.12, the step p is a descent direction if the matrix
2L 0 xx
0
is positive definite on the null space of the constraint matrix
AEx 0
.
19.23
AIx I
Lemma 16.3 states that this positive definiteness condition holds if the inertia of the primal
dual matrix in 19.12 is given by
n m, l m, 0, 19.24
in other words, if this matrix has exactly n m positive, l m negative, and no zero eigenvalues. Recall that l and m denote the number of equality and inequality constraints, respectively. As discussed in Section 3.4, the inertia can be obtained from the symmetric indefinite factorization of 19.12.
If the primaldual matrix does not have the desired inertia, we can modify it as follows. Note that the diagonal matrix is positive definite by construction but x2x L can be indefinite. Therefore, we can replace the latter matrix by x2x L I , where 0 is sufficiently large to ensure that the inertia is given by 19.24. The size of this modification is not known beforehand, but we can try successively larger values of until the desired inertia is obtained.
We must also guard against singularity of the primaldual matrix caused by the rank deficiency of AE the matrix AI I always has full rank. We do so by including a regularization parameter 0, in addition to the modification term I, and work with the modified primaldual matrix
2 LI 0 AxT AxT xx E I
0 0 I . AEx 0 I 0
AIx I 0 0
19.25
A procedure for selecting and is given in Algorithm B.1 in Appendix B. It is invoked at every iteration of the interiorpoint method to enforce the inertia condition 19.24 and to guarantee nonsingularity. Other matrix modifications to ensure positive definiteness have been discussed in Chapter 3 in the context of unconstrained minimization.
19.3. ALGORITHMIC DEVELOPMENT 575
STEP ACCEPTANCE: MERIT FUNCTIONS AND FILTERS
The role of the merit function or filter is to determine whether a step is productive and should be accepted. Since interiorpoint methods can be seen as methods for solv ing the barrier problem 19.4, it is appropriate to define the merit function or filter in terms of barrier functions. We may use, for example, an exact merit function of the form
m i1
where the norm is chosen, say, to be the 1 or the 2 norm unsquared. The penalty parameter 0 can be updated by using the strategies described in Chapter 18.
In a line search method, after the step p has been computed and the maximum step lengths 19.9 have been determined, we perform a backtracking line search that computes the step lengths
0,max, 0,max, 19.27 sszz
providing sufficient decrease of the merit function or ensuring acceptability by the filter. The new iterate is then defined as
x xspx, s ssps, 19.28a y yzpy, z zzpz. 19.28b
x,s fx
logsi cEx cIxs , 19.26
When defining a filter see Section 15.4 the pairs of the filter are formed, on the one hand,bythevaluesofthebarrierfunction fxm logsi and,ontheotherhand,by
i1
the constraint violations cEx, cIxs . A step will be accepted if it is not dominated by
any element in the filter. Under certain circumstances, if the step is not accepted by the filter, instead of reducing the step length s in 19.8a, a feasibility restoration phase is invoked; see the Notes and References at the end of the chapter.
QUASINEWTON APPROXIMATIONS
A quasiNewton version of the primaldual step is obtained by replacing x2x L in 19.12 by a quasiNewton approximation B. We can use the BFGS 6.19 or SR1 6.24 update formulas described in Chapter 6 to define B, or we can follow a limitedmemory BFGS approach see Chapter 7. It is important to approximate the Hessian of the Lagrangian of the nonlinear program, not the Hessian of the barrier function, which is highly ill conditioned and changes rapidly.
The correction pairs used by the quasiNewton updating formula are denoted here by ax, al, replacing the notation s, y of Chapter 6. After computing a step from x, s, y, z
576 CHAPTER 19. NONLINEAR INTERIOR METHODS
to x, s, y, z, we define
al xLx,s,y,zxLx,s,y,z,
ax x x.
To ensure that the BFGS method generates a positive definite matrix, one can skip or damp the update; see 18.14 and 18.15. SR1 updating must be safeguarded to avoid unboundedness, as discussed in Section 6.2, and may also need to be modified so that the inertia of the primaldual matrix is given by 19.24. This modification can be performed by means of Algorithm B.1.
The quasiNewton matrices B generated in this manner are dense n n matrices. For large problems, limitedmemory updating is desirable. One option is to implement a limited memory BFGS method by using the compact representations described in Section 7.2. Here B has the form
BIWMWT, 19.29
where 0 is a scaling factor, W is an n 2m matrix, M is a 2m 2m symmet ric and nonsingular matrix, and m denotes the number of correction pairs saved in the limitedmemory updating procedure. The matrices W and M are formed by using the vec torsalkandaxkaccumulatedinthelastm iterations.Sincethelimitedmemorymatrix Bispositivedefinite,andassumingAE hasfullrank,theprimaldualmatrixisnonsingular, and we can compute the solution to 19.12 by inverting the coefficient matrix using the ShermanMorrisonWoodbury formula see Exercise 19.14.
FEASIBLE INTERIORPOINT METHODS
In many applications, it is desirable for all of the iterates generated by an optimization algorithm to be feasible with respect to some or all of the inequality constraints. For example, the objective function may be defined only when some of the constraints are satisfied, making this feature essential.
Interiorpoint methods provide a natural framework for deriving feasible algorithms. If the current iterate x satisfies cIx 0, then it is easy to adapt the primaldual iteration 19.12 so that feasibility is preserved. After computing the step p, we let x x px, redefine the slacks as
s cIx, 19.30
and test whether the point x, s is acceptable for the merit function . If so, we define this point to be the new iterate; otherwise we reject the step p and compute a new, shorter trial step. In a line search algorithm we backtrack, and in a trustregion method we compute a new step with a reduced trustregion bound. This strategy is justified by the fact that if at a trial point we have that ci x 0 for some inequality constraint, the value of the merit
logsi in the merit function 19.26. iI
19.4. A LINE SEARCH INTERIORPOINT METHOD 577
function is , and we reject the trial point. We will also reject steps x px that are too
close to the boundary of the feasible region because such steps increase the barrier term
Making the substitution 19.30 has the effect of replacing logsi with logci x in the merit function, a technique reminiscent of the classical primal logbarrier approach discussed in Section 19.6.
19.4 A LINE SEARCH INTERIORPOINT METHOD
We now give a more detailed description of a line search interiorpoint method. We denote by Dx, s; p the directional derivative of the merit function at x, s in the direction
p. The stopping conditions are based on the error function 19.10.
Algorithm 19.2 Line Search InteriorPoint Algorithm.
Choose x0 and s0 0, and compute initial values for the multipliers y0 and z0 0.
If a quasiNewton approach is used, choose an n n symmetric and positive definite initial matrix B0 . Select an initial barrier parameter 0, parameters , 0, 1, and tolerances
and TOL .Setk0.
repeatuntilExk,sk,yk,zk;0 TOL repeatuntilExk,sk,yk,zk;
Compute the primaldual direction p px , ps , py , pz from 19.12, where the coefficient matrix is modified as in 19.25, if necessary;
Computemax,maxusing19.9;Setp p,p; sz wxs
Compute step lengths s , z satisfying both 19.27 and
xk spx,sk spsxk,sksDxk,sk;pw;
Compute xk1, sk1, yk1, zk1 using 19.28; if a quasiNewton approach is used
update the approximation Bk ; Set k k 1;
end
Set and update ; end
Thebarriertolerancecanbedefined,forexample,as ,asinAlgorithm19.1.An adaptive strategy that updates the barrier parameter at every step is easily implemented in this framework. If the merit function can cause the Maratos effect see Section 15.4, a secondorder correction or a nonmonotone strategy should be implemented. An al ternative to using a merit function is to employ a filter mechanism to perform the line search.
578 CHAPTER 19. NONLINEAR INTERIOR METHODS
We will see in Section 19.7 that Algorithm 19.2 must be safeguarded to ensure global convergence.
19.5 A TRUSTREGION INTERIORPOINT METHOD
We now consider an interiorpoint method that uses trust regions to promote convergence. As in the unconstrained case, the trustregion formulation allows great freedom in the choice of the Hessian and provides a mechanism for coping with Jacobian and Hessian singularities. The price to pay for this flexibility is a more complex iteration than in the line search approach.
The interiorpoint method described below is asymptotically equivalent to the line search method discussed in Section 19.4, but differs significantly in two respects. First, it is not fully a primaldual method in the sense that it first computes a step in the variables x,s and then updates the estimates for the multipliers, as opposed to the approach of Algorithm 19.1, in which primal and dual variables are computed simultaneously. Second, the trustregion method uses a scaling of the variables that discourages moves toward the boundary of the feasible region. This causes the algorithm to generate steps that can be different from, and enjoy more favorable convergence properties than, those produced by a line search method.
We first describe a trustregion algorithm for finding approximate solutions of a fixed barrier problem. We then present a complete interiorpoint method in which the barrier parameter is driven to zero.
AN ALGORITHM FOR SOLVING THE BARRIER PROBLEM
The barrier problem 19.4 is an equalityconstrained optimization problem and can be solved by using a sequential quadratic programming method with trust regions. A straightforward application of SQP techniques to the barrier problem leads, however, to inefficient steps that tend to violate the positivity of the slack variables and are frequently cut short by the trustregion constraint. To overcome this problem, we design an SQP method tailored to the structure of barrier problems.
At the iterate x, s, and for a given barrier parameter , we first compute Lagrange multiplier estimates y, z and then compute a step p px , ps that approximately solves the subproblem
19.31a
19.31b 19.31c 19.31d 19.31e
min fTp 1pT2 Lp eTS1p 1pTp xxxxx sss
px,ps 2 2 subject to AExpx cEx rE,
AIxpx ps cIx s rI, px,S1ps 2 a,
ps s.
19.5. A TRUSTREGION INTERIORPOINT METHOD 579
Here is the primaldual matrix 19.13, and the scalar 0, 1 is chosen close to 1 for example, 0.995. The inequality 19.31e plays the same role as the fraction to the boundary rule 19.9. Ideally, we would like to set r rE , rI 0, but since this can cause the constraints 19.31b19.31d to be incompatible or to give a step p that makes little progress toward feasibility, we choose the parameter r by an auxiliary computation, as in Algorithm 18.4.
We motivate the choice of the objective 19.31a by noting that the firstorder optimal ity conditions of 19.31a19.31c are given by 19.2 with the second block of equations scaled by S1. Thus the step computed from the subproblem 19.31 is related to the primaldual line search step in the same way as the SQP and NewtonLagrange steps of Section 18.1.
The trustregion constraint 19.31d guarantees that the problem 19.31 has a finite solution even when x2x Lx, s, y, z is not positive definite, and therefore this Hessian need never be modified. In addition, the trustregion formulation ensures that adequate progress is made at every iteration. To justify the scaling S1 used in 19.31d, we note that the shape of the trust region must take into account the requirement that the slacks not approach zero prematurely. The scaling S1 serves this purpose because it restricts those components i of the step vector ps for which si is close to its lower bound of zero. As we see below, it also playsanimportantroleinthechoiceoftherelaxationvectorsrE andrI.
We outline this SQP trustregion approach as follows. The stopping condition is defined in terms of the error function E given by 19.10, and the merit function can be defined as in 19.26 using the 2norm, 2.
Algorithm 19.3 TrustRegion Algorithm for Barrier Problems.
Input parameters: 0, x0, s0 0, , and a0 0. Compute Lagrange multiplier
estimatesy0 andz0 0.Setk0.
repeatuntilExk,sk,yk,zk;
Compute p px , ps by approximately solving 19.31. if p provides sufficient decrease in the merit function
Setxk1 xk px,sk1 sk ps;
Compute new multiplier estimates yk1, zk1 0
andsetak1 ak;
Define xk1 xk, sk1 sk, and set ak1 ak;
end
Set k k 1; end repeat
Algorithm 19.3 is applied for a fixed value of the barrier parameter . A complete interiorpoint algorithm driven by a sequence k 0 is described below. First, we discuss how to find an approximate solution of the subproblem 19.31, along with Lagrange multiplier estimates yk1, zk1.
else
580 CHAPTER 19. NONLINEAR INTERIOR METHODS
STEP COMPUTATION
The subproblem 19.31a19.31e is difficult to minimize exactly because of the presence of the nonlinear constraint 19.31d and the bounds 19.31e. An important observation is that we can compute useful inexact solutions, at moderate cost. Since this approach scales up well with the number of variables and constraints, it provides a framework for developing practical interiorpoint methods for largescale optimization.
The first step in the solution process is to make a change of variables that transforms the trustregion constraint 19.31d into a ball. By defining
p px px , p s S 1 p s
we can write problem 19.31 as
19.32
19.33a
min fTp 1pT2 Lp eTp 1p TSSp
x xxxx s s s
p x , p s subject to
2
AExpx cEx rE,
2
AIxpx Sp s cIxsrI, px,p s 2 a,
19.33b 19.33c 19.33d 19.33e
p s e.
TocomputethevectorsrE andrI,weproceedasinSection18.5andformulatethefollowing
normal subproblem in the variable v vx , vs : min AExvx cEx 2
v
subjectto vx,vs 2 0.8a, vs 2e.
AIxvx Svs cIx s 2
If we ignore 19.34c, this problem has the standard form of a trustregion problem, and we can compute an approximate solution by using the techniques discussed in Chapter 4, such as the dogleg method. If the solution violates the bounds 19.34c, we can backtrack so that these bounds are satisfied.
Havingsolved19.34,wedefinethevectorsrE andrI in19.33b19.33ctobethe residuals in the normal step computation, namely,
rE AExvx cEx, rI AIxvx Svs cIx s. 19.35
We are now ready to compute an approximate solution d of the subproblem 19.33. By 19.35, the vector v is a particular solution of the linear constraints 19.33b19.33c. We
19.34a 19.34b 19.34c
19.5. A TRUSTREGION INTERIORPOINT METHOD 581
can then solve the equalityconstrained quadratic program 19.33a19.33c by using the projected conjugate gradient iteration given in Algorithm 16.2. We terminate the projected CG iteration by Steihaugs rules: During the solution by CG we monitor the satisfaction of the trustregion constraint 19.33d and stop if the boundary of this region is reached, if negative curvature is detected, or if an approximate solution is obtained. If the solution given by the projected CG iteration does not satisfy the bounds 19.33e, we backtrack so that they are satisfied. After the step px , p s has been computed, we recover p from 19.32.
As discussed in Section 16.3, every iteration of the projected CG iteration requires the solution of a linear system in order to perform the projection operation. For the quadratic program 19.33a19.33c this projection matrix is given by
I A T A E x 0
A 0 , with A AIx S . 19.36
Thus, although this trustregion approach still requires the solution of an augmented system, the matrix 19.36 is simpler than the primaldual matrix 19.12. In particular, the Hessian x2x L need never be factored because the CG approach requires only products of this matrix with vectors.
We mentioned in Section 19.3 that the term SS in 19.33a has a much tighter distribution of eigenvalues than . Therefore the CG method will normally not be adversely affected by ill conditioning and is a viable approach for solving the quadratic program 19.33a19.33c.
LAGRANGE MULTIPLIERS ESTIMATES AND STEP ACCEPTANCE
At an iterate x, s, we choose y, z to be the leastsquares multipliers see 18.21 corresponding to 19.33a19.33c. We obtain the formula
y T 1 fx
AA A , 19.37
z e
where A is given by 19.36 The multiplier estimates z obtained in this manner may not
always be positive; to enforce positivity, we may redefine them as
zi min103,si, i 1,2,…,m. 19.38
The quantity si is called the ith primal multiplier estimate because if all components of z were defined by 19.38, then would reduce to the primal choice, 19.14.
As is standard in trustregion methods, the step p is accepted if
ared p pred p, 19.39
582 CHAPTER 19. NONLINEAR INTERIOR METHODS
where
aredp x,s x px,s ps 19.40 and where is a constant in 0, 1 say, 108 . The predicted reduction is defined as
pred p q 0 q p,
and
19.41
where q is defined as
qpfTp 1pT2 Lp eTS1p 1pTp mp,
x 2xxx x s 2s s a a
mpa AExpx cEx a . a AIxpxpscIxs a2
To determine an appropriate value of the penalty parameter , we require that be large enough that
pred p m0 m p, 19.42 for some parameter 0, 1. This is the same as condition 18.35 used in Section 18.5,
and the value of can be computed by the procedure described in that section. DESCRIPTION OF A TRUSTREGION INTERIORPOINT METHOD
We now present a more detailed description of the trustregion interiorpoint algo rithm for solving the nonlinear programming problem 19.1. For concreteness we follow the FiaccoMcCormick strategy for updating the barrier parameter. The stopping conditions are stated, once more, in terms of the error function E defined by 19.10. In a quasiNewton approach, the Hessian x2x L is replaced by a symmetric approximation.
Algorithm 19.4 TrustRegion InteriorPoint Algorithm.
Choose a value for the parameters 0, 0,1, 0,1, and 0,1, and
select the stopping tolerances and TOL . If a quasiNewton approach is used, select an n n symmetric initial matrix B0. Choose initial values for 0, x0, s0 0, and a0. Set k 0.
repeatuntilExk,sk,yk,zk;0 TOL repeatuntilExk,sk,yk,zk;
Compute Lagrange multipliers from 19.3719.38;
min f x subject to cx 0. x
The logbarrier function is defined by
Px; fx
19.43
19.44
19.6. THE PRIMAL LOGBARRIER METHOD 583
Compute x2x Lxk , sk , yk , zk or upate a quasiNewton approximation Bk , and define k by 19.13;
Computethenormalstepvk vx,vs;
Compute p k by applying the projected CG method to 19.33; Obtain the total step pk from 19.32;
Update k to satisfy 19.42;
Compute predk pk by 19.41 and aredk pk by 19.40;
if aredk pk predk pk
Setxk1 xk px,sk1 sk ps;
Chooseak1 ak; else
set xk1 xk, sk1 sk; and choose ak1 ak; endif
Set k k 1; end
Set and update ; end
The merit function 19.26 can reject steps that make good progress toward a solution: the Maratos effect discussed in Chapter 18. This deficiency can be overcome by selective application of a secondorder correction step; see Section 15.4.
Algorithm 19.4 can easily be modified to implement an adaptive barrier update strategy. The barrier stop tolerance can be defined as . Algorithm 19.4 is the basis of the KNITROCG method 50, which implements both exact Hessian and quasiNewton options.
19.6 THE PRIMAL LOGBARRIER METHOD
Prior to the introduction of primaldual interior methods, barrier methods worked in the space of primal variables x. As in the quadratic penalty function approach of Chapter 17, the goal was to solve nonlinear programming problems by unconstrained minimization applied to a parametric sequence of functions.
Primal barrier methods are more easily described in the context of inequality constrained problems of the form
logcix,
iI
584 CHAPTER 19. NONLINEAR INTERIOR METHODS
where 0. One can show that the minimizers of Px;, which we denote by x, approach a solution of 19.43 as 0, under certain conditions; see, for example, 111. The trajectory Cp defined by
def
Cp x0 19.45
is often referred to as the primal central path.
Since the minimizer x of Px; lies in the strictly feasible set x cx 0
where no constraints are active, we can in principle search for it by using any of the uncon strained minimization algorithms described in the first part of this book. These methods need to be modified, as explained in the discussion following equation 19.30, so that they reject steps that leave the feasible region or are too close to the constraint boundaries.
One way to obtain an estimate of the Lagrange multipliers is based on differentiating P to obtain
xPx;fx cix. 19.46 iI cix
When x is close to the minimizer x and is small, we see from Theorem 12.1 that the optimal Lagrange multipliers zi, i I, can be estimated as follows:
zi cix, i I. 19.47 A general framework for algorithms based on the primal logbarrier function 19.44
can be specified as follows.
Framework 19.5 Unconstrained Primal Barrier Method.
Given 0 0, a sequence k with k 0, and a starting point x0s ; for k 0,1,2,…
Find an approximate minimizer xk of P ; k , starting at xks , and terminating when Pxk;k k;
Compute Lagrange multipliers zk by 19.47;
if final convergence test satisfied
stop with approximate solution xk ;
Choose new penalty parameter k1 k ;
Choose new starting point xks1; end for
The primal barrier approach was first proposed by Frisch 115 in the 1950s and was analyzed and popularized by Fiacco and McCormick 98 in the late 1960s. It fell out of favor after the introduction of SQP methods and has not regained its popularity because it suffers from several drawbacks compared to primaldual interiorpoint methods. The most
19.6. THE PRIMAL LOGBARRIER METHOD 585
important drawback is that the minimizer x becomes more and more difficult to find as 0 because of the nonlinearity of the function P x ;
EXAMPLE 19.1 Consider the problem
min x1 0.52 x2 0.52 subject to x1 0, 1, x2 0, 1, for which the primal barrier function is
Px; x1 0.52 x2 0.52
logx1 log1x1logx2 log1x2 .
19.48
19.49
Contours of this function for the value 0.01 are plotted in Figure 19.1. The elongated nature of the contours indicates bad scaling, which causes poor performance of unconstrained optimization methods such as quasiNewton, steepest descent, and con jugate gradient. Newtons method is insensitive to the poor scaling, but the nonelliptical propertythe contours in Figure 19.1 are almost straight along the left edge while being circular along the right edgeindicates that the quadratic approximation on which New tons method is based does not capture well the behavior of the barrier function. Hence, Newtons method, too, may not show rapid convergence to the minimizer of 19.49 except in a small neighborhood of this point.
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.05 0.1
0.15 0.2 0.25
Figure 19.1
Contours of Px; from 19.49 for 0.01
586 CHAPTER 19. NONLINEAR INTERIOR METHODS
To lessen this nonlinearity, we can proceed as in 17.21 and introduce additional variables.Definingzi cix,werewritethestationaritycondition19.46as
zi ci x 0, 19.50a Cxz e 0, 19.50b
where Cx diagc1x,c2x,…,cmx. Note that this system is equivalent to the perturbed KKT conditions 19.2 for problem 19.43 if, in addition, we introduce slacks as in 19.2d. Finally, if we apply Newtons method in the variables x, s, z and temporarily ignore the bounds s, z 0, we arrive at the primaldual formulation. Thus, with hindsight, we can transform the primal logbarrier approach into the primaldual line search approach of Section 19.4 or into the trustregion algorithm of Section 19.5.
Other drawbacks of the classical primal barrier approach are that it requires a feasible initial point, which can be difficult to find in many cases, and that the incorporation of equality constraints in a primal function is problematic. A formulation in which the equality constraints are replaced by quadratic penalties suffers from the shortcomings of quadratic penalty functions discussed in Section 17.1.
The shortcomings of the primal barrier approach were attributed for many years to the ill conditioning of the Hessian of the barrier function P. Note that
2 Px;2fx 2cx cxcxT. 19.51 xx cix i ci2x i i
By substituting 19.47 into 19.51 and using the definition 12.33 of the Lagrangian Lx, z, we find that
2 Px; 2 Lx, z 1 z2c xc xT . 19.52 xx xx iii
iI
Note the similarity of this expression to the Hessian of the quadratic penalty function 17.19. Analysis of the matrix x2x Px; shows that it becomes increasingly ill conditioned near the minimizer x, as approaches zero.
This ill conditioning will be detrimental to the performance of the steepest descent, conjugate gradient, or quasiNewton methods. It is therefore correct to identify ill condi tioning as a source of the difficulties of unconstrained primal barrier functions that use these unconstrained methods. Newtons method is, however, not affected by ill conditioning, but its performance is still not satisfactory. As explained above, it is the high nonlinearity of the primal barrier function P that poses significant difficulties to Newtons method.
f x
iI
iI iI
19.7. GLOBAL CONVERGENCE PROPERTIES 587
19.7 GLOBAL CONVERGENCE PROPERTIES
We now study some global convergence properties of the primaldual interiorpoint methods described in Sections 19.4 and 19.5. Theorem 19.1 provides the starting point for the analysis. It gives conditions under which limit points of the iterates generated by the interiorpoint methods are KKT points for the nonlinear problem. Theorem 19.1 relies on the assumption that the perturbed KKT conditions 19.2 can be satisfied to a certain accuracy for every value of k . In this section we study conditions under which this assumption holds, that is, conditions that guarantee that our algorithms can find stationary points of the barrier problem 19.4.
We begin with a surprising observation. Whereas the line search primaldual ap proach is the basis of globally convergent interiorpoint algorithms for linear and quadratic programming, it is not guaranteed to be successful for nonlinear programming, even for nondegenerate problems.
FAILURE OF THE LINE SEARCH APPROACH
We have seen in Chapter 11 that line search Newton iterations for nonlinear equations can fail when the Jacobian loses rank. We now discuss a different kind of failure specific to interiorpoint methods. It is caused by the lack of coordination between the step computation and the imposition of the bounds.
EXAMPLE 19.2 WA CHTER AND BIEGLER 299 Consider the problem
subjectto
19.53a
19.53b
19.53c 19.53d
min x c1xs x s1 10,
def 2
def 1
c2xsxs22 0, s1 0, s2 0.
Note that the Jacobian of the equality constraints 19.53b19.53c with respect to x,s
has full rank everywhere. Let us apply a line search interiorpoint method of the form 19.6
19.9, starting from an initial point x0 such that s0, s0 0, and c x0 s0 0. 121
In this example, we use superscripts to denote iteration indices. Figure 19.2 illustrates the feasible region the dotted segment of the parabola and the initial point, all projected onto the x s1 plane. The primaldual step, which satisfies the linearization of the constraints 19.53b19.53c, leads from x0 to the tangent to the parabola. Here p1 and p2 are examples of possible steps satisfying the linearization of 19.53b19.53c. The new iterate x1 therefore lies between x0 and this tangent, but since s1 must remain positive, x1 will
588 CHAPTER 19. NONLINEAR INTERIOR METHODS
p 2
x0,s0 1
p 1
feasible region
x
Figure 19.2 Problem 19.53 projected onto the x s1 plane.
lie above the horizontal axis. Thus, from any starting point above the xaxis and to the left
of the parabola, namely, in the region
x,s1,s2 : x2 s1 1 0, s1 0, 19.54
the new iterate will remain in this region. The argument can now be repeated to show that the iterates xk never leave the region 19.54 and therefore never become feasible.
This convergence failure affects any method that generates directions that satisfy the linearization of the constraints 19.53b19.53c and that enforces the bounds 19.53d by the fraction to the boundary rule 19.8. The merit function can only restrict the step length further and is therefore incapable of resolving the difficulties. The strategy for updating is also irrelevant because the argument given above makes use only of the linearizations of the constraints.
These difficulties can be observed when practical linesearch codes are applied to the problem 19.53. For a wide range of starting points in the region 19.54, the interior point iteration converges to points of the form , 0, 0, with 0. In other words, the iterates can converge to an infeasible, nonoptimal point on the boundary of the set x1,s1,s2 : s1 0, s2 0, a situation that barrier methods are supposed to prevent. Furthermore, such limit points are not stationary for a feasibility measure see Definition 17.1.
s
1
19.7. GLOBAL CONVERGENCE PROPERTIES 589
Failures of this type are rare in practice, but they highlight a theoretical deficiency of the algorithmic class 19.619.9 that may manifest itself more often as inefficient behavior than as outright convergence failure.
MODIFIED LINE SEARCH METHODS
To remedy this problem, as well as the inefficiencies caused by Hessian and constraint Jacobian singularities, we must modify the search direction of the line search interiorpoint iteration in some circumstances. One option is to use penalizations of the constraints 147. Such penaltybarrier methods have been investigated only recently and mature implementations have not yet emerged.
An approach that has been successful in practice is to monitor the step lengths s , z in 19.28; if they are smaller than a given threshold, then we replace the primaldual step by a step that guarantees progress in feasibility and, preferably, improvement in optimality, too. In a filter method, when the step lengths are very small, we can invoke the feasibility restoration phase see Section 15.4, which is designed to generate a new iterate that reduces the infeasibility. A different approach, which assumes that a trustregion algorithm is at hand, is to replace the primaldual step by a trustregion step, such as that produced by Algorithm 19.4.
Safeguarding the primaldual step when the step lengths are very small is justified theoretically because, when line search iterations converge to nonstationary points, the step lengths s , z converge to zero. From a practical perspective, however, this strategy is not totally satisfactory because it attempts to react when bad steps are generated, rather than trying to prevent them. It also requires the choice of a heuristic to determine when a step length is too small. As we discuss next, the trustregion approach always generates productive steps and needs no safeguarding.
GLOBAL CONVERGENCE OF THE TRUSTREGION APPROACH
The interiorpoint trustregion method specified in Algorithm 19.4 has favorable global convergence properties, which we now discuss. For simplicity, we present the analysis in the context of inequalityconstrained problems of the form 19.43. We first study the solution of the barrier problem 19.4 for a fixed value of , and then consider the complete algorithm.
Intheresultthatfollows,Bk denotestheHessianx2xLk oraquasiNewtonapproxima tion to it. We use the measure of infeasibility hx cx , where y max0, y. This measure vanishes if and only if x is feasible for problem 19.43. Note that hx2 is differentiable and its gradient is
hx2 2Axcx.
590 CHAPTER 19. NONLINEAR INTERIOR METHODS
Wesaythatasequencexkisasymptoticallyfeasibleifcxk 0.ToapplyAlgorithm19.4 to a fixed barrier problem, we dispense with the outer repeat loop.
Theorem 19.2.
Suppose that Algorithm 19.4 is applied to the barrier problem 19.4, that is, is fixed and the inner repeat loop is executed with 0. Suppose that the sequence fk is bounded below and the sequences fk, ck, Ak, and Bk are bounded. Then one of the following three situations occurs:
i The sequence xk is not asymptotically feasible. In this case, the iterates approach sta tionarity of the measure of infeasibility hx cx , meaning that Akck 0, and the penalty parameters k tend to infinity.
ii Thesequencexkisasymptoticallyfeasible,butthesequenceck,Akhasalimitpoint , A failing the linear independence constraint qualification. In this situation also, the penalty parameters k tend to infinity.
iii Thesequencexkisasymptoticallyfeasible,andalllimitpointsofthesequenceck,Ak satisfy the linear independence constraint qualification. In this case, the penalty parameter k is constant and ck 0 for all large indices k, and the stationarity conditions of problem 19.4 are satisfied in the limit.
This theorem is proved in 48, where it is assumed, for simplicity, that is given
by the primal choice 19.14. The theorem accounts for two situations in which the KKT conditions may not be satisfied in the limit, both of which are of interest. Outcome i is a case in which, in the limit, there is no direction that improves feasibility to first order. This outcome cannot be ruled out because finding a feasible point is a problem that a local method cannot always solve without a good starting point. Note that we do not assume that the constraint Jacobian Ak has full rank.
In considering outcome ii, we must keep in mind that in some cases the solution to problem 19.43 is a point where the linear independence constraint qualification fails and that is not a KKT point. Outcome iii is the most desirable outcome and can be monitored in practice by observing, for example, the behavior of the penalty parameter k .
We now study the complete interiorpoint method given in Algorithm 19.4 applied to the nonlinear programming problem 19.43. By combining Theorems 19.1 and 19.2 we see that the following outcomes can occur:
For some barrier parameter generated by the algorithm, either the inequality ck sk is never satisfied, in which case the stationarity condition for minimizing hxissatisfiedinthelimit,orelsecksk 0,inwhichcasethesequenceck, Ak has a limit point c , A failing the linear independence constraint qualification;
AteachouteriterationofAlgorithm19.4theinnerstoptestExk,sk,yk,zk; is satisfied. Then all limit points of the iteration sequence are feasible. Furthermore,
19.8. SUPERLINEAR CONVERGENCE 591
if any limit point x satisfies the linear independence constraint qualification, the firstorder necessary conditions for problem 19.43 hold at x.
19.8 SUPERLINEAR CONVERGENCE
We can implement primaldual interiorpoint methods so that they converge quickly near the solution. All is needed is that we carefully control the decrease in the barrier parameter and the inner convergence tolerance , and let the parameter in 19.9 converge to 1 sufficiently rapidly. We now describe strategies for updating these parameters in the context of the line search iteration discussed in Section 19.4; these strategies extend easily to the trustregion method of Section 19.5.
In the discussion that follows, we assume that the merit function or filter is inactive. This assumption is realistic because with a careful implementation which may include secondorder correction steps or other features, we can ensure that, near a solution, all the steps generated by the primaldual method are acceptable to the merit function or filter.
We denote the primaldual iterates by
v x, s, y, z 19.55
and define the full primaldual step without backtracking by
v v p, 19.56
where p is the solution of 19.12. To establish local convergence results, we assume that the iterates converge to a solution point satisfying certain regularity assumptions.
Assumptions 19.1.
a v is a solution of the nonlinear program 19.1 for which the firstorder KKT conditions are satisfied.
b The Hessian matrices 2 f x and 2ci x, i E I, are locally Lipschitz continuous at v.
c The linear independence constraint qualification LICQ Definition 12.4, the strict complementarity condition Definition 12.5, and the secondorder sufficient conditions Theorem 12.6 hold at v.
We assume that v is an iterate at which the inner stop test Ev, is satisfied,
so that the barrier parameter is decreased from to . We now study how to control the parameters in Algorithm 19.2 so that the following three properties hold in a neighborhood of v:
592 CHAPTER 19. NONLINEAR INTERIOR METHODS
1. The iterate v satisfies the fraction to the boundary rule 19.9, that is, max max 1. sz
2. The inner stop test is satisfied at v, that is, Ev; . 3. The sequence of iterates 19.56 converge superlinearly to v. We can achieve these three goals by letting
and , for 0, and setting the other parameters as follows:
1, 0,1; 1, .
19.57
19.58
There are other practical ways of controlling the parameters of the algorithm. For example, we may prefer to determine the change in from the reduction achieved in the KKT conditions of the nonlinear program, as measured by the function E. The three results mentioned above can be established if the convergence tolerance is defined as in 19.57 and if we replace by Ev;0 in the righthand sides of the definitions 19.58 of and .
There is a limit to how fast we can decrease and still be able to satisfy the inner stop test after just one iteration condition 2. One can show that there is no point in decreasing at a faster than quadratic rate, since the overall convergence cannot be faster than quadratic. Not suprising, if is constant and , with 0,1, then the interiorpoint algorithm is only linearly convergent.
Although it is desirable to implement interiorpoint methods so that they achieve a superlinear rate of convergence, this rate is typically observed only in the last few iterations in practice.
19.9 PERSPECTIVES AND SOFTWARE
Software packages that implement nonlinear interiorpoint methods are widely available. Line search implementations include LOQO 294, KNITRODIRECT 303, IPOPT 301, and BARNLP 21, and for convex problems, MOSEK 5. The trustregion algorithm discussed in Section 19.5 has been implemented in KNITROCG 50. These interiorpoint packages have proved to be strong competitors of the leading activeset and augmented Lagrangian pack ages, such as MINOS 218, SNOPT 128, LANCELOT 72, FILTERSQP 105, and KNITROACTIVE 49. At present, interiorpoint and activeset methods appear to be the most promising approaches, while augmented Lagrangian methods seem to be less efficient. The KNITRO package provides crossover from interiorpoint to activeset modes 46.
Interiorpoint methods show their strength in largescale applications, where they often but not always outperform activeset methods. In interiorpoint methods, the linear
19.9. PERSPECTIVES AND SOFTWARE 593
system to be solved at every iteration has the same block structure, so effort can be focused on exploiting this structure. Both direct factorization techniques and projected CG methods are available, allowing the user to solve many types of applications efficiently. On the other hand, interiorpoint methods, unlike activeset methods, consider all the constraints at each iteration, even if they are irrelevant to the solution. As a result, the cost of the primaldual iteration can be excessive in some applications.
One of the main weaknesses of interiorpoint methods is their sensitivity to the choice of the initial point, the scaling of the problem, and the update strategy for the barrier parame ter . If the iterates approach the boundary of the feasible region prematurely, interiorpoint methods may have difficulty escaping it, and convergence can be slow. The availability of adaptive strategies for updating is, however, beginning to lessen this sensitivity, and more robust implementations can be expected in the coming years.
Although the description of the line search algorithm in Section 19.4 is fairly complete, various details of implementation such as secondorder corrections, iterative refinement, and resetting of parameters are needed to obtain a robust code. Our description of the trust region method of Algorithm 19.4 leaves some important details unspecified, particularly concerning the procedure for computing approximate solutions of the normal and tangential subproblems; see 50 for further discussion. The KNITROCG implementation of this trust region algorithm uses a projected CG iteration in the computation of the step, which allows the method to work even when only Hessianvector products are available, not the Hessian itself.
Filters and merit functions have each been used to globalize interiorpoint methods. Although some studies have shown that merit functions restrict the progress of the iteration unduly 298, recent developments in penalty update procedures see Chapter 18 have altered the picture, and it is currently unclear whether filter globalization approaches are preferable.
NOTES AND REFERENCES
The development of modern nonlinear interiorpoint methods was influenced by the success of interiorpoint methods for linear and quadratic programming. The concept of primaldual steps arises from the homotopy formulation given in Section 19.1, which is an extension of the systems 14.13 and 16.57 for linear and quadratic programming. Although the primal barrier methods of Section 19.6 predate primaldual methods by at least 15 years, they played a limited role in their development.
There is a vast literature on nonlinear interiorpoint methods. We refer the reader to the surveys by Forsgren, Gill, and Wright 111 and Gould, Orban, and Toint 147 for a comprehensive list of references. The latter paper also compares and contrasts interior point methods with other nonlinear optimization methods. For an analysis of interiorpoint methods that use filter globalization see, for example, Ulbrich, Ulbrich, and Vicente 291 and Wa chter and Biegler 300. The book by Conn, Gould, and Toint 74 gives a thorough presentation of several interiorpoint methods.
594 CHAPTER 19. NONLINEAR INTERIOR METHODS
Primal barrier methods were originally proposed by Frisch 115 and were analyzed in an authoritative book by Fiacco and McCormick 98. The term interiorpoint method and the concept of the primal central path Cp appear to have originated in this book. Nesterov and Nemirovskii 226 propose and analyze several families of barrier methods and establish polynomialtime complexity results for very general classes of problems such as semidefinite and secondorder cone programming. For a discussion of the history of barrier function methods, see Nash 221.
EXERCISES
a
b
c d
a b
c
19.1 Consider the nonlinear program
min f x subject to cEx 0, cIx 0. 19.59
Write down the KKT conditions of 19.1 and 19.59, and establish a onetoone correspondence between KKT points of these problems despite the different numbers of variables and multipliers.
The multipliers z correspond to the equality constraints 19.1c and should therefore be unsigned. Nonetheless, argue that 19.2 with 0 together with 19.3 can be seen as the KKT conditions of problem 19.1. Moreover, argue that the multipliers z in19.2canbeseenasthemultipliersoftheinequalitiescI in19.59.
Suppose x is feasible for 19.59. Show that LICQ holds at x for 19.59 if and only if LICQ holds at x , s for 19.1, with s cIx .
RepeatpartcassumingthattheMFCQconditionholdsseeDefinition12.6instead of LICQ.
19.2 This question concerns Algorithm 19.1.
Extend the proof of Theorem 19.1 to the general nonlinear program 19.1.
ShowthatthetheoremstillholdsiftheconditionExk,sk,yk,zkk isreplacedby Exk,sk,yk,zk k,foranysequence k thatconvergesto0ask 0.
Suppose that in Algorithm 19.1 the new iterate xk1, sk1, yk1, zk1 is obtained by any means. What conditions are required on this iterate so that Theorem 19.1 holds?
19.3 Consider the nonlinear system of equations 11.1. Show that Newtons method 11.6 is invariant to scalings of the equations. More precisely, show that the Newton step p does not change if each component of r is multiplied by a nonzero constant.
19.4 Consider the system
19.9. PERSPECTIVES AND SOFTWARE 595
x1 x2 2 0, x1x2 2×2 10.
Find all the solutions to this system. Show that if the first equation is multiplied by x2, the solutions do not change but the Newton step taken from 1, 1 will not be the same as that for the original system.
19.5 Let x,s,y,z be a primaldual solution that satisfies the LICQ and strict complementarity conditions.
a Giveconditionsonx2xcLx,s,y,zthatensurethattheprimaldualmatrixin19.6 is nonsingular.
b Show that some diagonal elements of tend to infinity and others tend to zero when 0. Can you characterize each case? Consider the cases in which is defined by 19.13 and 19.14.
c Argue that the matrix in 19.6 is not ill conditioned under the assumptions of this problem.
19.6
a Introduce the change of variables p s S1 ps in 19.12, and show that the 2, 2
block of the primaldual matrix has a cluster of eigenvalues around 0 when 0. b Analyzetheeigenvaluedistributionofthe2,2blockifthechangeofvariablesisgiven
by p s 12ps or p s S1ps.
c Let 0 be the smallest eigenvalue of x2x cL. Describe a change of variables for which
all the eigenvalues of the 2, 2 block converge to as 0.
19.7 Program the simple interiorpoint method Algorithm 19.1 and apply it to the problem 18.69. Use the same starting point as in that problem. Try different values for the parameter .
19.8
a Compute the minimumnorm solution of the system of equations defined by 19.35. This system defines the Newton component in the dogleg method used to find an ap proximate solution to 19.34. Show that the computation of the Newton component can use the factorization of the augmented matrix defined in 19.36.
b Compute the unconstrained minimizer of the quadratic in 19.34a along the steepest descent direction, starting from v 0. This minimizer defines the Cauchy component in the dogleg method used to find an approximate solution to 19.34.
596 CHAPTER 19. NONLINEAR INTERIOR METHODS
c The dogleg step is a combination of the Newton and Cauchy steps from parts a and b. Show that the dogleg step is in the range space of AT .
19.9
a If the normal subproblem 19.34a19.34c is solved by using the dogleg method,
showthatthesolutionvisintherangespaceofmatrixAT definedin19.36.
b After the normal step v is obtained, we define the residual vectors rE and rI as in 19.35 and w p v. Show that 19.33 becomes a quadratic program with circular trustregion constraint and bound constraint in the variables w.
c Showthatthesolutionwoftheproblemderivedinpartbisorthogonaltothenormal step v, that is, that wT v 0.
19.10 Verify that the leastsquares multiplier formula 18.21 corresponding to 19.33a19.33c is given by 19.37.
19.11
a Write the primaldual system 19.6 for problem 19.53, considering s1,s2 as slacks and denoting the multipliers of 19.53b, 19.53c by z1, z2. You should get a system of five equations with five unknowns. Show that the matrix of the system is singular at any iterate of the form x , 0, 0.
b Show that if the starting point in Example 19.53 lies in the region 19.54, the interiorpoint step leads to a point on the tangent line to the parabola, as illustrated in Figure 19.2. More specifically, show that the tangent line never lies to the left of the parabola.
c Letx0 2,s0 1,s0 1,letz0 z0 1,andlet0.Computethefull 1212
Newton step based on the system in part a. Truncate, if necessary, to satisfy a fraction to the boundary rule with 1. Verify that the new iterate is still in the region 19.54.
d Let us the consider the behavior of an SQP method. For the initial point in c, show that the linearized constraints of problem 18.56 dont forget the constraints s1 0,s2 0areinconsistent.Therefore,theSQPsubproblem18.11isinconsistent, and a relaxation of the constraint of the SQP subproblem must be performed.
19.12 Consider the following problem in a single variable x: min x subjecttox 0, 1x 0.
a WritetheprimalbarrierfunctionPx;associatedwiththisproblem.
b Plot the barrier function for different values of .
19.9. PERSPECTIVES AND SOFTWARE 597
c Characterizetheminimizersofthebarrierfunctionasafunctionofandconsiderthe limit as goes to 0.
19.13 Consider the scalar minimization problem
min 1 , subject to x 1.
x 1 x2
Write down Px; for this problem, and show that Px; is unbounded below for any
positive value of . See Powell 242 and M. Wright 313.
19.14
The goal of this exercise is to describe an efficient implementation of the limited
memory BFGS version of the interiorpoint method using the compact representation 19.29. First we decompose the primaldual matrix as
I0ATATW EI
0 0 I 0
MWT 0 0 0 . 19.60 AE 0 0 00
AI I 0 0 0
Use the ShermanMorrisonWoodbury formula to express the inverse 19.60. Then show that the primaldual step 19.12 requires the solution of systems of the form Cv b, where C is the left matrix in 19.60 and v and b are certain vectors.
APPENDIX A Background
Material
A.1 ELEMENTS OF LINEAR ALGEBRA
VECTORS AND MATRICES
In this book we work exclusively with vectors and matrices whose components are real numbers. Vectors are usually denoted by lowercase roman characters, and matrices by uppercase roman characters. The space of real vectors of length n is denoted by IRn , while the space of real m n matrices is denoted by IRmn.
This is pa Printer: O
g
A.1. ELEMENTS OF LINEAR ALGEBRA 599
Given a vector x IRn, we use xi to denote its ith component. We invariably assume that x is a column vector, that is,
x1
x
2 x . .
. xn
The transpose of x, denoted by xT is the row vector xT x1 x2 xn ,
andisoftenalsowrittenwithparenthesesasx x1,x2,…,xn.Wewritex 0toindicate componentwisenonnegativity,thatis,xi 0foralli1,2,…,n,whilex0indicates that xi 0 for all i 1,2,…,n.
GivenxIRn andyIRn,thestandardinnerproductisxTy n xiyi. i1
Given a matrix A IRmn , we specify its components by double subscripts as Ai j , for i 1,2,…,m and j 1,2,…,n. The transpose of A, denoted by AT , is the n m matrix whose components are Aji. The matrix A is said to be square if m n. A square matrix is symmetric if A AT .
A square matrix A is positive definite if there is a positive scalar such that
x T A x x T x , f o r a l l x IR n . A . 1
It is positive semidefinite if
x T A x 0 , f o r a l l x IR n .
We can recognize that a symmetric matrix is positive definite by computing its eigenvalues and verifying that they are all positive, or by performing a Cholesky factorization. Both techniques are discussed further in later sections.
The diagonal of the matrix A IRmn consists of the elements Aii, for i 1,2,…minm,n.Thematrix A IRmn islowertriangularif Aij 0wheneveri j;that is, all elements above the diagonal are zero. It is upper triangular if Ai j 0 whenever i j ; that is, all elements below the diagonal are zero. A is diagonal if Ai j 0 whenever i a j .
The identity matrix, denoted by I, is the square diagonal matrix whose diagonal elements are all 1.
A square n n matrix A is nonsingular if for any vector b IRn , there exists x IRn such that Ax b. For nonsingular matrices A, there exists a unique n n matrix B such that AB BA I.Wedenote B by A1 andcallittheinverseof A.Itisnothardtoshow that the inverse of AT is the transpose of A1.
A square matrix Q is orthogonal if it has the property that QQT QT Q I. In other words, the inverse of an orthogonal matrix is its transpose.
600 APPENDIX A. BACKGROUND MATERIAL
NORMS
For a vector x IRn , we define the following norms:
xTx12,
d e f n x 1
xi, defn 12
A.2a
A.2b A.2c
The norm
norm and to as the norm. All these norms measure the length of the vector in some sense, and they are equivalent in the sense that each one is bounded above and below by a multiple of the other. To be precise, we have for all x IRn that
i1
x 2 xi2
i1
x max xi .
def
i 1,…,n
2 is often called the Euclidean norm. We sometimes refer to
1 as the 1
x x2nx, x x1nx, A.3 and so on. In general, a norm is any mapping from IRn to the nonnegative real numbers
that satisfies the following properties:
xz x z , forallx,zIRn; x 0x0;
x x , forallIRandxIRn.
A.4a A.4b A.4c
Equality holds in A.4a if and only if one of the vectors x and z is a nonnegative scalar multiple of the other.
Another interesting property that holds for the Euclidean norm CauchySchwarz inequality, which states that
2 is the
axTza x z ,
with equality if and only if one of these vectors is a nonnegative multiple of the other. We
can prove this result as follows:
0xz22 x22xTzz2.
The righthandside is a convex function of , and it satisfies the required nonnegativity property only if there exist fewer than 2 distinct real roots, that is,
2xTz24x2 z2,
A.5
A max
i 1,…,m
Aij.
The Frobenius norm A F of the matrix A is defined by
A
12 A2 .
A.1. ELEMENTS OF LINEAR ALGEBRA 601
proving A.5. Equality occurs when the quadratic has exactly one real root that is, xTz x z andwhenxz0forsome,asclaimed.
Any norm has a dual norm D defined by
x D max xT y. A.6
y 1
It is easy to show that the norms 1 and are duals of each other, and that the Euclidean norm is its own dual.
We can derive definitions for certain matrix norms from these vector norm defini tions. If we let be generic notation for the three norms listed in A.2, we define the corresponding matrix norm as
Ax
A sup x . A.7
def
x a0
The matrix norms defined in this way are said to be consistent with the vector norms A.2. Explicit formulae for these norms are as follows:
m
Aij,
A 2 largest eigenvalue of AT A12,
A 1 max
j 1,…,n
i1 n
A.8a
A.8b A.8c
A.9
j1
m n Fij
i1 j1
This norm is useful for many purposes, but it is not consistent with any vector norm. Once again, these various matrix norms are equivalent with each other in a sense similar to A.3.
For the Euclidean norm 2, the following property holds:
ABA B, A.10
for all matrices A and B with consistent dimensions.
The condition number of a nonsingular matrix is defined as
A A A1 , A.11
602 APPENDIX A. BACKGROUND MATERIAL
where any matrix norm can be used in the definition. Different norms can by the use of a subscript1, 2, and , respectivelywith denoting 2 by default.
Norms also have a meaning for scalar, vector, and matrixvalued functions that are defined on a particular domain. In these cases, we can define Hilbert spaces of functions for which the inner product and norm are defined in terms of an integral over the domain. We omit details, since all the development of this book takes place in the space IRn , though many of the algorithms can be extended to more general Hilbert spaces. However, we mention for purposes of the analysis of Newtonlike methods that the following inequality holds for functions of the type that we consider in this book:
ab a b
a
Fta Ft dt, A.12 aa
where F is a continuous scalar, vector, or matrixvalued function on the interval a, b. SUBSPACES
Given the Euclidean space IRn , the subset S IRn is a subspace of IRn if the following property holds: If x and y are any two elements of S, then
x y S, for all , IR.
For instance, S is a subspace of IR2 if it consists of i the whole space IRn ; ii any line passing through the origin; iii the origin alone; or iv the empty set.
Givenanysetofvectorsai IRn,i 1,2,…,m,theset
S wIRnaiTw0,i1,2,…,m A.13
is a subspace. However, the set
w IRn aiT w 0, i 1, 2, . . . , m A.14
isnotingeneralasubspace.Forexample,ifwehaven2,m1,anda1 1,0T,thisset would consist of all vectors w1 , w2 T with w1 0, but then given two vectors x 1, 0T and y 2,3 in this set, it is easy to choose multiples and such that x y has a negative first component, and so lies outside the set.
Sets of the forms A.13 and A.14 arise in the discussion of secondorder optimality conditions for constrained optimization.
Asetofvectorss1,s2,…,sminIRn iscalledalinearlyindependentsetifthereareno realnumbers1,2,…,m suchthat
1s22s2msm 0,
A.1. ELEMENTS OF LINEAR ALGEBRA 603
unless we make the trivial choice 1 2 m 0. Another way to define linear independence is to say that none of the vectors s1, s2, . . . , sm can be written as a linear combinationoftheothervectorsinthisset.Ifinfactwehavesi Sforalli 1,2,…,m, we say that s1,s2,…,sm is a spanning set for S if any vector s S can be written as
s 1 s2 2 s2 m sm ,
for some particular choice of the coefficients 1 , 2 , . . . , m .
Ifthevectorss1,s2,…,sm arebothlinearlyindependentandaspanningsetforS,
we call them a basis of S. In this case, m the number of elements in the basis is referred to as the dimension of S, and denoted by dimS. Note that there are many ways to choose a basis of S in general, but that all bases contain the same number of vectors.
If A is any real matrix, the null space is the subspace Null A w Aw 0,
while the range space is
RangeAww Av forsomevectorv.
The fundamental theorem of linear algebra states that NullA RangeAT IRn,
where n is the number of columns in A. Here, denotes the direct sum of two sets: A B x y x A, y B.
When A is square n n and nonsingular, we have NullA NullAT 0 and RangeA RangeAT IRn. In this case, the columns of A form a basis of IRn, as do the columns of AT .
EIGENVALUES, EIGENVECTORS, AND THE SINGULARVALUE DECOMPOSITION
A scalar value is an eigenvalue of the n n matrix A if there is a nonzero vector q such that
Aq q.
The vector q is called an eigenvector of A. The matrix A is nonsingular if none of its eigenvalues are zero. The eigenvalues of symmetric matrices are all real numbers, while nonsymmetric matrices may have imaginary eigenvalues. If the matrix is positive definite as well as symmetric, its eigenvalues are all positive real numbers.
604 APPENDIX A. BACKGROUND MATERIAL
All matrices A not necessarily square can be decomposed as a product of three matrices with special properties. When A IRmn with m n, that is, A has more rows than columns, this singularvalue decomposition SVD has the form
AU S VT, A.15 0
where U and V are orthogonal matrices of dimension m m and n n, respectively, and Sisannndiagonalmatrixwithdiagonalelementsi,i 1,2,…,n,thatsatisfy
1 2 n 0.
These diagonal values are called the singular values of A. We can define the condition number A.11 of the m n possibly nonsquare matrix A to be 1n. This definition is identical to 2A when A happens to be square and nonsingular.
When m n the number of columns is at least equal to the number of rows, the SVD has the form
AUS0VT,
where again U and V are orthogonal of dimension m m and n n, respectively, while S is m m diagonal with nonnegative diagonal elements 1 2 m .
When A is symmetric, its n real eigenvalues 1 , 2 , . . . , n and their associated eigenvectors q1 , q2 , . . . , qn can be used to write a spectral decomposition of A as follows:
A
This decomposition can be restated in matrix form by defining
and writing
diag1,2,,n, Q q1 q2 … qn,
AQ QT.
n i1
i qi qiT .
In fact, when A is positive definite as well as symmetric, this decomposition is identical to the singularvalue decomposition A.15, where we define U V Q and S . Note that the singular values i and the eigenvalues i coincide in this case.
A.16
DETERMINANT AND TRACE
Thetraceofannnmatrix Aisdefinedby n i1
Aii.
traceA
A.17
A.18
If the eigenvalues of A are denoted by 1,2,…,n, it can be shown that
n i1
eigenvalues; that is,
trace A
i ,
A.1. ELEMENTS OF LINEAR ALGEBRA 605
In the case of the Euclidean norm A.8b, we have for symmetric positive definite matrices A that the singular values and eigenvalues of A coincide, and that
A 1A largest eigenvalue of A,
A1 n A1 inverse of smallest eigenvalue of A.
Hence, we have for all x IRn that
nAx2x2A1 xTAxA x21Ax2.
For an orthogonal matrix Q, we have for the Euclidean norm that Qx x ,
and that all the singular values of this matrix are equal to 1.
that is, the trace of the matrix is the sum of its eigenvalues.
The determinant of an n n matrix A, denoted by det A, is the product of its
det A
1n
i1
i .
A.19
The determinant has several appealing and revealing properties. For instance, det A 0 if and only if A is singular;
det AB det Adet B;
detA1 1detA.
606 APPENDIX A. BACKGROUND MATERIAL
RecallthatanyorthogonalmatrixAhasthepropertythatQQT QTQI,sothat Q1 QT . It follows from the property of the determinant that det Q det QT 1.
The properties above are used in the analysis of Chapter 6.
MATRIX FACTORIZATIONS: CHOLESKY, LU, QR
Matrix factorizations are important both in the design of algorithms and in their analysis. One such factorization is the singularvalue decomposition defined above in A.15. Here we define the other important factorizations.
All the factorization algorithms described below make use of permutation matrices. Suppose that we wish to exchange the first and fourth rows of a matrix A. We can perform this operation by premultiplying A by a permutation matrix P, which is constructed by interchanging the first and fourth rows of an identity matrix that contains the same number of rows as A. Suppose, for example, that A is a 5 5 matrix. The appropriate choice of P would be
00010 0 1 0 0 0
P0 0 1 0 0. 1 0 0 0 0
00001
A similar technique is used to to find a permutation matrix P that exchanges columns of a matrix.
The LU factorization of a matrix A IRnn is defined as
P A LU, A.20
where
P is an n n permutation matrix that is, it is obtained by rearranging the rows of
the n n identity matrix,
L is unit lower triangular that is, lower triangular with diagonal elements equal to 1,
and
U is upper triangular.
This factorization can be used to solve a linear system of the form Ax b efficiently by the
following threestep process:
form b Pb by permuting the elements of b;
solve Lz b by performing triangular forwardsubstitution, to obtain the vector z;
A.1. ELEMENTS OF LINEAR ALGEBRA 607
solve Ux z by performing triangular backsubstitution, to obtain the solution vector x.
The factorization A.20 can be found by using Gaussian elimination with row partial pivoting, an algorithm that requires approximately 2n33 floatingpoint operations when A is dense. Standard software that implements this algorithm notably, LAPACK 7 is readily available. The method can be stated as follows.
Algorithm A.1 Gaussian Elimination with Row Partial Pivoting. Given A IRnn;
Set P I, L 0;
for i 1,2,…,n
find the index j i,i 1,…,n such that Aji maxki,i1,…,n Aki; ifAij 0
stop; matrix A is singular if i a j
swap rows i and j of matrices A and L; elimination step
Lii 1;
for k i 1, i 2, . . . , n
Lki AkiAii;
for l i 1, i 2, . . . , n
Akl Akl Lki Ail; end for
end if end for
U upper triangular part of A.
Variants of the basic algorithm allow for rearrangement of the columns as well as the rows during the factorization, but these do not add to the practical stability properties of the algorithm. Column pivoting may, however, improve the performance of Gaussian elimination when the matrix A is sparse. by ensuring that the factors L and U are also reasonably sparse.
Gaussian elimination can be applied also to the case in which A is not square. When A is m n, with m n, the standard row pivoting algorithm produces a factorization of the form A.20, where L IRmn is unit lower triangular and U IRnn is upper triangular. Whenmn,wecanfindanLUfactorizationofAT ratherthanA,thatis,weobtain
PAT L1 U, A.21 L2
where L 1 is m m square unit lower triangular, U is m m upper triangular, and L 2 is a general n m m matrix. If A has full row rank, we can use this factorization to calculate
608 APPENDIX A. BACKGROUND MATERIAL
its null space explicitly as the space spanned by the columns of the matrix
LT LT
M PT 1 2 UT. A.22
I
It is easy to check that M has dimensions n n m and that AM 0.
When A IRnn is symmetric positive definite, it is possible to compute a similar but more specialized factorization at about half the costabout n33 operations. This
factorization, known as the Cholesky factorization, produces a matrix L such that
ALLT. A.23
If we require L to have positive diagonal elements, it is uniquely defined by this formula. The algorithm can be specified as follows.
Algorithm A.2 Cholesky Factorization. Given A IRnn symmetric positive definite; for i 1, 2, . . . , n;
Lii Aii;
for j i 1, i 2, . . . , n
Lji AjiLii;
for k i 1, i 2, . . . , j
Ajk Ajk LjiLki; end for
end for end for
Note that this algorithm references only the lower triangular elements of A; in fact, it is only necessary to store these elements in any case, since by symmetry they are simply duplicated in the upper triangular positions.
Unlike the case of Gaussian elimination, the Cholesky algorithm can produce a valid factorization of a symmetric positive definite matrix without swapping any rows or columns. However, symmetric permutation that is, reordering the rows and columns in the same way can be used to improve the sparsity of the factor L . In this case, the algorithm produces a permutation of the form
PT AP LLT
for some permutation matrix P.
The Cholesky factorization can be used to compute solutions of the system Ax b
by performing triangular forward and backsubstitutions with L and LT , respectively, as in the case of L and U factors produced by Gaussian elimination.
Another useful factorization of rectangular matrices A IRmn has the form AP QR,
where
P is an n n permutation matrix, A is m m orthogonal, and
R is m n upper triangular.
A.24
A.1. ELEMENTS OF LINEAR ALGEBRA 609
The Cholesky factorization can also be used to verify positive definiteness of a sym metric matrix A. If Algorithm A.2 runs to completion with all Lii values well defined and positive, then A is positive definite.
In the case of a square matrix m n, this factorization can be used to compute solutions of linear systems of the form Ax b via the following procedure:
s e t b Q T b ;
solve Rz b for z by performing backsubstitution; set x PT z by rearranging the elements of x.
For a dense matrix A, the cost of computing the QR factorization is about 4m2n3 operations. In the case of a square matrix, the operation count is about twice as high as for an LU factorization via Gaussian elimination. Moreover, it is more difficult to maintain sparsity in a QR factorization than in an LU factorization.
Algorithms to perform QR factorization are almost as simple as algorithms for Gaus sian elimination and for Cholesky factorization. The most widely used algorithms work by applying a sequence of special orthogonal matrices to A, known either as Householder transformations or Givens rotations, depending on the algorithm. We omit the details, and refer instead to Golub and Van Loan 136, Chapter 5 for a complete description.
In the case of a rectangular matrix A with m n, we can use the QR factorization of AT tofindamatrixwhosecolumnsspanthenullspaceofA.Tobespecific,wewrite
AT P QR Q1 Q2 R,
where Q1 consists of the first m columns of Q, and Q2 contains the last n m columns. It is easy to show that columns of the matrix Q2 span the null space of A. This procedure yields a more satisfactory basis matrix for the null space than the Gaussian elimination procedure A.22, because the columns of Q2 are orthogonal to each other and have unit length. It may be more expensive to compute, however, particularly in the case in which A is sparse.
When A has full column rank, we can make an identification between the R factor in A.24 and the Cholesky factorization. By multiplying the formula A.24 by its transpose,
610 APPENDIX A. BACKGROUND MATERIAL
we obtain
PT AT AP RT QT QR RT R,
andbycomparisonwithA.23,weseethatRT issimplytheCholeskyfactorofthesymmetric positive definite matrix P T AT A P . Recalling that L is uniquely defined when we restrict its diagonal elements to be positive, this observation implies that R is also uniquely defined for a given choice of permutation matrix P, provided that we enforce positiveness of the diagonals of R. Note, too, that since we can rearrange A.24 to read A P R1 Q, we can conclude that Q is also uniquely defined under these conditions.
Note that by definition of the Euclidean norm and the property A.10, and the fact that the Euclidean norms of the matrices P and Q in A.24 are both 1, we have that
while
A QRPT Q R PT R ,
RQTAPQT APA.
We conclude from these two inequalities that A R . When A is square, we have by a similar argument that A1 R1 . Hence the Euclideannorm condition number of A can be estimated by substituting R for A in the expression A.11. This observation is significant because various techniques are available for estimating the condition number of triangular matrices R; see Golub and Van Loan 136, pp. 128130 for a discussion.
SYMMETRIC INDEFINITE FACTORIZATION
When matrix A is symmetric but indefinite, Algorithm A.2 will break down by trying to take the square root of a negative number. We can however produce a factorization, similar to the Cholesky factorization, of the form
PAPT LBLT, A.25
where L is unit lower triangular, B is a block diagonal matrix with blocks of dimension 1 or 2, and P is a permutation matrix. The first step of this symmetric indefinite factorization proceeds as follows. We identify a submatrix E of A that is suitable to be used as a pivot block. The precise criteria that can be used to choose E are described below, but we note here that E is either a single diagonal element of A a 1 1 pivot block, or else the 2 2 block consisting of two diagonal elements of A say, aii and ajj along with the corresponding offdiagonalelementsthatis,aij andaji.Ineithercase,Emustbenonsingular.Wethen
A.1. ELEMENTS OF LINEAR ALGEBRA 611
find a permutation matrix P1 that makes E a leading principal submatrix of A, that is, E CT
P1AP1 C H , A.26 and then perform a block factorization on this rearranged matrix, using E as the pivot
block, to obtain
T I0E0 IE1CT
P1AP1 CE1 I 0 HCE1CT 0 I .
The next step of the factorization consists in applying exactly the same process to H CE1CT , known as the remaining matrix or the Schur complement, which has dimension either n 1 n 1 or n 2 n 2. We now apply the same procedure recursively, terminating with the factorization A.25. Here P is defined as a product of the permutation matrices from each step of the factorization, and B contains the pivot blocks E on its diagonal.
The symmetric indefinite factorization requires approximately n33 floatingpoint operationsthe same as the cost of the Cholesky factorization of a positive definite matrix but to this count we must add the cost of identifying suitable pivot blocks E and of performing the permutations, which can be considerable. There are various strategies for determining the pivot blocks, which have an important effect on both the cost of the factorization and its numerical properties. Ideally, our strategy for choosing E at each step of the factorization procedure should be inexpensive, should lead to at most modest growth in the elements of the remaining matrix at each step of the factorization, and should avoid excessive fillin that is, L should not be too much more dense than A.
A wellknown strategy, due to Bunch and Parlett 43, searches the whole remaining matrix and identifies the largestmagnitude diagonal and largestmagnitude offdiagonal elements, denoting their respective magnitudes by dia and off . If the diagonal element whose magnitude is dia is selected to be a 1 1 pivot block, the element growth in the remaining matrix is bounded by the ratio diaoff. If this growth rate is acceptable, we choose this diagonal element to be the pivot block. Otherwise, we select the offdiagonal element whose magnitude is off ai j , say, and choose E to be the 2 2 submatrix that includes this element, that is,
E aii aij . aij ajj
This pivoting strategy of Bunch and Parlett is numerically stable and guarantees to yield a matrix L whose maximum element is bounded by 2.781. Its drawback is that the evaluation of dia and off at each iteration requires many comparisons between floatingpoint numbers
612 APPENDIX A. BACKGROUND MATERIAL
to be performed: On3 in total during the overall factorization. Since each comparison costs roughly the same as an arithmetic operation, this overhead is not insignificant.
The more economical pivoting strategy of Bunch and Kaufman 42 searches at most two columns of the working matrix at each stage and requires just On2 comparisons in total. Its rationale and details are somewhat tricky, and we refer the interested reader to the original paper 42 or to Golub and Van Loan 136, Section 4.4 for details. Unfortunately, this algorithm can give rise to arbitrarily large elements in the lower triangular factor L, making it unsuitable for use with a modified Cholesky strategy.
The bounded BunchKaufman strategy is essentially a compromise between the BunchParlett and BunchKaufman strategies. It monitors the sizes of elements in L, ac cepting the inexpensive BunchKaufman choice of pivot block when it yields only modest element growth, but searching further for an acceptable pivot when this growth is excessive. Its total cost is usually similar to that of BunchKaufman, but in the worst case it can approach the cost of BunchParlett.
So far, we have ignored the effect of the choice of pivot block E on the sparsity of the final L factor. This consideration is important when the matrix to be factored is large and sparse, since it greatly affects both the CPU time and the amount of storage required by the algorithm. Algorithms that modify the strategies above to take account of sparsity have been proposed by Duff et al. 97, Duff and Reid 95, and Fourer and Mehrotra 113.
SHERMANMORRISONWOODBURY FORMULA
If the square nonsingular matrix A undergoes a rankone update to become A A a b T ,
where a, b IRn , then if A is nonsingular, we have
1 1 A1abT A1
It is easy to verify this formula: Simply multiply the definitions of A and A 1 together and check that they produce the identity.
This formula can be extended to higherrank updates. Let U and V be matrices in IRnp for some p between 1 and n. If we define
A A U V T ,
then A is nonsingular if and only if I V T A1U is nonsingular, and in this case we have
A A 1bTA1a.
A.27
A1 A1 A1UI V T A1U1V T A1. A.28
with
If 0, we have that
n i1
i i .
A.29
1 1 2 2 3 n n, where the relationship A.29 is again satisfied.
A.1. ELEMENTS OF LINEAR ALGEBRA 613
We can use this formula to solve linear systems of the form A x d. Since x A 1 d A 1 d A 1 U I V T A 1 U 1 V T A 1 d ,
weseethatx canbefoundbysolving p1linearsystemswiththematrix Atoobtain A1d and A1U , inverting the p p matrix I V T A1U , and performing some elementary matrix algebra. Inversion of the p p matrix I V T A1U is inexpensive when p a n.
INTERLACING EIGENVALUE THEOREM
The following result is proved for example in Golub and Van Loan 136, Theorem 8.1.8.
Theorem A.1 Interlacing Eigenvalue Theorem.
Let A IRnn be a symmetric matrix with eigenvalues 1, 2, . . . , n satisfying
1 2 n,
andletzIRn beavectorwith z 1,andIRbeascalar.Thenifwedenotethe
eigenvalues of A zzT by 1, 2, . . . , n in decreasing order, we have for 0 that 1 1 2 2 3 n n,
Informally stated, the eigenvalues of the modified matrix interlace the eigenvalues of the original matrix, with nonnegative adjustments if the coefficient is positive, and nonpositive adjustments if is negative. The total magnitude of the adjustments equals , whose magnitude is identical to the Euclidean norm zzT 2 of the modification.
ERROR ANALYSIS AND FLOATINGPOINT ARITHMETIC
In most of this book our algorithms and analysis deal with real numbers. Modern digital computers, however, cannot store or compute with general real numbers. Instead,
614 APPENDIX A. BACKGROUND MATERIAL
they work with a subset known as floatingpoint numbers. Any quantities that are stored on the computer, whether they are read directly from a file or program or arise as the intermediate result of a computation, must be approximated by a floatingpoint number. In general, then, the numbers that are produced by practical computation differ from those that would be produced if the arithmetic were exact. Of course, we try to perform our computations in such a way that these differences are as tiny as possible.
Discussion of errors requires us to distinguish between absolute error and relative error. If x is some exact quantity scalar, vector, matrix and x is its approximate value, the absolute error is the norm of the difference, namely, x x . In general, any of the norms A.2a, A.2b, and A.2c can be used in this definition. The relative error is the ratio of the absolute error to the size of the exact quantity, that is,
x x . x
When this ratio is significantly less than one, we can replace the denominator by the size of the approximate quantitythat is, x without affecting its value very much.
Most computations associated with optimization algorithms are performed in double precision arithmetic. Doubleprecision numbers are stored in words of length 64 bits. Most of these bits say t are devoted to storing the fractional part, while the remainder encode the exponent e and other information, such as the sign of the number, or an indication of whether it is zero or undefined. Typically, the fractional part has the form
.d1d2 …dt,
where each di , i 1, 2, . . . , t , is either zero or one. In some systems d1 is implicitly assumed
to be 1 and is not stored. The value of the floatingpoint number is then
t
di2i 2e.
i1
The value 2t 1 is known as unit roundoff and is denoted by u. Any real number whose absolute value lies in the range 2L , 2U where L and U are lower and upper bounds on the value of the exponent e can be approximated to within a relative accuracy of u by a floatingpoint number, that is,
flxx1 , where u, A.30
where fl denotes floatingpoint approximation. The value of u for doubleprecision IEEE arithmetic is about 1.1 1016. In other words, if the real number x and its floatingpoint approximation are both written as base10 numbers the usual fashion, they agree to at least 15 digits.
A.1. ELEMENTS OF LINEAR ALGEBRA 615
For further information on floatingpoint computations, see Overton 233, Golub and Van Loan 136, Section 2.4, and Higham 169.
When an arithmetic operation is performed with one or two floatingpoint numbers, the result must also be stored as a floatingpoint number. This process introduces a small roundoff error, whose size can be quantified in terms of the size of the arguments. If x and y are two floatingpoint numbers, we have that
flx y x y ux y, A.31
where denotes any of the operations , , , .
Although the error in a single floatingpoint operation appears benign, more signifi
cant errors may occur when the arguments x and y are floatingpoint approximations of two real numbers, or when a sequence of computations are performed in succession. Suppose, for instance, that x and y are large real numbers whose values are very similar. When we store them in a computer, we approximate them with floatingpoint numbers flx and fly that satisfy
flxx x, flyy y, where xux, yuy.
If we take the difference of the two stored numbers, we obtain a final result flflx fly
that satisfies
flflx fly flx fly1 xy, where xy u.
By combining these expressions, we find that the difference between this result and the true
value x y may be as large as
x y xy,
which is bounded by ux y x y. Hence, since x and y are large and close together, the relative error is approximately 2uxx y, which may be quite large, since x x y.
This phenomenon is known as cancellation. It can also be explained less formally by noting that if both x and y are accurate to k digits, and if they agree in the first k digits, then their difference will contain only about k k significant digitsthe first k digits cancel each other out. This observation is the reason for the wellknown adage of numerical computingthat one should avoid taking the difference of two similar numbers if at all possible.
616 APPENDIX A. BACKGROUND MATERIAL
CONDITIONING AND STABILITY
Conditioning and stability are two terms that are used frequently in connection with numerical computations. Unfortunately, their meaning sometimes varies from author to author, but the general definitions below are widely accepted, and we adhere to them in this book.
Conditioning is a property of the numerical problem at hand whether it is a linear algebra problem, an optimization problem, a differential equations problem, or whatever. A problem is said to be well conditioned if its solution is not affected greatly by small perturbations to the data that define the problem. Otherwise, it is said to be ill conditioned.
A simple example is given by the following 2 2 system of linear equations: 1 2 x1 3 .
11×2 2
By computing the inverse of the coefficient matrix, we find that the solution is simply
x1 1 2 3 1 . x2 1121
Ifwereplacethefirstrighthandsideelementby3.00001,thesolutionbecomesx1,x2T 0.99999, 1.00001T , which is only slightly different from its exact value 1, 1T . We would note similar insensitivity if we were to perturb the other elements of the righthandside or elements of the coefficient matrix. We conclude that this problem is well conditioned. On the other hand, the problem
1.00001 1 x1 2.00001 11×2 2
is ill conditioned. Its exact solution is x 1, 1T , but if we change the first element of the righthandside from 2.00001 to 2, the solution would change drastically to x 0, 2T .
For general square linear systems Ax b where A IRnn, the condition number of the matrix defined in A.11 can be used to quantify the conditioning. Specifically, if we perturb A to A and b to b and take x to be the solution of the perturbed system A x b , it can be shown that
xx A AA bb xAb
see, for instance, Golub and Van Loan 136, Section 2.7. Hence, a large condition number A indicates that the problem Ax b is ill conditioned, while a modest value indicates well conditioning.
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 617
Note that the concept of conditioning has nothing to do with the particular algorithm that is used to solve the problem, only with the numerical problem itself.
Stability, on the other hand, is a property of the algorithm. An algorithm is stable if it is guaranteed to produce accurate answers to all wellconditioned problems in its class, even when floatingpoint arithmetic is used.
As an example, consider again the linear equations Ax b. We can show that Algorithm A.1, in combination with triangular substitution, yields a computed solution x whose relative error is approximately
x x AgrowthAu, A.32 xA
where growthA is the size of the largest element that arises in A during execution of Algorithm A.1. In the worst case, we can show that growthA A may be around 2n1, which indicates that Algorithm A.1 is an unstable algorithm, since even for modest n say, n 200, the righthandside of A.32 may be large even when A is modest. In practice, however, large growth factors are rarely observed, so we conclude that Algorithm A.1 is stable for all practical purposes.
Gaussian elimination without pivoting, on the other hand, is definitely unstable. If we omit the possible exchange of rows in Algorithm A.1, the algorithm will fail to produce a factorization even of some wellconditioned matrices, such as
A01. 12
For systems Ax b in which A is symmetric positive definite, the Cholesky fac torization in combination with triangular substitution constitutes a stable algorithm for producing a solution x.
A.2 ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY
SEQUENCES
Suppose that xk is a sequence of points belonging to IRn . We say that a sequence xk converges to some point x, written limk xk x, if for any 0, there is an index K such that
xkx, forallkK.
For example, the sequence xk defined by xk 1 2k,1k2T converges to 1,0T .
618 APPENDIX A. BACKGROUND MATERIAL
Given a index set S 1, 2, 3, . . ., we can define a subsequence of tk corresponding to S, and denote it by tk kS .
We say that x IRn is an accumulation point or limit point for xk if there is an infinite set of indices k1, k2, k3, . . . such that the subsequence xki i1,2,3,… converges to x; that is,
limxki x. i
Alternatively, we say that for any 0 and all positive integers K , we have xkx , forsomekK.
An example is given by the sequence
1 , 12 , 1 , 14 , 1 , 18 ,…,
1 12 1 14 1 18
A.33
whichhasexactlytwolimitpoints:x0,0T andx1,1T.Asequencecanevenhave an infinite number of limit points. An example is the sequence xk sin k, for which every point in the interval 1, 1 is a limit point. A sequence converges if and only if it has exactly one limit point.
A sequence is said to be a Cauchy sequence if for any 0, there exists an integer K suchthat xk xl forallindicesk K andl K.Asequenceconvergesifandonly if it is a Cauchy sequence.
We now consider scalar sequences tk, that is, tk IR for all k. This sequence is said to be bounded above if there exists a scalar u such that tk u for all k, and bounded below if there is a scalar v with tk v for all k. The sequence tk is said to be nondecreasing if tk1 tk for all k, and nonincreasing if tk1 tk for all k. If tk is nondecreasing and bounded above, then it converges, that is, limk tk t for some scalar t. Similarly, if tk is nonincreasing and bounded below, it converges.
We define the supremum of the scalar sequence tk as the smallest real number u such thattk u forallk 1,2,3,…,anddenoteitbysuptk.Theinfimum,denotedbyinftk, is the largest real number v such that v tk for all k 1,2,3,…. We can now define the sequence of suprema as ui , where
def
ui suptkki.
Clearly, ui is a nonincreasing sequence. If bounded below, it converges to a finite number u , which we call the lim sup of tk , denoted by lim sup tk . Similarly, we can denote the sequence of infima by vi , where
def
vi inftkki,
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 619
which is nondecreasing. If vi is bounded above, it converges to a point v which we call the lim inf of tk, denoted by liminf tk. As an example, the sequence 1, 1,1, 1,1, 1,… has
a lim inf of 0 and a lim sup of 1.
RATES OF CONVERGENCE
One of the key measures of performance of an algorithm is its rate of convergence. Here, we define the terminology associated with different types of convergence.
Let xk be a sequence in IRn that converges to x. We say that the convergence is Qlinear if there is a constant r 0, 1 such that
xk1 x
xk x r, for all k sufficiently large. A.34
This means that the distance to the solution x decreases at each iteration by at least a constant factor bounded away from 1. For example, the sequence 1 0.5k converges Qlinearly to 1, with rate r 0.5. The prefix Q stands for quotient, because this type of convergence is defined in terms of the quotient of successive errors.
The convergence is said to be Qsuperlinear if
k xk x
For example, the sequence 1 kk converges superlinearly to 1. Prove this statement!
xk1 x
lim 0.
248
Qquadratic convergence, an even more rapid convergence rate, is obtained if xk1 x
xk x 2 M, for all k sufficiently large,
where M is a positive constant, not necessarily less than 1. An example is the sequence 1 0.52k .
The speed of convergence depends on r and more weakly on M , whose values depend not only on the algorithm but also on the properties of the particular problem. Regardless of these values, however, a quadratically convergent sequence will always eventually converge faster than a linearly convergent sequence.
Obviously, any sequence that converges Qquadratically also converges Qsuper linearly, and any sequence that converges Qsuperlinearly also converges Qlinearly. We can also define higher rates of convergence cubic, quartic, and so on, but these are less interesting in practical terms. In general, we say that the Qorder of convergence is p with
p 1 if there is a positive constant M such that
xk1 x
xk x p M, for all k sufficiently large.
620 APPENDIX A. BACKGROUND MATERIAL
QuasiNewton methods for unconstrained optimization typically converge Q superlinearly, whereas Newtons method converges Qquadratically under appropriate assumptions. In contrast, steepest descent algorithms converge only at a Qlinear rate, and when the problem is illconditioned the convergence constant r in A.34 is close to 1.
In the book, we omit the letter Q and simply talk about superlinear convergence, quadratic convergence, and so on.
A slightly weaker form of convergence, characterized by the prefix R for root, is concerned with the overall rate of decrease in the error, rather than the decrease over each individual step of the algorithm. We say that convergence is Rlinear if there is a sequence of nonnegative scalars k such that
xk x k for all k, and k converges Qlinearly to zero.
The sequence xk x is said to be dominated by k. For instance, the sequence
xk 10.5k, keven, A.35 1, k odd,
the first few iterates are 2, 1, 1.25, 1, 1.03125, 1, . . . converges Rlinearly to 1, because we have 1 0.5k 1 0.k , and the sequence 0.5k converges Qlinearly to zero. Likewise, we say that xk converges Rsuperlinearly to x if xk x is dominated by a sequence of scalars converging Qsuperlinearly to zero, and xk converges Rquadratically to x if xk x is dominated by a sequence converging Qquadratically to zero.
Note that in the Rlinear sequence A.35, the error actually increases at every second iteration! Such behavior occurs even in sequences whose Rrate of convergence is arbitrarily high, but it cannot occur for Qlinear sequences, which insist on a decrease at every step k, for k sufficiently large.
For an extensive discussion of convergence rates see Ortega and Rheinboldt 230. TOPOLOGY OF THE EUCLIDEAN SPACE IRn
The set F is bounded if there is some real number M 0 such that x M, forallxF.
AsubsetF IRn isopenifforeveryx F,wecanfindapositivenumber the ball of radius around x is contained in F; that is,
y IR n y x F .
0suchthat
The set F is closed if for all possible sequences of points xk in F, all limit points of xk are elements of F . For instance, the set F 0, 1 2, 10 is an open subset of IR, while
m x
m i1
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 621
F 0,12,5isaclosedsubsetofIR.ThesetF 0,1isasubsetofIRthatisneither open nor closed.
The interior of a set F, denoted by int F, is the largest open set contained in F. The closure of F, denoted by cl F, is the smallest closed set containing F. In other words, we have
x clF if limk xk x for some sequence xk of points in F. If F 1,1 2,4, then
clF 1,12,4, intF 1,12,4.
NotethatifF isopen,thenintF F,whileifF isclosed,thenclF F.
We note the following facts about open and closed sets. The union of finitely many closed sets is closed, while any intersection of closed sets is closed. The intersection of finitely
many open sets is open, while any union of open sets is open.
The set F is compact if every sequence x k of points in F has at least one limit point,
and all such limit points are in F. This definition is equivalent to the more formal one involving covers of F. The following is a central result in topology:
F IRn is closed and bounded F is compact.
Given a point x IRn , we call N IRn a neighborhood of x if it is an open set containing x. An especially useful neighborhood is the open ball of radius around x, which is denoted by IBx, ; that is,
IBx, y yx . GivenasetFIRn,wesaythatNisaneighborhoodofFifthereis 0suchthat
xFIBx, N.
CONVEX SETS IN IRn Aconvexcombinationofafinitesetofvectorsx1,x2,…,xminIRm isanyvectorx
of the form
i1
AconeisasetF withthepropertythatforallx F wehave
xF xF, forall0. A.36
ixi, where
The convex hull of x1, x2, . . . , xm is the set of all convex combinations of these vectors.
i 1, andi 0foralli 1,2,…,m.
622 APPENDIX A. BACKGROUND MATERIAL
For instance, the set F IR2 defined by
x1,x2T x1 0, x2 0
is a cone in IR2. Note that cones are not necessarily convex. For example, the set x1, x2T x1 0 or x2 0, which encompasses three quarters of the twodimensional plane, is a cone.
Theconegeneratedbyx1,x2,…,xmisthesetofallvectorsx oftheform
m i1
Note that all cones of this form are convex.
Finally, we define the affine hull and relative interior of a set. An affine set in IRn is a
thesetofallvectorsxS,wherexIRn andSisasubspaceofIRn.GivenFIRn, the affine hull of F denoted by aff F is the smallest affine set containing F . For instance, when F is the icecream cone defined in three dimensions as
xIR3x32 x12x2 A.37 see Figure A.1, we have affF IR3. If F is the set of two isolated points F
1,0,0T ,0,2,0T , we have
affF1,0,0T 1,2,0T forallIR.
x
ixi, wherei 0foralli 1,2,…,m.
x3
x2
x1
Figure A.1 Icecream cone set.
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 623
The relative interior riF of the set F is its interior relative to affF. If x F, then xriFifthereisan 0suchthat
x BaffFF. Referring again to the icecream cone A.37, we have that
a riFxIR3ax32x12x2 .
For the set of two isolated points F 1,0,0T ,0,2,0T , we have riF . For the set F defined by
def 3
F x IR x1 0,1, x2 0,1, x3 0,
we have that
affFIRIR0, riFxIR3x1 0,1,×2 0,1,×3 0.
CONTINUITY AND LIMITS
Let f be a function that maps some domain D IRn to the space IRm . For some point x0 clD, we write
lim f x f0 A.38 x x0
spokenthelimitof fxasxapproachesx0 is f0ifforall 0,thereisavalue0 such that
xx0andxD fxf0.
We say that f is continuous at x0 if x0 D and the expression A.38 holds with f0 f x0. We say that f is continuous on its domain D if f is continuous for all x0 D.
An example is provided by the function
fx x ifx 1,1,x a0, A.39
5 for all other x 10, 10.
This function is defined on the domain 10, 10 and is continuous at all points of the domain except the points x 0, x 1, and x 1. At x 0, the expression A.38 holds with f0 0, but the function is not continuous at this point because f0 a f 0 5. At
624 APPENDIX A. BACKGROUND MATERIAL
x 1, the limit A.38 is not defined, because the function values in the neighborhood of this point are close to both 5 and 1, depending on whether x is slightly smaller or slightly larger than 1. Hence, the function is certainly not continuous at this point. The same comments apply to the point x 1.
In the special case of n 1 that is, the argument of f is a real scalar, we can also define the onesided limit. Given x0 clD, We write
lim fx f0 x x0
spoken the limit of f x as x approaches x0 from above is f0 if for all value 0 such that
x0xx0 and xD fxf0 . Similarly, we write
lim fx f0 x x0
spoken the limit of f x as x approaches x0 from below is f0 if for all value 0 such that
x0xx0andxD fxf0. For the function defined in A.39, we have that
lim fx5, lim fx1. x1 x1
A.40 0, there is a
A.41 0, there is a
Considering again the general case of f : D IRm where D IRn for general m and n. The function f is said to be Lipschitz continuous on some set N D if there is a constant L 0 such that
fx1fx0 L x1x0 , forallx0,x1N. A.42
L is called the Lipschitz constant. The function f is locally Lipschitz continuous at a point x intD if there is some neighborhood N of x with N D such that the property A.42 holds for some L 0.
If g and h are two functions mapping D IRn to IRm, Lipschitz continuous on a set N D, their sum g h is also Lipschitz continuous, with Lipschitz constant equal to the sum of the Lipschitz constants for g and h individually. If g and h are two functions mapping D IRn to IR, the product gh is Lipschitz continuous on a set N D if both g and h are Lipschitz continuous on N and both are bounded on N that is, there is M 0
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 625
such that gx M and hx M for all x N. We prove this claim via a sequence of elementary inequalities, for arbitrary x0, x1 N :
gx0hx0 gx1hx1
gx0hx0 gx1hx0 gx1hx0 gx1hx1 hx0 gx0 gx1 gx1 hx0 hx1 2ML x0x1 ,
where L is an upper bound on the Lipschitz constant for both g and h. DERIVATIVES
A.43
Let : IR IR be a realvalued function of a real variable sometimes known as a univariate function. The first derivative is defined by
d def lim
. A.44 The second derivative is obtained by substituting by in this same formula; that is,
d 0
d2 def 2 lim
Suppose now that in turn depends on another quantity we denote this dependence by writing . We can use the chain rule to calculate the derivative of with respect to :
d d d . A.46 d d d
Consider now the function f : IRn IR, which is a realvalued function of n independent variables. We typically gather the variables into a vector x x1, x2, . . . , xn T . We say that f is differentiable at x if there exists a vector g IRn such that
lim fxy fxgTy 0, A.47 y0 y
where is any vector norm of y. This type of differentiability is known as Frechet differentiability. If g satisfying A.47 exists, we call it the gradient of f at x, and denote it
d 0
. A.45
626 APPENDIX A. BACKGROUND MATERIAL
by f x, written componentwise as
f x1
f x . . A.48
Here, f xi represents the partial derivative of f with respect to xi . By setting y ei in A.47, where ei is the vector in IRn consisting all all zeros, except for a 1 in position i, we obtain
f xi
def fx1,…,xi1,xi ,xi1,…,xn fx1,…,xi1,xi,xi1,…,xn lim
0
fx eifx.
A gradient with respect to only a subset of the unknowns can be expressed by means of a subscript on the symbol . Thus for the function of two vector variables f z, t, we use z f z, t to denote the gradient with respect to z holding t constant.
The matrix of second partial derivatives of f is known as the Hessian, and is defined as
2f 2f 2f x12 x1x2 x1xn 2f 2f 2f
f xn
2 fx x2x1 x2 x2xn . .. .
.. . 2f2f 2f
x x x x x2 n1n2n
We say that f is differentiable on a domain D if f x exists for all x D, and continuously differentiable if f x is a continuous functions of x . Similarly, f is twice differentiable on D if 2 f x exists for all x D and twice continuously differentiable if 2 f x is continuous on D. Note that when f is twice continuously differentiable, the Hessian is a symmetric matrix, since
2f 2f , foralli,j1,2,…,n. xixj xjxi
When f is a vector valued function that is f : IRn IRm See Chapters 10 and 11, we define f x to be the n m matrix whose i th column is fi x , that is, the gradient of
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 627
fi withrespecttox.Often,fornotationalconvenience,weprefertoworkwiththetranspose of his matrix, which has dimensions m n. This matrix is called the Jacobian and is often denotedby Jx.Specifically,thei, jelementof Jxisfixxj.
When the vector x in turn depends on another vector t that is, x xt, we can extend the chain rule A.46 for the univariate function. Defining
we have
ht f xt,
ht n f xi t xt f xt. i1 xi
A.49
A.50
EXAMPLE A.1
Let f :IR2 IRbedefinedby fx1,x2x12x1x2,wherex1 sint1t2 and
x2 t1 t22. Defining ht as in A.49, the chain rule A.50 yields ht
n fxit i1 xi
2x x cost1 x 2t1t2 1 2 2t 1 2tt
212
2 sint1 t2 t1 t22 cost1 sint1 t2
2t1 t2 . 2t1 t2
2t2
If, on the other hand, we substitute directly for x into the definition of f , we obtain
ht fxt sint1 t2 2 sint1 t2 t1 t22.
The reader should verify that the gradient of this expression is identical to the one obtained
above by applying the chain rule.
Special cases of the chain rule can be derived when xt in A.50 is a linear function of t, say xt Ct. We then have xt CT , so that
ht CT f Ct.
628 APPENDIX A. BACKGROUND MATERIAL
In the case in which f is a scalar function, we can differentiate twice using the chain rule to obtain
2ht CT 2 f CtC. The proof of this statement is left as an exercise.
DIRECTIONAL DERIVATIVES
The directional derivative of a function f : IRn IR in the direction p is given by
. A.51
The directional derivative may be well defined even when f is not continuously differen tiable; in fact, it is most useful in such situations. Consider for instance the 1 norm function
f x x 1. We have from the definition A.51 that
xp x D x 1;plim 1
n xpn x 1 lim i1 i i i1 i .
def
D f x; p lim
fx pfx 0
0
Ifxi 0,wehavexi pixi pi forall sufficientlysmall.Ifxi 0,wehave
xi pixi pi,whileifxi 0,wehavexi pi pi.Therefore,wehave
D x 1;p pi ixi 0
ixi 0
pi pi, ixi 0
so the directional derivative of this function exists for any x and p. The first derivative f x does not exist, however, whenever any of the components of x are zero.
When f is in fact continuously differentiable in a neighborhood of x, we have Dfx;pfxT p.
To verify this formula, we define the function
f x p f y,
where y x p. Note that
limfx pfxlim00. 0 0
A.52
0
MEAN VALUE THEOREM
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 629
By applying the chain rule A.50 to f y, we obtain n fyyi
A.53
i1 yi
n fypi fyT pfxpT p.
i1 yi
We obtain A.51 by setting 0 and comparing the last two expressions.
We now recall the mean value theorem for univariate functions. Given a continuously differentiable function : IR IR and two real numbers 0 and 1 that satisfy 1 0, we have that
1 0 1 0 A.54 for some 0 , 1 . An extension of this result to a multivariate function f : IRn IR is
that for any vector p we have
fxp fxfxpT p, A.55
forsome0,1.Thisresultcanbeprovedbydefining fxp,0 0,and 1 1 and applying the chain rule, as above.
EXAMPLE A.2
Considerf :IR2 IRdefinedbyfxx133x1x2,andletx0,0T and
p1,2T.Itiseasytoverifythat fx0and fxp13.Since 3×1 p12 3×2 p22 152
fx p 6×1 p1x2 p2 122 ,
we have that f x pT p 392. Hence the relation A.55 holds when we set
1 13, which lies in the open interval 0, 1, as claimed.
630 APPENDIX A. BACKGROUND MATERIAL
An alternative expression to A.55 can be stated for twice differentiable functions: We have
f x p f x f xT p 1 pT 2 f x pT p, A.56 2
for some 0, 1. In fact, this expression is one form of Taylors theorem, Theorem 2.1 in Chapter 2, to which we refer throughout the book.
The extension of A.55 to a vectorvalued function r : IRn IRm for m 1 is not immediate. There is in general no scalar such that the natural extension of A.55 is satisfied. However, the following result is often a useful analog. As in 10.3, we denote the Jacobianofrx,by Jx,where Jxisthemnmatrixwhosej,ientryisrjxi,for
j 1,2,…,m and i 1,2,…,n, and asssume that Jx is defined and continuous on the domain of interest. Given x and p, we then have
1 0
expression adequately by J x p, that is,
rx p rx Jxp.
If J is Lipschitz continuous in the vicinity of x and x p with Lipschitz constant L, we can use A.12 to estimate the error in this approximation as follows:
a 1 a rxprxJxp a JxpJxpda
0 1
0 1
0
rx p rx
When p is sufficiently small in norm, we can approximate the righthand side of this
JxpJx p d L p 2 d 1 L p 2.
2
Jx pp d. A.57
IMPLICIT FUNCTION THEOREM
The implicit function theorem lies behind a number of important results in local convergence theory of optimization algorithms and in the characterization of optimality see Chapter 12. Our statement of this result is based on Lang 187, p. 131 and Bertsekas 19, Proposition A.25.
for all t Nt .
which z is obtained as the solution of
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 631
Theorem A.2 Implicit Function Theorem. Leth:IRn IRm IRn beafunctionsuchthat
i hz,00forsomez IRn,
ii thefunctionh,iscontinuouslydifferentiableinsomeneighborhoodofz,0,and
iii zhz,t is nonsingular at the point z,t z,0.
Then there exist open sets Nz IRn and Nt IRm containing z and 0, respectively, and a continuous function z : Nt Nz such that z z0 and hzt,t 0 for all t Nt. Further, zt is uniquely defined. Finally, if h is p times continuously differentiable with respect to both its arguments for some p 0, then zt is also p times continuously differentiable with respect to t, and we have
zt thzt,tzhzt,t1
This theorem is frequently applied to parametrized systems of linear equations, in
Mtz gt,
where M IRnn has M0 nonsingular, and g IRn. To apply the theorem, we define
hz,t Mtz gt.
If M and g are continuously differentiable in some neighborhood of 0, the theorem implies that zt Mt1gt is a continuous function of t in some neighborhood of 0.
ORDER NOTATION
In much of our analysis we are concerned with how the members of a sequence behave eventually, that is, when we get far enough along in the sequence. For instance, we might ask whether the elements of the sequence are bounded, or whether they are similar in size to the elements of a corresponding sequence, or whether they are decreasing and, if so, how rapidly. Order notation is useful shorthand to use when questions like these are being examined. It saves us defining many constants that clutter up the argument and the analysis.
We will use three varieties of order notation: O, o, and . Given two nonnegative infinite sequences of scalars k and k , we write
k Ok
632 APPENDIX A. BACKGROUND MATERIAL
if there is a positive constant C such that
k Ck
for all k sufficiently large. We write
k ok
if the sequence of ratios k k approaches zero, that is,
lim k 0. k k
Finally, we write
k k iftherearetwoconstantsC0 andC1 with0C0 C1 suchthat
C0k k C1k,
that is, the corresponding elements of both sequences stay in the same ballpark for all k. Thisdefinitionisequivalenttosayingthatk Okandk Ok.
The same notation is often used in the context of quantities that depend continuously on each other as well. For instance, if is a function that maps IR to IR, we write
O
if there is a constant C such that C for all IR. Typically, we are interested only in values of that are either very large or very close to zero; this should be clear from the context. Similarly, we use
o A.58
to indicate that the ratio approaches zero either as 0 or . Again, the precise meaning should be clear from the context.
As a slight variant on the definitions above, we write k O1
to indicate that there is a constant C such that k C for all k, while k o1
A.2. ELEMENTS OF ANALYSIS, GEOMETRY, TOPOLOGY 633
indicatesthatlimkk 0.Wesometimesusevectorandmatrixquantitiesasarguments, and in these cases the definitions above are intended to apply to the norms of these quantities. Forinstance,if f :IRn IRn,wewrite fxO x ifthereisaconstantC0such that f x C x for all x in the domain of f . Typically, as above, we are interested only in some subdomain of f , usually a small neighborhood of 0. As before, the precise meaning should be clear from the context.
ROOTFINDING FOR SCALAR EQUATIONS
In Chapter 11 we discussed methods for finding solutions of nonlinear systems of equations Fx 0, where F : IRn IRn. Here we discuss briefly the case of scalar equations n 1, for which the algorithm is easy to illustrate. Scalar rootfinding is needed in the trustregion algorithms of Chapter 4, for instance. Of course, the general theorems of Chapter 11 can be applied to derive rigorous convergence results for this special case.
The basic step of Newtons method Algorithm Newton of Chapter 11 in the scalar case is simply
pk FxkFxk, xk1 xk pk A.59
cf. 11.6. Graphically, such a step involves taking the tangent to the graph of F at the point xk and taking the next iterate to be the intersection of this tangent with the x axis see Figure A.2. Clearly, if the function F is nearly linear, the tangent will be quite a good approximation to F itself, so the Newton iterate will be quite close to the true root of F.
tangent
Fx k
xk1 xk
Figure A.2 One step of Newtons method for a scalar equation.
634 APPENDIX A. BACKGROUND MATERIAL
secant
xk1 xk xk2
Figure A.3 One step of the secant method for a scalar equation.
The secant method for scalar equations can be viewed as the specialization of Broydens method to the case of n 1. The issues are simpler in this case, however, since the secant equation 11.27 completely determines the value of the 1 1 approximate Hessian Bk . That is, we do not need to apply extra conditions to ensure that Bk is fully determined. By combining 11.24 with 11.27, we find that the secant method for the case of n 1 is defined by
Bk Fxk Fxk1xk xk1, A.60a pk FxkBk, xk1 xk pk. A.60b
By illustrating this algorithm, we see the origin of the term secant. Bk approximates the slope of the function at xk by taking the secant through the points xk1, Fxk1 and xk, Fxk, and xk1 is obtained by finding the intersection of this secant with the x axis. The method is illustrated in Figure A.3.
APPENDIX B A Regularization
Procedure
The following algorithm chooses parameters , that guarantee that the regularized primal dual matrix 19.25 is nonsingular and satisfies the inertia condition 19.24. The algorithm assumes that, at the beginning of the interiorpoint iteration, old has been initialized to zero.
Algorithm B.1 Inertia Correction and Regularization.
Given the current barrier parameter , constants 0 and 1, and the
perturbation old used in the previous interiorpoint iteration.
This is page 635 Printer: Opaque this
636 APPENDIX B. A REGULARIZATION PROCEDURE
Factor 19.25 with 0.
if 19.25 is nonsingular and its inertia is n m, l m, 0
compute the primaldual step; stop; if 19.25 has zero eigenvalues
set 108; ifold 0
set 104; else
set old2; repeat
Factor the modified matrix 19.25; if the inertia is n m, l m, 0
Setold ;
Compute the primaldual step 19.12 using the coefficient
matrix 19.25; stop; Set 10;
This algorithm has been adapted from a more elaborate procedure described by Wa chter and Biegler 301. All constants used in the algorithm are arbitrary; we have pro vided typical choices. The algorithm aims to avoid unnecessarily large modifications I of x2x L while trying to minimize the number of matrix factorizations. Excessive modifications degrade the performance of the algorithm because they erase the second derivative informa tion contained in x2x L, and cause the step to take on steepestdescent like characteristics. The first trial value old2 is based on the previous modification old because the minimum perturbation required to achieve the desired inertia will often not vary much from one interiorpoint iteration to the next.
The heuristics implemented in Algorithm B.1 provide an alternative to those employed in Algorithm 7.3, which were presented in the context of unconstrained optimization. We emphasize, however, that all of these are indeed heuristics and may not always provide adequate safeguards.
else end repeat
References
1 R. K. AHUJA, T. L. MAGNANTI, AND J. B. ORLIN, Network Flows: Theory, Algorithms, and Applications, PrenticeHall, Englewood Cliffs, N.J., 1993.
2 H. AKAIKE, On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method, Annals of the Institute of Statistical Mathematics, 11 1959, pp. 117.
3 M.ALBAALI,DescentpropertyandglobalconvergenceoftheFletcherReevesmethodwithinexact line search, I.M.A. Journal on Numerical Analysis, 5 1985, pp. 121124.
4 E. D. ANDERSEN AND K. D. ANDERSEN, Presolving in linear programming, Mathematical Programming, 71 1995, pp. 221245.
5 , The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm, in High Performance Optimization, T. T. H. Frenk, K. Roos and S. Zhang, eds., Kluwer Academic Publishers, 2000, pp. 197232.
6 E.D.ANDERSEN,J.GONDZIO,C.ME SZA ROS,ANDX.XU,Implementationofinteriorpointmethods for large scale linear programming, in Interior Point Methods in Mathematical Programming, T. Terlaky, ed., Kluwer, 1996, ch. 6, pp. 189252.
7 E.ANDERSON,Z.BAI,C.BISCHOF,J.DEMMEL,J.DONGARRA,J.DUCROZ,A.GREENBAUM,S.HAM MARLING, A. MCKENNEY, S. OSTROUCHOV, AND D. SORENSEN, LAPACK Users Guide, SIAM, Philadelphia, 1992.
This is page 637 Printer: Opaque this
638 REFERENCES
8 M.ANITESCU,Onsolvingmathematicalprogramswithcomplementarityconstraintsasnonlinear programs, SIAM Journal on Optimization, 15 2005, pp. 12031236.
9 ARKI CONSULTING AND DEVELOPMENT AS, CONOPT version 3, 2004.
10 B. M. AVERICK, R. G. CARTER, J. J. MORE , AND G. XUE, The MINPACK2 test problem collection,
Preprint MCSP1530692, Argonne National Laboratory, 1992.
11 P.BAPTISTANDJ.STOER,Ontherelationbetweenquadraticterminationandconvergenceproperties
of minimization algorithms, Part II: Applications, Numerische Mathematik, 28 1977, pp. 367
392.
12 R. H. BARTELS AND G. H. GOLUB, The simplex method of linear programming using LU
decomposition, Communications of the ACM, 12 1969, pp. 266268.
13 R. BARTLETT AND L. BIEGLER, rSQP: An objectoriented framework for successive quadratic programming, in LargeScale PDEConstrained Optimization, L. T. Biegler, O. Ghat tas, M. Heinkenschloss, and B. van Bloemen Waanders, eds., vol. 30, New York, 2003,
SpringerVerlag, pp. 316330. Lecture Notes in Computational Science and Engineering.
14 M. BAZARAA, H. SHERALI, AND C. SHETTY, Nonlinear Programming, Theory and Applications.,
John Wiley Sons, New York, second ed., 1993.
15 A.BENTALANDA.NEMIROVSKI,LecturesonModernConvexOptimization:Analysis,Algorithms,
and Engineering Applications, MPSSIAM Series on Optimization, SIAM, 2001.
16 H. Y. BENSON, A. SEN, D. F. SHANNO, AND R. J. VANDERBEI, Interiorpoint algorithms, penalty methods and equilibrium problems, Technical Report ORFE0302, Operations Research and
Financial Engineering, Princeton University, 2003.
17 S. BENSON AND J. MORE , A limitedmemory variablemetric algorithm for bound constrained
minimization, Numerical Analysis Report P9090901, ANL, Argonne, IL, USA, 2001.
18 D. P. BERTSEKAS, Constrained Optimization and Lagrange Multiplier Methods, Academic Press,
New York, 1982.
19 , Nonlinear Programming, Athena Scientific, Belmont, MA, second ed., 1999.
20 M. BERZ, C. BISCHOF, C. F. CORLISS, AND A. GRIEWANK, eds., Computational Differentiation:
Techniques, Applications, and Tools, SIAM Publications, Philadelphia, PA, 1996.
21 J.BETTS,S.K.ELDERSVELD,P.D.FRANK,ANDJ.G.LEWIS,Aninteriorpointnonlinearprogramming algorithm for large scale optimization, Technical report MCT TECH003, Mathematics and
Computing Technology, The Boeing Company, P.O. Box 3707, Seattle, WA 981242207, 2000.
22 J. R. BIRGE AND F. LOUVEAUX, Introduction to Stochastic Programming, SpringerVerlag, New
York, 1997.
23 E.G.BIRGIN,J.M.MARTINEZ,ANDM.RAYDAN,Algorithm813:SPGsoftwareforconvexconstrained
optimization, ACM Transactions on Mathematical Software, 27 2001, pp. 340349.
24 C. BISCHOF, A. BOUARICHA, P. KHADEMI, AND J. J. MORE , Computing gradients in largescale optimization using automatic differentiation, INFORMS Journal on Computing, 9 1997,
pp. 185194.
25 C. BISCHOF, A. CARLE, P. KHADEMI, AND A. MAUER, ADIFOR 2.0: Automatic differentiation of
FORTRAN 77 programs, IEEE Computational Science Engineering, 3 1996, pp. 1832.
26 C.BISCHOF,G.CORLISS,ANDA.GRIEWANK,Structuredsecondandhigherorderderivativesthrough
univariate Taylor series, Optimization Methods and Software, 2 1993, pp. 211232.
27 C.BISCHOF,P.KHADEMI,A.BOUARICHA,ANDA.CARLE,EfficientcomputationofgradientsandJaco bians by transparent exploitation of sparsity in automatic differentiation, Optimization Methods
and Software, 7 1996, pp. 139.
28 C.BISCHOF,L.ROH,ANDA.MAUER,ADIC:AnextensibleautomaticdifferentiationtoolforANSIC, SoftwarePractice and Experience, 27 1997, pp. 14271456.
29 A.BJO RCK,NumericalMethodsforLeastSquaresProblems,SIAMPublications,Philadelphia,PA, 1996.
30 P. T. BOGGS, R. H. BYRD, AND R. B. SCHNABEL, A stable and efficient algorithm for nonlinear orthogonal distance regression, SIAM Journal on Scientific and Statistical Computing, 8 1987, pp. 10521078.
31 P.T.BOGGS,J.R.DONALDSON,R.H.BYRD,ANDR.B.SCHNABEL,ODRPACKSoftwareforweighted orthogonal distance regression, ACM Transactions on Mathematical Software, 15 1981, pp. 348 364.
32 P.T.BOGGSANDJ.W.TOLLE,Convergencepropertiesofaclassofranktwoupdates,SIAMJournal on Optimization, 4 1994, pp. 262287.
33 , Sequential quadratic programming, Acta Numerica, 4 1996, pp. 151.
34 P. T. BOGGS, J. W. TOLLE, AND P. WANG, On the local convergence of quasiNewton methods for constrained optimization, SIAM Journal on Control and Optimization, 20 1982, pp. 161171.
35 I.BONGARTZ,A.R.CONN,N.I.M.GOULD,ANDP.L.TOINT,CUTE:Constrainedandunconstrained testing environment, Research Report, IBM T.J. Watson Research Center, Yorktown Heights, NY,
1993.
36 J. F. BONNANS, E. R. PANIER, A. L. TITS, AND J. L. ZHOU, Avoiding the Maratos effect by means of
a nonmonotone line search. II. Inequality constrained problems feasible iterates, SIAM Journal
on Numerical Analysis, 29 1992, pp. 11871202.
37 S. BOYD, L. EL GHAOUI, E. FERON, AND V. BALAKRISHNAN, Linear Matrix Inequalities in Systems
and Control Theory, SIAM Publications, Phildelphia, 1994.
38 S.BOYDANDL.VANDENBERGHE,ConvexOptimization,CambridgeUniversityPress,Cambridge,
2003.
39 R. P. BRENT, Algorithms for minimization without derivatives, Prentice Hall, Englewood Cliffs,
NJ, 1973.
40 H.M.BU CKER,G.F.CORLISS,P.D.HOVLAND,U.NAUMANN,ANDB.NORRIS,eds.,AutomaticDiffer
entiation: Applications, Theory, and Implementations, vol. 50 of Lecture Notes in Computational
Science and Engineering, Springer, New York, 2005.
41 R.BULIRSCHANDJ.STOER,IntroductiontoNumericalAnalysis,SpringerVerlag,NewYork,1980.
42 J.R.BUNCHANDL.KAUFMAN,Somestablemethodsforcalculatinginertiaandsolvingsymmetric
linear systems, Mathematics of Computation, 31 1977, pp. 163179.
43 J.R.BUNCHANDB.N.PARLETT,Directmethodsforsolvingsymmetricindefinitesystemsoflinear
equations, SIAM Journal on Numerical Analysis, 8 1971, pp. 639655.
44 J. V. BURKE AND J. J. MORE , Exposing constraints, SIAM Journal on Optimization, 4 1994,
pp. 573595.
45 W. BURMEISTER, Die konvergenzordnung des FletcherPowell algorithmus, Zeitschrift fu r
Angewandte Mathematik und Mechanik, 53 1973, pp. 693699.
46 R. BYRD, J. NOCEDAL, AND R. WALTZ, Knitro: An integrated package for nonlinear optimization,
Technical Report 18, Optimization Technology Center, Evanston, IL, June 2005.
47 R. BYRD, J. NOCEDAL, AND R. A. WALTZ, Steering exact penalty methods, Technical Report OTC 200407, Optimization Technology Center, Northwestern University, Evanston, IL, USA, April
2004.
48 R.H.BYRD,J.C.GILBERT,ANDJ.NOCEDAL,Atrustregionmethodbasedoninteriorpointtechniques
for nonlinear programming, Mathematical Programming, 89 2000, pp. 149185.
REFERENCES 639
640 REFERENCES
49 R.H.BYRD,N.I.M.GOULD,J.NOCEDAL,ANDR.A.WALTZ,Analgorithmfornonlinearoptimization using linear programming and equality constrained subproblems, Mathematical Programming, Series B, 100 2004, pp. 2748.
50 R. H. BYRD, M. E. HRIBAR, AND J. NOCEDAL, An interior point method for large scale nonlinear programming, SIAM Journal on Optimization, 9 1999, pp. 877900.
51 R. H. BYRD, H. F. KHALFAN, AND R. B. SCHNABEL, Analysis of a symmetric rankone trust region method, SIAM Journal on Optimization, 6 1996, pp. 10251039.
52 R.H.BYRD,J.NOCEDAL,ANDR.B.SCHNABEL,RepresentationsofquasiNewtonmatricesandtheir use in limitedmemory methods, Mathematical Programming, Series A, 63 1994, pp. 129156.
53 R.H.BYRD,J.NOCEDAL,ANDY.YUAN,GlobalconvergenceofaclassofquasiNewtonmethodson
convex problems, SIAM Journal on Numerical Analysis, 24 1987, pp. 11711190.
54 R.H.BYRD,R.B.SCHNABEL,ANDG.A.SCHULTZ,Approximatesolutionofthetrustregionsprob lem by minimization over twodimensional subspaces, Mathematical Programming, 40 1988,
pp. 247263.
55 R.H.BYRD,R.B.SCHNABEL,ANDG.A.SHULTZ,Atrustregionalgorithmfornonlinearlyconstrained
optimization, SIAM Journal on Numerical Analysis, 24 1987, pp. 11521170.
56 M.R.CELIS,J.E.DENNIS,ANDR.A.TAPIA,Atrustregionstrategyfornonlinearequalityconstrained optimization, in Numerical Optimization, P. T. Boggs, R. H. Byrd, and R. B. Schnabel, eds.,
SIAM, 1985, pp. 7182.
57 R.CHAMBERLAIN,C.LEMARECHAL,H.C.PEDERSEN,ANDM.J.D.POWELL,Thewatchdogtechnique
for forcing convergence in algorithms for constrained optimization, Mathematical Programming,
16 1982, pp. 117.
58 S. H. CHENG AND N. J. HIGHAM, A modified Cholesky algorithm based on a symmetric indefinite
factorization, SIAM Journal of Matrix Analysis and Applications, 19 1998, pp. 10971100.
59 C.M.CHINANDR.FLETCHER,OntheglobalconvergenceofanSLPfilteralgorithmthattakesEQP
steps, Mathematical Programming, Series A, 96 2003, pp. 161177.
60 T. D. CHOI AND C. T. KELLEY, Superlinear convergence and implicit filtering, SIAM Journal on
Optimization, 10 2000, pp. 11491162.
61 V.CHVA TAL,LinearProgramming,W.H.FreemanandCompany,NewYork,1983.
62 F. H. CLARKE, Optimization and Nonsmooth Analysis, John Wiley Sons, New York, 1983
Reprinted by SIAM Publications, 1990.
63 A. COHEN, Rate of convergence of several conjugate gradient algorithms, SIAM Journal on
Numerical Analysis, 9 1972, pp. 248259.
64 T. F. COLEMAN, Linearly constrained optimization and projected preconditioned conjugate gradi
ents, in Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, J. Lewis, ed.,
Philadelphia, USA, 1994, SIAM, pp. 118122.
65 T. F. COLEMAN AND A. R. CONN, Nonlinear programming via an exact penaltyfunction:
Asymptotic analysis, Mathematical Programming, 24 1982, pp. 123136.
66 T.F.COLEMAN,B.GARBOW,ANDJ.J.MORE ,SoftwareforestimatingsparseJacobianmatrices,ACM
Transactions on Mathematical Software, 10 1984, pp. 329345.
67 , Software for estimating sparse Hessian matrices, ACM Transactions on Mathematical
Software, 11 1985, pp. 363377.
68 T.F.COLEMANANDJ.J.MORE ,EstimationofsparseJacobianmatricesandgraphcoloringproblems,
SIAM Journal on Numerical Analysis, 20 1983, pp. 187209.
69 , Estimation of sparse Hessian matrices and graph coloring problems, Mathematical
Programming, 28 1984, pp. 243270.
70 A.R.CONN,N.I.M.GOULD,ANDP.L.TOINT,Testingaclassofalgorithmsforsolvingminimization problems with simple bounds on the variables, Mathematics of Computation, 50 1988, pp. 399 430.
71 , Convergence of quasiNewton matrices generated by the symmetric rank one update, Mathematical Programming, 50 1991, pp. 177195.
72 , LANCELOT: a FORTRAN package for largescale nonlinear optimization Release A, no. 17 in Springer Series in Computational Mathematics, SpringerVerlag, New York, 1992.
73 , Numerical experiments with the LANCELOT package Release A for largescale nonlinear optimization, Report 9216, Department of Mathematics, University of Namur, Belgium, 1992.
74 A. R. CONN, N. I. M. GOULD, AND P. L. TOINT, TrustRegion Methods, MPSSIAM Series on
Optimization, SIAM, 2000.
75 A. R. CONN, K. SCHEINBERG, AND P. L. TOINT, On the convergence of derivativefree methods for
unconstrained optimization, in Approximation Theory and Optimization: Tributes to M. J. D. Powell, A. Iserles and M. Buhmann, eds., Cambridge University Press, Cambridge, UK, 1997, pp. 83108.
76 , Recent progress in unconstrained nonlinear optimization without derivatives, Mathemat ical Programming, Series B, 79 1997, pp. 397414.
77 W. J. COOK, W. H. CUNNINGHAM, W. R. PULLEYBLANK, AND A. SCHRIJVER, Combinatorial Optimization, John Wiley Sons, New York, 1997.
78 B. F. CORLISS AND L. B. RALL, An introduction to automatic differentiation, in Computational Differentiation: Techniques, Applications, and Tools, M. Berz, C. Bischof, G. F. Corliss, and A. Griewank, eds., SIAM Publications, Philadelphia, PA, 1996, ch. 1.
79 T.H.CORMEN,C.E.LEISSERSON,ANDR.L.RIVEST,IntroductiontoAlgorithms,MITPress,1990.
80 R. W. COTTLE, J.S. PANG, AND R. E. STONE, The Linear Complementarity Problem, Academic
Press, San Diego, 1992.
81 R. COURANT, Variational methods for the solution of problems with equilibrium and vibration,
Bulletin of the American Mathematical Society, 49 1943, pp. 123.
82 H.P.CROWDERANDP.WOLFE,Linearconvergenceoftheconjugategradientmethod,IBMJournal
of Research and Development, 16 1972, pp. 431433.
83 A. CURTIS, M. J. D. POWELL, AND J. REID, On the estimation of sparse Jacobian matrices, Journal
of the Institute of Mathematics and its Applications, 13 1974, pp. 117120.
84 J. CZYZYK, S. MEHROTRA, M. WAGNER, AND S. J. WRIGHT, PCx: An interiorpoint code for linear
programming, Optimization Methods and Software, 1112 1999, pp. 397430.
85 Y. DAI AND Y. YUAN, A nonlinear conjugate gradient method with a strong global convergence
property, SIAM Journal on Optimization, 10 1999, pp. 177182.
86 G.B.DANTZIG,LinearProgrammingandExtensions,PrincetonUniversityPress,Princeton,NJ,
1963.
87 W.C.DAVIDON,Variablemetricmethodforminimization,TechnicalReportANL5990revised,
Argonne National Laboratory, Argonne, IL, 1959.
88 , Variable metric method for minimization, SIAM Journal on Optimization, 1 1991,
pp. 117.
89 R. S. DEMBO, S. C. EISENSTAT, AND T. STEIHAUG, Inexact Newton methods, SIAM Journal on
Numerical Analysis, 19 1982, pp. 400408.
90 J. E. DENNIS, D. M. GAY, AND R. E. WELSCH, Algorithm 573 NL2SOL, An adaptive nonlinear leastsquares algorithm, ACM Transactions on Mathematical Software, 7 1981, pp. 348368.
REFERENCES 641
642 REFERENCES
91 J. E. DENNIS AND J. J. MORE , QuasiNewton methods, motivation and theory, SIAM Review, 19 1977, pp. 4689.
92 J.E.DENNISANDR.B.SCHNABEL,NumericalMethodsforUnconstrainedOptimizationandNon linear Equations, PrenticeHall, Englewood Cliffs, NJ, 1983. Reprinted by SIAM Publications, 1993.
93 J.E.DENNISANDR.B.SCHNABEL,Aviewofunconstrainedoptimization,inOptimization,vol.1of Handbooks in Operations Research and Management, Elsevier Science Publishers, Amsterdam, The Netherlands, 1989, pp. 172.
94 I. I. DIKIN, Iterative solution of problems of linear and quadratic programming, Soviet MathematicsDoklady, 8 1967, pp. 674675.
95 I.S.DUFFANDJ.K.REID,Themultifrontalsolutionofindefinitesparsesymmetriclinearequations, ACM Transactions on Mathematical Software, 9 1983, pp. 302325.
96 I.S.DUFFANDJ.K.REID,ThedesignofMA48:Acodeforthedirectsolutionofsparseunsymmetric linear systems of equations, ACM Transactions on Mathematical Software, 22 1996, pp. 187 226.
97 I.S.DUFF,J.K.REID,N.MUNKSGAARD,ANDH.B.NEILSEN,Directsolutionofsetsoflinearequations whose matrix is sparse symmetric and indefinite, Journal of the Institute of Mathematics and its Applications, 23 1979, pp. 235250.
98 A. V. FIACCO AND G. P. MCCORMICK, Nonlinear Programming: Sequential Unconstrained Mini mization Techniques, John Wiley Sons, New York, N.Y., 1968. Reprinted by SIAM Publications, 1990.
99 R.FLETCHER,Ageneralquadraticprogrammingalgorithm,JournaloftheInstituteofMathematics and its Applications, 7 1971, pp. 7691.
100 , Second order corrections for nondifferentiable optimization, in Numerical Analysis, D. Griffiths, ed., Springer Verlag, 1982, pp. 85114. Proceedings Dundee 1981.
101 , Practical Methods of Optimization, John Wiley Sons, New York, second ed., 1987.
102 , An optimal positive definite update for sparse Hessian matrices, SIAM Journal on
Optimization, 5 1995, pp. 192218.
103 , Stable reduced hessian updates for indefinite quadratic programming, Mathematical
Programming, 87 2000, pp. 251264.
104 R.FLETCHER,A.GROTHEY,ANDS.LEYFFER,ComputingsparseHessianandJacobianapproximations
with optimal hereditary properties, technical report, Department of Mathematics, University of
Dundee, 1996.
105 R. FLETCHER AND S. LEYFFER, Nonlinear programming without a penalty function, Mathematical
Programming, Series A, 91 2002, pp. 239269.
106 R. FLETCHER, S. LEYFFER, AND P. L. TOINT, On the global convergence of an SLPfilter algorithm,
Numerical Analysis Report NA183, Dundee University, Dundee, Scotland, UK, 1999.
107 R.FLETCHERANDC.M.REEVES,Functionminimizationbyconjugategradients,ComputerJournal,
7 1964, pp. 149154.
108 R. FLETCHER AND E. SAINZ DE LA MAZA, Nonlinear programming and nonsmooth optimization by
successive linear programming, Mathematical Programming, 43 1989, pp. 235256.
109 C.FLOUDASANDP.PARDALOS,eds.,RecentAdvancesinGlobalOptimization,PrincetonUniversity
Press, Princeton, NJ, 1992.
110 J.J.H.FORRESTANDJ.A.TOMLIN,Updatedtriangularfactorsofthebasistomaintainsparsityin the product form simplex method, Mathematical Programming, 2 1972, pp. 263278.
111 112 113 114 115 116 117
118
119 120 121 122 123 124 125 126 127
128
129
130 131
A.FORSGREN,P.E.GILL,ANDM.H.WRIGHT,Interiormethodsfornonlinearoptimization,SIAM Review, 44 2003, pp. 525597.
R. FOURER, D. M. GAY, AND B. W. KERNIGHAN, AMPL: A Modeling Language for Mathematical Programming, The Scientific Press, South San Francisco, CA, 1993.
R. FOURER AND S. MEHROTRA, Solving symmetric indefinite systems in an interiorpoint method for linear programming, Mathematical Programming, 62 1993, pp. 1539.
M. P. FRIEDLANDER AND M. A. SAUNDERS, A globally convergent linearly constrained Lagrangian method for nonlinear optimization, SIAM Journal on Optimization, 15 2005, pp. 863897. K. R. FRISCH, The logarithmic potential method of convex programming, Technical Report, University Institute of Economics, Oslo, Norway, 1955.
D. GABAY, Reduced quasiNewton methods with feasibility improvement for nonlinearly constrained optimization, Mathematical Programming Studies, 16 1982, pp. 1844. U.M.GARCIAPALOMARESANDO.L.MANGASARIAN,SuperlinearlyconvergentquasiNewtonmeth ods for nonlinearly constrained optimization problems, Mathematical Programming, 11 1976, pp. 113. D.M.GAY,MoreADofnonlinearAMPLmodels:computingHessianinformationandexploiting partial separability, in Computational Differentiation: Techniques, Applications, and Tools, M. Berz, C. Bischof, G. F. Corliss, and A. Griewank, eds., SIAM Publications, Philadelphia, PA, 1996, pp. 173184.
R.P. GE AND M. J. D. POWELL, The convergence of variable metric matrices in unconstrained optimization, Mathematical Programming, 27 1983, pp. 123143. A.H.GEBREMEDHIN,F.MANNE,ANDA.POTHEN,WhatcolorisyourJacobian?Graphcoloringfor computing derivatives, SIAM Review, 47 2005, pp. 629705.
E. M. GERTZ AND S. J. WRIGHT, Objectoriented software for quadratic programming, ACM Transactions on Mathematical Software, 29 2003, pp. 5881. J.GILBERTANDC.LEMARE CHAL,SomenumericalexperimentswithvariablestoragequasiNewton algorithms, Mathematical Programming, Series B, 45 1989, pp. 407435.
J. GILBERT AND J. NOCEDAL, Global convergence properties of conjugate gradient methods for optimization, SIAM Journal on Optimization, 2 1992, pp. 2142.
P. E. GILL, G. H. GOLUB, W. MURRAY, AND M. A. SAUNDERS, Methods for modifying matrix factorizations, Mathematics of Computation, 28 1974, pp. 505535.
P. E. GILL AND M. W. LEONARD, Limitedmemory reducedHessian methods for unconstrained optimization, SIAM Journal on Optimization, 14 2003, pp. 380401. P.E.GILLANDW.MURRAY,Numericallystablemethodsforquadraticprogramming,Mathematical Programming, 14 1978, pp. 349372.
P. E. GILL, W. MURRAY, AND M. A. SAUNDERS, Users guide for SNOPT Version 5.3: A FOR TRAN package for largescale nonlinear programming, Technical Report NA 974, Department of Mathematics, University of California, San Diego, 1997.
, SNOPT: An SQP algorithm for largescale constrained optimization, SIAM Journal on Optimization, 12 2002, pp. 9791006.
P. E. GILL, W. MURRAY, M. A. SAUNDERS, AND M. H. WRIGHT, Users guide for SOLQPSOL, Technical Report SOL846, Department of Operations Research, Stanford University, Stanford, California, 1984. P.E.GILL,W.MURRAY,ANDM.H.WRIGHT,PracticalOptimization,AcademicPress,1981.
, Numerical Linear Algebra and Optimization, Vol. 1, Addison Wesley, Redwood City, California, 1991.
REFERENCES 643
644 REFERENCES
132 133 134 135 136 137 138 139
140 141 142 143
144 145 146
147 148 149
150 151 152
D. GOLDFARB, Curvilinear path steplength algorithms for minimization which use directions of negative curvature, Mathematical Programming, 18 1980, pp. 3140.
D. GOLDFARB AND J. FORREST, Steepest edge simplex algorithms for linear programming, Mathematical Programming, 57 1992, pp. 341374.
D. GOLDFARB AND J. K. REID, A practicable steepestedge simplex algorithm, Mathematical Programming, 12 1977, pp. 361373.
G. GOLUB AND D. OLEARY, Some history of the conjugate gradient methods and the Lanczos algorithms: 19481976, SIAM Review, 31 1989, pp. 50100.
G. H. GOLUB AND C. F. VAN LOAN, Matrix Computations, The Johns Hopkins University Press, Baltimore, third ed., 1996. J.GONDZIO,HOPDMversion2.12:AfastLPsolverbasedonaprimaldualinteriorpointmethod, European Journal of Operations Research, 85 1995, pp. 221225.
, Multiple centrality corrections in a primaldual method for linear programming, Computational Optimization and Applications, 6 1996, pp. 137156.
J. GONDZIO AND A. GROTHEY, Parallel interior point solver for structured quadratic programs: Application to financial planning problems, Technical Report MS03001, School of Mathematics, University of Edinburgh, Scotland, 2003. N.I.M.GOULD,Ontheaccuratedeterminationofsearchdirectionsforsimpledifferentiablepenalty functions, I.M.A. Journal on Numerical Analysis, 6 1986, pp. 357372.
, On the convergence of a sequential penalty function method for constrained minimization, SIAM Journal on Numerical Analysis, 26 1989, pp. 107128.
, An algorithm for large scale quadratic programming, I.M.A. Journal on Numerical Analysis, 11 1991, pp. 299324. N.I.M.GOULD,M.E.HRIBAR,ANDJ.NOCEDAL,Onthesolutionofequalityconstrainedquadratic problems arising in optimization, SIAM Journal on Scientific Computing, 23 2001, pp. 1375 1394.
N.I.M. GOULD, S. LEYFFER, AND P. L. TOINT, A multidimensional filter algorithm for nonlinear equations and nonlinear least squares, SIAM Journal on Optimization, 15 2004, pp. 1738. N.I.M.GOULD,S.LUCIDI,M.ROMA,ANDP.L.TOINT,Solvingthetrustregionsubproblemusing the Lanczos method. SIAM Journal on Optimization, 9 1999, pp. 504525.
N. I. M. GOULD, D. ORBAN, AND P. L. TOINT, GALAHADa library of threadsafe Fortran 90 packages for largescale nonlinear optimization, ACM Transactions on Mathematical Software, 29 2003, pp. 353372.
N. I. M. GOULD, D. ORBAN, AND P. L. TOINT, Numerical methods for largescale nonlinear optimization, Acta Numerica, 14 2005, pp. 299361.
N. I. M. GOULD AND P. L. TOINT, An iterative workingset method for largescale nonconvex quadratic programming, Applied Numerical Mathematics, 43 2002, pp. 109128.
, Numerical methods for largescale nonconvex quadratic programming, in Trends in Industrial and Applied Mathematics, A. H. Siddiqi and M. Kocvara, eds., Dordrecht, The Netherlands, 2002, Kluwer Academic Publishers, pp. 149179.
A. GRIEWANK, Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation, Optimization Methods and Software, 1 1992, pp. 3554.
, Automatic directional differentiation of nonsmooth composite functions, in Seventh FrenchGerman Conference on Optimization, 1994.
A. GRIEWANK, Evaluating Derivatives: Principles and Techniques of Automatic Differentiation, vol. 19 of Frontiers in Applied Mathematics, SIAM, 2000.
153 154
155 156 157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172 173
174
A.GRIEWANKANDG.F.CORLISS,eds.,AutomaticDifferentitionofAlgorithms,SIAMPublications, Philadelphia, Penn., 1991.
A. GRIEWANK, D. JUEDES, AND J. UTKE, ADOLC, A package for the automatic differentiation of algorithms written in CC, ACM Transactions on Mathematical Software, 22 1996, pp. 131167.
A. GRIEWANK AND P. L. TOINT, Local convergence analysis of partitioned quasiNewton updates, Numerische Mathematik, 39 1982, pp. 429448.
, Partitioned variable metric updates for large structured optimization problems, Numerische Mathematik, 39 1982, pp. 119137. J.GRIMM,L.POTTIER,ANDN.ROSTAINGSCHMIDT,Optimaltimeandminimumspacetimeproduct for reversing a certain class of programs, in Computational Differentiation, Techniques, Appli cations, and Tools, M. Berz, C. Bischof, G. Corliss, and A. Griewank, eds., SIAM, Philadelphia, 1996, pp. 95106.
L. GRIPPO, F. LAMPARIELLO, AND S. LUCIDI, A nonmonotone line search technique for Newtons method, SIAM Journal on Numerical Analysis, 23 1986, pp. 707716.
C. GUE RET, C. PRINS, AND M. SEVAUX, Applications of optimization with XpressMP, Dash Optimization, 2002. W.W.HAGER,Minimizingaquadraticoverasphere,SIAMJournalonOptimization,122001, pp. 188208.
W. W. HAGER AND H. ZHANG, A new conjugate gradient method with guaranteed descent and an efficient line search, SIAM Journal on Optimization, 16 2005, pp. 170192.
, A survey of nonlinear conjugate gradient methods. To appear in the Pacific Journal of Optimization, 2005.
S.P.HAN,Superlinearlyconvergentvariablemetricalgorithmsforgeneralnonlinearprogramming problems, Mathematical Programming, 11 1976, pp. 263282.
, A globally convergent method for nonlinear programming, Journal of Optimization Theory and Applications, 22 1977, pp. 297309.
S. P. HAN AND O. L. MANGASARIAN, Exact penalty functions in nonlinear programming, Mathematical Programming, 17 1979, pp. 251269. HARWELLSUBROUTINELIBRARY,Acatalogueofsubroutinesrelease13,AEREHarwellLaboratory, Harwell, Oxfordshire, England, 1998.
M. R. HESTENES, Multiplier and gradient methods, Journal of Optimization Theory and Applications, 4 1969, pp. 303320. M.R.HESTENESANDE.STIEFEL,Methodsofconjugategradientsforsolvinglinearsystems,Journal of Research of the National Bureau of Standards, 49 1952, pp. 409436. N.J.HIGHAM,AccuracyandStabilityofNumericalAlgorithms,SIAMPublications,Philadelphia, 1996.
J.B. HIRIARTURRUTY AND C. LEMARECHAL, Convex Analysis and Minimization Algorithms, SpringerVerlag, Berlin, New York, 1993.
P. HOUGH, T. KOLDA, AND V. TORCZON, Asynchronous parallel pattern search for nonlinear optimization, SIAM Journal on Optimization, 23 2001, pp. 134156. ILOGCPLEX8.0,UsersManual,ILOGSA,Gentilly,France,2002.
D. JONES, C. PERTTUNEN, AND B. STUCKMAN, Lipschitzian optimization without the Lipschitz constant, Journal of Optimization Theory and Applications, 79 1993, pp. 157181. P.KALLANDS.W.WALLACE,StochasticProgramming,JohnWileySons,NewYork,1994.
REFERENCES 645
646 REFERENCES
175 176 177 178 179 180 181 182 183 184 185
186
187 188
189
190 191
192 193 194 195
N. KARMARKAR, A new polynomialtime algorithm for linear programming, Combinatorics, 4 1984, pp. 373395.
C. KELLER, N. I. M. GOULD, AND A. J. WATHEN, Constraint preconditioning for indefinite linear systems, SIAM Journal on Matrix Analysis and Applications, 21 2000, pp. 13001317.
C. T. KELLEY, Iterative Methods for Linear and Nonlinear Equations, SIAM Publications, Philadelphia, PA, 1995.
C. T. KELLEY, Detection and remediation of stagnation in the NelderMead algorithm using a sufficient decrease condition, SIAM Journal on Optimization, 10 1999, pp. 4355.
, Iterative Methods for Optimization, no. 18 in Frontiers in Applied Mathematics, SIAM Publications, Philadelphia, PA, 1999. L.G.KHACHIYAN,Apolynomialalgorithminlinearprogramming,SovietMathematicsDoklady, 20 1979, pp. 191194.
H. F. KHALFAN, R. H. BYRD, AND R. B. SCHNABEL, A theoretical and experimental study of the symmetric rank one update, SIAM Journal on Optimization, 3 1993, pp. 124.
V. KLEE AND G. J. MINTY, How good is the simplex algorithm? in Inequalities, O. Shisha, ed., Academic Press, New York, 1972, pp. 159175.
T. G. KOLDA, R. M. LEWIS, AND V. TORCZON, Optimization by direct search: New perspectives on some classical and modern methods, SIAM Review, 45 2003, pp. 385482.
M. KOCVARA AND M. STINGL, PENNON, a code for nonconvex nonlinear and semidefinite programming, Optimization Methods and Software, 18 2003, pp. 317333. H.W.KUHNANDA.W.TUCKER,Nonlinearprogramming,inProceedingsoftheSecondBerkeley Symposium on Mathematical Statistics and Probability, J. Neyman, ed., Berkeley, CA, 1951, University of California Press, pp. 481492.
J. W. LAGARIAS, J. A. REEDS, M. H. WRIGHT, AND P. E. WRIGHT, Convergence properties of the NelderMead simplex algorithm in low dimensions, SIAM Journal on Optimization, 9 1998, pp. 112147.
S.LANG,RealAnalysis,AddisonWesley,Reading,MA,seconded.,1983. C.L.LAWSONANDR.J.HANSON,SolvingLeastSquaresProblems,PrenticeHall,EnglewoodCliffs, NJ, 1974.
C. LEMARE CHAL, A view of line searches, in Optimization and Optimal Control, W. Oettli and J. Stoer, eds., no. 30 in Lecture Notes in Control and Information Science, SpringerVerlag, 1981, pp. 5978. K.LEVENBERG,Amethodforthesolutionofcertainnonlinearproblemsinleastsquares,Quarterly of Applied Mathematics, 2 1944, pp. 164168.
S. LEYFFER, G. LOPEZCALVA, AND J. NOCEDAL, Interior methods for mathematical programs with complementarity constraints, technical report 8, Optimization Technology Center, Northwestern University, Evanston, IL, 2004. C.LINANDJ.MORE ,Newtonsmethodforlargeboundconstrainedoptimizationproblems,SIAM Journal on Optimization, 9 1999, pp. 11001127.
C. LIN AND J. J. MORE , Incomplete Cholesky factorizations with limited memory, SIAM Journal on Scientific Computing, 21 1999, pp. 2445.
D. C. LIU AND J. NOCEDAL, On the limitedmemory BFGS method for large scale optimization, Mathematical Programming, 45 1989, pp. 503528.
D.LUENBERGER,IntroductiontoLinearandNonlinearProgramming,AddisonWesley,seconded., 1984.
196 L.LUKSANANDJ.VLCEK,IndefinitelypreconditionedinexactNewtonmethodforlargesparseequal ity constrained nonlinear programming problems, Numerical Linear Algebra with Applications, 5 1998, pp. 219247.
197 MacsymaUsersGuide,seconded.,1996.
198 O. L. MANGASARIAN, Nonlinear Programming, McGrawHill, New York, 1969. Reprinted by
SIAM Publications, 1995.
199 N. MARATOS, Exact penalty function algorithms for finite dimensional and control optimization
problems, PhD thesis, University of London, 1978.
200 M. MARAZZI AND J. NOCEDAL, Wedge trust region methods for derivative free optimization,
Mathematical Programming, Series A, 91 2002, pp. 289305.
201 H.M.MARKOWITZ,Portfolioselection,JournalofFinance,81952,pp.7791.
202 , The elimination form of the inverse and its application to linear programming, Management
Science, 3 1957, pp. 255269.
203 D. W. MARQUARDT, An algorithm for least squares estimation of nonlinear parameters, SIAM
Journal, 11 1963, pp. 431441.
204 D. Q. MAYNE AND E. POLAK, A superlinearly convergent algorithm for constrained optimization
problems, Mathematical Programming Studies, 16 1982, pp. 4561.
205 L.MCLINDEN,AnanalogueofMoreausproximationtheorem,withapplicationstothenonlinear
complementarity problem, Pacific Journal of Mathematics, 88 1980, pp. 101161.
206 N. MEGIDDO, Pathways to the optimal set in linear programming, in Progress in Mathematical Programming: InteriorPoint and Related Methods, N. Megiddo, ed., SpringerVerlag, New
York, NY, 1989, ch. 8, pp. 131158.
207 S.MEHROTRA,Ontheimplementationofaprimaldualinteriorpointmethod,SIAMJournalon
Optimization, 2 1992, pp. 575601.
208 S.MIZUNO,M.TODD,ANDY.YE,Onadaptivestepprimaldualinteriorpointalgorithmsforlinear
programming, Mathematics of Operations Research, 18 1993, pp. 964981.
209 J. L. MORALES AND J. NOCEDAL, Automatic preconditioning by limited memory quasinewton
updating, SIAM Journal on Optimization, 10 2000, pp. 10791096.
210 J.J.MORE ,TheLevenbergMarquardtalgorithm:Implementationandtheory,inLectureNotesin
Mathematics, No. 630Numerical Analysis, G. Watson, ed., SpringerVerlag, 1978, pp. 105116.
211 , Recent developments in algorithms and software for trust region methods, in Mathematical
Programming: The State of the Art, SpringerVerlag, Berlin, 1983, pp. 258287.
212 , A collection of nonlinear model problems, in Computational Solution of Nonlinear Systems of Equations, vol. 26 of Lectures in Applied Mathematics, American Mathematical
Society, Providence, RI, 1990, pp. 723762.
213 J.J.MORE ANDD.C.SORENSEN,OntheuseofdirectionsofnegativecurvatureinamodifiedNewton
method, Mathematical Programming, 16 1979, pp. 120.
214 , Computing a trust region step, SIAM Journal on Scientific and Statistical Computing, 4
1983, pp. 553572.
215 , Newtons method, in Studies in Numerical Analysis, vol. 24 of MAA Studies in
Mathematics, The Mathematical Association of America, 1984, pp. 2982.
216 J. J. MORE AND D. J. THUENTE, Line search algorithms with guaranteed sufficient decrease, ACM
Transactions on Mathematical Software, 20 1994, pp. 286307.
217 J. J. MORE AND S. J. WRIGHT, Optimization Software Guide, SIAM Publications, Philadelphia, 1993.
REFERENCES 647
648 REFERENCES
218
219
220
221
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
B. A. MURTAGH AND M. A. SAUNDERS, MINOS 5.1 Users guide, Technical Report SOL8320R, Stanford University, 1987.
K. G. MURTY AND S. N. KABADI, Some NPcomplete problems in quadratic and nonlinear programming, Mathematical Programming, 19 1987, pp. 200212.
S. G. NASH, Newtontype minimization via the Lanczos method, SIAM Journal on Numerical Analysis, 21 1984, pp. 553572.
, SUMT Revisited, Operations Research, 46 1998, pp. 763775.
U. NAUMANN, Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph, Mathematical Programming, 99 2004, pp. 399421. J.A.NELDERANDR.MEAD,Asimplexmethodforfunctionminimization,TheComputerJournal, 8 1965, pp. 308313.
G. L. NEMHAUSER AND L. A. WOLSEY, Integer and Combinatorial Optimization, John Wiley Sons, New York, 1988.
A. S. NEMIROVSKII AND D. B. YUDIN, Problem complexity and method efficiency, John Wiley Sons, New York, 1983. Y.E.NESTEROVANDA.S.NEMIROVSKII,InteriorPointPolynomialMethodsinConvexProgramming, SIAM Publications, Philadelphia, 1994.
G. N. NEWSAM AND J. D. RAMSDELL, Estimation of sparse Jacobian matrices, SIAM Journal on Algebraic and Discrete Methods, 4 1983, pp. 404418. J.NOCEDAL,UpdatingquasiNewtonmatriceswithlimitedstorage,MathematicsofComputation, 35 1980, pp. 773782.
, Theory of algorithms for unconstrained optimization, Acta Numerica, 1 1992, pp. 199
242. J.M.ORTEGAANDW.C.RHEINBOLDT,Iterativesolutionofnonlinearequationsinseveralvariables, Academic Press, New York and London, 1970.
M. R. OSBORNE, Nonlinear least squaresthe Levenberg algorithm revisited, Journal of the Australian Mathematical Society, Series B, 19 1976, pp. 343357.
, Finite Algorithms in Optimization and Data Analysis, John Wiley Sons, New York,
1985. M.L.OVERTON,NumericalComputingwithIEEEFloatingPointArithmetic,SIAM,Philadelphia, PA, 2001. C.C.PAIGEANDM.A.SAUNDERS,LSQR:Analgorithmforsparselinearequationsandsparseleast squares, ACM Transactions on Mathematical Software, 8 1982, pp. 4371. C.H.PAPADIMITRIOUANDK.STEIGLITZ,CombinatorialOptimization:AlgorithmsandComplexity, Prentice Hall, Englewood Cliffs, NJ, 1982.
E. POLAK, Optimization: Algorithms and Consistent Approximations, no. 124 in Applied Mathematical Sciences, Springer, 1997.
E. POLAK AND G. RIBIERE, Note sur la convergence de m ethodes de directions conjugu ees, Revue Franc aise dInformatique et de Recherche Ope rationnelle, 16 1969, pp. 3543.
B. T. POLYAK, The conjugate gradient method in extremal problems, U.S.S.R. Computational Mathematics and Mathematical Physics, 9 1969, pp. 94112.
M. J. D. POWELL, An efficient method for finding the minimum of a function of several variables without calculating derivatives, Computer Journal, 91 1964, pp. 155162.
, A method for nonlinear constraints in minimization problems, in Optimization, R. Fletcher, ed., Academic Press, New York, NY, 1969, pp. 283298.
241
242 243 244
245 246
247 248 249 250
251
252 253 254
255 256 257 258 259 260
, A hybrid method for nonlinear equations, in Numerical Methods for Nonlinear Algebraic Equations, P. Rabinowitz, ed., Gordon Breach, London, 1970, pp. 87114.
, Problems related to unconstrained optimization, in Numerical Methods for Uncon strained Optimization, W. Murray, ed., Academic Press, 1972, pp. 2955.
, On search directions for minimization algorithms, Mathematical Programming, 4 1973, pp. 193201.
, Convergence properties of a class of minimization algorithms, in Nonlinear Programming 2, O. L. Mangasarian, R. R. Meyer, and S. M. Robinson, eds., Academic Press, New York, 1975, pp. 127.
, Some convergence properties of the conjugate gradient method, Mathematical Programming, 11 1976, pp. 4249.
, Some global convergence properties of a variable metric algorithm for minimization without exact line searches, in Nonlinear Programming, SIAMAMS Proceedings, Vol. IX, R. W. Cottle and C. E. Lemke, eds., SIAM Publications, 1976, pp. 5372.
, A fast algorithm for nonlinearly constrained optimization calculations, in Numerical Analysis Dundee 1977, G. A. Watson, ed., Springer Verlag, Berlin, 1977, pp. 144157.
, Restart procedures for the conjugate gradient method, Mathematical Programming, 12 1977, pp. 241254.
, Algorithms for nonlinear constraints that use Lagrangian functions, Mathematical Programming, 14 1978, pp. 224248.
, The convergence of variable metric methods for nonlinearly constrained optimization calculations, in Nonlinear Programming 3, Academic Press, New York and London, 1978, pp. 2763.
, On the rate of convergence of variable metric algorithms for unconstrained optimization, Technical Report DAMTP 1983NA7, Department of Applied Mathematics and Theoretical Physics, Cambridge University, 1983.
, Variable metric methods for constrained optimization, in Mathematical Programming: The State of the Art, Bonn, 1982, SpringerVerlag, Berlin, 1983, pp. 288311.
, Nonconvex minimization calculations and the conjugate gradient method, Lecture Notes in Mathematics, 1066 1984, pp. 122141.
, The performance of two subroutines for constrained optimization on some difficult test problems, in Numerical Optimization, P. T. Boggs, R. H. Byrd, and R. B. Schnabel, eds., SIAM Publications, Philadelphia, 1984.
, Convergence properties of algorithms for nonlinear optimization, SIAM Review, 28 1986, pp. 487500.
, Direct search algorithms for optimization calculations, Acta Numerica, 7 1998, pp. 287
REFERENCES 649
336.
, UOBYQA: unconstrained optimization by quadratic approximation, Mathematical Programming, Series B, 92 2002, pp. 555582.
, On trustregion methods for unconstrained minimization without derivatives, Mathemat ical Programming, 97 2003, pp. 605623.
, Least Frobenius norm updating of quadratic models that satisfy interpolation conditions, Mathematical Programming, 100 2004, pp. 183215.
, The NEWUOA software for unconstrained optimization without derivatives, Numerical Analysis Report DAMPT 2004NA05, University of Cambridge, Cambridge, UK, 2004.
650 REFERENCES
261
262 263
264
265
266
267
268
269
270 271
272 273
274
275 276 277 278 279
280 281
282
M.J.D.POWELLANDP.L.TOINT,OntheestimationofsparseHessianmatrices,SIAMJournalon Numerical Analysis, 16 1979, pp. 10601074. R.L.RARDIN,OptimizationinOperationsResearch,PrenticeHall,EnglewoodCliffs,NJ,1998. F. RENDL AND H. WOLKOWICZ, A semidefinite framework for trust region subproblems with applications to large scale minimization, Mathematical Programming, 77 1997, pp. 273299. J. M. RESTREPO, G. K. LEAF, AND A. GRIEWANK, Circumventing storage limitations in variational data assimilation studies, SIAM Journal on Scientific Computing, 19 1998, pp. 15861605. K.RITTER,Ontherateofsuperlinearconvergenceofaclassofvariablemetricmethods,Numerische Mathematik, 35 1980, pp. 293313.
S. M. ROBINSON, A quadratically convergent algorithm for general nonlinear programming problems, Mathematical Programming, 3 1972, pp. 145156.
, Perturbed KuhnTucker points and rates of convergence for a class of nonlinear programming algorithms, Mathematical Programming, 7 1974, pp. 116.
, Generalized equations and their solutions. Part II: Applications to nonlinear programming, Mathematical Programming Study, 19 1982, pp. 200221. R.T.ROCKAFELLAR,ThemultipliermethodofHestenesandPowellappliedtoconvexprogramming, Journal of Optimization Theory and Applications, 12 1973, pp. 555562.
, Lagrange multipliers and optimality, SIAM Review, 35 1993, pp. 183238. J.B.ROSENANDJ.KREUSER,Agradientprojectionalgorithmfornonlinearconstraints,inNumerical Methods for NonLinear Optimization, F. A. Lootsma, ed., Academic Press, London and New York, 1972, pp. 297300.
, Iterative Methods for Sparse Linear Systems, SIAM Publications, Philadelphia, PA, second ed., 2003.
Y. SAAD AND M. SCHULTZ, GMRES: A generalized minimal residual algorithm for solving non symmetric linear systems, SIAM Journal on Scientific and Statistical Computing, 7 1986, pp. 856869.
H. SCHEEL AND S. SCHOLTES, Mathematical programs with complementarity constraints: Stationarity, optimality and sensitivity, Mathematics of Operations Research, 25 2000, pp. 122. T.SCHLICK,ModifiedCholeskyfactorizationsforsparsepreconditioners,SIAMJournalonScientific Computing, 14 1993, pp. 424445. R.B.SCHNABELANDE.ESKOW,AnewmodifiedCholeskyfactorization,SIAMJournalonScientific Computing, 11 1991, pp. 11361158.
R. B. SCHNABEL AND P. D. FRANK, Tensor methods for nonlinear equations, SIAM Journal on Numerical Analysis, 21 1984, pp. 815843.
G. SCHULLER, On the order of convergence of certain quasiNewton methods, Numerische Mathematik, 23 1974, pp. 181192.
G. A. SCHULTZ, R. B. SCHNABEL, AND R. H. BYRD, A family of trustregionbased algorithms for unconstrained minimization with strong global convergence properties, SIAM Journal on Numerical Analysis, 22 1985, pp. 4767. G.A.F.SEBERANDC.J.WILD,NonlinearRegression,JohnWileySons,NewYork,1989.
T. STEIHAUG, The conjugate gradient method and trust regions in large scale optimization, SIAM Journal on Numerical Analysis, 20 1983, pp. 626637.
J. STOER, On the relation between quadratic termination and convergence properties of minimization algorithms. Part I: Theory, Numerische Mathematik, 28 1977, pp. 343366.
283
284 285
286
287
288
289
290 291
292 293
294
295
296
297 298
299 300 301 302 303
304
K. TANABE, Centered Newton method for mathematical programming, in System Modeling and Optimization: Proceedings of the 13th IFIP conference, vol. 113 of Lecture Notes in Control and Information Systems, Berlin, 1988, SpringerVerlag, pp. 197206.
M. J. TODD, Potential reduction methods in mathematical programming, Mathematical Programming, Series B, 76 1997, pp. 345.
, Semidefinite optimization, Acta Numerica, 10 2001, pp. 515560.
, Detecting infeasibility in infeasibleinteriorpoint methods for optimization, in Founda tions of Computational Mathematics, Minneapolis, 2002, F. Cucker, R. DeVore, P. Olver, and E. Suli, eds., Cambridge University Press, Cambridge, 2004, pp. 157192.
M. J. TODD AND Y. YE, A centered projective algorithm for linear programming, Mathematics of Operations Research, 15 1990, pp. 508529.
P.L.TOINT,Onsparseandsymmetricmatrixupdatingsubjecttoalinearequation,Mathematics of Computation, 31 1977, pp. 954961.
, Towards an efficient sparsity exploiting Newton method for minimization, in Sparse Matrices and Their Uses, Academic Press, New York, 1981, pp. 5787.
L.TREFETHENANDD.BAU,NumericalLinearAlgebra,SIAM,Philadelphia,PA,1997.
M. ULBRICH, S. ULBRICH, AND L. N. VICENTE, A globally convergence primaldual interiorpoint filter method for nonlinear programming, Mathematical Programming, Series B, 100 2004, pp. 379410.
L.VANDENBERGHEANDS.BOYD,Semidefiniteprogramming,SIAMReview,381996,pp.4995. R.J.VANDERBEI,LinearProgramming:FoundationsandExtensions,SpringerVerlag,NewYork,
second ed., 2001.
R. J. VANDERBEI AND D. F. SHANNO, An interior point algorithm for nonconvex nonlinear
programming, Computational Optimization and Applications, 13 1999, pp. 231252. A.VARDI,Atrustregionalgorithmforequalityconstrainedminimization:convergenceproperties
and implementation, SIAM Journal of Numerical Analysis, 22 1985, pp. 575591. S.A.VAVASIS,QuadraticprogrammingisNP,InformationProcessingLetters,361990,pp.73
77.
, Nonlinear Optimization, Oxford University Press, New York and Oxford, 1991.
A.WA CHTER,Aninteriorpointalgorithmforlargescalenonlinearoptimizationwithapplications in process engineering, PhD thesis, Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA, 2002.
A.WA CHTERANDL.T.BIEGLER,Failureofglobalconvergenceforaclassofinteriorpointmethods for nonlinear programming, Mathematical Programming, 88 2000, pp. 565574.
, Line search filter methods for nonlinear programming: Motivation and global convergence, SIAM Journal on Optimization, 16 2005, pp. 131.
, On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming, Mathematical Programming, 106 2006, pp. 2557.
H. WALKER, Implementation of the GMRES method using Householder transformations, SIAM Journal on Scientific and Statistical Computing, 9 1989, pp. 815825.
R. A. WALTZ, J. L. MORALES, J. NOCEDAL, AND D. ORBAN, An interior algorithm for nonlinear optimization that combines line search and trust region steps, Tech. Rep. 20036, Optimization Technology Center, Northwestern University, Evanston, IL, USA, June 2003.
WATERLOOMAPLESOFTWARE,INC,MapleVsoftwarepackage,1994.
REFERENCES 651
652 REFERENCES
305
306
307
308 309
310
311 312 313 314
315
316 317
318
319 320 321 322
L. T. WATSON, Numerical linear algebra aspects of globally convergent homotopy methods, SIAM Review, 28 1986, pp. 529545. R.B.WILSON,Asimplicialalgorithmforconcaveprogramming,PhDthesis,GraduateSchoolof Business Administration, Harvard University, 1963.
D. WINFIELD, Function and functional optimization by interpolation in data tables, PhD thesis, Harvard University, Cambridge, USA, 1969. W.L.WINSTON,OperationsResearch,WadsworthPublishingCo.,thirded.,1997. P.WOLFE,Adualitytheoremfornonlinearprogramming,QuarterlyofAppliedMathematics,19 1961, pp. 239244.
, The composite simplex algorithm, SIAM Review, 7 1965, pp. 4254.
S. WOLFRAM, The Mathematica Book, Cambridge University Press and Wolfram Media, Inc., third ed., 1996.
L. A. WOLSEY, Integer Programming, WileyInterscience Series in Discrete Mathematics and Optimization, John Wiley Sons, New York, NY, 1998. M.H.WRIGHT,Interiormethodsforconstrainedoptimization,inActaNumerica1992,Cambridge University Press, Cambridge, 1992, pp. 341407.
, Direct search methods: Once scorned, now respectable, in Numerical Analysis 1995 Pro ceedings of the 1995 Dundee Biennial Conference in Numerical Analysis, Addison Wesley Longman, 1996, pp. 191208.
S. J. WRIGHT, Applying new optimization algorithms to model predictive control, in Chemical Process ControlV, J. C. Kantor, ed., CACHE, 1997.
, PrimalDual InteriorPoint Methods, SIAM Publications, Philadelphia, PA, 1997.
, Modified Cholesky factorizations in interiorpoint algorithms for linear programming, SIAM Journal on Optimization, 9 1999, pp. 11591191. S.J.WRIGHTANDJ.N.HOLT,AninexactLevenbergMarquardtmethodforlargesparsenonlinear least squares problems, Journal of the Australian Mathematical Society, Series B, 26 1985, pp. 387403.
E. A. YILDIRIM AND S. J. WRIGHT, Warmstart strategies in interiorpoint methods for linear programming, SIAM Journal on Optimization, 12 2002, pp. 782810. Y.YUAN,Onthetruncatedconjugategradientmethod,MathematicalProgramming,SeriesA,87 2000, pp. 561573.
Y. ZHANG, Solving largescale linear programs with interiorpoint methods under the Matlab environment, Optimization Methods and Software, 10 1998, pp. 131.
C. ZHU, R. H. BYRD, P. LU, AND J. NOCEDAL, Algorithm 778: LBFGSB, FORTRAN subroutines
for large scale bound constrained optimization, ACM Transactions on Mathematical Software, 23 1997, pp. 550560.
Index
Accumulation point, see Limit point Active set, 308, 323, 336, 342
Affine scaling
direction, 395, 398, 414
method, 417
Alternating variables method, see also
Coordinate search method, 104,
230 Angle test, 41
Applications
design optimization, 1
finance, 7
portfolio optimization, 1, 449450, 492 transportation, 4
Armijo line search, see Line search, Armijo Augmented Lagrangian function, 423
as merit function, 436 definition, 514
exactness of, 517518
example, 516
Augmented Lagrangian method, 422, 498, 514526
convergence, 518519 framework for, 515 implementation, 519523 LANCELOT, 175, 519522 motivation, 514515
Automatic differentiation, 170, 194 adjoint variables, 208, 209
and graphcoloring algorithms, 212,
216218 checkpointing, 210
common expressions, 211
computational graph, 205206, 208, 210, 211, 213, 215
This is page 653 Printer: Opaque this
654 INDEX
Automatic cont.
computational requirements, 206207,
210, 214, 216, 219
forward mode, 206207, 278
forward sweep, 206, 208, 210, 213215,
219
foundations in elementary arithmetic,
194, 204 Hessian calculation
forward mode, 213215 interpolation formulae, 214215 reverse mode, 215216
intermediate variables, 205209, 211, 212, 218
Jacobian calculation, 210213 forward mode, 212
reverse mode, 212213
limitations of, 216217
reverse mode, 207210
reverse sweep, 208210, 218
seed vectors, 206, 207, 212, 213, 216 software, 194, 210, 217
Backtracking, 37, 240 Barrier functions, 566, 583 Barrier method, 563566
primal, 583
Basic variables, 429
Basis matrix, 429431
BFGS method, 24, 29, 136143
damping, 537 implementation, 142143 properties, 141142, 161 selfcorrection, 142 skipping, 143, 537
Boundconstrained optimization, 97, 485490
BQPD, 490
Broyden class, see QuasiNewton method,
Broyden class
Broydens method, 273, 274, 284, 285, 302,
634
derivation of, 279281 limitedmemory variants, 283 rate of convergence, 281283 statement of algorithm, 281
ByrdOmojokun method, 547, 579
Calculus of variations, 9
Cancellation error, see Floatingpoint
arithmetic, cancellation
Cauchy point, 7173, 76, 77, 93, 100, 170,
172, 262, 486
calculation of, 7172, 96
for nonlinear equations, 291292 role in global convergence, 7779
Cauchy sequence, 618
CauchySchwarz inequality, 75, 99, 151,
600
Central path, 397399, 417
for nonlinear problems, 565, 584, 594
neighborhoods of, 399401, 403, 406, 413
Chain rule, 29, 194, 204, 206208, 213, 625, 627, 629
Cholesky factorization, 87, 141, 143, 161, 251, 259, 289, 292, 454, 599, 608609, 617
incomplete, 174
modified, 48, 5154, 63, 64, 76
bounded modified factorization property, 48
sparse, 412413
stability of, 53, 617
Classification of algorithms, 422 Combinatorial difficulty, 424 Complementarity condition, 70, 313, 321,
333, 397
strict, 321, 337, 342, 533, 565, 591
Complementarity problems linear LCP, 415 nonlinear NCP, 417
Complexity of algorithms, 388389, 393, 406, 415, 417
Conditioning, see also Matrix, condition number, 426, 430432, 616617 ill conditioned, 29, 502, 514, 586, 616
well conditioned, 616
Cone, 621
Cone of feasible directions, see Tangent
cone Conjugacy, 25, 102
Conjugate direction method, 103
expanding subspace minimization, 106, 172, 173
termination of, 103
Conjugate gradient method, 71, 101132,
166, 170173, 253, 278
nstep quadratic convergence, 133 clustering of eigenvalues, 116
effect of condition number, 117 expanding subspace minimization, 112 FletcherReeves, see FletcherReeves
method
for reduced system, 459461
global convergence, 40 HestenesStiefel, 123
Krylov subspace, 113
modified for indefiniteness, 169170 nonlinear, 25, 121131
numerical performance, 131 optimal polynomial, 113
optimal process, 112
PolakRibiere, see PolakRibiere
method
practical version, 111
preconditioned, 118119, 170, 460 projected, 461463, 548, 571, 581, 593 rate of convergence, 112
relation to limitedmemory, 180 restarts, 124
superlinear convergence, 132 superquadratic, 133
termination, 115, 124
Constrained optimization, 6
nonlinear, 4, 6, 211, 293, 356, 421, 498,
500
Constraint qualifications, 315320, 333,
338340, 350
linear independence LICQ, 320, 321,
323, 339, 341, 358, 464, 503, 517,
533, 557, 565, 591 MangasarianFromovitz MFCQ,
339340 Constraints, 2, 307
bounds, 434, 519, 520 equality, 305 inequality, 305
Continuation methods for nonlinear equations, 274, 303
application to KKT conditions for nonlinear optimization, 565
convergence of, 300301 formulation as initialvalue ODE,
297299
motivation, 296297 predictorcorrector method, 299300 zero path, 296301, 303
divergence of, 300301 tangent, 297300
turning point, 296, 297, 300
Convergence, rate of, 619620
nstep quadratic, 133
linear, 262, 619, 620
quadratic, 23, 29, 49, 168, 257, 619, 620 sublinear, 29
superlinear, 23, 29, 73, 132, 140, 142, 160, 161, 168, 262265, 414, 619, 620
superquadratic, 133
Convex combination, 621 Convex hull, 621
Convex programming, 7, 8, 335 Convexity, 78
of functions, 8, 1617, 28, 250 of sets, 8, 28, 352
strict, 8
Coordinate descent method, see Alternating variables method, 233
Coordinate relaxation step, 431 Coordinate search method, 135, 230231 CPLEX, 490
Critical cone, 330
Datafitting problems, 1112, 248 Degeneracy, 465
of basis, 366, 369, 372, 382
of linear program, 366
Dennis and More characterization, 47 Descent direction, 21, 29, 30
DFP method, 139
Differential equations
ordinary, 299
partial, 216, 302
Direct sum, 603
Directional derivative, 206, 207, 437,
628629
Discrete optimization, 56
INDEX 655
656 INDEX
Dual slack variables, 359
Dual variables, see also Lagrange
multipliers, 359 Duality, 350
in linear programming, 359362
in nonlinear programming, 343349 weak, 345, 361
Eigenvalues, 84, 252, 337, 599, 603, 613 negative, 77, 92
of symmetric matrix, 604
Eigenvectors, 84, 252, 603 Element function, 186 Elimination of variables, 424
linear equality constraints, 428433 nonlinear, 426428
when inequality constraints are present,
434
Ellipsoid algorithm, 389, 393, 417 Error
absolute, 614
relative, 196, 251, 252, 614, 617 truncation, 216
Errorsinvariables models, 265
Feasibility restoration, 439440
Feasible sequences, 316325, 332333, 336
limiting directions of, 316325, 329, 333 Feasible set, 3, 305, 306, 338
geometric properties of, 340341 primal, 358
primaldual, 397, 399, 405, 414
Filter method, 437440 Filters, 424, 437440, 575, 589
for interiorpoint methods, 575
Finite differencing, 170, 193204, 216, 268,
278
and graphcoloring algorithms, 202204 and noise, 221
centraldifference formula, 194,
196197, 202, 217 forwarddifference formula, 195, 196,
202, 217
gradient approximation, 195197 graphcoloring algorithms and, 200201 Hessian approximation, 201204 Jacobian approximation, 197201, 283
Firstorder feasible descent direction, 310315
Firstorder optimality conditions, see also KarushKuhnTucker KKT conditions, 90, 275, 307329, 340, 352
derivation of, 315329
examples, 308315, 317319, 321322 fundamental principle of, 325326 unconstrained optimization, 1415, 513
Fixedregressor model, 248 FletcherReeves method, 102, 121131
convergence of, 125
numerical performance, 131 Floatingpoint arithmetic, 216, 614615,
617
cancellation, 431, 615 doubleprecision, 614
roundoff error, 195, 217, 251, 615 unit roundoff, 196, 217, 614
Floatingpoint numbers, 614 exponent, 614
fractional part, 614
Forcing sequence, see Newtons method, inexact, forcing sequence
Function
continuous, 623624
continuously differentiable, 626, 631 derivatives of, 625630 differentiable, 626
Lipschitz continuous, 624, 630 locally Lipschitz continuous, 624 onesided limit, 624
univariate, 625
Functions
smooth, 10, 14, 306307, 330
Fundamental theorem of algebra, 603
GaussNewton method, 254258, 263, 266, 275
connection to linear least squares, 255 line search in, 254
performance on largeresidual
problems, 262
Gaussian elimination, 51, 430, 455, 609
sparse, 430, 433
stability of, 617
with row partial pivoting, 607, 617
Global convergence, 7792, 261, 274 Global minimizer, 1213, 16, 17, 502, 503 Global optimization, 68, 422
Global solution, see also Global minimizer,
6, 6970, 8991, 305, 335, 352 GMRES, 278, 459, 492, 571
Goldstein condition, 36, 48 Gradient, 625
generalized, 18
Gradient projection method, 464,
485490, 492, 521
Group partial separability, see Partially
separable function, group partially separable
Hessian, 14, 19, 20, 23, 26, 626 average, 138, 140
Homotopy map, 296
Homotopy methods, see Continuation
methods for nonlinear equations
Implicit filtering, 240242 Implicit function theorem, 324,
630631
Inexact Newton method, see Newtons
method, inexact Infeasibility measure, 437
Inner product, 599
Integer programming, 5, 416
branchandbound algorithm, 6 Integral equations, 302
Interiorpoint methods, see Primaldual
interiorpoint methods nonlinear, see Nonlinear interiorpoint
method
Interlacing eigenvalue theorem, 613 Interpolation conditions, 223
Invariant subspace, see Partially separable
optimization, invariant subspace Iterative refinement, 463
Jacobian, 246, 254, 256, 269, 274, 324, 395, 504, 627, 630
Karmarkars algorithm, 389, 393, 417 KarushKuhnTucker KKT conditions, 330, 332, 333, 335337, 339, 350,
354, 503, 517, 520, 528
for general constrained problem, 321 for linear programming, 358360, 367,
368, 394415
for linear programming, 394
KNITRO, 490, 525, 583, 592 Krylov subspace, 108
method, 459
LBFGS algorithm, 177180, 183 Lagrange multipliers, 310, 330, 333, 337, 339, 341343, 353, 358, 360, 419,
422
estimates of, 503, 514, 515, 518, 521,
522, 584
Lagrangian function, 90, 310, 313, 320,
329, 330, 336
for linear program, 358, 360
Hessian of, 330, 332, 333, 335, 337, 358
LANCELOT, 520, 525, 592
Lanczos method, 77, 166, 175176 LAPACK, 607
Leastsquares multipliers, 581 Leastsquares problems, linear, 250254
normal equations, 250251, 255, 259, 412
sensitivity of solutions, 252
solution via QR factorization, 251252 solution via SVD, 252253
Leastsquares problems, nonlinear, 12, 210 applications of, 246248 DennisGayWelsch algorithm,
263265
FletcherXu algorithm, 263 largeresidual problems, 262265 largescale problems, 257
scaling of, 260261
software for, 263, 268
statistical justification of, 249250 structure, 247, 254
Leastsquares problems, total, 265
Level set, 92, 261
LevenbergMarquardt method, 258262,
266, 289
as trustregion method, 258259, 292 for nonlinear equations, 292 implementation via orthogonal
transformations, 259260 inexact, 268
INDEX 657
658 INDEX
LevenbergMarquardt cont. local convergence of, 262 performance on largeresidual
problems, 262 lim inf, lim sup, 618619
Limit point, 28, 79, 92, 99, 502, 503, 618, 620
Limitedmemory method, 25, 176185, 190
compact representation, 181184 for interiorpoint method, 575, 597 LBFGS, 176180, 538
memoryless BFGS method, 180 performance of, 179
relation to CG, 180
scaling, 178
SR1, 183
twoloop recursion, 178
Line search, see also Step length selection Armijo, 33, 48, 240
backtracking, 37
curvature condition, 33
Goldstein, 36
inexact, 31
Newtons method with, 2223 quasiNewton methods with, 2325 search directions, 2025
strong Wolfe conditions, see Wolfe
conditions, strong
sufficient decrease, 33
Wolfe conditions, see Wolfe conditions
Line search method, 1920, 3048, 66, 67, 71, 230231, 247
for nonlinear equations, 271, 285, 287290
global convergence of, 287288
poor performance of, 288289 Linear programming, 4, 6, 7, 9, 293
artificial variables, 362, 378380 basic feasible points, 362366 basis B, 362368, 378
basis matrix, 363
dual problem, 359362 feasible polytope, 356
vertices of, 365366 fundamental theorem of, 363364 infeasible, 356, 357
nonbasic matrix, 367
primal solution set, 356
slack and surplus variables, 356, 357,
362, 379, 380 splitting variables, 357 standard form, 356357 unbounded, 356, 357, 369 warm start, 410, 416
Linearly constrained Lagrangian methods, 522523, 527
MINOS, 523, 527
Linearly dependent, 337
Linearly independent, 339, 503, 504, 517,
519, 602
Lipschitz continuity, see also Function,
Lipschitz continuous, 80, 93, 256,
257, 261, 269, 276278, 287, 294 Local minimizer, 12, 14, 273
isolated, 13, 28
strict, 13, 14, 16, 28, 517 weak, 12
Local solution, see also Local minimizer, 6, 305306, 316, 325, 329, 332, 340, 342, 352, 513
isolated, 306
strict, 306, 333, 335, 336 strong, 306
Logbarrier function, 417, 597 definition, 583584
difficulty of minimizing, 584585 example, 586
ill conditioned Hessian of, 586 Logbarrier method, 498, 584
LOQO, 490, 592
LSQR method, 254, 268, 459, 492, 571 LU factorization, 606608
Maratos effect, 440446, 543, 550 example of, 440, 543
remedies, 442
Matlab, 416 Matrix
condition number, 251, 601602, 604, 610, 616
determinant, 154, 605606 diagonal, 252, 412, 429, 599 fullrank, 298, 300, 504, 609 identity, 599
indefinite, 76
inertia, 55, 454
lower triangular, 599, 606, 607 modification, 574
nonsingular, 325, 337, 601, 612
null space, 298, 324, 337, 430, 432, 603,
608, 609
orthogonal, 251, 252, 337, 432, 599, 604,
609
permutation, 251, 429, 606
positive definite, 15, 16, 23, 28, 68, 76,
337, 599, 603, 609
positive semidefinite, 8, 15, 70, 415, 599 projection, 462
range space, 430, 603
rankdeficient, 253
rankone, 24
ranktwo, 24
singular, 337
sparse, 411, 413, 607
Cholesky factorization, 413 symmetric, 24, 68, 412, 599, 603 symmetric indefinite, 413 symmetric positive definite, 608 trace, 154, 605
transpose, 599
upper triangular, 251, 337, 599, 606,
607, 609
Maximum likelihood estimate, 249 Mean value theorem, 629630
Merit function, see also Penalty function,
435437, 446
1, 293, 435436, 513, 540543, 550
choice of parameter, 543 exact, 435436
definition of, 435
nonsmoothness of, 513
Fletchers augmented Lagrangian, 436,
540
for feasible methods, 435
for nonlinear equations, 273, 285287,
289, 290, 293, 296, 301303, 505 for SQP, 540543
Merit functions, 424, 575
Method of multipliers, see Augmented
Lagrangian method
MINOS, see also Linearly constrained Lagrangian methods, 523, 525, 592
Modelbased methods for derivativefree optimization, 223229
minimum Frobenius change, 228 Modeling, 2, 9, 11, 247249 Monomial basis, 227
MOSEK, 490
Multiobjective optimization, 437
Negative curvature direction, 49, 50, 63, 76, 169172, 175, 489, 491
Neighborhood, 13, 14, 28, 256, 621 Network optimization, 358
Newtons method, 25, 247, 254, 257, 263
for logbarrier function, 585
for nonlinear equations, 271, 274277,
281, 283, 285, 287290, 294, 296,
299, 302
cycling, 285
inexact, 277279, 288
for quadratic penalty function, 501, 506 global convergence, 40
Hessianfree, 165, 170
in one variable, 8487, 91, 633
inexact, 165168, 171, 213
forcing sequence, 166169, 171, 277
large scale
LANCELOT, 175
line search method, 49 TRON, 175
modified, 4849
adding a multiple of I, 51 eigenvalue modification, 4951
NewtonCG, 202
line search, 168170 preconditioned, 174175 trustregion, 170175
NewtonLanczos, 175176, 190
rate of convergence, 44, 76, 92, 166168,
275277, 281282, 620 scale invariance, 27
Noise in function evaluation, 221222 Nondifferentiable optimization, 511 Nonlinear equations, 197, 210, 213, 633
degenerate solution, 274, 275, 283, 302 examples of, 271272, 288289,
300301
INDEX 659
660 INDEX
Nonlinear cont.
merit function, see Merit function, for
nonlinear equations
multiple solutions, 273274 nondegenerate solution, 274 quasiNewton methods, see Broydens
method
relationship to least squares, 271272,
275, 292293, 302 relationship to optimization, 271 relationship to primaldual
interiorpoint methods, 395 solution, 271
statement of problem, 270271 Nonlinear interiorpoint method, 423,
563593
barrier formulation, 565 feasible version, 576
global convergence, 589 homotopy formulation, 565 superlinear convergence, 591 trustregion approach, 578
Nonlinear leastsquares, see Leastsquares problems, nonlinear
Nonlinear programming, see Constrained optimization, nonlinear
Nonmonotone strategy, 18, 444446 relaxed steps, 444
Nonnegative orthant, 97
Nonsmooth functions, 6, 1718, 306, 307,
352
Nonsmooth penalty function, see Penalty
function, nonsmooth Norm
dual, 601
Euclidean, 25, 51, 251, 280, 302, 600,
601, 605, 610 Frobenius, 50, 138, 140, 601 matrix, 601602
vector, 600601
Normal cone, 340341
Normal distribution, 249 Normal subproblem, 580
Null space, see Matrix, null space Numerical analysis, 355
Objective function, 2, 10, 304 Onedimensional minimization, 19, 56
OOPS, 490
OOQP, 490
Optimality conditions, see also Firstorder
optimality conditions, Second order optimality conditions, 2, 9, 305
for unconstrained local minimizer, 1417
Order notation, 631633
Orthogonal distance regression, 265267
contrast with least squares, 265266
structure, 266267
Orthogonal transformations, 251, 259260
Givens, 259, 609 Householder, 259, 609
Partially separable function, 25, 186189, 211
automatic detection, 211
definition, 211
Partially separable optimization, 165
BFGS, 189
compactifying matrix, 188 element variables, 187 quasiNewton method, 188 SR1, 189
Penalty function, see also Merit function, 498
1, 507513
exact, 422423, 507513 nonsmooth, 497, 507513
quadratic, see also Quadratic penalty
method, 422, 498507, 525527,
586
difficulty of minimizing, 501502 Hessian of, 505506
relationship to augmented
Lagrangian, 514 unbounded, 500
Penalty parameter, 435, 436, 498, 500, 501, 507, 514, 521, 525
update, 511, 512 PENNON, 526
Pivoting, 251, 617 PolakRibiere method, 122
convergence of, 130 PolakRibiere method
numerical performance, 131
Polynomial bases, 226 monomials, 227
Portfolio optimization, see Applications, portfolio optimization
Preconditioners, 118120 banded, 120
constraint, 463
for constrained problems, 462 for primaldual system, 571 for reduced system, 460 incomplete Cholesky, 120 SSOR, 120
Preprocessing, see Presolving
Presolving, 385388
Primal interiorpoint method, 570 Primaldual interiorpoint methods, 389,
597
centering parameter, 396, 398, 401, 413 complexity of, 393, 406, 415
contrasts with simplex method, 356,
393
convex quadratic programs, 415 corrector step, 414
duality measure, 395, 398 infeasibility detection, 411 linear algebra issues, 411413 Mehrotras predictorcorrector
algorithm, 393, 407411 pathfollowing algorithms, 399414
longstep, 399406 predictorcorrector Mizuno
ToddYe algorithm,
413 shortstep, 413
potential function, 414 TanabeToddYe, 414
potentialreduction algorithms, 414 predictor step, 413
quadratic programming, 480485 relationship to Newtons method, 394,
395
starting point, 410411
Primaldual system, 567
Probability density function, 249 Projected conjugate gradient method,
see Conjugate gradient method, projected
Projected Hessian, 558 twosided, 559
Proximal point method, 523
QMR method, 459, 492, 571
QPA, 490
QPOPT, 490
QR factorization, 251, 259, 290, 292, 298,
337, 432, 433, 609610 cost of, 609
relationship to Cholesky factorization, 610
Quadratic penalty method, see also Penalty function, quadratic, 497, 501502, 514
convergence of, 502507
Quadratic programming, 422, 448492
activeset methods, 467480 big M method, 473 blocking constraint, 469 convex, 449
cycling, 477
duality, 349, 490
indefinite, 449, 467, 491492
inertia controlling methods, 491, 492 initial working set, 476
interiorpoint method, 480485 nonconvex, see Quadratic programming,
indefinite
nullspace method, 457459
optimal active set, 467
optimality conditions, 464
phase I, 473
Schurcomplement method, 455456 software, 490
strictly convex, 349, 449, 472,
477478
termination, 477478 updating factorizations, 478 working set, 468478
QuasiNewton approximate Hessian, 23, 24, 73, 242, 634
QuasiNewton method, 25, 165, 247, 263, 501, 585
BFGS, see BFGS method, 263 bounded deterioration, 161 Broyden class, 149152 curvature condition, 137
INDEX 661
662 INDEX
QuasiNewton cont.
DFP, see DFP method, 190, 264
for interiorpoint method, 575
for nonlinear equations, see Broydens
method
for partially separable functions, 25 global convergence, 40
largescale, 165189
limited memory, see Limited memory
method
rate of convergence, 46, 620
secant equation, 24, 137, 139, 263264,
280, 634
sparse, see Sparse quasiNewton method
Range space, see Matrix, range space Regularization, 574
Residuals, 11, 245, 262265, 269
preconditioned, 462
vector of, 18, 197, 246
Restoration phase, 439
Robust optimization, 7
Root, see Nonlinear equations, solution Rootfinding algorithm, see also Newtons
method, in one variable, 259, 260,
633
for trustregion subproblem, 8487
Rosenbrock function extended, 191
Roundoff error, see Floatingpoint arithmetic, roundoff error
Row echelon form, 430
S 1QP method, 293, 549 Saddle point, 28, 92
Scale invariance, 27, 138, 141
of Newtons method, see Newtons method, scale invariance
Scaling, 2627, 9597, 342343, 585 example of poor scaling, 2627 matrix, 96
Schur complement, 456, 611
Secant method, see also QuasiNewton
method, 280, 633, 634 Secondorder correction, 442444, 550 Secondorder optimality conditions,
330337, 342, 602
for unconstrained optimization, 1516
necessary, 92, 331
sufficient, 333336, 517, 557 Semidefinite programming, 415 Sensitivity, 252, 616
Sensitivity analysis, 2, 194, 341343, 350,
361
Separable function, 186
Separating hyperplane, 327
Sequential linearquadratic programming
SLQP, 293, 423, 534
Sequential quadratic programming, 423,
512, 523, 529560 ByrdOmojokun method, 547 derivation, 530533
full quasiNewton Hessian, 536 identification of optimal active set, 533 IQP vs. EQP, 533
KKT system, 275
leastsquares multipliers, 539
line search algorithm, 545
local algorithm, 532
NewtonKKT system, 531
nullspace, 538
QP multipliers, 538
rate of convergence, 557560 reducedHessian approximation,
538540
relaxation constraints, 547
S 1QP method, see S 1QP method step computation, 545 trustregion method, 546549 warm start, 545
Set
affine, 622
affine hull of, 622 bounded, 620
closed, 620
closure of, 621
compact, 621
interior of, 621
open, 620
relative interior of, 622, 623
ShermanMorrisonWoodbury formula, 139, 140, 144, 162, 283, 377,
612613 Simplex method
as activeset method, 388
basis B, 365 complexity of, 388389 cycling, 381382
lexicographic strategy, 382
perturbation strategy, 381382 degenerate steps, 372, 381
description of single iteration, 366372 discovery of, 355
dual simplex, 366, 382385
entering index, 368, 370, 372, 375378 finite termination of, 368370 initialization, 378380
leaving index, 368, 370
linear algebra issues, 372375
Phase IPhase II, 378380
pivoting, 368
pricing, 368, 370, 375376
multiple, 376
partial, 376
reduced costs, 368
revised, 366
steepestedge rule, 376378
Simulated annealing, 221
Singular values, 255, 604
Singularvalue decomposition SVD, 252,
269, 303, 603604
Slack variables, see also Linear
programming, slacksurplus
variables, 424, 519 SNOPT, 536, 592
Software
BQPD, 490
CPLEX, 490
for quadratic programming, 490 IPOPT, 183, 592
KNITRO, 183, 490, 525, 592 LBFGSB, 183
LANCELOT, 520, 525, 592 LOQO, 490, 592
MINOS, 523, 525, 592
MOSEK, 490
OOPS, 490
OOQP, 490
PENNON, 526
QPA, 490
QPOPT, 490
SNOPT, 592
TRON, 175 VE09, 490 XPRESSMP, 490
Sparse quasiNewton method, 185186, 190
SR1 method, 24, 144, 161
algorithm, 146
for constrained problems, 538, 540 limitedmemory version, 177, 181, 183 properties, 147
safeguarding, 145
skipping, 145, 160
Stability, 616617
Starting point, 18
Stationary point, 15, 28, 289, 436, 505 Steepest descent direction, 20, 21, 71, 74 Steepest descent method, 21, 2527, 31,
73, 95, 585
rate of convergence, 42, 44, 620
Step length, 19, 30 unit, 23, 29
Step length selection, see also Line search, 5662
bracketing phase, 57 cubic interpolation, 59 for Wolfe conditions, 60 initial step length, 59 interpolation in, 57 selection phase, 57
Stochastic optimization, 7 Stochastic simulation, 221 Strict complementarity, see
Complementarity condition, strict Subgradient, 17
Subspace, 602 basis, 430, 603
orthonormal, 432 dimension, 603 spanning set, 603
Sufficient reduction, 71, 73, 79 Sum of absolute values, 249
Sum of squares, see Leastsquares
problems, nonlinear Symbolic differentiation, 194
Symmetric indefinite factorization, 455, 570, 610612
BunchKaufman, 612
INDEX 663
664 INDEX
Symmetric cont. BunchParlett, 611 modified, 5456, 63 sparse, 612
Symmetric rankone update, see SR1 method
Tangent, 315325
Tangent cone, 319, 340341
Taylor series, 15, 22, 28, 29, 67, 274, 309,
330, 332, 334, 502
Taylors theorem, 15, 2123, 80, 123, 138,
167, 193195, 197, 198, 202, 274, 280, 294, 323, 325, 332, 334, 341, 630
statement of, 14 Tensor methods, 274
derivation, 283284
Termination criterion, 92
Triangular substitution, 433, 606, 609, 617 Truncated Newton method, see Newtons
method, NewtonCG, linesearch Trust region
boundary, 69, 75, 95, 171173 boxshaped, 19, 293
choice of size for, 67, 81
elliptical, 19, 67, 95, 96, 100 radius, 20, 26, 68, 69, 73, 258, 294 spherical, 95, 258
Trustregion method, 1920, 69, 77, 79, 80, 82, 87, 91, 247, 258, 633
contrast with line search method, 20, 6667
dogleg method, 71, 7377, 79, 84, 91, 95, 99, 173, 291293, 548
doubledogleg method, 99
for derivativefree optimization, 225 for nonlinear equations, 271, 273, 285,
290296
global convergence of, 292293 local convergence of, 293296
global convergence, 71, 73, 7692, 172
local convergence, 9295
Newton variant, 26, 68, 92 software, 98
Steihaugs approach, 77, 170173,
489
strategy for adjusting radius, 69 subproblem, 19, 2526, 68, 69, 72, 73,
76, 77, 91, 9597, 258 approximate solution of, 68, 71 exact solution of, 71, 77, 79, 8392 hard case, 8788
nearly exact solution of, 95, 292293
twodimensional subspace minimization, 71, 7677, 79, 84, 95, 98, 100
Unconstrained optimization, 6, 352, 427, 432, 499, 501
of barrier function, 584
Unit ball, 91
Unit roundoff, see Floatingpoint
arithmetic, unit roundoff
Variable metric method, see QuasiNewton method
Variable storage method, see Limited memory method
VE09, 490
Watchdog technique, 444446
Weakly active constraints, 342
Wolfe conditions, 3336, 48, 78, 131, 137,
138, 140143, 146, 160, 179, 255,
287, 290
scale invariance of, 36
strong, 34, 35, 122, 125, 126, 128, 131,
138, 142, 162, 179 XPRESSMP, 490
Zoutendijk condition, 3841, 128, 156, 287