程序代写代做代考 data structure js algorithm Chapter 1An Improved Algorithm for Parallel Sparse LUDecomposition on a Distributed-Memory MultiprocessorJacko Koster� Rob H. BisselingyAbstractIn this paper we present a new parallel algorithm for the LU decomposition of ageneral sparse matrix. Among its features are matrix redistribution at regular intervalsand a dynamic pivot search strategy that adapts itself to the number of pivots produced.Experimental results obtained on a network of 400 transputers show that these featuresconsiderably improve the performance.1 IntroductionThis paper presents an improved version of the parallel algorithm for the LU decompositionof a general sparse matrix developed by van der Stappen, Bisseling, and van de Vorst[9]. The LU decomposition of a matrix A = (Aij ; 0 � i; j < n) produces a unit lowertriangular matrix L, an upper triangular matrix U , a row permutation vector � and acolumn permutation vector �, such thatA�i;�j = (LU)ij ; for 0 � i; j < n:(1)We assume thatA is sparse and nonsingular and that it has an arbitrary pattern of nonzeros,with all elements having the same (small) probability of being nonzero. A review of parallelalgorithms for sparse LU decomposition can be found in [9].We use the following notations. A submatrix of a matrix A is the intersection of severalrows and columns of A. The submatrix A[I; J ], I; J � f0; : : : ; n � 1g, has domain I � J .If I = fig, we use A[i; J ] as shorthand for A[fig; J ]. The concurrent assignment operatorc; d := a; b denotes the simultaneous assignment of a to c and b to d. For any (sub)matrixA, nz(A) denotes the number of nonzeros in A. For any set I , jI j is the cardinality of I .Our algorithm is aimed at a distributed-memory message-passing MIMD multiprocessorwith anM �N mesh communication network. We identify each processor in the mesh witha pair (s; t), 0 � s < M , 0 � t < N . A Cartesian distribution [1] of A is a pair ofmappings (�; ) that assigns matrix element Aij to processor (�i; j), with 0 � �i < Mand 0 � j < N . For processor (s; t), the set I(s) denotes the local set of row indicesI(s) = fi : i 2 I ^ �i = sg. Similarly, J(t) = fj : j 2 J ^ j = tg.�CERFACS, 42 Ave G. Coriolis, 31057 Toulouse Cedex, France (Jacko.Koster@cerfacs.fr). Part ofthis work was done while this author was employed at Eindhoven University of Technology.yDepartment of Mathematics, Utrecht University, P.O. Box 80010, 3508 TA Utrecht, the Netherlands(bisseling@math.ruu.nl). Part of this work was done while this author was employed at Koninklijke/Shell-Laboratorium, Amsterdam. 1

Chapter 1An Improved Algorithm for Parallel Sparse LUDecomposition on a Distributed-Memory MultiprocessorJacko Koster� Rob H. BisselingyAbstractIn this paper we present a new parallel algorithm for the LU decomposition of ageneral sparse matrix. Among its features are matrix redistribution at regular intervalsand a dynamic pivot search strategy that adapts itself to the number of pivots produced.Experimental results obtained on a network of 400 transputers show that these featuresconsiderably improve the performance.1 IntroductionThis paper presents an improved version of the parallel algorithm for the LU decompositionof a general sparse matrix developed by van der Stappen, Bisseling, and van de Vorst[9]. The LU decomposition of a matrix A = (Aij ; 0 � i; j < n) produces a unit lowertriangular matrix L, an upper triangular matrix U , a row permutation vector � and acolumn permutation vector �, such thatA�i;�j = (LU)ij ; for 0 � i; j < n:(1)We assume thatA is sparse and nonsingular and that it has an arbitrary pattern of nonzeros,with all elements having the same (small) probability of being nonzero. A review of parallelalgorithms for sparse LU decomposition can be found in [9].We use the following notations. A submatrix of a matrix A is the intersection of severalrows and columns of A. The submatrix A[I; J ], I; J � f0; : : : ; n � 1g, has domain I � J .If I = fig, we use A[i; J ] as shorthand for A[fig; J ]. The concurrent assignment operatorc; d := a; b denotes the simultaneous assignment of a to c and b to d. For any (sub)matrixA, nz(A) denotes the number of nonzeros in A. For any set I , jI j is the cardinality of I .Our algorithm is aimed at a distributed-memory message-passing MIMD multiprocessorwith anM �N mesh communication network. We identify each processor in the mesh witha pair (s; t), 0 � s < M , 0 � t < N . A Cartesian distribution [1] of A is a pair ofmappings (�; ) that assigns matrix element Aij to processor (�i; j), with 0 � �i < Mand 0 � j < N . For processor (s; t), the set I(s) denotes the local set of row indicesI(s) = fi : i 2 I ^ �i = sg. Similarly, J(t) = fj : j 2 J ^ j = tg.�CERFACS, 42 Ave G. Coriolis, 31057 Toulouse Cedex, France (Jacko.Koster@cerfacs.fr). Part ofthis work was done while this author was employed at Eindhoven University of Technology.yDepartment of Mathematics, Utrecht University, P.O. Box 80010, 3508 TA Utrecht, the Netherlands(bisseling@math.ruu.nl). Part of this work was done while this author was employed at Koninklijke/Shell-Laboratorium, Amsterdam. 1 2 Koster and Bisseling2 The PARPACK-LU algorithmIn this section we brie y describe the previous algorithm [9], which we refer to asPARPACK-LU. This algorithm assigns the nonzero matrix elements to the processorsaccording to the grid distribution de�ned by�i = i mod M ^ i = i mod N; for 0 � i < n:(2)Each step of the algorithm consists of a pivot search, row and column permutations,and a multiple-rank update of the reduced matrix A[I; I ]. At the beginning of a step,I = fk; : : : ; n� 1g, where k is the number of pivots processed so far.The pivot search determines a set S of m pivots from the reduced matrix with thefollowing three properties. First, each element (i; j) from S satis�es the threshold criterionjAij j � u �maxl2I jAlj j;(3)where u is a user-de�ned parameter, 0 < u � 1 [4, Ch. 9]. This ensures numerical stability.Second, the elements of S have low Markowitz cost, to preserve sparsity. The Markowitzcost of a nonzero element Aij in a submatrix A[I; J ] equals (nz(A[I; j])�1)(nz(A[i; J ])�1).Third, the elements of S are mutually compatible [2, 3], i.e.,Ai;j0 = 0 ^ Ai0;j = 0; for (i; j); (i0; j 0) 2 S ^ (i; j) 6= (i0; j 0):(4)The compatibility of the pivots enables the algorithm to process them in one step and toperform a single rank-m update of the reduced matrix.After the pivot search, the rows and columns of the matrix are permuted such that them�m submatrix A[IS ; IS], IS = fk; : : : ; k+m� 1g, turns into a diagonal submatrix withthe m pivots positioned on the diagonal. This is followed by the rank-m update, whichcontains the bulk of the oating point operations. In this part, the set IS is subtractedfrom I , each multiplier column j in A[I; IS] is divided by the corresponding pivot value Ajjand the matrix product A[I; IS] �A[IS ; I ] is subtracted from A[I; I ].3 The new algorithmAn outline of the new algorithm is given below; a detailed description and a program textcan be found in [7].Algorithm 1 (Parallel LU decomposition).I(s); J(t) := fi : 0 � i < n ^ �i = sg; fj : 0 � j < n ^ j = tg;L; U; k := 0; 0; 0;while k < n do begin�nd pivot set S = f(ir; jr) : k � r < k +mg;IS ; JS := fir : k � r < k +mg; fjr : k � r < k +mg;I(s); J(t) := I(s) n IS ; J(t) n JS ;register pivots in � and �;perform multiple-rank update;store multipliers in L and pivots and update rows in U ;k := k +m;row-redistribute(A;L);col-redistribute(A;U)end;row-permute(L; �);col-permute(U; �). Improved Parallel Sparse LU Decomposition 3In the search for pivot candidates, we vary the number ncol of matrix columns searchedper processor column, on the basis of the estimated density of the reduced matrix. In the�rst steps of the algorithm, the reduced matrix is still sparse and the candidate pivots willmost likely be compatible and hence most of them will become pivots. As the computationproceeds, �ll-in causes the reduced matrix to become gradually denser. Consequently,the probability of nonzero elements being compatible decreases and expensive searches forlarge sets of candidate pivots will yield only relatively few compatible pivots. Therefore,for higher densities of the reduced matrix, we decrease ncol. When the reduced matrixbecomes so dense that the pivot search repeatedly produces only one pivot, the algorithmswitches to a simple search for only one pivot.The PARPACK-LU algorithm distributes the work load by maintaining the griddistribution and performing explicit row and column permutations [1, 6] at every step.Therefore, it often requires the exchange of rows between processor rows, and similarlyfor columns, because many of the explicit permutations cannot be performed locally. Thisinduces a certain amount of added communication time. When implicit permutations [8]are used, the rows and columns of the matrix remain in place. The matrix elements areaddressed indirectly and no communication for row or column movements is required. This,however, may lead to a poor load balance, depending on the sequence of pivot choices.We distribute the load by redistributing the reduced matrix at regular intervals suchthat jJ(t)j � d(n� k)=Ne for all t and jI(s)j � d(n � k)=Me, for all s. It su�ces to movecolumns from processor columns (�; t) for which jJ(t)j > d(n�k)=Ne to processor columnsfor which jJ(t)j < d(n � k)=Ne, and similarly for the rows of the matrix. The number ofcolumns to be sent left and right is determined using the criterion: a column to be disposedof is sent in the direction that has the least average surplus per processor column. The�rst processor column that has space available accepts a passing column. This heuristicreduces the distance over which the rows and columns are communicated. The columnredistribution by processor column (�; t) is done as follows:Algorithm 2 (Column redistribution).if redistribution needed then beginceiling := d(n� k)=Ne;surplus := jJ(t)j � ceiling;if surplus > 0 then determine nl; nr � 0 such that nl + nr = surpluselse nl; nr := 0; 0;redistribute (nr, right);redistribute (nl, left)end;The theoretical analysis of [9] shows that the unordered two-dimensional doubly linkedlist is the best local data structure for parallel sparse LU decomposition, among severalplausible candidates. We use this data structure for the implementation of the newalgorithm; the previous algorithm was implemented using the ordered two-dimensionalsingly linked list. The present data structure facilitates insertion and deletion of nonzeros.4 Experimental resultsThe parallel computer used for our experiments (and those of [9]) is a Parsytec SuperClusterFT-400 consisting of a square mesh of 400 INMOS T800-20 transputers. The new algorithmis implemented in ANSI C with extensions for communication and parallelism. This

4 Koster and Bisseling Table 1Time (in s) of LU decomposition on p processors using a static pivot search strategy.Matrix p = 1 p = 4 p = 9 p = 16 p = 25 p = 49 p = 100 p = 400IMPCOL B 0.25 0.19 0.16 0.15 0.13 0.13 0.12 0.13WEST0067 0.32 0.25 0.24 0.19 0.19 0.18 0.16 0.18FS 541 1 8.85 3.97 2.65 2.24 1.89 1.53 1.28 1.14STEAM2 48.9 16.1 9.75 6.22 5.27 3.83 3.07 2.51SHL 400 2.51 1.68 1.28 1.09 0.94 0.79 0.71 0.61BP 1600 4.78 3.38 2.69 2.45 2.20 1.96 1.79 1.63JPWH 991 125. 43.6 20.6 12.6 11.2 7.06 5.64 4.28SHERMAN1 17.1 11.5 7.18 5.87 4.35 2.97 2.79 2.42SHERMAN2 1294. 108. 64.7 46.1 30.1 13.6LNS 3937 1430. 128. 82.9 55.7 34.6 20.5GEMAT11 41.9 21.8 15.6 12.6 10.9 8.78 7.40 5.71implementation allows us to use the parallel program also for experiments on an ordinarysequential computer. We did not optimise the parallel program for these sequential runs.The sequential experiments were performed on a SUN SPARC10 model 30 workstation.The test set of sparse matrices that we use for our experiments is taken from [9]. It consistsof eleven unsymmetric matrices from the Harwell-Boeing sparse matrix collection [5].The user of the program has to specify six input parameters. The parameter ncol0 is theinitial number of matrix columns that is searched for pivot candidates by each processorcolumn. The parameter u is used for threshold criterion (3). Pivot candidates with aMarkowitz cost higher than �Mmin+ � are discarded. Here � and � are input parameters,and Mmin is the lowest Markowitz cost of a pivot candidate. After nlast successive pivotsearches produce only one pivot, a switch is made to a single-pivot strategy. Each time amultiple of f pivots has been processed, the matrix is redistributed.First, we repeated the experiments from [9, Table 4] to compare the performance of thenew algorithm to that of the PARPACK-LU algorithm. The standard static pivot searchprocedure in [9] is most closely approximated by �xing ncol = 1, and using u = 0:1, � = 4,� = 0, and nlast =1. The matrix is redistributed at every step, i.e., f = 1, to mimic theexplicit row and column permutations of the PARPACK-LU algorithm. This experimentallows us to examine the e�ect of choosing a di�erent data structure, without regard toother improvements. Table 1 shows the time needed for parallel LU decomposition ona square mesh of p transputers. A comparison with the results for the PARPACK-LUalgorithm [9] shows that the new algorithm is superior. It outperforms the previous onefor small p, with a gain of up to a factor of 2.4 (for SHERMAN1 and p = 1). For largep, the two algorithms perform equally well, because communication time, which dominatesfor such p, is similar.To investigate the in
uence of the pivot strategy, we replaced the standard pivot strategyof [9] by a dynamic one: ncol0 = 10 and at the end of each step ncol is adjusted accordingto if m > ncol �N2 then ncol := ncol + 1 else ncol := ncol� 1This strategy strives for generating twice as many pivot candidates as pivots. This way, arelatively large set of pivot candidates, each with a reasonable probability of becoming apivot, is used in constructing the pivot set, while the time required for the compatibilitychecks is kept within bounds. (The other parameters in this experiment are u = 0:1, � = 4,

Improved Parallel Sparse LU Decomposition 5Table 2Time (in s) of LU decomposition on a sequential computer and on p processors of a parallelcomputer, using a dynamic pivot search strategy.Matrix SPARC10 p = 1 p = 16 p = 100 p = 400IMPCOL B 0.19 0.11 0.10 0.11WEST0067 0.27 0.18 0.15 0.14FS 541 1 1.25 7.99 1.86 1.08 0.96STEAM2 7.05 45.6 6.88 2.55 1.95SHL 400 0.12 0.77 0.46 0.39 0.36BP 1600 0.48 3.01 1.27 1.06 0.96JPWH 991 13.9 82.2 13.0 4.34 2.57SHERMAN1 2.98 18.8 4.53 2.00 1.54SHERMAN2 206. 1145. 127. 25.9 10.6LNS 3937 277. 1405. 120. 33.2 16.4GEMAT11 4.13 26.6 6.68 4.54 3.83� = 10, nlast = 25, and f = 100.) Table 2 shows that the dynamic pivot search strategyis superior. We attribute this mainly to the higher rank m of the matrix updates, whichleads to less frequent synchronisation of the processors. Even for p = 1 there is a gain: forexample, the computing time is reduced by a factor of three for the matrix SHL 400. Thiscon�rms the theoretical analysis from [9] that multiple-rank updates are bene�cial also insequential sparse LU decomposition algorithms.References[1] R. H. Bisseling and J. G. G. van de Vorst, Parallel LU decomposition on a transputer network,Lecture Notes in Computer Science 384, Springer-Verlag, New York, 1989, pp. 61{77.[2] D. A. Calahan, Parallel solution of sparse simultaneous linear equations, Proc. 11th AnnualAllerton Conf. on Circuits and System Theory, 1973, pp. 729{735.[3] T. A. Davis and P.-C. Yew, A nondeterministic parallel algorithm for general unsymmetricsparse LU factorization, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 383{402.[4] I. S. Du�, A. M. Erisman, and J. K. Reid, Direct Methods for Sparse Matrices, OxfordUniversity Press, Oxford, UK, 1986.[5] I. S. Du�, R. G. Grimes, and J. G. Lewis, Sparse matrix test problems, ACM Trans. Math.Software, 15 (1989), pp. 1{14.[6] G. A. Geist and C. H. Romine, LU factorization algorithms on distributed-memory multipro-cessor architectures, SIAM J. Sci. Stat. Comput., 9 (1988), pp. 639{649.[7] J. Koster, Parallel solution of sparse systems of linear equations on a mesh network oftransputers, Final Report, Institute for Continuing Education, Eindhoven University ofTechnology, Eindhoven, The Netherlands, July 1993.[8] A. Skjellum, Concurrent dynamic simulation: multicomputer algorithms research appliedto ordinary di�erential-algebraic process systems in chemical engineering, Ph. D. Thesis,California Institute of Technology, Pasadena, CA, May 1990.[9] A. F. van der Stappen, R. H. Bisseling, and J. G. G. van de Vorst, Parallel sparse LUdecomposition on a mesh network of transputers, SIAM J. Matrix Anal. Appl., 14 (1993),pp. 853{879.