FIT3143 Tutorial Week 9
PARALLEL ALGORITHM DESIGN –
PARTITIONING
OBJECTIVES
Copyright By PowCoder代写 加微信 powcoder
• Optimize algorithm design for improved efficiency
• Design solutions for efficient parallel partitioning problem
Note: Tutorials are not assessed. Nevertheless, please attempt the questions to improve your unit comprehension in preparation for the labs, assignments, and final assessments.
1. The following code file implements a serial matrix multiplication. Two text files (i.e., MA.txt and MB.txt) were preloaded with 2000 × 2000 matrices. Therefore, the multiplication of the matrices in MA.txt and MB.txt would also produce a 2000 × 2000 matrix C. When the code is compiled and executed on a computer with six cores running a native Linux operating system, the following outcome is observed:
Figure 1: Serial matrix multiplication performance (2000 × 2000 matrix). Note that performance figures will vary on different hardware platforms.
Rev 01. Updated by VMB Page 1
In an attempt to optimise the serial matrix multiplication, lines 90 and 91 in the provided code file were swapped based on the following screenshot:
Figure 2: Lines 90 and 91 were swapped. Prior to swapping the code, the k-th for loop represents the third nested loop. After swapping, the j-th for loop now represents the third nested loop.
When the revised code is compiled and executed on a computer with the same specifications as before, the following outcome is observed:
Figure 3: Serial matrix multiplication performance (2000 × 2000 matrix) based on a revised code of swapping lines 90 & 91 as seen in Figure 2.
a) By comparing Figures 1 and 3, we can observe an approximately 50% improvement in the performance of the serial matrix multiplication algorithm. Explain how the process of swapping lines 90 and 91 in the serial matrix multiplication code file resulted in such a substantial performance improvement.
b) Apart from the optimization as discussed in (a), propose another method which would improve the performance of a serial matrix multiplication algorithm.
Note: It is not compulsory to recompile the provided code here. Focus on analysing the code to identify potential optimisation techniques without having to parallelize the code.
Rev 01. Updated by VMB Page 2
a) Intheinitiallyprovidedserialmatrixmultiplicationcode(priortoswappingLines90& 91), the multiplication algorithm can be expanded as follows:
𝑖 = 0, 𝑗 = 0, 𝑘 = 0: 3
𝑐00 =(𝑎00 ×𝑏00)+(𝑎01 ×𝑏10)+(𝑎02 ×𝑏20)+(𝑎03 ×𝑏30) 𝑖 = 0, 𝑗 = 1, 𝑘 = 0: 3
𝑐01 =(𝑎00 ×𝑏01)+(𝑎01 ×𝑏11)+(𝑎02 ×𝑏21)+(𝑎03 ×𝑏31) 𝑖 = 0, 𝑗 = 2, 𝑘 = 0: 3
𝑐02 =(𝑎00 ×𝑏02)+(𝑎01 ×𝑏12)+(𝑎02 ×𝑏22)+(𝑎03 ×𝑏32) 𝑖 = 0, 𝑗 = 3, 𝑘 = 0: 3
𝑐03 =(𝑎00 ×𝑏03)+(𝑎01 ×𝑏13)+(𝑎02 ×𝑏23)+(𝑎03 ×𝑏33)
After swapping Lines 90 & 91, the algorithm can be expanded as follows:
𝑖 = 0, 𝑘 = 0, 𝑗 = 0: 3 𝑐00 = (𝑎00 × 𝑏00) 𝑐01 = (𝑎00 × 𝑏01) 𝑐02 = (𝑎00 × 𝑏02) 𝑐03 = (𝑎00 × 𝑏03)
𝑖 = 0, 𝑘 = 1, 𝑗 = 0: 3
𝑐00 = 𝑐00 + (𝑎01 × 𝑏10) 𝑐01 = 𝑐01 + (𝑎01 × 𝑏11) 𝑐02 = 𝑐02 + (𝑎01 × 𝑏12) 𝑐03 = 𝑐03 + (𝑎01 × 𝑏13) 𝑖 = 0, 𝑘 = 2, 𝑗 = 0: 3
𝑐00 = 𝑐00 + (𝑎02 × 𝑏20) 𝑐01 = 𝑐01 + (𝑎02 × 𝑏21) 𝑐02 = 𝑐02 + (𝑎02 × 𝑏22) 𝑐03 = 𝑐03 + (𝑎02 × 𝑏23) 𝑖 = 0, 𝑘 = 3, 𝑗 = 0: 3
𝑐00 = 𝑐00 + (𝑎03 × 𝑏30) 𝑐01 = 𝑐01 + (𝑎03 × 𝑏31) 𝑐02 = 𝑐02 + (𝑎03 × 𝑏32) 𝑐03 = 𝑐03 + (𝑎03 × 𝑏33)
When comparing both methods, the second method allows for a much more efficient read of matrix b from memory.
Rev 01. Updated by VMB Page 3
Referring to the illustration above, it can be observed that in the initial approach, to compute the multiplication for c00, this would require reading b00, b10, b20 and b30 from different memory regions. Accessing memory from different regions would require consume more time. By swapping lines 90 & 91, the memory read of matrix b is much more efficient.
b) Flatten the 2D array into a 1D array
int commonPoint = colA;
for(i = 0; i < rowC; i++){
for(j = 0; j < colC; j++){
for(k = 0; k < commonPoint; k++){
pMatrixB[k * colB + j]);
pMatrixC[i * colC + j] += (pMatrixA[i * colA + k] *
2. ModifytheserialcodeinQuestion1toparallelizethematrixmultiplicationalgorithm using OpenMP. Identify and implement an appropriate partitioning strategy for the matrix multiplication process.
Note: It is not compulsory to compile and execute the parallel code. Focus your solution on applying a logically correct OpenMP code construct to the for loops which are used for the matrix multiplication operation.
int commonPoint = colA;
#pragma omp parallel for shared(ppMatrixA, ppMatrixB, ppMatrixC, commonPoint, rowC, colC) private(i, j, k) schedule(dynamic) for(i = 0; i < rowC; i++){
for(j = 0; j < colC; j++){
for(k = 0; k < commonPoint; k++){
ppMatrixC[i][j] += (ppMatrixA[i][k] * ppMatrixB[k][j]);
Rev 01. Updated by VMB Page 4
3. ThefollowingcodefileimplementsaparallelmatrixmultiplicationalgorithmusingMPI with row-based partitioning. Two text files (i.e., MA.txt and MB.txt) were preloaded with 2000 × 2000 matrices. Therefore, the multiplication of the matrices in MA.txt and MB.txt would also produce a 2000 × 2000 matrix. When the code is compiled and executed on a computer with six cores running a native Linux operating system, the following outcome is observed:
Figure 4: Parallel matrix multiplication performance using MPI with row partitioning (2000 × 2000 matrix). Note that performance figures will vary on different hardware platforms
In this code, blocking send, receive and broadcast MPI functions were used to distribute matrices A and B between the MPI.
Modify the blocking communication functions in this code to that of non-blocking communication. Discuss possible outcomes in terms of its performance.
// 3. Root process distribute matrices A and B to other processes if(my_rank == 0){
row_offset = rows_per_procs;
for(i = 1; i < processors; i++){
// reset the tagSendCounter
tagSendCounter = 0;
if(i == processors - 1){
loopCondition = (i * rows_per_procs) +
rows_per_procs + rows_remain;
rows_per_procs;
loopCondition = (i * rows_per_procs) +
for(j=row_offset; j< loopCondition; j++){
// MPI_Send(ppMatrixA[j], ma_col, MPI_INT, i, 0,
MPI_Isend(ppMatrixA[j], ma_col, MPI_INT, i, MPI_COMM_WORLD, &messageReq[messageCounter++]);
row_offset += rows_per_procs;
MPI_COMM_WORLD);
tagSendCounter++,
rowsToSendOrReceive = endRow - startRow; for(i=0; i