2.5. Matrix multiplication (with weak dependences)

This benchmark runs a matrix multiplication operation C = AB, where A has size NxM, B has size MxP, and the resulting matrix C has size NxP. The kernel uses a blocking version of the matrix multiplication algorithm with three nested loops chasing the matrix multiplication index pattern; the loop body implements the inner/block matrix multiplication:

for (long k = 0; k < NB_K; k++) {
   for (long i = 0; i < NB_I; i++) {
      #pragma oss task label(matmul block)
      for (long j = 0; j < NB_J; j++) {
         double (* restrict A_block)[TS] = A[i][k];
         double (* restrict B_block)[TS] = B[k][j];
         double (* restrict C_block)[TS] = C[i][j];

         for (long ii = 0; ii < TS; ii++) {
            for (long jj = 0; jj < TS; jj++) {
               for (long kk = 0; kk < TS; kk++) {
                  C_block[ii][jj] += A_block[ii][kk] * B_block[kk][jj];
               }
            }
         }
      }
   }
   #pragma oss taskwait
}

The parallelization creates a task containing the ‘j’ loop (i.e. the ‘i’ loop body). After all the ‘i’ iterations the code executes a taskwait directive which guarantees the execution of all the previous instantiated tasks before continuing. The algorithm opens and closes the parallelism as many times as ‘k’ iterations.

2.5.1. Goals of this exercise

  • Code is completely annotated: you DON’T need to modify it. You only need to compile and execute.
  • Review source code and check the different directives and their clauses. Try to understand what they mean.
  • Pay special attention to data-sharing attributes, including those that do not appear (i.e., the implicit ones).
  • Check several runtime options when executing the program (versions, schedulers, etc.).
  • Check scalability. Execute the program using different numbers of cpus and compute the speed-up.
  • Change program arguments that may have an impact on task granularity (block size, tile size, etc.).
  • Change program arguments that may have an impact on the number of tasks (matrix sizes and/or block/tile sizes).
  • Get different paraver traces using different runtime options or program arguments and compare them.

2.5.2. Execution instructions

You should run the program as:

./matmul N M P TS

where:

  • N is the number of rows of the matrix A.
  • M is the number of columns of the matrix A and the number of rows of the matrix B.
  • P is the number of columns of the matrix B.
  • The matrix multiplication operation will be applied in blocks that contain TSxTS elements.