02-Matrix multiplication

This example performs the multiplication of two matrices (A and B) into a third one (C). The code has been shown multiple times during the lecture and has been used as a reference in the hybrid parallelization (i.e., MPI + OpenMP). It consists of a blocked matrix multiplication algorithm, in which each block becomes a task. The dependencies allow the correct synchronization between them. Think about how the parallelizacion has been done, pay special attention in the following code included in the matmul() function:

for (node = 0; node < nodes; node++) {
   #pragma omp task depend(in: a[0][0], B[0][0]) depend(inout: C[i][0])
   mxm(m, n, a, B, (double (*)[n]) &C[i][0]);

   #pragma omp task depend(in: a[0][0]) depend(out: rbuf[0][0]) //depend(inout: serial)
   call_send_recv(m, n, a, down, rbuf, up);

   i = (i+n) % m;                     //next C block circular
   ptmp = a; a = rbuf; rbuf = ptmp;   //swap pointers
}

Goals of this exercise

  • Code is completely annotated: you DON’T need to modify it.
  • Review the source code and check the parallelization. This includes MPI services and OpenMP directives/services. Try to understand what they mean.
  • Review the Makefile rules. Check how each source code file is compiled/linked.
  • Check (scalability), execute the program using different number of ranks and threads and compute the speed-up.
  • Get different paraver traces and visualize them: thread state, task name,...
  • Why the depend(inout: serial) clause is commented? Is it needed?

Execution instructions

The first step is to compile the program and generate the auxiliary scripts. You can do it by executing the following command inside this folder:

$ make

The generated binary accepts three options. The most relevant options are the size of the matrix with -s, and the number of timesteps with -t. More options can be seen passing the -h option. An example of execution could be:

$ mpirun -n 4 ./matmul -t 4 -s 512

in which the application will perform 4 timesteps in 4 MPI processes, where each process will have the same number of cores (# of available cores / 4). The total size of the matrix will be 512x512, this means that each process will hold a block of 128x512 of matrix A, and a block of 512x128 for each matrix B and C.

We have prepared two scripts run-once.sh and multirun.sh for submitting jobs on the execution queues of this cluster. Firstly, run-once.sh runs a single execution with the given parameters, and if enabled, you can also extract a Paraver trace of the execution.

You can submit the job script into the queues by doing:

$ bsub < run-once.sh

Please feel free to change the job parameters (processes, processes per node, etc.) and the execution parameters of the program. Additionally, you can generate a trace of the execution by uncommenting the following line of the run-once.sh script:

INSTRUMENT=./trace.sh

Finally, multirun.sh performs multiple executions of the program with increasing parameters.