02-Matrix multiplication
========================

.. highlight:: c
This example performs the multiplication of two matrices (A and B) into a third
one (C).  The code has been shown multiple times during the lecture and has
been used as a reference in the hybrid parallelization (i.e., MPI + OpenMP). It
consists of a blocked matrix multiplication algorithm, in which each block
becomes a task. The dependencies allow the correct synchronization between
them.  Think about how the parallelizacion has been done, pay special attention
in the following code included in the ``matmul()`` function::

 for (node = 0; node < nodes; node++) {
    #pragma omp task depend(in: a[0][0], B[0][0]) depend(inout: C[i][0])
    mxm(m, n, a, B, (double (*)[n]) &C[i][0]);
    
    #pragma omp task depend(in: a[0][0]) depend(out: rbuf[0][0]) //depend(inout: serial)
    call_send_recv(m, n, a, down, rbuf, up);
    
    i = (i+n) % m;                     //next C block circular
    ptmp = a; a = rbuf; rbuf = ptmp;   //swap pointers
 }


**Goals of this exercise**

* Code is completely annotated: you DON’T need to modify it.
* Review the source code and check the parallelization.
  This includes MPI services and OpenMP directives/services.
  Try to understand what they mean.
* Review the ``Makefile`` rules. Check how each source code file is compiled/linked.
* Check (scalability), execute the program using different number of ranks and threads
  and compute the speed-up.
* Get different paraver traces and visualize them: thread state, task name,...
* Why the ``depend(inout: serial)`` clause is commented? Is it needed? 


**Execution instructions**

The first step is to compile the program and generate the auxiliary scripts.
You can do it by executing the following command inside this folder::

   $ make

The generated binary accepts three options. The most relevant options are the
size of the matrix with `-s`, and the number of timesteps with `-t`. More
options can be seen passing the `-h` option. An example of execution could be::

   $ mpirun -n 4 ./matmul -t 4 -s 512

in which the application will perform 4 timesteps in 4 MPI processes, where
each process will have the same number of cores (# of available cores / 4).
The total size of the matrix will be 512x512, this means that each process will
hold a block of 128x512 of matrix A, and a block of 512x128 for each matrix B
and C.

We have prepared two scripts `run-once.sh` and `multirun.sh` for submitting jobs
on the execution queues of this cluster. Firstly, `run-once.sh` runs a single
execution with the given parameters, and if enabled, you can also extract a
Paraver trace of the execution.

You can submit the job script into the queues by doing::

   $ bsub < run-once.sh

Please feel free to change the job parameters (processes, processes per node, etc.)
and the execution parameters of the program. Additionally, you can generate a
trace of the execution by uncommenting the following line of the `run-once.sh`
script::

   INSTRUMENT=./trace.sh

Finally, `multirun.sh` performs multiple executions of the program with increasing
parameters.