02-Matrix multiplication ======================== .. highlight:: c This example performs the multiplication of two matrices (A and B) into a third one (C). The code has been shown multiple times during the lecture and has been used as a reference in the hybrid parallelization (i.e., MPI + OpenMP). It consists of a blocked matrix multiplication algorithm, in which each block becomes a task. The dependencies allow the correct synchronization between them. Think about how the parallelizacion has been done, pay special attention in the following code included in the ``matmul()`` function:: for (node = 0; node < nodes; node++) { #pragma omp task depend(in: a[0][0], B[0][0]) depend(inout: C[i][0]) mxm(m, n, a, B, (double (*)[n]) &C[i][0]); #pragma omp task depend(in: a[0][0]) depend(out: rbuf[0][0]) //depend(inout: serial) call_send_recv(m, n, a, down, rbuf, up); i = (i+n) % m; //next C block circular ptmp = a; a = rbuf; rbuf = ptmp; //swap pointers } **Goals of this exercise** * Code is completely annotated: you DON’T need to modify it. * Review the source code and check the parallelization. This includes MPI services and OpenMP directives/services. Try to understand what they mean. * Review the ``Makefile`` rules. Check how each source code file is compiled/linked. * Check (scalability), execute the program using different number of ranks and threads and compute the speed-up. * Get different paraver traces and visualize them: thread state, task name,... * Why the ``depend(inout: serial)`` clause is commented? Is it needed? **Execution instructions** The first step is to compile the program and generate the auxiliary scripts. You can do it by executing the following command inside this folder:: $ make The generated binary accepts three options. The most relevant options are the size of the matrix with `-s`, and the number of timesteps with `-t`. More options can be seen passing the `-h` option. An example of execution could be:: $ mpirun -n 4 ./matmul -t 4 -s 512 in which the application will perform 4 timesteps in 4 MPI processes, where each process will have the same number of cores (# of available cores / 4). The total size of the matrix will be 512x512, this means that each process will hold a block of 128x512 of matrix A, and a block of 512x128 for each matrix B and C. We have prepared two scripts `run-once.sh` and `multirun.sh` for submitting jobs on the execution queues of this cluster. Firstly, `run-once.sh` runs a single execution with the given parameters, and if enabled, you can also extract a Paraver trace of the execution. You can submit the job script into the queues by doing:: $ bsub < run-once.sh Please feel free to change the job parameters (processes, processes per node, etc.) and the execution parameters of the program. Additionally, you can generate a trace of the execution by uncommenting the following line of the `run-once.sh` script:: INSTRUMENT=./trace.sh Finally, `multirun.sh` performs multiple executions of the program with increasing parameters.