03-TAMPI¶
This folder contains two guided exercices to experiment with the Task-Aware MPI (TAMPI) library. Those are the N-body and Heat diffusion benchmarks. Both of them contain comments preceded by the keyword TODO, in which we give some hints about what you should add or modify in order to leverage TAMPI features.
Before proceeding to those exercices, you should execute the following command at the current directory to setup the necessary environment:
$ source configure.sh
Once the environment is loaded, you can follow the instructions of each exercice.
N-body Benchmark¶
An N-body simulation numerically approximates the evolution of a system of bodies in which each body continuously interacts with every other body. A familiar example is an astrophysical simulation in which each body represents a galaxy or an individual star, and the bodies attract each other through the gravitational force.
N-body simulation arises in many other computational science problems as well. For example, protein folding is studied using N-body simulation to calculate electrostatic and van der Waals forces. Turbulent fluid flow simulation and global illumination computation in computer graphics are other examples of problems that use N-body simulation.
This application has been parallelized using MPI and OpenMP. The MPI part requires an MPI version supporting the multi-threading mode. It mainly targets the Intel MPI implementation, however, it should work with other libraries when adding the needed implementation-specific flags in the Makefile.
A second level of parallelism is exploited following the OpenMP approach. The program uses tasks to express the parallelism among the different blocks. Communication services have also been taskified but their execution has been serialized by means of a sentinel dependence (in the code serial) that guarantees the proper order of execution. The following function centralizes all the communication operations to exchange data among processes:
for (int i = 0; i < num_blocks; i++) {
#pragma omp task depend(in: sendbuf[i]) depend(inout: serial)
send_particles_block(sendbuf+i, i, dst);
#pragma omp task depend(out: recvbuf[i]) depend(inout: serial)
recv_particles_block(recvbuf+i, i, src);
}
...
void send_particles_block(const particles_block_t *sendbuf, int block_id, int dst)
{
MPI_Send(sendbuf, sizeof(particles_block_t), MPI_BYTE, dst, block_id+10, MPI_COMM_WORLD);
}
void recv_particles_block(particles_block_t *recvbuf, int block_id, int src)
{
MPI_Recv(recvbuf, sizeof(particles_block_t), MPI_BYTE, src, block_id+10, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
Taking as a baseline this version of the program we want to exploit the TAMPI interoperability layer capabilities. The Makefile system already include the compilation and link against TAMPI. The first step is to relax the task execution order by removing the artificial dependencies in communication tasks. Then, we can go one step further and transform our blocking service calls into their non-blocking counterparts. Don’t forget to asynchronously wait for the completion of communication before finalizing communication tasks.
Goals of this exercise
- Transform the code in order to exploit TAMPI capabilities for non-blocking communications. Look for TODO comments.
- Compare performance results between the initial version and your candidate version.
- Check several runtime options when executing the program (versions, schedulers, etc.).
- Check scalability. Execute the program using different numbers of cpus or MPI processes and compute the speed-up.
- Change program arguments that may have an impact on task granularity (block size, tile size, etc.).
- Change program arguments that may have an impact on the number of tasks (matrix sizes and/or block/tile sizes).
- Get different paraver traces using different runtime options or program arguments and compare them.
Execution instructions
The first step is to compile the program and generate the auxiliary scripts. You can do it by executing the following command inside this folder:
$ make
The generated binary accepts several options. The most relevant options are the number of total particles with -p, and the number of timesteps with -t. More options can be seen passing the -h option. An example of execution could be:
$ mpirun -n 4 ./nbody -t 100 -p 8192
in which the application will perform 100 timesteps in 4 MPI processes, where each process will have the same number of cores (# of available cores / 4). The total number of particles will be 8192, this means that each process will have 2048 particles (2 blocks per process).
We have prepared two scripts run-once.sh and multirun.sh for submitting jobs on the execution queues of this cluster. Firstly, run-once.sh runs a single execution with the given parameters, and if enabled, you can also extract a Paraver trace of the execution.
You can submit the job script into the queues by doing:
$ bsub < run-once.sh
Please feel free to change the job parameters (processes, processes per node, etc.) and the execution parameters of the program. Additionally, you can generate a trace of the execution by uncommenting the following line of the run-once.sh script:
INSTRUMENT=./trace.sh
Finally, multirun.sh performs multiple executions of the program with increasing parameters.
Heat Diffusion¶
The Heat simulation uses an iterative Gauss-Seidel method to solve the heat equation, which is a parabolic partial differential equation that describes the distribution of heat (or variation in temperature) in a given region over time. The heat equation is of fundamental importance in a wide range of science fields. In mathematics, it is the parabolic partial differential equation par excellence. In statistics, it is related to the study of the Brownian motion. Also, the diffusion equation is a generic version of the heat equation, and it is related to the study of chemical diffusion processes.
This application has been parallelized using MPI and OpenMP. The MPI part requires an MPI version supporting the multi-threading mode. It mainly targets the Intel MPI implementation, however, it should work with other libraries by adding the needed implementation-specific flags in the Makefile.
A second level of parallelism is exploited following the OpenMP approach. The program uses tasks to express the parallelism among different blocks. Communication services have also been taskified but their execution has been serialized by means of a sentinel dependence (in the code serial) that guarantees the proper order of execution. The following functions are examples of this serialization:
inline void sendFirstComputeRow(block_t *matrix, int nbx, int nby, int rank, int rank_size)
{
for (int by = 1; by < nby-1; ++by) {
#pragma omp task depend(in: matrix[nby+by]) depend(inout: serial)
MPI_Send(&matrix[nby+by][0], BSY, MPI_DOUBLE, rank - 1, by, MPI_COMM_WORLD);
}
}
inline void sendLastComputeRow(block_t *matrix, int nbx, int nby, int rank, int rank_size)
{
for (int by = 1; by < nby-1; ++by) {
#pragma omp task depend(in: matrix[(nbx-2)*nby+by]) depend(inout: serial)
MPI_Send(&matrix[(nbx-2)*nby+by][BSX-1], BSY, MPI_DOUBLE, rank + 1, by, MPI_COMM_WORLD);
}
}
Taking as a baseline this version of the program we want to exploit the TAMPI interoperability layer capabilities. The Makefile system already include the compilation and link against TAMPI. The first step is to relax the task execution order by removing the artificial dependencies in communication tasks. Then, we can go one step further and transform our blocking service calls into their non-blocking counterparts. Don’t forget to asynchronously wait for the completion of communication before finalizing communication tasks.
Goals of this exercise
- Transform the code in order to exploit TAMPI capabilities for non-blocking communications. Look for TODO comments.
- Compare performance results between the initial version and your candidate version.
- Check several runtime options when executing the program (versions, schedulers, etc.).
- Check scalability. Execute the program using different numbers of cpus or MPI processes and compute the speed-up.
- Change program arguments that may have an impact on task granularity (block size, tile size, etc.).
- Change program arguments that may have an impact on the number of tasks (matrix sizes and/or block/tile sizes).
- Get different paraver traces using different runtime options or program arguments and compare them.
Execution instructions
The first step is to compile the program and generate the auxiliary scripts. You can do it by executing the following command inside this folder:
$ make
The generated binary accepts several options. The most relevant options are the size of the matrix in each dimension with -s, and the number of timesteps with -t. More options can be seen passing the -h option. An example of execution could be:
$ mpirun -n 4 ./heat -t 150 -s 8192
in which the application will perform 150 timesteps in 4 MPI processes, where each process will have the same number of cores (# of available cores / 4). The size of the matrix in each dimension will be 8192 (8192 elements in total), this means that each process will have 2048 * 8192 elements (16 blocks per process).
We have prepared two scripts run-once.sh and multirun.sh for submitting jobs on the execution queues of this cluster. Firstly, run-once.sh runs a single execution with the given parameters, and if enabled, you can also extract a Paraver trace of the execution.
You can submit the job script into the queues by doing:
$ bsub < run-once.sh
Please feel free to change the job parameters (processes, processes per node, etc.) and the execution parameters of the program. Additionally, you can generate a trace of the execution by uncommenting the following line of the run-once.sh script:
INSTRUMENT=./trace.sh
Finally, multirun.sh performs multiple executions of the program with increasing parameters.