Performance guidelines and troubleshooting advices related to OmpSs applications using offloaded codes are described here.

In order to offload to remote nodes, the allocation of the nodes must be performed by the application by using provided API Calls.

From application’s source code, the allocation can be done by calling deep_booster_alloc(MPI_Comm spawners, int number_of_hosts, int process_per_host, MPI_Comm *intercomm) API function. This function will create number_of_hosts*process_per_host remote workers and create a communicator with the offloaded processes in intercomm.

The masters who will spawn each process are the ones who are part of the communicator “spawners”. This call is a collective operation and should be called by every process who is part of this communicator. Two values have been tested:

• MPI_COMM_WORLD: All the masters in the communicator will spawn number_of_hosts*process_per_host remote workers. All these workers can communicate using MPI (they are inside the same MPI_COMM_WORLD).
• MPI_COMM_SELF: Each master which calls the spawn will spawn number_of_hosts*process_per_host remote workers, only visible for himself. Workers spawned by the same node will be able to communicate using MPI, but they won’t be able to communicate with the workers created by a different master.

This routine has different API calls (both work in C and also in Fortran (MPI_Comm/int/MPI_Comm* = INTEGER in Fortran, int* = INTEGER ARRAY in Fortran )):

• deep_booster_alloc (MPI_Comm spawners, int number_of_hosts, int process_per_host, MPI_Comm *intercomm): Allocates process_per_host processes in each host. If there are not enough number of hosts, the call will fail and returned intercomm will be MPI_COMM_NULL.
• deep_booster_alloc_list (MPI_Comm spawners, int pph_list_length, int* pph_list, MPI_Comm *intercomm): Provides a list with the number of processes which will be spawned in each node. For example, the following list {0,2,4,0,1,3} will spawn 10 processes split as indicated between hosts 1,2,4,5, skipping host 0 and 3. If there are not enough number of hosts, the call will fail and returned intercomm will be MPI_COMM_NULL.
• deep_booster_alloc_nonstrict (MPI_Comm spawners, int number_of_hosts, int process_per_host, MPI_Comm *intercomm, int* provided): Allocates process_per_host processes in each host. If there are not enough number of hosts, the call will allocate as many as available and return the number of processes allocated (available_hosts*process_per_host) in “provided”.
• deep_booster_alloc_list_nonstrict (MPI_Comm spawners, int pph_list_length, int* pph_list, MPI_Comm *intercomm, int* provided): Provides a list with the number of processes which will be spawned in each node. For example, the following list {0,2,4,0,1,3} will spawn 10 processes split indicated between hosts 1,2,4,5, skipping host 0 and 3. If there are not enough number of hosts, the call will allocate as many as available and return the number of the number of processes allocated available in “provided”.

Deallocation of the nodes will be performed automatically by the runtime at the end of the execution, however its strongly suggested to free them explicitly (it can be done at any time of the execution). This can be done by using the API function, deep_booster_free(MPI_Comm *intercomm).

Communication between master and spawned processes (when executing offloaded tasks) in user code works, but it’s not recommended.

Example in C:

int main (int argc, char** argv)
{
int my_rank;
int mpi_size;
nanos_mpi_init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);
MPI_Comm boosters;
int num_arr[2000];
init_arr(num_arr,0,1000);
//Spawn as one worker per master
deep_booster_alloc(MPI_COMM_WORLD, mpi_size  , 1  , &boosters);
if (boosters==MPI_COMM_NULL) exit(-1);

//Each master will offload to the same rank than him, but in the workers
{
//Workers SUM all their num_arr[0,1000] into num_arr[1000,2000]
MPI_Allreduce(&num_arr[0],&num_arr[1000],1000,MPI_INT,MPI_SUM,MPI_COMM_WORLD);
}

print_arr(num_arr,1000,2000);

deep_booster_free(&boosters);
nanos_mpi_finalize();
return 0;
}


Example in Fortran:

    PROGRAM MAIN
INCLUDE "mpif.h"
EXTERNAL :: SLEEP
IMPLICIT NONE
INTEGER :: BOOSTERS
INTEGER :: RANK, IERROR,MPISIZE, PROVIDED
INTEGER, DIMENSION(2000) :: inputs
INTEGER, DIMENSION(2000) :: outputs

CALL MPI_COMM_RANK(MPI_COMM_WORLD, RANK, ierror)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, MPISIZE, ierror)

!!Spawn as one worker per master
CALL deep_booster_alloc(MPI_COMM_WORLD, MPISIZE, 1, BOOSTERS)
IF (boosters==MPI_COMM_NULL) THEN
CALL exit(-1)
END IF
INPUTS=5
OUTPUTS=999

!!Each master will offload to the same rank than him, but in the workers
!$OMP TASK IN(INPUTS) OUT(OUTPUTS) ONTO(BOOSTERS,RANK) CALL MPI_ALLREDUCE(INPUTS,OUTPUTS,2000,MPI_INT,MPI_SUM,MPI_COMM_WORLD, IERROR); !$OMP END TASK
!$OMP TASKWAIT PRINT *, outputs(2) CALL DEEP_BOOSTER_FREE(BOOSTERS) CALL MPI_Finalize(ierror) END PROGRAM  ## 3.2.2.2. Compiling OmpSs offload applications¶ OmpSs offload Applications can be compiled by using mercurium compilers, “mpimcc/mpimcxx” and “mpimfc” with –ompss flag (E.G.: mpimcc –ompss test.c), which are a wrapper for the current MPI implementation available in users PATH. User MUST and provide use a multithread MPI implementation when offloading (otherwise nanox will give a explanatory warning and crash at start or hang) and link all the libraries of his program with the same one, -mt_mpi is automatically set for Intel MPI by our compiler. If no OFFL_CC/OFFL_CXX/OFFL_FC environment variables are defined. OmpSs will use Intel (mpiicc, mpiicpc or mpiifort) compilers by default, and if they are not available, it fall back to OMPI/GNU ones (mpicc,mpicxx and mpicpc). WARNING: When setting OFFL_CC/OFFL_FC, make sure that OFFL_CXX points to the same MPI implementation than OFFL_CC and OFFL_FC (as one C++ MPI-Dependant part of nanox is automatically compiled and linked with your application) ### 3.2.2.2.1. Naming Convention¶ MPI Offload executables have to follow a naming convention which identifies the executable and the architecture. Both executables have to be generated from the same source code. This convention is XX.YY, where XX is your executable name and must be the same for every architecture and YY is the architecture name (as given by ‘uname -p’ or ‘uname -m’ (both upper or lower case names are supported)). For x86_64 and k1om (MIC), two default alias are providen, intel64/INTEL64 and mic/MIC. Example dual-architecture (MIC+Intel64) compilation: mpimcc --ompss test.c -o test.intel64 mpimcc --ompss --mmic test.c -o test.mic  ## 3.2.2.3. Offload hosts¶ In machines where there is no job manager integration with our offload, the hosts where to offload each process have to be specified manually in a host-file. The path of this file can be specified to Nanox in the environment variable NX_OFFL_HOSTFILE, like NX_OFFL_HOSTFILE=./offload_hosts The syntax of this file is the following: hostA:n<env_var1,env_var2… hostB:n<env_var4,env_var1… Example hostfile: knights4-mic0:30<NX_SMP_WORKERS=1,NX_BINDING_STRIDE=4 knights3-mic0 knights2-mic0 knights1-mic0:3 knights0-mic0  ## 3.2.2.4. Debugging your application¶ If you want to see where your offload workers are executed and more information about them, use the environment variable NX_OFFL_DEBUG=<level> level is a number in the range 1-5, higher numbers will show more debug information. Sometimes offloading environment can be hard to debug and prints are not powerful enough, below you can find some techniques which will allow you to do so. You should be able to debug it with any MPI debugging tool if you point it to the right offload process ID. In order to do this you have two options. 1. Obtaining backtrace: 1- Compile your application with -k and -g and with --debug flag 2- Set "ulimit -c unlimited", after doing so, when the application crashes it will dump the core. 3- Open it with "gdb ./exec corefile" and you will be able to see some information about the crash (for example: see backtrace with "bt" command).  1. Execution time debugging: 1- Compile your application with -k and -g and with --debug flag 2- Add a sleep(X) at the start of the offload section which gives you enough time to do the next steps: 3- Execute your application and make sure it is able to execute the offload section/sleep. 4- ssh to the remote machine or execute everything in localhost (for debugging purposes). 5- Look for the allocated/offloaded processes. (NX_OFFL_DEBUG=3 will help you to identify their hostname and PID). 6- Attach to one of the processes with "gdb -pid YOURPID" (use your preferred debugger) -Your offload tasks will be executed by the main thread of the application. 7- You are now debugging the offload process/thread, you can place breakpoints and then write "continue" in gdb.  ## 3.2.2.5. Offloading variables¶ When offloading, a few rules are used in order to offload variables (also arrays/data-regions): • Local variables*: They will be copied as in regular OmpSs/OMP tasks. • Non-const Global variables*: They will be copied to and from the offload worker (visible inside the task body and also in external functions). • Const global variables*: As they are non-writable, they can only be in/firstprivate (copy of host value will be visible inside task body, external functions will see initial value). • Global variables which are never specified in copy/firstprivate : They will not be modified by OmpSs, so users may use them to store “node-private” values under their responsibility. Global variables are C/C++ global variables and Fortran Module/Common variables. Task region refers to the code between the task brackets/definition, external functions are functions which are called from inside task region. *OmpSs offload does not provide any guarantee about the value or the allocated space for these variables after the task which copies/uses them has finished. ## 3.2.2.6. Obtaining traces of offload applications¶ In order to trace an offload application users must follow these steps: • Compile with –instrument. An installation using Nanox instrumentation version (–with-extrae) is required. • Set “export EXTRAE_HOME=/my/extrae/path” and “export EXTRAE_HOME_MIC=/my/extrae/path/mic” • Optional: if you want to use a configuration file, export EXTRAE_CONFIG_FILE=./extrae.xml. Do not enable merging of traces inside the xml file, as is not supported. EXTRAE_DIR environment variable has to be set if this file points extrae intermediate files to a different folder than current directory ($PWD).

“offload_instrumentation.sh” script can be found in $NANOX_HOME/bin in case its not available through the PATH. It will setup the environment and then call the real application. Three configuration environment variables are available for this script: • EXTRAE_DIR: Directory where intermediate files will be generated (if specified with .xml file, this variable has to point to the same folder). [Default value:$PWD]
• AUTO_MERGE_EXTRAE_TRACES: If enabled, after the application finishes, the script will check the directory pointed by EXTRAE_DIR and merge the intermediate files into a final trace (.prv) [Default value: YES]. If disabled, users can merge these files manually by calling mpi2prv with the right parameters or by executing the script with no arguments (it will merge intermediate files located in EXTRAE_DIR).
• CLEAR_EXTRAE_TMP_FILES: If enabled, intermediate files will be removed after finishing the execution of the script [Default value: NO].

• If you are offloading tasks which use MPI communications on the workers side and they hang, make sure that you launched as many tasks as nodes in the communicator (so every rank of the remote communicator is executing one of the tasks), otherwise all the other tasks will be waiting for that one if they perform collective operations.
• If you are trying to offload to Intel MICs, make sure the variable I_MPI_MIC is set to yes/1 (export I_MPI_MIC=1).
• If your application behaves differently by simply compiling with OmpSs offload, take into account that OmpSs offload initializes MPI with MPI_THREAD_MULTIPLE call.
• If your application hangs inside DEEP Booster alloc calls, check that the hosts provided in the hostfile exist.
• Your application (when using MPI_COMM_SELF as first parameter when calling deep_booster_alloc from multiple MPI processes) has to be executed in a folder where you have write permissions, this is needed because Intel MPI implementation of MPI_Comm_spawn is not inter-process safe so we have to use a filesystem lock in order to serialize spawns so it does not crash. As a consequence of this problem spawns using MPI_COMM_SELF will be serialized. This behaviour can be disabled by using the variable NX_OFFL_DISABLE_SPAWN_LOCK=1, but then MPI Implementation has to support concurrent spawns or you will have to guarantee inside your application code that multiple calls to DEEP_Booster_alloc with MPI_COMM_SELF as first parameter are not performed at the same time. This behaviour is disabled when using OpenMPI as everything works as it should in this implementation.
• If you are trying to offload code which gets compiled/configured with CMake, you have to point the compilers to ones provided by the offload, you can do this with FC=mpimfc CXX=mpimcxx CC=mpimcc cmake .. If you are using FindMPI, you should disable it (recommended). Setting MPI and non-MPI compilers pointing to our compiler with FC=mpimfc CXX=mpimcxx CC=mpimcc cmake . -DMPI_C_COMPILER=mpimcc -DMPI_CXX_COMPILER=mpimcxx -DMPI_Fortran_COMPILER=mpimfc (CXX compiler is mandatory even if you are not using C++) should be enough.

Apart from offload troubleshooting, there are some problems with Intel implementation of MPI_COMM_SPAWN_MULTIPLE on MICs:

• (Fixed when using Intel MPI 5.0.0.016 or higher) If the same host (only on MICs) appears twice, the call to DEEP_BOOSTER_ALLOC/MPI_COMM_SPAWN may hang, in this case try specifying thread binding manually and interleaving those hosts with other machines.

Example crashing hostfile:

knights3-mic0
knights3-mic0
knights4-mic0
knights4-mic0


Example workaround fixed hostfile:

knights3-mic0
knights4-mic0
knights3-mic0
knights4-mic0

• More than 80 different hosts can’t be allocated by using a single file, in this case DEEP_BOOSTER_ALLOC/MPI_COMM_SPAWN will throw a Segmentation Fault. In order to handle this problem, we made a workaround, in offload host-file, instead of specifying one host, you can specify a host-file which will contain hosts.

Example hostfile using sub-groups in files:

#Hostfile1 contains 64 hosts, and all of them will have the same offload variables
#(path should be absolute or relative to working directory of your app)
./hostfile1:64<NX_SMP_WORKERS=1,NX_BINDING_STRIDE=4
knights3-mic0
./hostfile2:128


## 3.2.2.8. Tuning OmpSs Applications Using Offload for Performance¶

In general, best performance results is when write-back cache policy is used (default).

When offloading, one thread will take care of sending tasks from the host to the device. These threads will only wait for a few operations (for most operations, it just sends orders to the remote nodes) this is enough for most applications, but if you feel that these threads are not enough to offload all your independent tasks. You can increase the number of threads which will handle each alloc by using the variable NX_OFFL_MAX_WORKERS as in NX_OFFL_MAX_WORKERS=12, which by default has a value of 1.

If you know that you application is not uniform and it will have transfers from one remote node to other remote node while one of them is executing tasks, you may get a small performance improvement by setting NX_OFFL_CACHE_THREADS=1, which will start an extra cache thread at worker nodes. If this option is not enabled by the user, it will be enabled automatically only when this behaviour is detected.

Each worker node can receive tasks from different masters but it will execute only one at a time. Try to avoid this kind of behaviour if you application needs to be balanced.

Other Nanox options for performance tuning can be found at My application does not run as fast as I think it could.

## 3.2.2.9. Setting environment variables (i.e. number of threads) for offload workers¶

In regular applications, number of threads will be defined by regular variables, for example NX_SMP_WORKERS=X which sets the number of SMP workers/OMP threads in the master. In offload workers, default value will be the same than cores on the machine that the worker is running, if you want to specify them manually, as you may have more than one architecture, number of threads can be different for each node/architecture. In order to do this we provide a few ways to do it, in case of conflicts, the latest one in this list, will be used:

• Specify it individually on the hostfile on each node as an environment variable (NX_SMP_WORKERS or OFFL_NX_SMP_WORKERS). This value overwrites global and architecture variable for that node/group of nodes.
• Specify it globally using the variable “OFFL_VAR”, this means that once the remote worker starts, VAL will be defined for every offload worker with the value specified in OFFL_VAR.
• Specify it per-architecture using the variable “YY_VAR”, being YY the suffix of the executable for that architecture (explained in Naming Convention.) and VAR the variable you want to define in that architecture. For example, for Intel mic architecture, you can use “MIC_NX_SMP_WORKERS=240” in order to set number of threads to 240 for this architecture. This value overwrites global variable for the concrete architecture.

MPI implementation usually exports all those variables automatically, if this is not the case, users must configure it so variables which are needed are exported.

Other offload configuration variables can be obtained by using “nanox –help”:

Environment variables
NX_OFFL_HOSTS = <string>
Defines hosts file where secondary process can spawn in DEEP_Booster_Alloc Same format than NX_OFFLHOSTFILE but in a single line and separated with ‘;’ Example: hostZ hostA<env_vars hostB:2<env_vars hostC:3 hostD:4
NX_OFFL_ALIGNTHRESHOLD = <positive integer>
Defines minimum size (bytes) which determines if offloaded variables (copy_in/out) will be aligned (default value: 128), arrays with size bigger or equal than this value will be aligned when offloaded
Defines if offload processes will have an extra cache thread, this is good for applications which need data from other tasks so they don’t have to wait until task in owner node finishes. (Default: False, but if this kind of behaviour is detected, the thread will be created)
NX_OFFL_ALIGNMENT = <positive integer>
Defines the alignment (bytes) applied to offloaded variables (copy_in/out) (default value: 4096)
NX_OFFL_HOSTFILE = <string>
Defines hosts file where secondary process can spawn in DEEP_Booster_Alloc The format of the file is: One host per line with blank lines and lines beginning with # ignored Multiple processes per host can be specified by specifying the host name as follows: hostA:n Environment variables for the host can be specified separated by comma using hostA:n<env_var1,envar2… or hostA<env_var1,envar2…
NX_OFFL_EXEC = <string>
Defines executable path (in child nodes) used in DEEP_Booster_Alloc
NX_OFFL_MAX_WORKERS = <positive integer>
Defines the maximum number of worker threads created per alloc (Default: 1)
NX_OFFL_LAUNCHER = <string>
Defines launcher script path (in child nodes) used in DEEP_Booster_Alloc
NX_OFFL_ENABLE_ALLOCWIDE = <0/1>
Alloc full objects in the cache. This way if you only copy half of the array, the whole array will be allocated. This is good/useful when OmpSs copies share data between offload nodes (Default: False)
NX_OFFL_CONTROLFILE = <string>
Defines a shared (GPFS or similar) file which will be used to automatically manage offload hosts (round robin). This means that each alloc will consume hosts, so future allocs do not oversubscribe on the same host.