3. Nanos6 Runtime Options

This section describes how to run OmpSs-2 applications and which runtime options are available.

3.1. Executing and controlling number of CPUs

Nanos6 applications can be compiled and executed in this way:

# Compile OmpSs-2 program with LLVM/Clang
$ clang -fompss-2 app.c -o app

# Execute on all available cores of the current session
$ ./app

The number of cores that are used is controlled by running the application through the taskset command. For instance:

# Execute on cores 0, 1, 2 and 4
$ taskset -c 0-2,4 ./app

3.2. Runtime configuration options

The behaviour of the Nanos6 runtime can be tuned after compilation by means of a configuration file. All former Nanos6 environment variables are obsolete and will be ignored by the runtime system. Currently, the supported configuration file format is TOML v1.0.0-rc1. The default configuration file is named nanos6.toml and can be found in the share directory of the Nanos6 installation:

$INSTALLATION_PREFIX/share

Note

The nanos6.toml file was recently moved from $INSTALLATION_PREFIX/share/doc/nanos6/scripts to the $INSTALLATION_PREFIX/share directory

To override the default configuration, it is recommended to copy the default file and change the relevant options. The first configuration file found will be interpreted, according to the following order:

  1. The file pointed by the NANOS6_CONFIG environment variable.

  2. The file nanos6.toml found in the current working directory.

  3. The file nanos6.toml found in the installation path (default file).

The configuration file is organized into different sections and subsections. Option names have the format <section>[.<subsection>].<option_name>. For instance, the dependency system that the runtime should load is specified by the version.dependencies option. Inside the TOML configuration file, that option is placed inside the version section as follows:

[version]
    dependencies = "discrete"

Alternatively, if the configuration has to be changed programatically and creating new files is not practical, the configuration options can be overriden using the NANOS6_CONFIG_OVERRIDE environment variable. The content of this variable has to be in the format option1=value1,option2=value2,option3=value3,..., providing a comma-separated list of assignations.

For example, you can run the following command to change the dependency implementation and use CTF instrumentation:

NANOS6_CONFIG_OVERRIDE="version.dependencies=discrete,version.instrument=ctf" ./ompss-program`

By default, the runtime system emits a warning during initialization when it detects the definition of irrelevant environment variables that start with the NANOS6 prefix. The exceptions are the previous two variables NANOS6_CONFIG and NANOS6_CONFIG_OVERRIDE, but also the NANOS6_HOME variable. This latter is not relevant for the Nanos6 runtime system but it is often defined by users and it is relevant for the OmpSs-2 LLVM-based compiler. The warning can be disabled by setting the loader.warn_envars configuration option to false.

3.3. Runtime variants

There are several Nanos6 runtime variants, each one focusing on different aspects of parallel executions: performance, debugging, instrumentation, etc. OmpSs-2 applications do not require recompiling their code to run with instrumentation, e.g., to extract Extrae traces or to generate additional information. This is instead controlled through configration options, at run-time. Users can select a specific Nanos6 variant when running an application by setting the version.instrument, version.debug and version.dependencies configuration variables. This section explains the different values for these variables.

The instrumentation is specified by the version.instrument configuration variable and can take the following values:

version.instrument = none

This is the default value and does not enable any kind of instrumentation. This is the variant that should be used when executing peformance experiments since it is the one that adds no instrumentation overhead. Continue reading this section for more information about performance runs with Nanos6.

version.instrument = ovni

Instrumented with ovni to generate Paraver traces. See Generating ovni traces.

version.instrument = extrae

Instrumented to produce Paraver traces. See: Generating Extrae traces.

version.instrument = ctf

Instrumented to produce CTF traces and convert them to the Paraver format. See: Generating CTF traces.

version.instrument = verbose

Instrumented to emit a log of the execution. See: Verbose instrumentation.

version.instrument = lint

Instrumented to support the OmpSs-2@Linter tool.

Note

The graph and stats instrumentations have been recently removed and are no longer available

By default, Nanos6 loads runtime variants that are compiled using high optimization flags and most of the internal assertions turned off. This is the configuration that should be used along with the default none instrumentation for benchmarking experiments. However, these are not the only options that provide the best performance. See Benchmarking for more information.

Sometimes it is useful to run an OmpSs-2 program with debug information for debugging purposes (e.g., when a program crashes). The runtime system provides the option version.debug to load a runtime variant that has been compiled without optimizations and with all internal assertions turned on. The default value for this option is false (no debug), but can be changed to true to enable the debug information. Please note that the runtime system will significantly decrease its performance when enabling this option. Additionally, all instrumentation variants have their optimized and debug variants.

Finally, the last configuration variable used to specify a runtime variant is the version.dependencies, which is explained in the next section.

3.4. Task data dependencies

The Nanos6 runtime has support for different dependency implementations. The discrete dependencies are the default dependency implementation. This is the most optimized implementation but it does not fully support the OmpSs-2 dependency model since it does not support region dependencies. In the case the user program requires region dependencies (e.g., to detect dependencies among partial overlapping dependency regions), Nanos6 privides the regions implementation, which is completely spec-compliant. This latter is also the only implementation that supports OmpSs-2@Cluster.

The dependency implementation can be selected at run-time through the version.dependencies configuration variable. The available implementations are:

version.dependencies = "discrete"

Default and optimized implementation not supporting region dependencies. Region syntax is supported but will behave as a discrete dependency to the first address. Scales better than the default implementation thanks to its simpler logic and is functionally similar to traditional OpenMP model.

version.dependencies = "regions"

Supporting all dependency features. Default implementation in OmpSs-2@Cluster installations.

In cases where an OmpSs-2 program requires region dependency support, we recommended to add the declarative directive assert in any of the program source files, as shown below. Then, before the program is started, the runtime system will check whether the loaded dependency implementation is regions and will abort the execution if it is not true.

#pragma oss assert("version.dependencies==regions")

int main() {
    // ...
}

Notice that the assert directive could also check whether the runtime is using discrete dependencies.

3.5. Task scheduler

The scheduling infrastructure provides the following configuration options to modify the behavior of the task scheduler:

scheduler.policy (default: fifo)

Specifies whether ready tasks are added to the ready queue using a FIFO (fifo) or a LIFO (lifo) policy.

scheduler.immediate_successor (default: 0.75)

Probability of enabling the immediate successor feature to improve cache data reutilization between successor tasks. If enabled, when a CPU finishes a task it starts executing the successor task (computed through their data dependencies).

scheduler.priority (default: true)

Boolean indicating whether the scheduler should consider the task priorities defined by the user in the task’s priority clause.

3.6. Task worksharings

Important

Task worksharings, which were implemented by the task for clause, are no longer part of the OmpSs-2 specification.

3.7. Stack size

By default, Nanos6 allocates stacks of 8 MB for its worker threads. In some codes this may not be enough. For instance, when converting Fortran codes, some global variables may need to be converted into local variables. This may increase substantially the amount of stack required to run the code and may surpass the space that is available.

To solve that problem, the stack size can be set through the misc.stack_size configuration variable. Its value is expressed in bytes but it also accepts the K, M, G, T and E suffixes, that are interpreted as power of 2 multipliers. For instance:

[misc]
    stack_size = "16M"

3.8. Dynamic Load Balancing (DLB)

DLB is a library devoted to speed up hybrid parallel applications and maximize the utilization of computational resources. More information about this library can be found at the following link: https://pm.bsc.es/dlb.

To enable DLB support for Nanos6, a working DLB installation must be present in your environment. Configuring Nanos6 with DLB support is done through the --with-dlb flag, specifying the root directory of the DLB installation.

After configuring DLB support for Nanos6, its enabling can be controlled at run-time through the dlb.enabled configuration variable. To run Nanos6 with DLB support then, this variable must be set to true (dlb.enabled=true), since by default DLB is disabled.

Once DLB is enabled for Nanos6, OmpSs-2 applications will benefit from dynamic resource sharing automatically. Assuming that DLB has been explicitly enabled at the configuration file, the following example showcases the executions of two applications that share the available CPUs between them:

# Run the first application using 10 CPUs (0, 1, ..., 9)
$ taskset -c 0-9   ./merge-sort.test &

# Run the second application using 10 CPUs (10, 11, ..., 19)
$ taskset -c 10-19 ./cholesky-fact.test &

# Now those applications should be running while sharing resources
# ...

3.9. CPU manager policies

Nanos6 offers different policies when handling CPUs through the cpumanager.policy configuration variable:

cpumanager.policy = idle

Activates the idle policy, in which idle threads halt on a blocking condition, while not consuming CPU cycles.

cpumanager.policy = hybrid

Activates the hybrid policy, in which idle threads continue spinning for a specific number of iterations while consuming CPU cycles and then halt on a blocking condition.

cpumanager.policy = busy

Activates the busy policy, in which idle threads continue spinning and never halt, consuming CPU cycles.

cpumanager.policy = lewi

If DLB is enabled, activates the LEnd When Idle policy, which is similar to the idle policy but in DLB mode. In this policy, idle threads lend their CPU to other runtimes or processes.

cpumanager.policy = greedy

If DLB is enabled, activates the greedy policy, which disables lending CPUs from the process’ mask, but allows acquiring and lending external CPUs.

cpumanager.policy = default

Fallback to the default implementation. If DLB is disabled, this policy falls back to the hybrid policy, while if DLB is enabled it falls back to the lewi policy.

Furthermore, the number of busy-wait iterations performed in the hybrid policy can be controlled by the following configuration variable:

cpumanager.busy_iters = X

This variable indicates the collective number of busy iterations performed by all CPUs. Thus, when setting a value of X, idle threads will spin for X/numCPUs iterations before halting. By default this number is set to 240K iterations collectively.

3.10. Benchmarking, instrumenting and debugging

As previously stated, there are several Nanos6 runtime variants, each one focusing on different aspects of parallel executions: performance, debugging, instrumentation, etc. OmpSs-2 applications do not require recompiling their code to run with instrumentation, e.g., to extract Paraver traces or to generate additional information. This is instead controlled through configration options, at run-time. Users can select a specific Nanos6 variant when running an application by setting the version.instrument, version.debug and version.dependencies configuration variables. The next subsections explain the details of the different variants and options of Nanos6.

Important

The ovni instrumentation is the recommended variant to extract Paraver traces. Our support for Extrae and CTF instrumentations is deprecated, so they will not include new further features.

3.11. Throttle

There are some cases where user programs are designed to run for a very long time, instantiating in the order of tens of millions of tasks or more. These programs can demand a huge amount of memory in small intervals when they rely only on data dependencies to achieve task synchronization. In these cases, the runtime system could run out of memory when allocating internal structures for task-related information if the number of instantiated tasks is not kept under control.

To prevent this issue, the runtime system offers a throttle mechanism that monitors memory usage and stops task creators while there is high memory pressure. This mechanism does not incur too much overhead because the stopped threads execute other ready tasks (already instantiated) until the memory pressure decreases. The main idea of this mechanism is to prevent the runtime system from exceeding the memory budget during execution. Furthermore, the execution time when enabling this feature should be similar to the time in a system with infinite memory.

The throttle mechanism requires a valid installation of Jemalloc, which is a scalable multi-threading memory allocator. Hence, the runtime system must be configured with the --with-jemalloc option. Although the throttle feature is disabled by default, it can be enabled and tunned at runtime through the following configuration variables:

throttle.enabled (default: false)

Boolean variable that enables the throttle mechanism.

throttle.tasks (default: 5.000.000)

Maximum absolute number of alive childs that any task can have. It is divided by 10 at each nesting level.

throttle.pressure (default: 70)

Percentage of memory budget used at which point the number of tasks allowed to exist will be decreased linearly until reaching 1 at 100% memory pressure.

throttle.max_memory (default: available physical memory / 2)

Maximum used memory or memory budget. Note that this variable can be set in terms of bytes or in memory units. For example: throttle.max_memory=50GB.

3.12. Hardware counters

Nanos6 offers an infrastructure to obtain hardware counter statistics of tasks with various backends. The usage of this API is controlled through the Nanos6 configuration file. Currently, Nanos6 supports the PAPI, RAPL and PQoS backends. All the available hardware counter backends are listed in the default configuration file, found in the scripts folder. To enable any of these, modify the false fields and change them to true. Specific counters can be enabled or disabled by adding or removing their name from the list of counters inside each backend subsection.

Next we showcase a simplified version of the hardware counter section of the configuration file, where the PAPI backend is enabled with counters that monitor the total number of instructions and cycles, and the PAPI backend is enabled as well:

[hardware_counters]
    [hardware_counters.papi]
        enabled = true
        counters = ["PAPI_TOT_INS", "PAPI_TOT_CYC"]
    [hardware_counters.rapl]
        enabled = true

3.13. OmpSs-2@Cluster

In order to enable OmpSs-2@Cluster support, you need a working MPI installation in your environment that supports multithreading, i.e. MPI_THREAD_MULTIPLE. Nanos6 needs to be configured with the --enable-cluster flag.

For more information on how to write and run cluster applications see Cluster.md.