4. Library Routines¶

Warning

If you find any missing or outdated information regarding library routines or others, we would greatly appreciate your feedback. Please contact us at: star + at + bsc + . + es.

This chapter describes the set of OmpSs-2 run-time library routines available while executing an OmpSs-2 program. Programmers must guarantee the correctness of compiling and linking programs without OmpSs-2 support by the use of the conditional compilation mechanism (see Conditional compilation section).

In the following sections, we provide a short description of all public library routines. These routines are declared in public header files. These may differ between runtimes, but at the time of writing these can be found in the nanos6.h header file for the Nanos6 runtime, or in the nodes.h header file for the NODES runtime.

Although nOS-V and NODES are the successor libraries to Nanos6, for compatibility reasons the API has remained with as few changes as possible. Hence, the naming conventions for the public library routines have adopted the nanos6_ prefix even in the API exposed by the NODES runtime. Next, we list the most notable public library routines.

Warning

The run-time system library does not provide the Fortran header yet.

4.1. Spawning functions¶

The nanos6_spawn_function allows to asynchronously spawn a new function that will leverage the run-time system resources:

void nanos6_spawn_function(
  void (*function)(void *),
  void *args,
  void (*completion_callback)(void *),
  void *completion_args,
  char const *label
);

Where each of this routine’s parameters are:

function the function to be spawned.
args a parameter that is passed to the function.
completion_callback an optional function that will be called when the function finishes.
completion_args a parameter that is passed to the completion callback.
label an optional name for the function.

The routine will create a new OmpSs-2 task executing the function code and receiving the args parameters. This task has an independent namespace of data dependencies and has no relationship with any other existing task (i.e., no taskwait will wait for it). Once the task finishes the run-time system will invoke the registered callback service (i.e., completion_callback using completion_args parameters). The callback is used as the provided synchronization mechanism.

The label string will be used for debugging/instrumentation purposes.

4.2. Task blocking¶

This API is composed of three functions and provides the functionality of pausing and resuming OmpSs-2 tasks at run-time. Firstly, the nanos6_get_current_blocking_context returns an opaque pointer that is used for blocking and unblocking the current task:

void *nanos6_get_current_blocking_context(void);

The underlying implementation may or may not return the same value for repeated calls to this function. Once the handler has been used once in a call to nanos6_block_current_task and a call to nanos6_unblock_task, it is discarded and a new one must be obtained to perform another cycle of blocking and unblocking.

Secondly, the nanos6_block_current_task function blocks the execution of the current task:

void nanos6_block_current_task(void *blocking_context);

The current task will block at least until another thread calls nanos6_unblock_task with the blocking context of that task. The run-time system may choose to execute other tasks within the execution scope of this call.

Finally, the nanos6_unblock_task routine marks a previously blocked (or about to be blocked) task as unblocked:

void nanos6_unblock_task(void *blocking_context);

This function can be called even before the target task actually blocks. In that case, the target task will not block, as it has already been unblocked. However, only one call to nanos6_unblock_task may precede its matching call to nanos6_block_current_task. The return of this function does not guarantee that the blocked task has already resumed its execution; it only guarantees that eventually it will be resumed.

Note that both nanos6_block_current_task and nanos6_unblock_task may be task scheduling points.

4.3. Time-based task blocking¶

OmpSs-2 defines a new API function named nanos6_wait_for that pauses the calling task for a specific time in microseconds. The task stops during that time approximately and yields the CPU to execute other tasks in the meantime:

uint64_t nanos6_wait_for(uint64_t time_us);

The function returns the actual time slept so that the caller task can make decisions based on that, for instance, to change the next sleep time.

4.4. Polling services¶

Warning

The original polling service API has been removed. This section explains a more flexible, usable and performant approach

The OmpSs-2 programming model offers a mechanism to perform dynamic and periodic polling which can be useful for third party libraries. In this way, libraries do not need to leverage an additional thread (potentially oversubscribing CPU resources), for instance, for checking asynchronous operations periodically.

We recommend spawning an independent task by calling the nanos6_spawn_function (see Spawning functions) and passing the desired polling function and arguments as parameters. Then, the polling function can execute a loop (until the polling must finish) where each iteration checks the corresponding asynchronous operations and then pauses the task for a specific amount of microseconds with nanos6_wait_for (see Time-based task blocking). During the pause time, the CPU can be leveraged by the run-time system to execute other ready tasks.

This flexible mechanism allows changing the polling frequency dynamically. Moreover, the user may take into account the actual slept time (returned by nanos6_wait_for) or the workload to recompute the required frequency.

We show an example in C++ below:

std::atomic<bool> _mustFinish(false);
std::atomic<bool> _finished(false);

void polling_function(void *args)
{
  uint64_t time_us = 500; // 500 microseconds

  while (!_mustFinish.load()) {
    // Call the actual polling user function
    check_operations_completion();

    // Pause the polling task
    uint64_t actual_us = nanos6_wait_for(time_us);

    // Change time_us if needed...
  }
}

void polling_complete(void *args)
{
  // Notify the spawner when the polling task has completed
  _finished.store(true);
}

int main()
{
  nanos6_spawn_function(polling_function, NULL,
                        polling_complete, NULL,
                        "polling task");

  // ...

  // Notify the polling task to stop
  _mustFinish.store(true);

  while (!_finished.load()) {
    // Yield the CPU while we wait
    nanos6_wait_for(100);
  }
}

4.5. Task external events¶

The task external events API provides the functionality of decoupling the release of task data dependencies to the completion of external events. A task fully completes (i.e., releases its dependencies) once it finishes the execution of its task body, and all external events bound during its execution are fulfilled. This allows binding the completion of a task to the finalization of any asynchronous operation.

This API is composed of three functions, which follow the style of the blocking API. Firstly, the nanos6_get_current_event_counter returns an opaque pointer that is used to increase/decrease the number of events:

void *nanos6_get_current_event_counter(void);

This function provides implementation-specific data that we name event counter throughout the rest of this text. A task can bind its completion to new external events by calling the following function:

void nanos6_increase_current_task_event_counter(
  void *event_counter,
  unsigned int increment
);

This function atomically increases the number of pending external events of the calling task. The first parameter event_counter must be the event counter of the invoking task, while the second parameter increment is the number of external events to be bound. The presence of pending events in a task prevents the release of its dependencies, even if the task has finished its execution. Note that only the task itself can bind its external events.

Then, the task itself, or another task or external thread, can fulfill the events of the task by calling the following function:

void nanos6_decrease_task_event_counter(
  void *event_counter,
  unsigned int decrement
);

This function atomically decreases the number of pending external events of a given task. The first parameter event_counter is the event counter of the target task, while the second parameter decrement is the number of completed external events to be decreased. Once the number of external events of a task becomes zero and it finishes its execution, the task can release its dependencies.

Note that, all external events of a task can complete before it actually finishes its execution. In this case, the task will release its dependencies as soon as it finishes its execution. Otherwise, the last call that makes the counter become zero, will trigger the release of the dependencies.

Notice that the user is responsible for not fulfilling events that the target task has still not bound.

4.6. Coroutines¶

Warning

Coroutines are only supported through the NODES + nOS-V runtime systems.

C++20 introduced coroutines, which enable asynchronous programming and lazy computations. Coroutines can suspend and resume their execution similar to normal tasks. However, they leverage compiler-assisted hints to achieve an optimal memory footprint and optimize context switching. Coroutines do not support Thread-Local Storage (TLS).

OmpSs-2 provides support for coroutines. To leverage it, programmers must modify their applications to incorporate C++20 coroutines as follows:

A task’s return type must be changed to the coroutine type offered by OmpSs-2.

The respective blocking API calls must be replaced with their awaitable counterparts.

To leverage coroutines, OmpSs-2 exposes the following library routines:

oss_taskwait_awaitable oss_co_taskwait() - Awaitable version of nanos6_taskwait.

std::suspend_always oss_co_suspend_current_task(); - Awaitable version of nanos6_block_current_task.

oss_user_lock_awaitable oss_co_user_lock(void **handlerPointer); - Awaitable version of nanos6_user_lock.

oss_waitfor_awaitable oss_co_wait_for(uint64_t timeout, uint64_t *waitTime); - Awaitable version of nanos6_wait_for.

oss_yield_awaitable oss_co_yield(); - Awaitable version of nanos6_yield.

Internally, by using the coroutine API, tasks will not be blocked as with the counterpart functions. Instead, the coroutines in tasks will be suspended upon calling these functions. Next, we showcase an example of how an OmpSs-2 Fibonacci example can leverage coroutines:

#pragma oss task label("fibonacci")
oss_coroutine fibonacci(int index, int *result) {
   if (index <= 1) {
      *result = index;
   } else {
      int r1, r2;

      fibonacci(index-1, &r1);
      fibonacci(index-2, &r2);

      co_await oss_co_taskwait();

      *result = r1 + r2;
   }
}

Finally, the example listed below showcases how the coroutine counterpart of using nanos6_wait_for may look like:

#pragma oss task label("example timeout")
oss_coroutine timeout_task(uint64_t *timeoutP, int timeout)
{
   co_await oss_co_wait_for(timeout, timeoutP));
}

For further information on OmpSs-2 coroutine-related library routines and their counterparts, refer to the documentation found within the public library headers.

4.7. NUMA Awareness¶

Warning

The NUMA Aware API is currently only available through the Nanos6 runtime system.

OmpSs-2 offers a simple API to mitigate NUMA effects in NUMA systems, which are adverse effects caused by accessing data in remote NUMA nodes. By leveraging this API, users can allocate memory in NUMA systems using various policies. In this way, the runtime library can apply NUMA-aware scheduling decisions which may benefit performance.

The NUMA API comprises the functions explained below, which allow users to allocate and free memory, and decide how the data is distributed across NUMA nodes.

The nanos6_numa_alloc_block_interleave is used to allocate data. Users should replace their regular allocation methods (malloc, mmap, new, etc.) with this method. The function allocates a specific size of bytes, interleaving the allocation in blocks of size block_size bytes among the NUMA nodes specified by the mask:

void *nanos6_numa_alloc_block_interleave(uint64_t size, const nanos6_bitmask_t *mask, uint64_t block_size);

The nanos6_numa_free is used to release previously allocated memory. The parameter ptr is the pointer to the memory that will be freed, which should be the one returned by a previous allocation call:

void *nanos6_numa_free(void *ptr);

The nanos6_bitmask_set_wildcard sets a bitmask depending on a wildcard. The available wildcards are: NUMA_ALL to set all the available NUMA nodes in the system, NUMA_ALL_ACTIVE to set the NUMA nodes which the current process has all their cores available (i.e., the process can run on them), and NUMA_ANY_ACTIVE to set the NUMA nodes which the current process has at least one core available:

void nanos6_bitmask_set_wildcard(nanos6_bitmask_t *bitmask, wildcard);

There are several functions to query and manipulate the bitmask, which are shown below. The nanos6_count_setbits returns the number of enabled bits in a bitmask, which is useful to retrieve the NUMA nodes available in the application. The rest of functions are not discussed as their behavior is self-explanatory:

uint64_t nanos6_count_setbits(const nanos6_bitmask_t *bitmask);

uint64_t nanos6_bitmask_isbitset(const nanos6_bitmask_t *bitmask, uint64_t n);

void nanos6_bitmask_clearall(nanos6_bitmask_t *bitmask);

void nanos6_bitmask_clearbit(nanos6_bitmask_t *bitmask, uint64_t n);

void nanos6_bitmask_setall(nanos6_bitmask_t *bitmask);

void nanos6_bitmask_setbit(nanos6_bitmask_t *bitmask, uint64_t n);

Next, we showcase the pseudo-code of a usage example of the OmpSs-2 NUMA API, where some arrays are allocated and interleaved between NUMA nodes:

nanos6_bitmask_t bitmask;
nanos6_bitmask_set_wildcard(&bitmask, NUMA_ALL);
size_t numa_nodes = nanos6_count_setbits(&bitmask);
size_t size = N*sizeof(double);
size_t block_size = size/numa_nodes;

// Allocate vectors
double *a = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *b = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *c = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);

// Initialize
for (int i = 0; i < NUM_BLOCKS; i++) {
  #pragma oss task out(a[i*TS ;TS], b[i*TS ;TS] c[i*TS ;TS])
  {
    // init a, b, c
  }
}

// Execution
for (int step = 0; step < timesteps; step++) {
  for (int i = 0; i < NUM_BLOCKS; i++) {
    #pragma oss task in(a[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // copy kernel
    }

    #pragma oss task in(c[i*TS ;TS]) out(b[i*TS ;TS])
    {
      // scale kernel
    }

    #pragma oss task in(a[i*TS ;TS], b[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // add kernel
    }

    #pragma oss task in(b[i*TS ;TS], c[i*TS ;TS]) out(a[block*TS ;TS])
    {
      // triad kernel
    }
  }
}

// Release memory
nanos6_numa_free(a);
nanos6_numa_free(b);
nanos6_numa_free(c);

Further information on the NUMA Aware API can be found in the following doctoral thesis.