4. Library Routines

This chapter describes the set of OmpSs-2 run-time library routines available while executing an OmpSs-2 program. Programmers must guarantee the correctness of compiling and linking programs withouth OmpSs-2 support by the use of the conditional compilation mechanism (see Conditional compilation section).

In the following sections, we provide a short description of all public library routines. These routines are declared in the public header file named nanos6.h.

Warning

The run-time system library does not provide the Fortran header yet.

4.1. Spawning functions

The nanos6_spawn_function allows to asynchrously spawn a new function that will leverage the run-time system resources:

void nanos6_spawn_function(
  void (*function)(void *),
  void *args,
  void (*completion_callback)(void *),
  void *completion_args,
  char const *label
);

Where each of this routine’s parameters are:

  • function the function to be spawned.

  • args a parameter that is passed to the function.

  • completion_callback an optional function that will be called when the function finishes.

  • completion_args a parameter that is passed to the completion callback.

  • label an optional name for the function.

The routine will create a new OmpSs-2 task executing the function code and receiving the args parameters. This task has an independent namespace of data dependencies and has no relationship with any other existing task (i.e., no taskwait will wait for it). Once the task finishes the run-time system will invoke the registered callback service (i.e., completion_callback using completion_args parameters). The callback is used as the provided synchronization mechanism.

The label string will be used for debugging/instrumentation purposes.

4.2. Task blocking

This API is composed by three functions and provides the functionality of pausing and resuming OmpSs-2 tasks at run-time. Firstly, the nanos6_get_current_blocking_context returns an opaque pointer that is used for blocking and unblocking the current task:

void *nanos6_get_current_blocking_context(void);

The underlying implementation may or may not return the same value for repeated calls to this function. Once the handler has been used once in a call to nanos6_block_current_task and a call to nanos6_unblock_task, the handler is discarded and a new one must be obtained to perform another cycle of blocking and unblocking.

Secondly, the nanos6_block_current_task function blocks the execution of the current task:

void nanos6_block_current_task(void *blocking_context);

The current task will block at least until another thread calls nanos6_unblock_task with the blocking context of that task. The run-time system may choose to execute other tasks within the execution scope of this call.

Finally, the nanos6_unblock_task routine mark as unblocked a task previously or about to be blocked:

void nanos6_unblock_task(void *blocking_context);

This function can be called even before the target task actually blocks. In that case, the target task will not block because someone already unblocked it. However, only one call to nanos6_unblock_task may precede its matching call to nanos6_block_current_task. The return of this function does not guarantee that the blocked task has resumed yet its execution; it only guarantees that it will be resumed.

Note that both nanos6_block_current_task and nanos6_unblock_task may be task scheduling points.

4.3. Time-based task blocking

OmpSs-2 defines a new API function named nanos6_wait_for that pauses the calling task for a specific time in microseconds. The task stops during that time approximately and yields the CPU to execute other tasks in the meanwhile:

uint64_t nanos6_wait_for(uint64_t time_us);

The function returns the actual time slept so that the caller task can take decisions based on that, for instance, to change the next sleep time.

4.4. Polling services

Warning

The original polling service API has been removed. This section explains a more flexible, usable and performant approach

The OmpSs-2 programming model offers a mechanism to perform dynamic and periodic polling which can be useful for third party libraries. In this way, libraries do not need to leverage an additional thread (potentially oversubscribing CPU resources), for instance, for checking asynchronous operations periodically.

We recommend spawning an independent task by calling the nanos6_spawn_function (see Spawning functions) and passing the desired polling function and arguments as parameters. Then, the polling function can execute a loop (until the polling has to finish) where each iteration checks the corresponding asynchronous operations and then pauses the task for specific amout of microseconds with nanos6_wait_for (see Time-based task blocking). During the pause time, the CPU can be leveraged by the run-time system to execute other ready tasks.

This flexible mechanism allows changing the polling frequency dynamically. Moreover, the user may take into account the actual slept time (returned by nanos6_wait_for) or the work load to recompute the required frequency.

We show an example in C++ below:

std::atomic<bool> _mustFinish(false);
std::atomic<bool> _finished(false);

void polling_function(void *args)
{
  uint64_t time_us = 500; // 500 microseconds

  while (!_mustFinish.load()) {
    // Call the actual polling user function
    check_operations_completion();

    // Pause the polling task
    uint64_t actual_us = nanos6_wait_for(time_us);

    // Change time_us if needed...
  }
}

void polling_complete(void *args)
{
  // Notify the spawner when the polling task has completed
  _finished.store(true);
}

int main()
{
  nanos6_spawn_function(polling_function, NULL,
                        polling_complete, NULL,
                        "polling task");

  // ...

  // Notify the polling task to stop
  _mustFinish.store(true);

  while (!_finished.load()) {
    // Yield the CPU while we wait
    nanos6_wait_for(100);
  }
}

4.5. Task external events

The task external events API provides the functionality of decoupling the release of task data dependencies to the completion of external events. A task fully completes (i.e., releases its dependencies) once it finishes the execution of its task body, and all externals events bound during its execution are fulfilled. This allows binding the completion of a task to the finalization of any asynchronous operation.

This API is composed of three functions, which follow the style of the blocking API. Firstly, the nanos6_get_current_event_counter returns an opaque pointer that is used to increase/decrease the number of events:

void *nanos6_get_current_event_counter(void);

This function provides an implementation-specific data that we name event counter throughout the rest of this text. A task can bind its completion to new external events by calling the following function:

void nanos6_increase_current_task_event_counter(
  void *event_counter,
  unsigned int increment
);

This function atomically increases the number of pending external events of the calling task. The first parameter event_counter must be the event counter of the invoking task, while the second parameter increment is the number of external events to be bound. The presence of pending events in a task prevents the release of its dependencies, even if the task has finished its execution. Note that only the task itself can bind its external events.

Then, the task itself, or another task or external thread, can fulfill the events of the task by calling the following function:

void nanos6_decrease_task_event_counter(
  void *event_counter,
  unsigned int decrement
);

This function atomically decreases the number of pending external events of a given task. The first parameter event_counter is the event counter of the target task, while the second parameter decrement is the number of completed external events to be decreased. Once the number of external events of a task becomes zero and it finishes its execution, the task can release its dependencies.

Note that, all external events of a task can complete before it actually finishes its execution. In this case, the task will release its dependencies as soon as it finishes its execution. Otherwise, the last call that makes the counter become zero, will trigger the release of the dependencies.

Notice that the user is responsible for not fulfilling events that the target task has still not bound.

4.6. NUMA Awareness

OmpSs-2 offers a simple API to mitigate NUMA effects in NUMA systems, which are adverse effects caused by accessing data in remote NUMA nodes. By leveraging this API, users can allocate memory in NUMA systems using various policies. In this way, the runtime library can apply NUMA-aware scheduling decisions which may benefit performance.

The NUMA API comprises the functions explained below, which allow users to allocate and free memory, and decide how the data is distributed across NUMA nodes.

The nanos6_numa_alloc_block_interleave is used to allocate data. Users should replace their regular allocation methodes (malloc, mmap, new, etc.) with this method. The function allocates a specific size of bytes, interleaving the allocation in blocks of size block_size bytes among the NUMA nodes specified by the mask:

void *nanos6_numa_alloc_block_interleave(uint64_t size, const nanos6_bitmask_t *mask, uint64_t block_size);

The nanos6_numa_free is used to release previously allocated memory. The parameter ptr is the pointer to the memory that will be freed, which should be the one returned by a previous allocation call:

void *nanos6_numa_free(void *ptr);

The nanos6_bitmask_set_wildcard sets a bitmask depending on a wildcard. The available wildcards are: NUMA_ALL to set all the available NUMA nodes in the system, NUMA_ALL_ACTIVE to set the NUMA nodes which the current process has all their cores available (i.e., the process can run on them), and NUMA_ANY_ACTIVE to set the NUMA nodes which the current process has at least one core available:

void nanos6_bitmask_set_wildcard(nanos6_bitmask_t *bitmask, wildcard);

There are several functions to query and manipulate the bitmask, which are shown below. The nanos6_count_setbits returns the number of enabled bits in a bitmask, which is useful to retrieve the NUMA nodes available in the application. The rest of functions are not discussed as their behavior is self-explanatory:

uint64_t nanos6_count_setbits(const nanos6_bitmask_t *bitmask);

uint64_t nanos6_bitmask_isbitset(const nanos6_bitmask_t *bitmask, uint64_t n);

void nanos6_bitmask_clearall(nanos6_bitmask_t *bitmask);

void nanos6_bitmask_clearbit(nanos6_bitmask_t *bitmask, uint64_t n);

void nanos6_bitmask_setall(nanos6_bitmask_t *bitmask);

void nanos6_bitmask_setbit(nanos6_bitmask_t *bitmask, uint64_t n);

Next, we showcase the pseudo-code of a usage example of the OmpSs-2 NUMA API, where some arrays are allocated and interleaved between NUMA nodes:

nanos6_bitmask_t bitmask;
nanos6_bitmask_set_wildcard(&bitmask, NUMA_ALL);
size_t numa_nodes = nanos6_count_setbits(&bitmask);
size_t size = N*sizeof(double);
size_t block_size = size/numa_nodes;

// Allocate vectors
double *a = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *b = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *c = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);

// Initialize
for (int i = 0; i < NUM_BLOCKS; i++) {
  #pragma oss task out(a[i*TS ;TS], b[i*TS ;TS] c[i*TS ;TS])
  {
    // init a, b, c
  }
}

// Execution
for (int step = 0; step < timesteps; step++) {
  for (int i = 0; i < NUM_BLOCKS; i++) {
    #pragma oss task in(a[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // copy kernel
    }

    #pragma oss task in(c[i*TS ;TS]) out(b[i*TS ;TS])
    {
      // scale kernel
    }

    #pragma oss task in(a[i*TS ;TS], b[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // add kernel
    }

    #pragma oss task in(b[i*TS ;TS], c[i*TS ;TS]) out(a[block*TS ;TS])
    {
      // triad kernel
    }
  }
}

// Release memory
nanos6_numa_free(a);
nanos6_numa_free(b);
nanos6_numa_free(c);