4. Library Routines

This chapter describes the set of OmpSs-2 run-time library routines available while executing an OmpSs-2 program. Programmers must guarantee the correctness of compiling and linking programs withouth OmpSs-2 support by the use of the conditional compilation mechanism (see Conditional compilation section).

In the following sections, we provide a short description of all public library routines. These routines are declared in the public header file named nanos6.h.

Warning

The run-time system library does not provide the Fortran header yet.

4.1. Spawning functions

The nanos6_spawn_function allows to asynchrously spawn a new function that will leverage the run-time system resources:

void nanos6_spawn_function(
  void (*function)(void *),
  void *args,
  void (*completion_callback)(void *),
  void *completion_args,
  char const *label
);

Where each of this routine’s parameters are:

  • function the function to be spawned.

  • args a parameter that is passed to the function.

  • completion_callback an optional function that will be called when the function finishes.

  • completion_args a parameter that is passed to the completion callback.

  • label an optional name for the function.

The routine will create a new OmpSs-2 task executing the function code and receiving the args parameters. This task has an independent namespace of data dependencies and has no relationship with any other existing task (i.e., no taskwait will wait for it). Once the task finishes the run-time system will invoke the registered callback service (i.e., completion_callback using completion_args parameters). The callback is used as the provided synchronization mechanism.

The label string will be used for debugging/instrumentation purposes.

4.2. Task blocking

This API is composed by three functions and provides the functionality of pausing and resuming OmpSs-2 tasks at run-time. Firstly, the nanos6_get_current_blocking_context returns an opaque pointer that is used for blocking and unblocking the current task:

void *nanos6_get_current_blocking_context(void);

The underlying implementation may or may not return the same value for repeated calls to this function. Once the handler has been used once in a call to nanos6_block_current_task and a call to nanos6_unblock_task, the handler is discarded and a new one must be obtained to perform another cycle of blocking and unblocking.

Secondly, the nanos6_block_current_task function blocks the execution of the current task:

void nanos6_block_current_task(void *blocking_context);

The current task will block at least until another thread calls nanos6_unblock_task with the blocking context of that task. The run-time system may choose to execute other tasks within the execution scope of this call.

Finally, the nanos6_unblock_task routine mark as unblocked a task previously or about to be blocked:

void nanos6_unblock_task(void *blocking_context);

This function can be called even before the target task actually blocks. In that case, the target task will not block because someone already unblocked it. However, only one call to nanos6_unblock_task may precede its matching call to nanos6_block_current_task. The return of this function does not guarantee that the blocked task has resumed yet its execution; it only guarantees that it will be resumed.

Note that both nanos6_block_current_task and nanos6_unblock_task may be task scheduling points.

4.3. Time-based task blocking

OmpSs-2 defines a new API function named nanos6_wait_for that pauses the calling task for a specific time in microseconds. The task stops during that time approximately and yields the CPU to execute other tasks in the meanwhile:

uint64_t nanos6_wait_for(uint64_t time_us);

The function returns the actual time slept so that the caller task can take decisions based on that, for instance, to change the next sleep time.

4.4. Polling services

Warning

The original polling service API has been removed. This section explains a more flexible, usable and performant approach

The OmpSs-2 programming model offers a mechanism to perform dynamic and periodic polling which can be useful for third party libraries. In this way, libraries do not need to leverage an additional thread (potentially oversubscribing CPU resources), for instance, for checking asynchronous operations periodically.

We recommend spawning an independent task by calling the nanos6_spawn_function (see Spawning functions) and passing the desired polling function and arguments as parameters. Then, the polling function can execute a loop (until the polling has to finish) where each iteration checks the corresponding asynchronous operations and then pauses the task for specific amout of microseconds with nanos6_wait_for (see Time-based task blocking). During the pause time, the CPU can be leveraged by the run-time system to execute other ready tasks.

This flexible mechanism allows changing the polling frequency dynamically. Moreover, the user may take into account the actual slept time (returned by nanos6_wait_for) or the work load to recompute the required frequency.

We show an example in C++ below:

std::atomic<bool> _mustFinish(false);
std::atomic<bool> _finished(false);

void polling_function(void *args)
{
  uint64_t time_us = 500; // 500 microseconds

  while (!_mustFinish.load()) {
    // Call the actual polling user function
    check_operations_completion();

    // Pause the polling task
    uint64_t actual_us = nanos6_wait_for(time_us);

    // Change time_us if needed...
  }
}

void polling_complete(void *args)
{
  // Notify the spawner when the polling task has completed
  _finished.store(true);
}

int main()
{
  nanos6_spawn_function(polling_function, NULL,
                        polling_complete, NULL,
                        "polling task");

  // ...

  // Notify the polling task to stop
  _mustFinish.store(true);

  while (!_finished.load()) {
    // Yield the CPU while we wait
    nanos6_wait_for(100);
  }
}

4.5. Task external events

The task external events API provides the functionality of decoupling the release of task data dependencies to the completion of external events. A task fully completes (i.e., releases its dependencies) once it finishes the execution of its task body, and all externals events bound during its execution are fulfilled. This allows binding the completion of a task to the finalization of any asynchronous operation.

This API is composed of three functions, which follow the style of the blocking API. Firstly, the nanos6_get_current_event_counter returns an opaque pointer that is used to increase/decrease the number of events:

void *nanos6_get_current_event_counter(void);

This function provides an implementation-specific data that we name event counter throughout the rest of this text. A task can bind its completion to new external events by calling the following function:

void nanos6_increase_current_task_event_counter(
  void *event_counter,
  unsigned int increment
);

This function atomically increases the number of pending external events of the calling task. The first parameter event_counter must be the event counter of the invoking task, while the second parameter increment is the number of external events to be bound. The presence of pending events in a task prevents the release of its dependencies, even if the task has finished its execution. Note that only the task itself can bind its external events.

Then, the task itself, or another task or external thread, can fulfill the events of the task by calling the following function:

void nanos6_decrease_task_event_counter(
  void *event_counter,
  unsigned int decrement
);

This function atomically decreases the number of pending external events of a given task. The first parameter event_counter is the event counter of the target task, while the second parameter decrement is the number of completed external events to be decreased. Once the number of external events of a task becomes zero and it finishes its execution, the task can release its dependencies.

Note that, all external events of a task can complete before it actually finishes its execution. In this case, the task will release its dependencies as soon as it finishes its execution. Otherwise, the last call that makes the counter become zero, will trigger the release of the dependencies.

Notice that the user is responsible for not fulfilling events that the target task has still not bound.

4.6. NUMA Awareness

OmpSs-2 offers a simple API to mitigate “NUMA effects” in NUMA systems (adverse effects caused by accessing data in remote NUMA nodes). With this API, users can allocate memory in NUMA systems using various policies. By leveraging this API, the runtime library can apply NUMA-aware scheduling techniques, which benefit performance.

The API is composed by the following functions, which allow users to allocate and free memory, and decide how the data is distributed across NUMA nodes:

  • nanos6_numa_alloc_block_interleave is used to allocate data. Users should replace their regular allocation methodes (malloc, mmap, new…) with this method. It allocates a specific size of bytes, interleaving the allocation in blocks of size block_size among the NUMA nodes specified by the mask:

void *nanos6_numa_alloc_block_interleave(uint64_t size, const nanos6_bitmask_t *mask, uint64_t block_size);

  • nanos6_numa_free is used to release previously allocated memory. ptr is the pointer that will be freed, which is returned by a previous allocation call:

void *nanos6_numa_free(void *ptr);

  • nanos6_bitmask_set_wildcard(bitmask, wildcard) sets a bitmask depending on the wildcard, which can be NUMA_ALL, to represent all the available nodes in the system, NUMA_ALL_ACTIVE, to represent the nodes in which all the cores are assigned to the process mask, and NUMA_ANY_ACTIVE, to represent the nodes where atleast one core is assigned:

void nanos6_bitmask_set_wildcard(nanos6_bitmask_t *bitmask, wildcard);

  • nanos6_count_setbits(const nanos6_bitmask_t *bitmask) returns the number of enabled bits in a bitmask. This is useful to retrieve the NUMA nodes available in the application.

There are other bitmask manipulation methods such as nanos6_bitmask_isbitset(bitmask, n), nanos6_bitmask_[clear/set]all(bitmask), or nanos6_bitmask_[clear/set]bit(bitmask, n), which are not discussed as their behaviour is self-explanatory.

Next we showcase the pseudo-code of a usage example of the OmpSs-2 NUMA API, where some arrays are allocated and interleaved between NUMA nodes:

nanos6_bitmask_t bitmask;
nanos6_bitmask_set_wildcard(&bitmask, NUMA_ALL);
size_t numa_nodes = nanos6_count_setbits(&bitmask);
size_t size = N*sizeof(double);
size_t block_size = size/numa_nodes;

// Allocate vectors
double *a = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *b = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);
double *c = nanos6_numa_alloc_block_interleave(size, &bitmask, block_size);

// Initialize
for (int i = 0; i < NUM_BLOCKS; i++)
  #pragma oss task out(a[i*TS ;TS], b[i*TS ;TS] c[i*TS ;TS])
  // init a, b, c

// Execution
for (int step = 0; step < timesteps; step++) {
  for (int i = 0; i < NUM_BLOCKS; i++) {
    #pragma oss task in(a[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // copy kernel
    }

    #pragma oss task in(c[i*TS ;TS]) out(b[i*TS ;TS])
    {
      // scale kernel
    }

    #pragma oss task in(a[i*TS ;TS], b[i*TS ;TS]) out(c[i*TS ;TS])
    {
      // add kernel
    }

    #pragma oss task in(b[i*TS ;TS], c[i*TS ;TS]) out(a[block*TS ;TS])
    {
      // triad kernel
    }
  }
}

// Release memory
nanos6_numa_free(a);
nanos6_numa_free(b);
nanos6_numa_free(c);