5.4. How to exploit NUMA (socket) aware scheduling policy using Nanos++¶

In order to use OmpSs in NUMA system we have developed a special scheduling policy, which also supports other OmpSs features as task priorities.

We have tested this policy in machines with up to 8 NUMA nodes (48 cores), where we get about 70% of the peak performance in the Cholesky factorisation. We would appreciate if you shared with us the details of the machine where you plan to use OmpSs.

Important

This scheduling policy works best with the Portable Hardware Locality (hwloc) library. Make sure you enabled it when compiling Nanos++. Check Nanos++ configure flags.

Important

Memory is asigned to nodes in pages. A whole page can only belong to a single NUMA node, thus, you must make sure memory is aligned to the page size. You can use the aligned attribute or other allocation functions.

This policy assigns tasks to threads based on either data copy information or programmer hints indicating in which NUMA node that task should run.

You must select the NUMA scheduling policy when running your application. You can do so by defining NX_SCHEDULE=socket or by using –schedule=socket in NX_ARGS. Example:

$ NX_SCHEDULE=socket ./my_application
$ NX_ARGS="--schedule=socket" ./my_application

5.4.1. Automatic NUMA node discovery¶

The NUMA scheduling policy has the ability to detect initialisation tasks and track where your data is located. This option is disabled by default and can be selected by supplying –socket-auto-detect in NX_ARGS. Like this:

$ NX_ARGS="--schedule=socket --socket-auto-detect" ./my_application

This automatic feature has a few requirements and assumptions:

First-touch NUMA policy is in effect (default in most systems).
Data is initialised by OmpSs.
Copies are enabled (either manually, using copy_in|out|inout; or copy_deps, enabled automatically by the compiler).
Initialisation tasks can only be detected if they are SMP tasks with at least one output whose produced version will be one.

Important

Initialisation tasks will be assigned to NUMA nodes so that your data is initialised in round-robin. If this does not suit you, head over to the programming hints section below.

Some examples:

// Default OmpSs configuration: copy_deps enabled by the compiler
#pragma omp task out( [N][N]block )
void init_task( float * block )
{
}

// Alternate OmpSs configuration: copy_deps manually activated
#pragma omp target device(smp) copy_deps
#pragma omp task out( [N][N]block )
void init_task( float * block )
{
}

// No dependencies defined, copies manually defined
#pragma omp target device(smp) copy_out( [N][N]block )
#pragma omp task
void init_task( float * block )
{
}

// This one will not be detected
#pragma omp task
void not_valid_init_task( float * block )
{
}

// Nor will be this one
#pragma omp task inout( [N][N]block )
void not_init_task( float * block )
{
}

The rest of the application would be like a normal OmpSs one:

#pragma omp task inout( [N][N] block )
void compute_task( float* block )
{
    for( /* loop here */ )
    {
        // Do something
    }
}

int main(int argc, char* argv[])
{

  // Allocate matrix A

  // Call init tasks
  for( int i = 0; i < nb*nb; ++i )
  {
      init_task( A+i );
  }

  // Now compute
  for( int i = 0; i < nb*nb; ++i )
  {
      compute_task( A+i );
  }

  #pragma omp taskwait
  return 0;
}

5.4.2. Using programmer hints¶

This approach provides a more direct control on where to run tasks. Data copies are not required although it has the same first touch policy requirement, and your data must be also initialised by an OmpSs task.

For instance, you probably have initialisation tasks and want to spread your data over all the NUMA nodes. You must use the function nanos_current_socket to specify in which node the following task should be executed. Example:

#pragma omp task out( [N][N]block )
void init_task( float * block )
{
  // First touch NUMA policy will assign pages in the node of the current
  // thread when they are written for the first time.
}

int main(int argc, char* argv[])
{
  int numa_nodes;
  nanos_get_num_sockets( &numa_nodes );

  // Allocate matrix A

  // Call init tasks
  for( int i = 0; i < nb*nb; ++i )
  {
      // Round-robin assignment
      nanos_current_socket( i % numa_nodes );
      init_task( A+i );
  }

#pragma omp taskwait
  return 0;
}

Now you have the data where you want it to be. Your computation tasks must be also told where to run, to minimise access to out of node memory. You can do this the same way you do for init tasks:

int main( int argc, char *argv[] )
{
  // Allocation and initialisation goes above this
  for( /* loop here */ )
  {
      // Once again you must set the socket before calling the task
      nanos_current_socket( i % numa_nodes );
      compute_task( A+i );
  }
#pragma omp taskwait
}

5.4.3. Nesting¶

If you want to use nested tasks, you don’t need to (and you should not) call nanos_current_socket() when creating the child tasks. Tasks created by another one will be run in the same NUMA node as their parent. For instance, let’s say that compute_task() is indeed nested:

#pragma omp task inout( [N][N] block )
void compute_task( float* block )
{
  for( /* loop here */ )
  {
      small_calculation( block[i] );
  }
#pragma omp taskwait
}

#pragma omp task inout( [N]data )
void small_calculation( float* data )
{
  // Do something
}

In this case the scheduling policy will run small_calculation in the same NUMA node as the parent compute_task.

5.4.4. Other¶

Deepening a bit into the internals, this scheduling policy has task queues depending on the number of nodes. Threads are expected to work only on tasks that were assigned to the NUMA node they belong to, but as there might be imbalance, we have implemented work stealing.

When using stealing (enabled by default), you must consider that:

Threads will only steal child tasks. If you have an application without nested tasks, you must change a parameter to steal from first level tasks (–socket-steal-parents).

Threads will steal only from adjacent nodes. If you have 4 NUMA nodes in your system, a thread in NUMA node #0 will only steal from nodes 1 and 2 if the distance to 1 and 2 is (for instance) 22, and the distance to node 3 is 23. This requires your system to have a valid distance matrix (you can use numactl –hardware to check it).

In order to prevent stealing from the same node, it will be performed in round robin. In the above example, the first time a thread in node 0 will steal from node 1, and the next time it will use node 2.

You can get the full list of options of the NUMA scheduling policy here Or using nanox --help.