The OmpSs Programming Model

OmpSs is an effort to integrate features from the StarSs programming model developed by BSC into a single programming model. In particular, our objective is to extend OpenMP with new directives to support asynchronous parallelism and heterogeneity (devices like GPUs). However, it can also be understood as new directives extending other accelerator based APIs like CUDA or OpenCL. Our OmpSs environment is built on top of our Mercurium compiler and Nanos++ runtime system.

Data dependency synchronization

Main data dependency synchronization

Asynchronous parallelism is enabled in OmpSs by the use data-dependencies between the different tasks of the program. The OpenMP task construct is extended with the in (standing for input), out (standing for output) and inout (standing for input/output) clauses to this end. They allow to specify for each task in the program what data a task is waiting for and signaling is readiness. Note, that whether the task really uses that data in the specified way its the programmer responsibility.

  • If a created task has an in clause that evaluates to a given lvalue, then the task will not be eligible to run as long as a previously created task with an out clause applying to the same lvalue has not finished its execution.
  • If a created task has an out clause that evaluates to a given lvalue, then the task will not be eligible to run as long as a previously created task with an in or out clause applying to the same lvalue has not finished its execution.
  • If a created task has an inout clause that evaluates to a given lvalue, then it is considered as if it had an in clause and an out clause that evaluated to that same lvalue.

For compatibility with earlier versions of OmpSs, you can use clauses input and output with exactly the same semantics of clauses in and out, respectively.

Each time a new task is created its in and out dependencies are matched against of those of existing tasks. If a dependency, either RaW, WaW or WaR, is found the task becomes a successor of the corresponding tasks. This process creates a task dependency graph at runtime. Tasks are scheduled for execution as soon as all their predecessor in the graph have finished (which does not mean they are executed immediately) or at creation if they have no predecessors.

The following example illustrates the dependency clauses:

void foo ( int *a, int *b )
{
   for ( i  = 1; i < N; i++ ) {
      #pragma omp task in(a[i-1]) inout(a[i]) out(b[i])
         propagate(&a[i-1],&a[i],&b[i]);

      #pragma omp task in(b[i-1]) inout(b[i])
         correct(&b[i-1],&b[i]); 
   }
}

This code as generate the following task graph as the loop unfolds:

The use of the data dependency clauses allow the execution of tasks from the multiple iterations at the same time. All dependency clauses allow extended lvalues from those of C/C++. Two different extensions are allowed:

  • Array sections allow to refer to multiple elements of an array (or pointee data) in single expression. There are two forms of array sections:
    • [ lower : upper ]  In this case all elements in the range of lower to upper (both included) are referenced. If no lower is specified it is assumed to be 0. If the section is applied to an array and upper is omitted it is assumed to be the last element of the array.
    • [ lower ; size ] In this case all elements in the range of lower to lower+(size-1) (both included) are referenced.
  • Shaping expressions allow to recast pointers into recover the size of dimensions that could have been lost across calls. A shaping expression is one or more [ size ] expressions before a pointer.

The following example shows examples of these extended expressions:

void sort ( int n, int *a ) 
{
    if ( n < small ) seq_sort ( n, a );

    #pragma omp task inout( a[ 0 : n/2 ] ) 
         sort ( n/2, a ); 
    #pragma omp task inout( a[ n/2+1 : n ] ) 
         sort ( n/2, a[n/2+1] ); 
    #pragma omp task inout( a [ 0 : n / 2 ], a [ n/2+1 : n ] )
         merge ( n/2 , a, a, a[n/2+1] );
} 

Our current implementation only supports array sections that completely overlap. Implementation support for partially overlapped sections is under development.

Other data dependency synchronization

The taskwait construct is also extended with the on clause. This clause allows to wait only on the tasks that produces some data in the same way as in clause. It suspends the current task until all previous tasks with an out over the expression are completed. The following example illustrates its use:

int compute1 ( void );

int compute2 ( void );

void main ( )
{
   int result1,result2;

   #pragma omp task out(result1);
       result1 = compute1();
   #pragma omp task out(result2);
       result2 = compute2();

   #pragma omp taskwait on(result1)
   printf("result1 = %d\n",result1);

   #pragma omp taskwait on(result2)
   printf("result2 = %d\n",result2);
}

The task construct also supports the concurrent clause. The concurrent clause is an special version of the inout clause where the dependencies are computed with respect to inout and inout but not with respect to other concurrent clauses. As it relaxes the synchronization between tasks the programmer must ensure that either the task can executed concurrently or that additional synchronization is used. The following example shows a possible use of this clause:

void generate ( int n, int *a );

int reduce ( int n, int *a )
{
   int result = 0;
   for ( i = 0 ; i < n ; i++ )
   {
      #pragma omp task concurrent(result)
      {
          #pragma omp atomic
              result += a[i];
      }
   }
}

void main ( )
{
   int a[100];

   #pragma omp task out(a)
   generate(100, a);
   reduce(100, a);
}

While the reduce tasks can be executed concurrently between them they are still ordered with respect to the generate task but as the reduce tasks can run simultaneously the atomic construct needs to be used to ensure proper synchronization . Note that this example is not the best way to do an array reduction but it is shown here for illustrative purposes.

Asynchronous calls

The task construct is extended to allow the annotation of function declarations or definitions in addition to structured-blocks. When a function is annotated with the task construct each invocation of that function becomes a task creation point. For example, in the following code a task is created in each iteration of the loop:

#pragma omp task
void foo ( int i );

void bar ( )
{
   for ( int i = 0; i < 10; i ++ )
       bar(i);
}

Note that only the execution of the function itself is part of the task not the evaluation of the task arguments (which are firstprivatized). Another restriction is that the task is not allowed to have any return value, that is, the return must be void.

The target construct

To support heterogeneity a new construct is introduced: the target construct. The intent of the target construct is to specify that a given element can be run in a set of devices. The target construct can be applied to either a task construct or a function definition. In the future we will allow to allow it to work on worksharing constructs.

The syntax of the construct is the following:

 #pragma omp target [clauses]
  task construct  | function definition | function header

The valid clauses for the target construct are the folowing:

  • device - It allows to specify on which devices should be targeting the construct. If no device clause is specified then the SMP device is assumed. Currently we also support the cuda device that allows the execution of native CUDA kernels in GPGPUs.
  • copy_in - It specifies that a set of shared data may be needed to be transferred to the device before the associated code can be executed.
  • copy_out - It specifies that a set of shared data may be needed to be transferred from the device after the associated code is executed.
  • copy_inout - This clause is a combination of copy_in and copy_out.
  • copy_deps - It specifies that if the attached construct has any dependence clauses then they will also have copy semantics (i.e., in will also be considered copy_inoutput will also be considered copy_out and inout as copy_inout).
  • implements - It specifies that the code is an alternate implementation for the target devices of the function name specified in the clause. This alternate can be used instead of the original if the implementation considers it appropriately.

Additionally, both SMP and CUDA tasks annotated with the target construct are eligible for execution a cluster environment in an experimental implementation. Please, contact us if you are interested in using it.