3. Language description

This section describes the OmpSs language, this is, all the necessary elements to understand how an application programmed in OmpSs executes and/or behaves in a parallel architecture. OmpSs provides a simple path for users already familiarized with the OpenMP programming model to easily write (or port) their programs to OmpSs.

This description is completely guided by the list of OmpSs directive constructs. In each of the following sections we will find a short description of the directive, its specific syntax, the list of its clauses (including the list of valid parameters for each clause and a short description for them). In addition, each section finalizes with a simple example showing how this directive can be used in a valid OmpSs program.

As is the case of OpenMP in C and C++, OmpSs directives are specified using the #pragma mechanism (provided by the base language) and in Fortran they are specified using special comments that are identified by a unique sentinel. The sentinel used in OmpSs (as is the case of OpenMP) is omp. Compilers will typically ignore OmpSs directives if support is disabled or not provided.

3.1. Task construct

The programmer can specify a task using the task construct. This construct can appear inside any code block of the program, which will mark the following statement as a task.

The syntax of the task construct is the following:

#pragma omp task [clauses]
structured-block

The valid clauses for the task construct are:

  • private(<list>)
  • firstprivate(<list>)
  • shared(<list>)
  • depend(<type>: <memory-reference-list>)
  • <depend-type>(<memory-reference-list>)
  • reduction(<operator>: <memory-reference-list>)
  • priority(<value>)
  • if(<scalar-expression>)
  • final(<scalar-expresion>)
  • label(<string>)
  • tied

The private and firstprivate clauses declare one or more list items to be private to a task (i.e. the task receives a new list item). All internal references to the original list item are replaced by references to this new list item.

List items privatized using the private clause are uninitialized when the execution of the task begins. List items privatized using the firstprivate clause are initialized with the value of the corresponding original item at task creation.

The shared clause declare one or more list items to be shared to a task (i.e. the task receives a reference to the original list item). The programmer must ensure that shared storage does not reach the end of its lifetime before tasks referencing this storage have finished.

The depend clause allows to infer additional task scheduling restrictions from the parameters it defines. These restrictions are known as dependences. The syntax of the depend clause include a dependence type, followed by colon and its associated list items. The list of valid type of dependences are defined in section “Dependence flow” in the previous chapter. In addition to this syntax, OmpSs allows to specify this information using as the name of the clause the type of dependence. Then, the following code:

#pragma omp task depend(in: a,b,c) depend(out: d)

Is equivalent to this one:

#pragma omp task in(a,b,c) out(d)

The reduction clause allows to define the task as a participant of a reduction operation. The first occurrence of a participating task defines the begin of the scope for the reduction. The end of the scope is implicitly ended by a taskwait or a dependence over the memory-reference-item.

If the expression of the if clause evaluates to true, the execution of the new created task can be deferred, otherwise the current task must suspend its execution until the new created task has complete its execution.

If the expression of the final clause evaluates to true, the new created task will be a final tasks and all the task generating code encountered when executing its dynamic extent will also generate final tasks. In addition, when executing within a final task, all the encountered task generating codes will execute these tasks immediately after its creation as if they were simple routine calls. And finally, tasks created within a final task can use the data environment of its parent task.

The tied clause defines a new task scheduling restriction for the newly created tasks. Once a thread begins the execution of this task, the task becomes tied to this thread. In the case this task has suspended its execution by any task scheduling point only the same thread (i.e. the thread to which the task is tied to) may resume its execution.

The label clause defines a string literal that can be used by any performance or debugger tool to identify the task with a more human-readable format.

The following C code shows an example of creating tasks using the task construct:

float x = 0.0;
float y = 0.0;
float z = 0.0;

int main() {

  #pragma omp task
  do_computation(x);

  #pragma omp task
  {
    do_computation(y);
    do_computation(z);
  }

  return 0;
}

When the control flow reaches #pragma omp task construct, a new task instance is created, however when the program reaches return 0 the previously created tasks may not have been executed yet by the OmpSs run-time.

The task construct is extended to allow the annotation of function declarations or definitions in addition to structured-blocks. When a function is annotated with the task construct each invocation of that function becomes a task creation point. Following C code is an example of how task functions are used:

extern void do_computation(float a);
#pragma omp task
extern void do_computation_task(float a);

float x = 0.0;
int main() {
   do_computation(x); //regular function call
   do_computation_task(x); //this will create a task
   return 0;
}

Invocation of do_computation_task inside main function create an instance of a task. As in the example above, we cannot guarantee that the task has been executed before the execution of the main finishes.

Note that only the execution of the function itself is part of the task not the evaluation of the task arguments. Another restriction is that the task is not allowed to have any return value, that is, the return must be void.

3.2. Target construct

To support heterogeneity a new construct is introduced: the target construct. The intent of the target construct is to specify that a given element can be run in a set of devices. The target construct can be applied to either a task construct or a function definition. In the future we will allow to allow it to work on worksharing constructs.

The syntax of the target construct is the following:

#pragma omp target [clauses]
task-construct  | function-definition | function-header

The valid clauses for the target construct are the following:

  • device(target-device) - It allows to specify on which devices should be targeting the construct. If no device clause is specified then the SMP device is assumed. Currently we also support the CUDA device that allows the execution of native CUDA kernels in GPGPUs.
  • copy_in(list-of-variables) - It specifies that a set of shared data may be needed to be transferred to the device before the associated code can be executed.
  • copy_out(list-of-variables) - It specifies that a set of shared data may be needed to be transferred from the device after the associated code is executed.
  • copy_inout(list-of-variables) - This clause is a combination of copy_in and copy_out.
  • copy_deps - It specifies that if the attached construct has any dependence clauses then they will also have copy semantics (i.e., in will also be considered copy_in, output will also be considered copy_out and inout as copy_inout).
  • implements(function-name) - It specifies that the code is an alternate implementation for the target devices of the function name specified in the clause. This alternate can be used instead of the original if the implementation considers it appropriately.
  • shmem(size_t) - It specifies the amount of memory the runtime will allocate for CUDA and OpenCL kernels. Shared memory should be handled by the user in the kernel code.
  • ndrange(<syntax>) - It specifies the thread hierarchy used to execute the kernel in the device. It may be expressed by means of two different syntax:
    • ndrange(n, G1,…, Gn, L1,…,Ln ) - The ‘n’ parameter determines the number of dimmensions, (i.e., 1, 2 or 3), and the ‘Gx’ and ‘Lx’ are sequence of scalars determining the global and local sizes respectively. There will be as many ‘G’ and ‘L’ as the number of dimmensions (e.g., ndrange(2, 1024, 1024, 128, 128), will create a thread hierarchy of 1024x1024 elements, grouped in blocks of 128x128 elements).
    • ndrange(n, G[n], L[n] ) - The ‘n’ parameter determines the number of dimmensions, (i.e., 1, 2 or 3), and the ‘G’ and ‘L’ vectors contain as many elements as the dimension parameters (e.g., ndrange(2, Global, Local), where ‘Global’ is an int[ ] = {1024, 1024} and ‘Local’ is an int [ ] = {128,128}, will create a thread hierarchy of 1024x1024 elements, grouped in blocks of 128x128 elements).

Additionally, both SMP and CUDA tasks annotated with the target construct are eligible for execution a cluster environment in an experimental implementation. Please, contact us if you are interested in using it.

Restrictions:

  • At most only one device clause must appear in the target construct.
  • At most only one shmem clause must appear in the target construct.
  • At most only one implements clause must appear in the target construct.
  • At most only one ndrange clause must appear in the target construct.

3.3. Loop construct

When a task encounters a loop construct it starts creating tasks for each of the chunks in which the loop’s iteration space is divided. Programmers can choice among different schedule policies in order to divide the iteration space.

The syntax of the loop construct is the following:

#pragma omp for [clauses]
loop-block

The valid clauses for the loop construct are the following:

  • schedule(schedule-policy[, chunk-size]) - It specifies one of the valid partition policies and, optionally, the chunk-size used to divide the iteration space. Valid schedule policies are one the following options:

    • dynamic - loop is divided to team’s threads in tasks of chunk-size granularity. Tasks are assigned as threads request them. Once a thread finishes the execution of one of these tasks it will request another task. Default chunk-size is 1.
    • guided - loop is divided as the executing threads request them. The chunk-size is proportional to the number of unassigned iterations, so it starts to be bigger at the beginning, but it becomes smaller as the loop execution progresses. Chunk-size will never be smaller than chunk-size parameter (except for the last iteration chunk).
    • static - loop is divided into chunks of size chunk-size. Each task is divided among team’s threads following a round-robin fashion. If no chunk-size is provided all the iteration space is divided by number-of-threads chunks of the same size (or proximately the same size if number-of-threads does not divide number-of-iterations).
  • nowait - When this option is specified the encountering task can immediately proceed with the following code without wait for all the tasks created to execute each of the loop’s chunks.

Following C code shows an example on loop distribution:

float x[10];
int main() {
#pragma omp for schedule(static)
   for (int i = 0; i < 10; i++) {
      do_computation(x[i]);
   }
   return 0;
}

3.4. Taskwait construct

Apart from implicit synchronization (task dependences) OmpSs also offers mechanism which allow users to synchronize task execution. taskwait construct is an stand-alone directive (with no code block associated) and specifies a wait on the completion of all direct descendant tasks.

The syntax of the taskwait construct is the following:

#pragma omp taskwait [clauses]

The valid clauses for the taskwait construct are the following:

  • on(list-of-variables) - It specifies to wait only for the subset (not all of them) of direct descendant tasks. taskwait with an on clause only waits for those tasks referring any of the variables appearing on the list of variables.

The on clause allows to wait only on the tasks that produces some data in the same way as in clause. It suspends the current task until all previous tasks with an out over the expression are completed. The following example illustrates its use:

int compute1 (void);

int compute2 (void);

int main()
{
  int result1, result2;

#pragma omp task out(result1)
  result1 = compute1();

#pragma omp task out(result2)
  result2 = compute2();

#pragma omp taskwait on(result1)
  printf("result1 = %d\n",result1);

#pragma omp taskwait on(result2)
  printf("result2 = %d\n",result2);

  return 0;
}

3.5. Taskyield directive

The taskyield directive specifies that the current task can be suspended and the scheduler runtime is allowed to scheduler a different task. The taskyield explicitly includes a task schedule point.

The syntax of the taskyield directive is the following:

#pragma omp taskyield

The taskyield directive has no related clauses.

In the following example we can see how to use the taskyield directive:

void compute ( void ) {
   int a=0,b=0;

   #pragma omp task shared(a)
   { a++;}

   #pragma omp taskyield

   #pragma omp task shared(b)
   { b++; }

   #pragma omp taskwait
}

int main() {
   #pragma omp task
   compute();

   #pragma omp taskwait
   return 0;
}

When encountering the taskyield directive the runtime system may decide among continue execute the task compute (i.e. the current task) or begins the execution of the a++ task (if not yet executed).

3.6. Atomic construct

The atomic construct ensures that following expression is executed atomically. Runtime systems will use native machine mechanism to guarantee atomic execution. If there is no native mechanism to guarantee atomicity (e.g. function call) it will use a regular critical section to implement the atomic construct.

The syntax of the atomic construct is the following:

#pragma omp atomic
structured-block

Atomic construct has no related clauses.

3.7. Critical construct

The critical construct allows programmers to specify regions of code that will be executed in mutual exclusion. The associated region will be executed by a single thread at a time, other threads will wait at the beginning of the critical section until no thread in the team was executing it.

The syntax of the critical construct is the following:

#pragma omp critical
structured-block

Critical construct has no related clauses.

3.8. Declare reduction construct

The user can define its own reduction-identifier using the declare reduction directive. After declaring the UDR, the reduction-identifier can be used in a reduction clause. The syntax of this directive is the following one:

#pragma omp declare reduction(reduction-identifier : type-list : combiner-expr) [initializer(init-expr)]
where:
  • reduction-identifier is the identifier of the reduction which is being declared
  • type-list is a list of types
  • combiner-expr is the expression that specifies how we have to combine the partial results. We can use two predefined identifiers in this expression: omp_out and omp_in. The omp_out identifier is used to represent the result of the combination whereas the omp_in identifier is used to represent the input data.
  • init-expr is the expression that specifies how we have to initialize the private copies. We can use also two predefined identifiers in this expression: omp_priv and omp_orig. The omp_priv identifier is used to represent a private copy whereas the omp_orig identifier is used to represent the original variable that is being involved in a reduction.

In the following example we can see how we declare a UDR and how we use it:

struct C {
    int x;
};

void reducer(struct C* out, struct C* in) {
    (*out).x += (*in).x;
}

#pragma omp declare reduction(my_UDR : struct C : reducer(&omp_out,&omp_in)) initializer(omp_priv = {0})

int main() {
    struct C res = { 0 };
    struct C v[N];

    init(&v);

    for (int i = 0; i < N; ++i) {
#pragma omp task reduction(my_UDR : res) in(v) firstprivate(i)
        {
           res.x += v[i].x;
        }
    }
#pragma omp taskwait
}