The OmpSs Programming Model
OmpSs is an effort to integrate features from the StarSs programming model developed by BSC into a single programming model. In particular, our objective is to extend OpenMP with new directives to support asynchronous parallelism and heterogeneity (devices like GPUs). However, it can also be understood as new directives extending other accelerator based APIs like CUDA or OpenCL. Our OmpSs environment is built on top of our Mercurium compiler and Nanos++ runtime system.
Asynchronous parallelism is enabled in OmpSs by the use data-dependencies between the different tasks of the program. To support heterogeneity a new construct is introduced: the target construct.
In OmpSs the task construct also allows the annotation of function declarations or definitions in addition to structured-blocks. When a function is annotated with the task construct each invocation of that function becomes a task creation point. Note that only the execution of the function itself is part of the task not the evaluation of the task arguments (which are firstprivatized). Another restriction is that the task is not allowed to have any return value, that is, the return must be void.
The task construct allows to express data-dependencies using in (standing for input, equivalent to OpenMP depend(in:var-list) ), out (standing for output, equivalent to OpenMP depend(out:var-list) ) and inout (standing for input/output, equivalent to OpenMP depend(inout:var-list) ) clauses to this end. They allow to specify for each task in the program what data a task is waiting for and signaling is readiness. Note, that whether the task really uses that data in the specified way its the programmer responsibility. Each time a new task is created its in and out dependencies are matched against of those of existing tasks. If a dependency, either RaW, WaW or WaR, is found the task becomes a successor of the corresponding tasks. This process creates a task dependency graph at runtime. Tasks are scheduled for execution as soon as all their predecessor in the graph have finished (which does not mean they are executed immediately) or at creation if they have no predecessors.
The task construct is extended with the concurrent clause. The concurrent clause is an special version of the inout clause where the dependencies are computed with respect to in, out and inout but not with respect to other concurrent clauses. As it relaxes the synchronization between tasks the programmer must ensure that either the task can executed concurrently or that additional synchronization is used. While the reduce tasks can be executed concurrently between them they are still ordered with respect to the generate task but as the reduce tasks can run simultaneously the atomic construct needs to be used to ensure proper synchronization.
The use of the data dependency clauses allow the execution of tasks from the multiple iterations at the same time. All dependency clauses allow extended lvalues from those of C/C++. Two different extensions are allowed: array sections allow to refer to multiple elements of an array (or pointee data) in single expression and shaping expressions allow to recast pointers into recover the size of dimensions that could have been lost across calls.
The taskwait construct is also extended with the on clause. This clause allows to wait only on the tasks that produces some data in the same way as in clause. It suspends the current task until all previous tasks with an out over the expression are completed.
The intent of the target construct is to specify that a given element can be run in a set of devices. The target construct can be applied to either a task construct or a function definition. In the future we will allow to allow it to work on worksharing constructs.