1. Introduction

OmpSs-2 is a programming model composed of a set of directives and library routines that can be used in conjunction with a high level programming language in order to develop concurrent applications. This programming model is an effort to integrate features from the StarSs programming model family, developed by the Programming Models group of the Computer Sciences department at Barcelona Supercomputing Center (BSC), into a single programming model.

OmpSs-2 extends the tasking model of OmpSs/OpenMP to support both task nesting and fine-grained dependences across different nesting levels, which enables the effective parallelization of applications using a top-down methodology.

Tasks are the elementary unit of work which represents a specific instance of an executable code. Dependences let the user annotate the data flow of the program, this way at run-time this information can be used to determine if the parallel execution of two tasks may cause data races.

The goal of OmpSs-2 is to provide a productive environment to develop applications for modern High-Performance Computing (HPC) systems. Two concepts add to make OmpSs-2 a productive programming model: performance and ease of use. Programs developed in OmpSs-2 must be able to deliver a reasonable performance when compared to other programming models targeting the same architecture(s). Ease of use is a concept difficult to quantify but OmpSs-2 has been designed using principles that have been praised by their effectiveness in that area.

In particular, one of our most ambitious objectives is to extend the OpenMP programming model with new directives, clauses and/or API services or general features to better support asynchronous data-flow parallelism and heterogeneity. These are, currently, the most prominent features introduced in the OmpSs-2 programming model:

  • Lifetime of task data environment: A task is completed once the last statement of its body is executed. It becomes deeply completed when also all its children become deeply completed. The data environment of a task, which includes all the variables captured when the task is created, is preserved until the task is deeply completed. Notice that the stack of the thread that is executing the task is NOT part of the task data environment.

  • Nested dependency domain connection: Incoming dependences of a task propagate to its childrean as if the task did not exist. When a task finishes, its outgoing dependences are replaced by those generated by its children

  • Early release of dependences: By default, once a task is completed it will release all the dependences that are not included on any unfinished descendant task. If the wait clause is specified in the task construct, all its dependences will be released at once when the task becomes deeply completed.

  • Weak dependences: The weakin/weakout clauses specify potential dependences only required by descendants tasks. These annotations do not delay the execution of the task.

  • Native offload API: A new asynchronous API to execute OmpSs-2 kernels on a specified set of CPUs from any kind of application, including Java, Python, R, etc.

  • Task Pause/Resume API: A new API that can be used to programmatically suspend and resume the execution of a task. It is currently used to improve the interoperability and performance of hybrid MPI and OmpSs-2 applications.

1.1. Reference implementation

The reference implementation of OmpSs-2 is based on the Mercurium source-to-source compiler and the Nanos6 Runtime Library:

  • The Mercurium source-to-source compiler provides the necessary support for transforming the high-level directives into a parallelized version of the application.

  • The Nanos6 run-time system library provides the services to manage all the parallelism in the user-application, including task creation, synchronization and data movement, and provide support for resource heterogeneity.

1.2. A bit of history

OmpSs-2 is the second generation of the OmpSs programming model. The name originally comes from two other programming models: OpenMP and StarSs. The design principles of these two programming models constitutes the fundamental ideas used to conceive the OmpSs philosophy.

OmpSs takes from OpenMP its viewpoint of providing a way to, starting from a sequential program, produce a parallel version of the same by introducing annotations in the source code. This annotations do not have an explicit effect in the semantics of the program, instead, they allow the compiler to produce a parallel version of it. This characteristic feature allows the users to parallelize applications incrementally. Starting from the sequential version, new directives can be added to specify the parallelism of different parts of the application. This has an important impact on the productivity that can be achieved by this philosophy. Generally when using more explicit programming models the applications need to be redesigned in order to implement a parallel version of the application, the user is responsible of how the parallelism is implemented. A direct consequence of this is the fact that the maintenance effort of the source code increases when using a explicit programming model, tasks like debugging or testing become more complex.

StarSs, or Star SuperScalar, is a family of programming models that also offer implicit parallelism through a set of compiler annotations. It differs from OpenMP in some important areas. StarSs uses a different execution model, thread-pool where OpenMP implements fork-join parallelism. StarSs also includes features to target heterogeneous architectures through leveraging native kernels implementation while OpenMP targets accelerator support through direct compiler code generation. Finally StarSs offers asynchronous parallelism as the main mechanism of expressing parallelism whereas OpenMP only started to implement it since its version 3.0. Also StarSs offers task synchronization by means of dependences as the main mechanism of expressing the task execution order, enabling the look-ahead instantiation of task whereas OpenMP included it since its version 4.0.

StarSs raises the bar on how much implicitness is offered by the programming model. When programming using OpenMP, the developer first has to define which regions of the program will be executed on parallel, then he or she has to express how the inner code has to be executed by the threads forming the parallel region, and finally it may be required to add directives to synchronize the different parts of the parallel execution. StarSs simplifies part of this process by providing an environment where parallelism is implicitly created from the beginning of the execution, thus the developer can omit the declaration of parallel regions. The definition of parallel code is used using the concept of tasks, which are pieces of code which can be executed asynchronously in parallel. When it comes to synchronizing the different parallel regions of a StarSs applications, the programming model also offers a dependency mechanism which allows to express the correct order in which individual tasks must be executed to guarantee a proper execution. This mechanism enables a much richer expression of parallelism by StarSs than the one achieved by OpenMP, this makes StarSs applications to exploit the parallel resources more efficiently.

OmpSs tries to be the evolution that OpenMP needs in order to be able to target newer architectures. For this, OmpSs takes key features from OpenMP but also new ideas that have been developed in the StarSs family of programming models.

1.3. Influence in OpenMP

Many OmpSs and StarSs ideas have been introduced into the OpenMP programming model. The next figure summarizes our contributions in the standard:

../_images/omp_contributions.png

Starting from the version 3.0, released on May 2008, OpenMP included the support for asynchronous tasks. The reference implementation, which was used to measure the benefits that tasks provided to the programming model, was developed at BSC and consisted on the Nanos4 run-time library and the Mercurium source-to-source compiler.

Our next contribution, which was included in OpenMP 4.0 (released on July 2013), was the extension of the tasking model to support data dependences, one of the strongest points of OmpSs that allows to define fine-grain synchronization among tasks. This feature was tested using Mercurium source-to-source compiler and the Nanos++ RTL.

In OpenMP 4.5 (released on November 2015, and current version), the tasking model was extended with the taskloop construct using Nanos++ as the reference implementation to validate these ideas. BSC also contributed to version 4.5 adding the priority clause to task and taskloop constructs.

Upcopming versions of OpenMP will include: task reductions, as a way to use tasks as the participants of a reduction operation; iterators and multidependences, mechanism to specify a variable number of dependences for a task instance; taskwait dependences, to relax the set of descendants we need to wait for before proceeding with the taskwait construct; and the mutexinoutset (a.k.a. commutative), a new type of dependence to support mutually exclusive tasks. The Nanos++ RTL was also used to validate the new interface included in the OpenMP standard with respect third party tools.

1.4. Glossary of terms

ancestor tasks

The set of tasks formed by your parent task and all of its ancestor tasks.

base language

The base language is the programming language in which the program is written.

child task

A task is a child of the task which encounters its task generating code.

construct

A construct is an executable directive and its associated statement. Unlike the OpenMP terminology, we will explicitly refer to the lexical scope of a constructor or the dynamic extent of a construct when needed.

data environment

The data environment is formed by the set of variables associated with a given task.

declarative directive

A directive that annotates a declarative statement.

dependence

Is the relationship existing between a predecessor task and one of its successor tasks.

descendant tasks

The descendant tasks of a given task is the set of all its child tasks and the descendant tasks of them.

directive

In C/C++ a #pragma preprocessor entity.

In Fortran a comment which follows a given syntax.

dynamic extent

The dynamic extent is the interval between establishment of the execution entity and its explicit disestablishment. Dynamic extent always obey to a stack-like discipline while running the code and it includes any code in called routines as well as any implicit code introduced by the OmpSs-2 implementation.

executable directive

A directive that annotates an executable statement.

expression

Is a combination of one or more data components and operators that the base program language may understand.

function task

In C, an task declared by a task directive at file-scope that comes before a declaration that declares a single function or comes before a function-definition. In both cases the declarator should include a parameter type list.

In C++, a task declared by a task directive at namespace-scope or class-scope that comes before a function-definition or comes before a declaration or member-declaration that declares a single function.

In Fortran, a task declared by a task directive that comes before a the SUBROUTINE statement of an external-subprogram, internal-subprogram or an interface-body.

inline task

In C/C++ an explicit task created by a task directive in a statement inside a function-definition.

In Fortran, an explicit task created by a task directive in the executable part of a program unit.

lexical scope

The lexical scope is the portion of code which is lexically (i.e. textually) contained within the establishing construct including any implicit code lexically introduced by the OmpSs-2 implementation. The lexical scope does not include any code in called routines.

outline tasks

An outlined task is also know as a function tasks.

predecessor task

A task becomes predecessor of another task(s) when there are dependence(s) between this task and the other ones (i.e. its successor tasks. That is, there is a restriction in the order the run-time system must execute them: all predecessor tasks must complete before a successor task can be executed.

parent task

The task that encountered a task generating code is the parent task of the new created task(s).

ready task pool

Is the set of tasks ready to be executed (i.e. they are not blocked by any condition).

structured block

An executable statement with a single entry point (at the top) and a single exit point (at the bottom).

successor task

A task becomes successor of another task(s) when there are dependence(s) between these tasks (i.e. its predecessors tasks) and itself. That is, there is a restriction in the order the run-time system must execute them: all the predecessor task must complete before a successor task can be executed.

task

A task is the minimum execution entity that can be managed independently by the run-time scheduler (although a single task may be executed at different phases according with its task scheduling points). Tasks in OmpSs-2 can be created by any task generating code.

task dependency graph

The set of tasks and its relationships (successor / predecessor) with respect the correspondant scheduling restrictions.

task generating code

The code which execution create a new task. In OmpSs-2 it can occurs when encountering a task construct, a loop construct or when calling a routine annotated with a task declarative directive.

task scheduling point

The different points in wich the run-time system may suspend the execution of the current task and execute a different one.