Programming Models @ BSC

Tutorial at HiPEAC2019: Heterogeneous Parallel Programming with OmpSs

Tue, 09 Oct 2018 00:00:00 +0200

Place: Valencia, SPAIN
Event date: January 23rd, 2019 (associated to HiPEAC Conference 2019)
Speakers: Xavier Martorell and Xavier Teruel

Abstract

OmpSs is a task-based programming model developed at BSC that we use as a forerunner for OpenMP. Like OpenMP, it is based on compiler directives. It is the base platform where we have developed OpenMP tasking, support for dependences, priorities, task reductions, support for heterogeneous devices, and our last addition is the support for application acceleration on FPGAs.

In this tutorial we are going to learn how to program using OmpSs, and its heterogeneous architectures support. We will introduce the OmpSs basic concepts related to task-based parallelism for the SMP cores and then quickly move to the support for heterogeneous devices. OmpSs supports offloading tasks to a variety of accelerators, including CUDA and OpenCL GPUs, and also FPGAs using High-Level Synthesis (HLS) from vendors. OmpSs facilitates programming because it leverages existing OpenCL and CUDA kernels without the burden to have to deal with data copies to/from the devices. Data copies are just triggered automatically by the OmpSs runtime, based on the task data dependence annotations. On the FPGAs environment with HLS, plain C/C++ applications can offload kernels to the FPGA.

OmpSs for FPGA devices is the result of our work at the AXIOM, EuroEXA and Legato European Projects. We will also show how the same directives are being used to outline code that can be compiled, run on FPGA devices, and analyzed with the BSC analysis tool Paraver thanks to the internal FPGA tracing facilities.

The tutorial will include two laboratory sessions. We will provide student accounts to attendees in our Minotauro machine (Intel-based with NVidia GPUs), and several exercises will be provided to be completed online (cholesky, matrix multiplication, nbody, 3d-stencil, merge-sort, histogram…), and learn better the details of the OmpSs support for both the SMP and heterogeneous architectures.

Agenda

08.30h – Introduction to the OmpSs Programming Model. Basic directives and support for heterogeneous systems
10:00h – — Coffee Break —
10:30h – Hands-on - OmpSs with CUDA and OpenCL support
12:00h – — Lunch —
13:30h – OmpSs with support for FPGA devices
15:00h – — Coffee Break —
15:30h – Hands-on - OmpSs @FPGA
17:00h – End of the tutorial

Parallel Programming Workshop

Mon, 27 Aug 2018 00:00:00 +0200

Venue: Barcelona, SPAIN
Event date: October 18-20th, 2018
Speakers: Xavier Teruel & Xavier Martorell

Description

The objectives of this course are to understand the fundamental concepts supporting message-passing and shared memory programming models. The course covers the two widely used programming models: MPI for the distributed-memory environments, and OpenMP for the shared-memory architectures. It also presents the main tools developed at BSC to get information and analyze the execution of parallel applications, Paraver and Extrae. Moreover it sets the basic foundations related with task decomposition and parallelization inhibitors, using a tool to analyze potential parallelism and dependences, Tareador.

Agenda

Day 1 (Wednesday) 2:00 pm - 5:30 pm:

Shared-memory programming models, OpenMP fundamentals
Parallel regions and work sharing constructs
Synchronization mechanisms in OpenMP
Practical: heat diffusion in OpenMP

Day 2 (Thursday) 9:30am – 1:00 pm:

Tasking in OpenMP 3.0/4.0/4.5
Programming using a hybrid MPI/OpenMP approach
Practical: multisort in OpenMP and hybrid MPI/OpenMP

Day 2 (Thursday) 2:00am – 5:30 pm:

Parallware: guided parallelization
Practical session with Parallware examples

Day 3 (Friday) 9:30 am – 1:00 pm:

Introduction to the OmpSs programming model
Practical: heat equation example and divide-and-conquer

Day 3 (Friday) 2:00pm – 5:30 pm

Programming using a hybrid MPI/OmpSs approach
Practical: heat equation example and divide-and-conquer

External references

https://www.bsc.es/education/training/patc-courses

Heterogeneous Parallel Programming with OmpSs (at PACT 2018, Cyprus)

Thu, 21 Jun 2018 00:00:00 +0200

Place: Limassol, CYPRUS
Event date: November 4th, 2018 (associated to PACT 2018)
Speakers: Xavier Martorell

Abstract

OmpSs tutorial at PUMPS 2018

Thu, 10 May 2018 00:00:00 +0200

Place: Barcelona, SPAIN
Event date: July 20th, 2018
Speakers: Xavier Martorell and Xavier Teruel

The ninth edition of the Programming and Tuning Massively Parallel Systems + Artificial Intelligence summer school (PUMPS+AI) is aimed at enriching the skills of researchers, graduate students and teachers with cutting-edge technique and hands-on experience in developing applications for many-core processors with massively parallel computing resources like GPU accelerators.

Important dates

Applications due: May 31, 2018
Notification of acceptance: June 12, 2018
Hackathon day: 15 July (only for selected applicants)
Summer school: 16-20 July

More info

PUMPS Website

Heterogeneous Programming on GPUs with MPI + OmpSs

Tue, 03 Apr 2018 00:00:00 +0200

Venue: BSC, Barcelona, SPAIN
Event date: May 9-10th, 2018
Speakers: Xavier Martorell and Xavier Teruel

More specifically, the tutorial will introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems and it will demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize: 1) MPI applications on clusters of SMPs, and 2) Leverage CUDA kernels with OmpSs on clusters of GPUs

Prerequisites

Good knowledge of C/C++
Basic knowledge of CUDA/OpenCL
Basic knowledge of Paraver/Extrae

Learning Outcomes

The students who finish this course will be able to develop benchmarks and simple applications with the MPI/OmpSs programming model to be executed in clusters of GPUs.

Agenda

Day 1
- 09.00h – Introduction to OmpSs
- 11.30h – OmpSs single node programming hands-on
- 13.00h – Lunch Break
- 14.00h – More on OmpSs: GPU/CUDA programming
- 15.00h – OmpSs single node programming hands-on with GPUs
- 17.30h – Adjourn
Day 2
- 09.00h – Introduction to MPI/OmpSs
- 10.00h – MPI/OmpSs hands-on
- 13.00h – Lunch Break
- 14.00h – Free hands-on session
- 17.30h – Adjourn

External links

BSC Website

Release DLB 2.0

Thu, 21 Dec 2017 00:00:00 +0100

New release of the DLB library (2.0) offering a new asynchronous API for runtime systems and the novel DROM module that provides a public API to manage the computational resources assigned to running processes

The Computer Sciences department at BSC is proud to announce the release of DLB 2.0 the Dynamic Load Balance library improves the load balance of parallel applications at runtime. “Fixing load imbalance on applications is not only important to improve a single application’s performance, but it is also key to boost the utilization of supercomputing systems” says Marta Garcia, lead scientist of the DLB tool.

“DLB has a tremendous potential to address for free the imbalance issues in hybrid applications that would otherwise require significant refactoring efforts” says Jesús Labarta, Computer Science department director.

DLB is today helping to improve balance in different European projects, such as the Human Brain Project, HPC Europa 3, MontBlanc 3, and it is used for a wide range of applications of different domains, like for instance neuroscience, computational mechanics, molecular dynamics, cosmological simulations or climate modeling.

“DLB is our preferred tool to mitigate imbalances occurring on Alya executions. These imbalances appear spontaneously or come from inaccurate load distributions. DLB solves both problems at runtime, acting only when necessary, making our code much more resilient for modern HPC systems. We save millions CPU hours every year by using DLB” says Ricard Borrell, Senior Researcher at BSC’s CASE department.

The newest version of DLB (2.0) offers a new module called the DROM module, which allows external entities (e.g. Job scheduler, resource manager), to request a change of resources for a running process. Thus, now the DLB library is organized in two different modules, LeWI and DROM, that are independent between them but can work coordinated.

More specifically, in this new version, apart from several bug fixes, we have introduced the following new features:

DROM (Dynamic Resource Ownership Management) module.
DROM offers an API for external entities (i.e. Job Scheduler, Resource Manager…), it allows to remove CPUs from a running process to assign them to a new process or an existing one.
Asynchronous version of LeWI (Lend When Idle) load balancing algorithm.
The load balancing algorithm LeWI can work in a synchronous and asynchronous mode. The new asynchronous mode provides an interaction between the runtime and DLB without polling.
New DLB public API
- More clear, with the unification of names
- More exhaustive, supporting more use cases
Callback system for parallel runtimes
The callback system allows registering functions as callbacks for DLB actions, providing a friendly interface for integrating new parallel runtimes with DLB.
Support for interoperability of multiple runtimes
DLB provides support for several parallel runtimes within the same process sharing computational resources.
New mechanism to set DLB options based in DLB_ARGS environment variable
Now all the options passed to DLB are contained in an environment variable, facilitating the configuration of DLB and the detection of errors when setting of options.

You can freely download DLB (distributed as open source under LGPL-3.0 license) and get more information at DLB’s website

OmpSs tutorial at PUMPS 2017

Wed, 19 Apr 2017 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event date: June 30th, 2017
Speakers: Xavier Martorell & Xavier Teruel

The eighth edition of the Programming and Tuning Massively Parallel Systems summer school (PUMPS) is aimed at enriching the skills of researchers, graduate students and teachers with cutting-edge technique and hands-on experience in developing applications for many-core processors with massively parallel computing resources like GPU accelerators.

IMPORTANT DATES

Applications due: April 30, 2017 (Due to space limitations, early application is strongly recommended. You may also be suggested to attend an online prerequisite training on basic CUDA programming before joining PUMPS).
Notification of acceptance: May 15, 2017
Summer school: June 26-30, 2017

MORE INFO

Summer School website

Basic Programming of Multicore and Many-Core Processors for Image and Video Processing

Tue, 18 Apr 2017 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event date: June 22-23th, 2017
Speakers: Juan Gómez Luna

OBJECTIVES

This course is delivered by the GPU Center of Excellence (GCOE) awarded by NVIDIA to the Barcelona Supercomputing Center (BSC) in association with the Universitat Politecnica de Catalunya (UPC) as a Severo Ochoa workshop.
The course will present the parallelization of several widely-known image and video processing algoriths such as color space conversion, gaussian filtering, and histogramming. Its aim is to be an initial approach to parallel programming to those who may be interested in the potential parallelization of the applications they work with.
Current processors can be classified into multicore and many-core processors, depending on the number of available cores. Among the many-core processors, GPUs are the most popular. Both multicores and many-cores are suitable for exploiting the inherent parallelism in many applications. This way, they can speed up these applications, in order to achieve certain requirements (for instance, real-time performance in image and video applications). The aim of this course is to serve as an initial approach to parallel programming to those who may be interested in the potential parallelization of the applications they work with. We will use several widely-known image and video processing algorithms as case studies: color space conversion, gaussian filtering, histogramming… By taking advantage of the data parallelism available in these algorithms, we will introduce OpenMP for programming multicore processors and CUDA for GPUs. The course will be eminently practical with seven hands-on labs.

AGENDA

Day 1
- 09:00 Parallel computing: OpenMP and CUDA
- 10:45 Coffee break
- 11:15 Hands-on lab 1: Brightness adjustment
- 13:00 Lunch break
- 14:00 Hands-on lab 2: RGB to YUV conversion
- 15:45 Coffee break
- 16:15 Hands-on lab 3: Gaussian filter
- 18:00 Adjourn
Day 2
- 09:00 Hands-on lab 4: Your own filter
- 10:45 Coffee break
- 11:15 Hands-on lab 5: Histogram calculation
- 13:00 Lunch break
- 14:00 Hands-on lab 6: Edge Detection
- 15:45 Coffee break
- 16:15 Hands-on lab 7: Asynchronous transfers
- 18:00 Adjourn

MORE INFO

https://www.bsc.es/education/training/cuda-training/basic-programming-multicore-and-many-core-processors-image-and-video-processing

Heterogeneous Programming on GPUs with MPI + OmpSs

Mon, 10 Apr 2017 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event date: May 10-11th, 2017
Speakers: Xavier Martorell & Xavier Teruel

Introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems
Demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize:
- MPI applications on clusters of SMPs, and
- Leverage CUDA kernels with OmpSs on clusters of GPUs

MORE INFO

https://events.prace-ri.eu/event/540/

Heterogeneous Programming on GPUs with MPI + OmpSs

Fri, 10 Mar 2017 00:00:00 +0100

Venue: Barcelona, SPAIN
Event date: May 10-11th, 2017
Speakers: Xavier Martorell

More specifically, the tutorial will: Introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems Demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize: MPI applications on clusters of SMPs, and Leverage CUDA kernels with OmpSs on clusters of GPUs

Agenda

Day 1
- 09.00h – Introduction to OmpSs
- 11.30h – OmpSs single node programming hands-on
- 13.00h – Lunch Break
- 14.00h – More on OmpSs: GPU/CUDA programming
- 15.00h – OmpSs single node programming hands-on with GPUs
Day 2
- 09.00h – Introduction to MPI/OmpSs
- 10.00h – MPI/OmpSs hands-on
- 13.00h – Lunch Break
- 14.00h – Free hands-on session

External references

BSC Website

PATC Introduction to OpenACC

Tue, 07 Mar 2017 00:00:00 +0100

Venue: Barcelona, SPAIN
Event date: April 27-28, 2017
Speaker: Antonio J. Peña

This is an expansion of the topic “OpenACC and other approaches to GPU computing” covered on this year’s and last year’s editions of the Introduction to CUDA Programming. This course is delivered by the GPU Center of Excellence (GCOE) awarded by NVIDIA to the Barcelona Supercomputing Center (BSC) in association with Universitat Politecnica de Catalunya (UPC). It will provide very good introduction to the PUMPS Summer School run jointly with NVIDIA - 26 - 30 June also at Campus Nord, Barcelona. For further information visit the school website. You may be also interested in our new course: Basic Programming Multicore and many-core processors image and video processing.

Agenda

Agenda to be announced shortly: https://www.bsc.es/education/training/patc-courses/patc-introduction-openacc/agenda

More Info

https://www.bsc.es/education/training/patc-courses/patc-introduction-openacc

Introduction to CUDA Programming

Tue, 07 Mar 2017 00:00:00 +0100

Venue: Barcelona, SPAIN

Event date: April 18-21, 2017

Speaker: Manuel Ujaldon

CONTENTS

The aim of this course is to provide students with knowledge and hands-on experience in developing applications software for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Many commercial offerings from NVIDIA, AMD, and Intel already offer such levels of concurrency. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors.

AGENDA

Day 1 (April 18th)

09:00 The GPU hardware: Many-core Nvidia developments

10:45 Coffee break

11:15 CUDA Programming: Threads, blocks, kernels, grids

13:00 Lunch break

14:00 CUDA Tools: Compiling, debugging, profiling, occupancy calculator

15:45 Coffee break

16:15 CUDA Examples(1): VectorAdd, Stencil,

18:00 Adjourn

Day 2 (April 19th)

09:00 CUDA Examples(2): Matrices Multiply. Assorted optimizations

10:45 Coffee break

11:15 CUDA Examples(3): Dynamic parallelism, Hyper-Q, unified memory

13:00 Lunch break

14:00 Hands-on Lab 1

15:45 Coffee break

16:15 Hands-on Lab 2

18:00 Adjourn

Day 3 (April 20th)

09:00 Inside Pascal: Multiprocessors, stacked memory, NV-link

10:45 Coffee break

11:15 OpenACC and other approaches to GPU computing

13:00 Lunch break

14:00 Hands-on Lab 3

15:45 Coffee break

16:15 Hands-on Lab 4

18:00 Adjourn

Day 4 (April 21st)

09:00 Hands-on Lab 5

10:45 Coffee break

11:15 Hands-on Lab 6

13:00 Lunch break

14:00 Hands-on Lab 7

15:45 Coffee break

16:15 Free Hands-on Lab

18:00 Adjourn

MORE INFO

https://www.bsc.es/education/training/patc-courses/introduction-cuda-programming

GPU Programming Models and their Combinations

Tue, 07 Mar 2017 00:00:00 +0100

Venue: Cordoba, SPAIN

Event date: April 21, 2017

Speaker: Antonio J. Peña

CONTENTS

The aim of these courses is to provide students with knowledge and hands-on experience in developing applications software for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Many commercial offerings from NVIDIA, AMD, and Intel already offer such levels of concurrency. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors. The target audiences of the course are students who want to develop exciting applications for these processors, as well as those who want to develop programming tools and future implementations for these processors.

MORE INFO

While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model which may be combined with CUDA, and recently with OpenACC as well. Using OpenACC we will start benefiting from GPU computing, obtaining great coding productivity and nice performance improvements. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.

Web site: http://www.uco.es/~el1goluj/cuda_teaching_center.html

UAM Course: Parallel Programming

Sun, 02 Oct 2016 00:00:00 +0200

Venue: Madrid, SPAIN
Event date: November 2-4th, 2016
Speakers: Xavier Teruel & Xavier Martorell

Los objetivos del curso son entender y practicar conceptos fundamentales sobre programación paralela y distribuída con paso de mensajes (MPI) y memoria compartida (OpenMP). También se presentarán algunas herramientas útiles en la depuración como Valgrind, Paraver y Tareador.

MORE INFO

OmpSs tutorial at Splash 2016

Sat, 01 Oct 2016 00:00:00 +0200

Tutorial: Amsterdam, NETHERLANDS

Event date: November 1, 2016

Speaker: Jaume Bosch

CONTENTS

This tutorial presents task-based programming models, such as the OmpSs and OpenMP 4.0 Models. OmpSs is a programming model developed at Barcelona Supercomputing Center (BSC). Like OpenMP, it is based on compiler directives. It is the base platform where BSC has developed OpenMP tasking, support for dependences, priorities, task reductions, and it also includes support for heterogeneous devices.

MORE INFO

http://2016.splashcon.org/event/seps2016-tutorial-task-based-programming-for-embedded-multicore-systems

Parallel Programming Workshop

Fri, 26 Aug 2016 00:00:00 +0200

Venue: Barcelona, SPAIN
Event date: October 26-28th, 2016
Speakers: Xavier Teruel & Xavier Martorell

External references

Heterogeneous Parallel Programming with OmpSs

Fri, 15 Jul 2016 00:00:00 +0200

Venue: Haifa, ISRAEL
Event date: September 15th, 2016
Speakers: Xavier Martorell & Xavier Teruel

This tutorial will show the OmpSs Programming Model. It will be based on both teaching and laboratory sessions. OmpSs is a programming model developed at BSC that we use as a forerunner for OpenMP. Like OpenMP, it is based on compiler directives. It is the base platform where we have developed OpenMP tasking, support for dependences, priorities, task reductions, and it also includes support for heterogeneous devices.

We will introduce the OmpSs basic concepts related to task-based parallelism for the SMP cores and then quickly move to the support for heterogeneous devices. OmpSs allows to leverage existing OpenCL and CUDA kernels without the burden to have to deal with data copies to/from the devices. Data copies are just triggered automatically by the OmpSs runtime, based on the task dependence annotations.

OmpSs is currently being extended for FPGA devices, in the context of the AXIOM European Project. We will also show how the same directives are being used to outline code that can be compiled and run on FPGA devices.

Agenda

Session 1. Introduction to OmpSs (8.00 - 10:00)
- OmpSs tasking (fundamentals of OmpSs)
- Task dependences (execution model)
- Additional concurrent, commutative clauses
- Development environment: Mercurium compiler and Nanos++
- Hands-on on simple benchmarks: how to compile and execute applications
Session 2. OmpSs support for heterogeneous architectures (10:30 - 12:15)
- OmpSs target extensions
- Automatic data transfers, software cache
- Leveraging CUDA and OpenCL kernels
Session 3. Hands-on (13:50 - 15:35)
- Parallelizing applications on heterogeneous architectures with OmpSs
- Cholesky, matrix multiplication, nbody, 3d-stencil, merge-sort, histogram…
Session 4. FPGA support in OmpSs (15:50 - 17:00)
- Exploiting parallelism on FPGA devices
- Integrating the development environment with support for FPGAs

External references

New OmpSs release 16.06

Thu, 16 Jun 2016 00:00:00 +0200

Mercurium compiler: 2.0.0
Nanos++ RT Library: 0.10
Download this version here

The programming models team is glad to announce you the release of the new stable version of OmpSs, which is based on the latest Mercurium and Nanos++ 0.10.

Apart from several bug-fixes in both tools, the main features introduced in this new version are:

New cluster support
- Execute OmpSs programs transparently on top of a distributed memory system (CUDA & OpenCL devices are also supported).
- Network communication is implemented using GASNet, which provides support for modern High Performance networking technologies.
- Several optimization mechanisms allow to maximize the performance of applications running on a cluster: a data affinity scheduling policy distributes and minimizes the network activity, task presend allows the OmpSs run-time to overlap communication with computation.
Support for non-contiguous data
- Tasks can reference non-contiguous, multi-dimensional data, which eases the implementation of some applications.
Thread manager
- The Thread Manager module dynamically controls the amount of working threads needed for a specific amount of workload (info)
Task reductions
- Extend the task construct adding support to the reduction clause (info)
- Enhance the support of user-defined reductions (info)
Enhanced support for Dynamic Load Balancing

If you have any doubt or question, feel free to contact us at: pm-tools [at] bsc.es

OmpSs tutorial at PUMPS

Sun, 15 May 2016 00:00:00 +0200

Venue: Barcelona, SPAIN
Event date: July 15th, 2016
Speakers: Xavier Martorell & Xavier Teruel

The sixth edition of the Programming and Tuning Massively Parallel Systems summer school (PUMPS) is aimed at enriching the skills of researchers, graduate students and teachers with cutting-edge technique and hands-on experience in developing applications for many-core processors with massively parallel computing resources like GPU accelerators.

More info

PUMPS Website

OmpSs Workshop at AXIOM project

Fri, 29 Apr 2016 00:00:00 +0200

Workshop: Siena, ITALY

Event date: May 31st

Speaker: Javier Bueno

DESCRIPTION

Build your own supercomputer with OmpSs, UDOO and Arduino. Makers are revolutionary people that consider the physical world just another brick of a house always under construction. With this workshop we aim at guiding these hackers-at-heart through the setup and configuration of a cluster of UDOO QUAD boards powered by AXIOM’s OmpSs, a programming model developed to build clusters in a simple way and take them to the supercomputer level.

MORE INFO

Project's website

Heterogeneous Programming on GPUs with MPI + OmpSs

Mon, 18 Apr 2016 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event date: May 11-12th, 2016
Speaker: Xavier Martorell

More specifically, the tutorial will:

Introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems
Demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize:
- MPI applications on clusters of SMPs, and
- Leverage CUDA kernels with OmpSs on clusters of GPUs

More Info

https://events.prace-ri.eu/event/424

BigStorage Initial Training School

Thu, 11 Feb 2016 00:00:00 +0100

Tutorial: Barcelona, SPAIN

Event date: March 3-9, 2016

Speaker: Xavier Martorell

CONTENTS

BigStorage is an European Training Network (ETN) whose main goal is to train future data scientists in order to enable them and us to apply holistic and interdisciplinary approaches for taking advantage of a data-overwhelmed world, which requires HPC and Cloud infrastructures with a redefinition of storage architectures underpinning them – focusing on meeting highly ambitious performance and energy usage objectives.

MORE INFO

http://www.bigstorage-project.eu/index.php/events

ITN Course: Parallel Programming Workshop

Wed, 18 Nov 2015 00:00:00 +0100

Venue: Barcelona, SPAIN
Event date: January 18-22th, 2016
Speakers: Xavier Teruel & Xavier Martorell

External references

https://www.bsc.es/marenostrum-support-services/hpc-education-and-training/2015-16-workshops-and-seasonal-schools/tccm

Parallel Programming Workshop

Thu, 15 Oct 2015 00:00:00 +0200

Tutorial: Barcelona, SPAIN

Event date: November 23-27, 2015

Speaker: Xavier Martorell

CONTENTS

MORE INFO

PATC Course: Parallel Programming Workshop

UAM Course: Parallel Programming

Wed, 30 Sep 2015 00:00:00 +0200

Venue: Madrid, SPAIN
Event date: November 30th, 2015 - December 4th, 2015
Speakers: Xavier Teruel & Xavier Martorell

External References

http://www.uam.es/ss/Satellite/es/1234886350331/1242691057233/evento/evento/1242691057233.htm

New OmpSs release 15.06

Thu, 04 Jun 2015 00:00:00 +0200

Mercurium compiler: 1.99.7

Nanos++ RT Library: 0.7.10

Download this version here

New OmpSs release 15.04

Wed, 15 Apr 2015 00:00:00 +0200

Mercurium compiler: 1.99.7

Nanos++ RT Library: 0.7.9

Download this version here

OmpSs tutorial at PUMPS 2015

Tue, 24 Mar 2015 00:00:00 +0100

Tutorial: Barcelona, SPAIN

Event date: July 6-10th, 2015

Speakers: TBD

CONTENTS

MORE INFO

PUMPS course website

Course on OmpSs PATC

Tue, 24 Mar 2015 00:00:00 +0100

Tutorial: Barcelona, SPAIN

Event date: May 13-14th, 2015

Speaker: Xavier Martorell

CONTENTS

MORE INFO

BSC PATC courses website

New OmpSs release 15.02

Sun, 15 Feb 2015 00:00:00 +0100

Mercurium compiler: 1.99.6

Nanos++ RT Library: 0.7.6

Download this version here

OmpSs tutorial at PUMPS 2014

Wed, 02 Jul 2014 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event date: July 7-11th, 2014
Speakers: Rosa M. Badia and Xavier Martorell

The fifth edition of the Programming and Tuning Massively Parallel Systems summer school (PUMPS) is aimed at enriching the skills of researchers, graduate students and teachers with cutting-edge technique and hands-on experience in developing applications for many-core processors with massively parallel computing resources like GPU accelerators.

More Info

Summer School website

Contributions at Joint laboratory for Petascale computing

Fri, 06 Jun 2014 00:00:00 +0200

Workshop: Sophia Antipolis, FRANCE
Event Date: June 9-11, 2014
Speakers: Jesus Labarta, Victor Lopez and Florentino Sainz

Presentation of BSC activities (Jesus Labarta)
DLB: Dynamic Load Balancing Library (Victor Lopez)
DEEP Collective offload (Florentino Sainz)

Abstracts

Dynamic Load Balancing Library: DLB is a dynamic library designed to speed up hybrid applications by improving its load balance with little or none intervention from the user. The idea behind the library is to redistribute the computational resources of the second level of parallelism (OpenMP, OmpSs) to improve the load balance of the outer level of parallelism (MPI). DLB library uses an interposition technique at run time, so it is not necessary to do a previous analysis or modify the application; although finer control is also supported through an API. Finally, we also present a case study with CESM (Community Earth System Model), a global climate model that provides computer simulations of the Earth climate states. The application already uses a hybrid parallel programming model (MPI+OpenMp), so with few modifications in the source code we have compiled it to use the OmpSs programming model where DLB will benefit from the high malleability of it.

Deep Collective offload: We present a new extension of OmpSs programming model which allows users to dynamically offload C/C++ or Fortran code from one or many nodes to a group of remote nodes. Communication between remote nodes executing offloaded code is possible through MPI. It aims to improve programmability of Exascale and nowadays supercomputers which use different type of processors and interconnection networks which have to work together in order to obtain the best performance. We can find a good example of these architectures in the DEEP project, which has two separated clusters (CPUs and Xeon Phis). With our technology, which works in any architecture which fully supports MPI, users will be able to easily offload work from the CPU cluster to the accelerators cluster without the constraint of falling back to the CPU cluster in order to perform MPI communications.

More Info

Event information

Course on programming models using OmpSs

Fri, 30 May 2014 00:00:00 +0200

Tutorial: Bucaramanga, COLOMBIA
Event date: June 3-6th, 2014
Speakers: Vicenç Beltran and Florentino Sainz

Abstract

OmpSs is an effort to integrate features from the StarSs programming model developed by BSC into a single programming model. In particular, our objective is to extend OpenMP with new directives to support asynchronous parallelism and heterogeneity (devices like GPUs). However, it can also be understood as new directives extending other accelerator based APIs like CUDA or OpenCL. Our OmpSs environment is built on top of our Mercurium compiler and Nanos++ runtime system.

Place

Universidad Industrial de Santander - Campus Principal
Bucaramanga, Santander, Colombia

More Info

PATC Course: Heterogeneous Programming on GPUs with MPI + OmpSs

Mon, 14 Apr 2014 00:00:00 +0200

Tutorial: Barcelona, SPAIN
Event Date: May 14-15, 2014
Speakers: Rosa M. Badia and Xavier Martorell

The tutorial will motivate the audience on the need for portable, efficient programming models that put less pressure on program developers while still getting good performance for clusters and clusters with GPUs, and more specifically, the tutorial will introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems. It will also demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize: first using MPI applications on clusters of SMPs, but alos leveraging CUDA kernels with OmpSs on clusters of GPUs.

More Info

Course web site

Public Release DLB library version 1.0

Wed, 11 Dec 2013 00:00:00 +0100

The Programming models grup at Barcelona Supercomputing Center is proud to announce the first official release of the Dynamic Load Balancing Library DLB.

DLB is a dynamic library that aims to improve the performance of hybrid applications by decreasing the load imbalance of the outter level of parallelism (usually MPI) by redistributing the computational resources in the inner level (shared memory parallelism).

For more information please visit our webpage: https://pm.bsc.es/dlb.

The latest releases of DLB, detailed documentation and source code are available at our github.

BSC @ SC13: Tutorial and HPC Educators Session

Tue, 29 Oct 2013 00:00:00 +0100

BSC is contributing to SC13 with a tutorial and a session on the HPC Educators session OmpSs task-based programming model and its use.

More Info

Heterogeneous Programming on GPUs with MPI + OmpSs (SBAC-PAD 2013)

Mon, 30 Sep 2013 00:00:00 +0200

Tutorial: Porto de Galinhas, BRAZIL
Event date: October 23, 2013
Speaker: Rosa Maria Badia

More Info

Course web site

Parallel Programming Workshop (PATC Course)

Fri, 20 Sep 2013 00:00:00 +0200

Tutorial: Barcelona, SPAIN - October 14-18, 2013
Speakers: Rosa M. Badia and Xavier Martorell

Objectives

The course starts with the objective of setting up the basic foundations related with task decomposition and parallelization inhibitors, using a tool to analyze potential parallelism and dependences. The course follows with the objective of understanding the fundamental concepts supporting shared-memory and message-passing programming models. The course is taught using formal lectures and practical/programming sessions to reinforce the key concepts and set up the compilation/execution environment. The course covers the two widely used programming models: OpenMP for the shared-memory architectures and MPI for the distributed-memory counterparts. The use of OpenMP in conjunction with MPI to better exploit the shared-memory capabilities of current compute nodes in clustered architectures is also considered. Paraver will be used along the course as the tool to understand the behavior and performance of parallelized codes.

More Info

Course web site

OmpSs tutorial at PUMPS 2013

Sat, 15 Jun 2013 00:00:00 +0200

OmpSs: Leveraging GPU/CUDA Programming

Barcelona, SPAIN -- July 8-12, 2013

CONTENTS

The fourth edition of the Programming and Tuning Massively Parallel Systems summer school (PUMPS) is aimed at enriching the skills of researchers, graduate students and teachers with cutting-edge technique and hands-on experience in developing applications for many-core processors with massively parallel computing resources like GPU accelerators.

More information at:

Summer School website

OmpSs tutorial at Colombia

Sat, 01 Jun 2013 00:00:00 +0200

Asynchronous Hybrid and Heterogeneous Parallel Programming with MPI/OmpSs Course

July 1-5, 2013

ABSTRACT

PLACE

Universidad Industrial de Santander - Campus Principal

Bucaramanga, Santander, Colombia

Sala: Sala de conferencias EISI - Facultad de Ingenierías Fisicomecánicas

More information at:

Course website

Ompss tutorial at XSEDE project

Wed, 29 May 2013 00:00:00 +0200

Parallel CPU programming (OmpSs) at the University of New York

June 24, 2013

BLOCK 1: OmpSs Quick Overview
- High Performance Computing
  - Some basic concepts
  - Supercomputers nowadays
  - Parallel programming models
- OmpSs Introduction
  - OmpSs main features
  - A Practical Example: Cholesky factorization
- BSC’s Implementation
  - Mercurium Compiler
  - Nanos++ Runtime Library
  - Visualization Tools
BLOCK 2: Basics of OmpSs
- Tasking and Synchronization
  - Data Sharing Attributes
  - Dependence Model
  - Other Tasking Directive Clauses
  - Taskwait
  - Synchronization
  - Outlined Task Syntax
- Memory Regions
  - Introduction
  - Syntax
- Nesting and Dependences
  - Memory regions and dependences
  - Nested tasks and dependences
  - Using dependence sentinels
- Programming Methodology

More information at: Course website

OmpSs tutorial at ISCA 2013

Fri, 24 May 2013 00:00:00 +0200

Hybrid and Heterogeneous Parallel Programming with MPI/OmpSs for Exascale Systems

June 24, 2013

Abstract

Due to its asynchronous nature and look-ahead capabilities, MPI/OmpSs is a promising programming model approach for future exascale systems, with the potential to exploit unprecedented amounts of parallelism, while coping with memory latency, network latency and load imbalance. Many large-scale applications are already seeing very positive results from their ports to MPI/OmpSs (see EU projects Montblanc, TEXT). We will first cover the basic concepts of the programming model. OmpSs can be seen as an extension of the OpenMP model. Unlike OpenMP, however, task dependencies are determined at runtime thanks to the directionality of data arguments. The OmpSs runtime supports asynchronous execution of tasks on heterogeneous systems such as SMPs, GPUs and clusters thereof. The integration of OmpSs with MPI facilitates the migration of current MPI applications and improves, automatically, the performance of these applications by overlapping computation with communication between tasks on remote nodes. The tutorial will also cover the constellation of development and performance tools available for the MPI/OmpSs programming model: the methodology to determine OmpSs tasks, the Tareador tool, and the Paraver performance analysis tools. The tutorial will also include practical sessions on application development and analysis on single many-core nodes, heterogeneous environments with GPUs, and cluster environments with MPI/OmpSs.

More information at:

OmpSs tutorial at PRACE project

Mon, 15 Apr 2013 00:00:00 +0200

PATC Course: Heterogeneous Programming on GPUs with MPI + OmpSs

May 15-16, 2013

Objectives

Introduce the hybrid MPI/OmpSs parallel programming model for future exascale systems
Demonstrate how to use MPI/OmpSs to incrementally parallelize/optimize:
- MPI applications on clusters of SMPs, and
- Leverage CUDA kernels with OmpSs on clusters of GPUs

More information at:

Course website

Ompss tutorial at CAPAP-H

Fri, 15 Feb 2013 00:00:00 +0100

Programación de aplicaciones con MPI + OmpSs

March 15, 2013

Abstract

Dada su naturaleza asíncrona y posibilidades de prever tareas a ejecutar, MPI/OmpSs es un modelo de programación paralelo muy prometedor para sistemas exascale. El modelo tiene un gran potencial para explotar el paralelismo inherente de las aplicaciones, a la vez que oculta la latencia con memoria y con la red o mejora el balanceo de carga entre los diferentes procesos. Un número significativo de aplicaciones están viendo una mejora importante en su rendimiento cuando son adaptados al modelo MPI/OmpSs (por ejemplo, aplicaciones de los proyectos Montblanc o TEXT). En el curso, se describirán primero conceptos básicos del modelo de programación. OmpSs puede considerarse una extensión del estándar OpenMP. Sin embargo, a diferencia de OpenMP las dependencias de datos entre las tareas son determinadas en tiempo de ejecución teniendo en cuenta la direccionalidad de los argumentos de las tareas. La librería de ejecución de OmpSs da soporte a sistemas heterogéneos compuestos de procesadores de propósito general (multicores), GPUs, y clusters. La integración de OmpSs con MPI facilita la migración de aplicaciones actuales y mejora su comportamiento mediante el solape de tareas de computación con comunicación.

Ponentes: Xavier Martorell y Rosa M. Badía.

More information at:

Tutorial website

OmpSs tutorial at Supercomputing 2012

Tue, 02 Oct 2012 00:00:00 +0200

Asynchronous Hybrid and Heterogeneous Parallel Programming with MPI/OmpSs for Exascale Systems

Nov 12, 2012

TIME: 1:30PM - 5:00PM\n
PRESENTERS: Jesus Labarta, Xavier Martorell, Christoph Niethammer and Costas Bekas.\n</ul>
ABSTRACT
Due to its asynchronous nature and look-ahead capabilities, MPI/OmpSs is a promising programming model approach for future exascale systems, with the potential to exploit unprecedented amounts of parallelism, while coping with memory latency, network latency and load imbalance. Many large-scale applications are already seeing very positive results from their ports to MPI/OmpSs (see EU projects Montblanc, TEXT). We will first cover the basic concepts of the programming model. OmpSs can be seen as an extension of the OpenMP model. Unlike OpenMP, however, task dependencies are determined at runtime thanks to the directionality of data arguments. The OmpSs runtime supports asynchronous execution of tasks on heterogeneous systems such as SMPs, GPUs and clusters thereof. The integration of OmpSs with MPI facilitates the migration of current MPI applications and improves, automatically, the performance of these applications by overlapping computation with communication between tasks on remote nodes. The tutorial will also cover the constellation of development and performance tools available for the MPI/OmpSs programming model: the methodology to determine OmpSs tasks, the Ayudame/Temanejo debugging toolset, and the Paraver performance analysis tools. Experiences on the parallelization of real applications using MPI/OmpSs will also be presented. The tutorial will also include a demo.

NANOS++ for Clusters

Thu, 09 Sep 2010 00:00:00 +0200

A few weeks ago we started the development of NANOS++ for Clusters. As everyone would expect for the name, the main goal is to support the execution of parallel applications on cluster environments using the current programming models available at NANOS++. The starting design has been driven by the following ideas:

Low cohesion with the rest of the runtimei
Minimal impact on applications code
Independent of the underlying network technology

So far the initial design has been quite successful following these three objectives. The cluster support has been added as a plugin which can be enabled during application runtime. It has been developed on MareNostrum, a cluster of PowerPC nodes, but it should work on other platforms as the code is architecture-independent. Network support is provided by GasNet, a low-level networking layer oriented to build runtime libraries on top of it which supports several network technologies.

We have succeed executing applications coded using the OmpSs pragmas. A dense Matrixmultiply has been able to run up to 64 nodes, however the current design limits the scalability of the system up to 4 nodes.

The next stages of the development will be focused on allowing the system to scale while using a higher number of nodes, enabling instrumentation features and adding more applications.

Instrumenting Nanos++ (3rd part)

Fri, 13 Aug 2010 00:00:00 +0200

We conclude in this article the series of posts about Instrumenting Nanos++. In the first article we discussed about the different components which take part on the instrumentation process, the different types of events which can be generated by the runtime and also, how the instrumentation output can be adapted to different formats (using plugins). In the second one we discussed about internal implementation details and how programmers can use instrumentation services in order to generate events. In this one we will talk about Instrumentation modules which can help programmers when instrumenting the code and we will show some practical examples using these modules.

Instrumentation Modules

Instrumentation modules help programmers in the instrumentation process by doing automatically some of the duties that users need to follow for correct instrumentation. So far its main utility is to take care about multiple exits in a given piece of code. As a module is a C++ object, we can use the constructor to open an instrumentation burst leaving the responsibility of closing it to the corresponding destructor. The simplest Instrumentation module is InstrumentState:

class InstrumentState {
   private:
      Instrumentation     &_inst;
      bool                _closed;
   public:
      InstrumentState ( nanos_event_state_value_t state )
         : _inst(*sys.getInstrumentor()), _closed(false)
      {
         _inst.raiseOpenStateEvent( state );
      }
      ~InstrumentState ( ) { if (!_closed) close(); }
      void close() { _closed=true; _inst.raiseCloseStateEvent();  }
};

Creating a new InstrumentState object will produce the opening of a State event (the value is specified in the object constructor). Once the object goes out of the scope where is declared the destructor will close it (if programmer has not closed it before). As most of instrumentation phases affect a whole function the programmer has just to create an object of a Instrumentation module at the beginning of the function.

Example: Instrumenting the API

API functions have – generally – a common behaviour. They open a Burst event with a pair <key,value>. The key is the internal code “api” and the value is an specific identifier of the function we are instrumenting on. API functions also open a State event with a value according with the function duty. Both events will be closed once the function execution finishes. Here it is an example using nanos_yield() implementation:

nanos_err_t nanos_yield ( void )
{
   NANOS_INSTRUMENT( InstrumentStateAndBurst inst("api","yield",SCHEDULING) );
   try {
      Scheduler::yield();
   } catch ( ... ) {
      return NANOS_UNKNOWN_ERR;
   }
   return NANOS_OK;
}

Yield function will wrap its execution between <“api”,”yield”> Burst and SCHEDULING State events. Although the function can have other exit points (apart from the return) InstrumentStateAndBurst destructor will throw closing events automatically.

Example: Instrumenting Runtime Internal Functions

Different Nanos++ functions have different instrumentation approaches. In this sections we have chosen a scheduling related function: Scheduler::waitOnCondition(). Due space limitations we have abridged the code focusing our interest in the instrumentation parts.

void Scheduler::waitOnCondition (GenericSyncCond *condition)
{
   NANOS_INSTRUMENT( InstrumentState inst(SYNCHRONIZATION) );

   const int nspins = sys.getSchedulerConf().getNumSpins();
   int spins = nspins;

   WD * current = myThread->getCurrentWD();

   while ( !condition->check() ) {
      BaseThread *thread = getMyThreadSafe();
      spins--;
      if ( spins == 0 ) {
         condition->lock();
         if ( !( condition->check() ) ) {
            condition->addWaiter( current );

            NANOS_INSTRUMENT( InstrumentState inst1(SCHEDULING) );
            WD *next = _schedulePolicy.atBlock( thread, current );
            NANOS_INSTRUMENT( inst1.close() );

            if ( next ) {
               NANOS_INSTRUMENT( InstrumentState inst2(RUNTIME) );
               switchTo ( next );
            }
            else {
               condition->unlock();
               NANOS_INSTRUMENT( InstrumentState inst3(YIELD) );
               thread->yield();
            }
         } else {
            condition->unlock();
         }
         spins = nspins;
      }
   }
}

In this function the instrumentation changes the thread state in several parts of the code. First, all the function code is surrounded by a SYNCHRONIZATION state (inst). A Opening state event is raised at the very beginning of the function and the corresponding close event will be thrown once the execution flow gets out from the function scope. During the function execution the thread state may change to SCHEDULING when calling _schedulePolicy.atBlock(), RUNTIME when we are context switching WorkDescriptors and YIELD when we are forcing a thread yield. In this case the SCHEDULING state change is the only one we have to force to close before getting out from its scope. Note, that if an C++ exception is raised by any of the lower layers the states that are open at point will close automatically. So, the use of the Instrumentation modules improves the general exception safety of the code.

Instrumenting Nanos++ (2nd part)

Fri, 23 Jul 2010 00:00:00 +0200

We continue in this article with the Nanos++ instrumentation overview started in previously. In the previous article we discussed about the different components which take part on the instrumentation process, the different types of events which can be generated by the runtime and also, how the instrumentation output can be adapted to different formats (using plugins). In this article we focus on the internal implementation and how programmers can use instrumentation services in order to generate events.

Instrumentation class

The main component of the instrumentation is the Instrumentation class. This class offers several services which can be grouped in:

Create event’s services: these services are focused in create specific event objects. Usually they are not called by external agents but they are used by raise event’s services (explained below).
Raise event’s services: these services are focused in effectively producing an event (or list of events) which will be visible by the user. Usually these functions will call one or several create event’s service(s) and finally produce an effective output by calling plugin’s addEventList() service.
Context switch’s services: they are used to backup/restore the instrumentation information history for the current WorkDescriptor (see InstrumentationContext class).

Instrumentation class also offers two more services to enable/disable state instrumentation. Once the user calls disableStateInstrumentation() the runtime will not produce more state events until the user enable it by calling enableStateInstrumentation(). Although no state events will be produced during this interval of time Instrumentation class will keep all potential state changes by creating a special event object: the substate event.

InstrumentationContext class

In order to reproduce the history of events in WorkDescriptor’s context switches (and taking into account that we are producing a thread-centered trace) the Instrumentation class needs a repository for this kind of information related with each WorkDescriptor. The InstrumentationContext is the responsible of keeping a history of state transitions, still opened bursts or delayed event list. InstrumentationContext is implemented through two different classes: InstrumentationContext (which defines the behavior of this component) and InstrumentationContextData (which actually keeps the information and it is embedded in the WorkDescriptor class).

InstrumentationContext behaviour is defined by the plugin itself and has several implementation according with State and Burst generation scheme. These two elements can have different behavior in a context switch. In one case we want only to generate the last event of this type (this is the usual implementation) but in other cases we wanted to generate a complete sequence of events of the same type in the same order they occur (this is the showing stacked event behavior). Currently they are four InstrumentationContext implementations:

InstrumentationContext
InstrumentationContextStackedStates
InstrumentationContextStackedBursts
InstrumentationContextStackedStatesAndBursts

The plugin itself is responsible for defining the InstrumentationContext behaviour by defining an object of this class and initializing the field _instrumentationContext with a reference to it. Example:

class InstrumentationExample: public Instrumentation
{
   private:
      InstrumentationContextStackedBursts _icLocal;
   public:
      InstrumentationExtrae() : Instrumentation(), _icLocal()
      {
         _instrumentationContext = &_icLocal;
      }
      .
      .
      .
}

Instrumentation examples

In this section we focus in the runtime instrumentation code. We discuss two different examples: a critical runtime piece of code and a work descriptor’s context switch.

Some runtime chunks of code are bound by instrumentation events in order to measure the duration of this piece of code. An example is a cache allocation. This function is bound by a state event and a burst event. State event will change the current thread’s state to CACHE and the Burst event will keep information of the memory allocation size for the specific call. Here is the example:

void * allocate( size_t size )
{
      void *result;
      NANOS_INSTRUMENT(nanos_event_key_t k);
      NANOS_INSTRUMENT(k = Instrumentor->getInstrumentorDictionary()->getEventKey("cache-malloc"));
      NANOS_INSTRUMENT(Instrumentor()->raiseOpenStateAndBurst(CACHE, k, (nanos_event_value_t) size));
      result = _T::allocate(size);
      NANOS_INSTRUMENT(Instrumentor()->raiseCloseStateAndBurst(k));
      return result;
}

WorkDescriptor’s context switch uses two instrumentation servicesi: wdLeaveCPU() and wdEnterCPU(). The wdLeaveCPU() is called from the leaving task context execution and wdEnterCPU() is called once we are executing the new task.

   .
   .
   .
   NANOS_INSTRUMENT( sys.getInstrumentor()->wdLeaveCPU(oldWD) );
   myThread->switchHelperDependent(oldWD, newWD, arg);

   myThread->setCurrentWD( *newWD );
   NANOS_INSTRUMENT( sys.getInstrumentor()->wdEnterCPU(newWD) );
   .
   .
   .

In the next article we will conclude these series of articles about the instrumentation module of Nanos++ giving an overview of the external runtime instrumentation API and showing some mechanisms which will make easier programmer’s duty.

Instrumenting Nanos++ (1st part)

Wed, 30 Jun 2010 00:00:00 +0200

Continuing with the series of Nanos++ articles we wanted to shortly describe the instrumentation mechanism. In this post we give an overview of the Instrumentation main components and concepts. In a future post will show how to use them to instrument the runtime.

The main goal of instrumentation is to get some information about the program execution. In other words, we want to know “What happens in this WorkDescriptor running on this Thread ?”. There are the three main components involved in the instrumentation process: What (we also call it Event, WorkDescriptor and Thread.

Events are something that happen at a given time or at a given interval of time.
WorkDescriptors are the runtime basic unit of work. They offer a context to execute a piece of code.
Threads are logical (or virtual) processors that execute WorkDescriptors.

Instrumentation is driven through Key/Value pairs in which the item Key identifies the semantic of the associated Value (e.g., WorkDescriptor ID as a Key and a numerical identifier as the associated Value). Keys and Values can be registered in a global dictionary (InstrumentationDictionary) which can be used as a repository.

Nanos++ defines four different type of events:

Point: Punctual events. They have a list of KV pairs.
Bursts: Interval events. They have a single KV pair which identify the type of burst that we are creating. The runtime automatically manages a stack of burst of with the same key.
State: Thread state events. They have no KV pair, they have just a numerical code which identifies a runtime state.
Point-to-Point (PtP): Two connected punctual events. They have a domain and identifier in order to match origin and destination and also, they have a list of KV pairs.

The core of the instrumentation behavior is specified through the Instrumentation class. This class implements several type of methods: methods to create events, methods to raise event, WorkDescriptor context swhich methods and finally, specific Instrumentation methods which are actually defined into each derived class (plugins). Specific Instrumentation methods are (ideally) the ones that have to be implemented in each derived Instrumentation class. They are:

initialize(): this method is executed at runtime startup and can be used to create buffers, auxiliary structures, initialize values (e.g. time stamp), etc.
finalize(): this method is executed at runtime shutdown and can be used to dump remaining data into a file or standard output, post-process trace information, delete buffers and auxiliary structures, etc.
addEventList(): this method is executed each time the runtime raises an event. It receives a list of events (EventList) and the specific instrumentation class has to deal with each event in this list in order to generate (or not) a valid output.

But specific Instrumentation programmers can also overload other base methods in order to get an specific behavior when the plugin is invoked.

Into Nanos++

Tue, 25 May 2010 00:00:00 +0200

For the last several months we have been working on NANOS++, the replacement of our old OpenMP nanos4 runtime. The design objectives that are driving the development are:

Extensible support
Heterogeneity support
Multiple programming model support

Our aim has been to enable easy development of different parts of the runtime so researchers have a platform that allows them to try different mechanisms. So far, there are several parts of the runtime which are quite extensible: the scheduling policy, the throttling policy, the barrier implementations, the slicers implementation, the instrumentation layer and the architectural level. This extensibility does not come for free. The runtime overheads are slightly increased, but there should be low enough for results to be meaningful except for cases of extreme-fine grain applications.

The execution model of the runtime is asynchronous task parallelism. Task can be spawned and then synchronized between them based on point-to-point dependencies. This execution model allows us to implement several programming models on top. Currently we support the StarSs model, partially the OpenMP model and the Chapel language model. This model also simplifies support for heterogeneity as one task is easy to spawn to accelerators.

One of our current focus is to bring our OpenMP support to the level that we had with the previous runtime (note that you also need our Mercurium compiler for this). Another current focus is management of data transfer for heterogeneous platforms such as GPUs or other environments where explicit data management may make sense (such as NUMA architectures). The current git version already has some minimal support for GPUs that use CUDA and where the runtime does all the data transfers.

Of course the runtime has a lot internal development to enable support features such as instrumentation, debug, synchronization mechanism, etc.

We’ll probably to do more posts on different specific aspects of the runtime in the future.

Where is OpenMP going?

Wed, 07 Oct 2009 00:00:00 +0200

Christian Terboven, from Aachen University, wrote a nice summary of recent discussions in the OpenMP language committee which I recommend for people interested in the development of OpenMP.

The git version of our Mercurium compiler already implements a prototype version of the future user defined reductions (UDRs) for OpenMP. I’ll try to write a bit more about them in the near future so people can give it a try :-)

Programming Models @ BSC

Tutorial at HiPEAC2019: Heterogeneous Parallel Programming with OmpSs

Abstract

Agenda

Parallel Programming Workshop

Description

Agenda

External references

Heterogeneous Parallel Programming with OmpSs (at PACT 2018, Cyprus)

Abstract

OmpSs tutorial at PUMPS 2018

Contents

Important dates

More info

Heterogeneous Programming on GPUs with MPI + OmpSs

Contents

Prerequisites

Learning Outcomes

Agenda

External links

Release DLB 2.0

OmpSs tutorial at PUMPS 2017

CONTENTS

IMPORTANT DATES

MORE INFO

Basic Programming of Multicore and Many-Core Processors for Image and Video Processing

OBJECTIVES

AGENDA

MORE INFO

Heterogeneous Programming on GPUs with MPI + OmpSs

CONTENTS

MORE INFO

Heterogeneous Programming on GPUs with MPI + OmpSs

Contents

Agenda

External references

PATC Introduction to OpenACC

Contents

Agenda

More Info

Introduction to CUDA Programming

GPU Programming Models and their Combinations

UAM Course: Parallel Programming

CONTENTS

MORE INFO

OmpSs tutorial at Splash 2016

Parallel Programming Workshop

Contents

External references

Heterogeneous Parallel Programming with OmpSs

Contents

Agenda

External references

New OmpSs release 16.06

OmpSs tutorial at PUMPS

Contents

More info

OmpSs Workshop at AXIOM project

Heterogeneous Programming on GPUs with MPI + OmpSs

Contents

More Info

BigStorage Initial Training School

ITN Course: Parallel Programming Workshop

Contents

External references

Parallel Programming Workshop

UAM Course: Parallel Programming

Contents

External References

New OmpSs release 15.06

New OmpSs release 15.04

OmpSs tutorial at PUMPS 2015

Course on OmpSs PATC

New OmpSs release 15.02

OmpSs tutorial at PUMPS 2014

Contents

More Info

Contributions at Joint laboratory for Petascale computing

Contents

Abstracts