Dynamic Load Balancing

DLB is a library devoted to speedup hybrid parallel applications (see fig. 1). And at the same time DLB improves the efficient use of the computational resources inside a computing node.

 

The DLB library will improve the load balance of the outer level of parallelism by redistributing the computational resources at the inner level of parallelism. This readjustment of resources will be done at dynamically at runtime.

This dynamism allows DLB to react to different sources of imbalance: Algorithm, data, hardware architecture and resource availability among others.

In fig. 2 we can see a typical life cycle when developing an HPC application. This process is time consuming for the programmer/expert and spends computational resources. Moreover when talking about highly tunned applications this cycle must be done for each different architecture the application will run.

To use DLB we do not need a previous performance analysis of the application (see fig. 3). Therefor we will reduce significantly the time spent by the programmers modifying the application, debugging and analyzing the sources of performance loss and computational resources necessary to run the performance tests.

The DLB library uses an interposition technique during runtime, therefor in most of the cases it is not necessary to modify the application nor recompile it (see fig. 4).

The DLB approach to redistribute the computational resource at runtime depending on the instantaneous demand can improve the performance in different situations:

  • Hybrid applications with an imbalance problem at the outer level of parallelism.
  • Hybrid applications with an imbalance problem at the inner level of parallelism.
  • Hybrid applications with serialized parts of the code.
  • Multiple applications with different parallelism patterns.

Who can use DLB?

Any application written in C, C++ or Fortran in any of the supported parallel programming models.

The current supported parallel programming models are the following:

  • MPI+OpenMP
  • MPI+OmpSs
  • OmpSs (Multiple Applications)

We are open to adding support for more programming models in both inner and outer level of parallelism.

How does DLB work?

DLB will use the malleability of the inner level of parallelism to change the number of threads of the different processes running in the same node. There are different load balancing algorithms implemented within DLB. They all relay on this main idea but they target different types of applications or situations.

In fig. 5 we can see an example of a DLB load balancing algorithm. In this case the application is running two MPI processes in a computing node, with two OpenMP threads each one. When MPI process 1 reaches a blocking MPI call it will lend its assigned cpus (number 1 and 2) to the second MPI process running in the same node. This will allow MPI process 2 to finish its computation faster.

What does DLB need?

DLB need more than one process running in the same computing node (with shared memory). These processes can be MPI processes of the same application or processes of different applications.

We need the processes to use a shared memory parallel programming model (current version supports, OpenMP and OmpSs).

The shared memory programming model must be malleable, both from the point of view of the programming model (OpenMP and OmpSs are highly malleable) and from the point of view of the application (do not rely on the number of threads).

Contact Information

Publications

HPC hybrid application with nested parallelism

fig. 1: Hybrid application with nested parallelism

Typical application development life cycle

fig. 2: Typical HPC application life cycle

Application development life cycle with DLB

fig. 3: HPC application life cycle with DLB

MPI interposition mechanism

fig. 4: DLB interposition mechanism used with MPI applications

DLB example of load balancing algorithm

fig 5: Example of DLB load balancing algorithm.