04-DLB

The DLB library aims to improve the load balance of HPC hybrid applications (i.e., two levels of parallelism). DLB will improve the load balance of the outer level of parallelism (e.g., MPI) by redistributing the computational resources at the inner level of parallelism (e.g., OpenMP). This readjustment of resources will be done dynamically at run time. This dynamism allows DLB to react to different sources of imbalance: Algorithm, data, hardware architecture, and resource availability, among others. For more information see: https://pm.bsc.es/ftp/dlb/doc/user-guide/

Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) is a highly simplified application, hard-coded to only solve a simple Sedov blast problem with analytic answers – but represents the numerical algorithms, data motion, and programming style typical in scientific C or C++ based applications. LULESH represents a typical hydrocode, like ALE3D. LULESH approximates the hydrodynamics equations discretely by partitioning the spatial problem domain into a collection of volumetric elements defined by a mesh. A node on the mesh is a point where mesh lines intersect. LULESH is built on the concept of an unstructured hex mesh. Instead, indirection arrays that define mesh relationships are used.

LULESH has been parallelized using MPI and OpenMP. Additionally, we can easily create load unbalances between MPI processes by passing a couple of flags to the program. Because of that, LULESH is a good example to understand how DLB works and learn to use it.

Goals of this exercise

  • Analyze an execution with paraver and determine load unbalance sources.
  • Modify an MPI+OpenMP application to properly use the DLB library.
  • Compare performance results between the initial version and your candidate version.
  • Compare paraver traces between the initial version and your candidate version.

Excercise instructions

First of all, we’ll start compiling the program as it is:

source configure.sh
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE="-g3 -O3" -DMPI_CXX_COMPILER=`which mpicxx` ..
make

This should generate the binary file lulesh2.0. Let’s execute it and obtain our first paraver trace:

cd ../runs
bsub < run.sh

This will send the job to the execution queue. You can check the execution progress with:

bjobs

Congratulations! lulesh_nodlb.prv has been generated, your first lulesh trace.

Open the trace and analyze it to find the main load unbalance sources, especially which parallels are executed after some threads start to wait at an MPI call:

wxparaver lulesh_nodlb.prv

You can find the most relevant paraver config files inside ./cfgs folder. Additionally, you can find all config files in ./all_cfgs folder.

To use DLB with an MPI+OpenMP program, we need to do some modifications to the source code, placing calls to the DLB API at specific points. Before a parallel region, it’s necessary to call Borrow CPUs from the system. These are the calls we’ll be using:

int DLB_Init(int ncpus, const_dlb_cpu_set_t mask, const char *dlb_args)
int DLB_Finalize(void)
int DLB_Borrow(void)

You can find more information about this calls and others from the DLB API at https://pm.bsc.es/ftp/dlb/doc/user-guide/api.html

Got them? Fine, time to edit lulesh.cc to add some DLB calls. First, copy lulesh.cc in case you want to maintain the source file and the open lulesh.cc:

cp lulesh.cc lulesh.cc_org
vim lulesh.cc

First of all, include the DLB header file “dlb.h”. Next, we need to call DLB_Init and DLB_Finalize after initializing all MPI processes and before finalizing all them. Finally, you have to put Borrow calls before each parallel you have identified in the paraver trace.

Nice! You have an MPI+OpenMP annotated with DLB; now it’s time to compile and execute it. We’ll start modifying the makefile to link with DLB library. Go to build directory and open CMake cache:

ccmake .

Here, toggle advanced mode (press t) and add these lines in CMAKE_CXX_FLAGS_RELEASE and CMAKE_EXE_LINKER_FLAGS_RELEASE, respectively:

-O3 -DNDEBUG -I"/apps/PM/dlb/git/impi/include"
-L"/apps/PM/dlb/git/impi/lib" -ldlb_mpi -Wl,-rpath,"/apps/PM/dlb/git/impi/lib"

Now, configure, generate, and exit CMake (press c and g). Makefile is ready to go, so we can proceed to compile our program and execute it. It’s important to open run.sh first, uncomment “run with DLB” lines and comment “run without DLB” lines.

Finally, you can submit the job, compare their execution time, and open the trace with paraver to compare it with the previous trace:

bsub < run.sh
wxparaver lulesh_nodlb.prv lulesh_dlb.prv