dlb_talp.h

Date:

Wed Apr 3 2024

NAME

dlb_talp.h

SYNOPSIS


#include <time.h>
#include <stdint.h>

Data Structures

struct dlb_monitor_t
struct dlb_pop_metrics_t
struct dlb_node_metrics_t
struct dlb_node_times_t

Macros

#define DLB_MPI_REGION NULL

Enumerations

enum { DLB_MONITOR_NAME_MAX = 128 }

Functions

int DLB_TALP_Attach (void)
Attach current process to DLB system as TALP administrator.
int DLB_TALP_Detach (void)
Detach current process from DLB system.
int DLB_TALP_GetNumCPUs (int *ncpus)
Get the number of CPUs in the node.
int DLB_TALP_GetPidList (int *pidlist, int *nelems, int max_len)
Get the list of running processes registered in the DLB system.
int DLB_TALP_GetTimes (int pid, double *mpi_time, double *useful_time)
Get the CPU time spent on MPI and useful computation for the given process.
int DLB_TALP_GetNodeTimes (const char *name, dlb_node_times_t *node_times_list, int *nelems, int max_len)
Get the list of raw times for the specified region.
int DLB_TALP_QueryPOPNodeMetrics (const char *name, dlb_node_metrics_t *node_metrics)
From either 1st or 3rd party, query node metrics for one region.
const dlb_monitor_t * DLB_MonitoringRegionGetMPIRegion (void)
Get the pointer of the implicit MPI Monitoring Region.
dlb_monitor_t * DLB_MonitoringRegionRegister (const char *name)
Register a new Monitoring Region.
int DLB_MonitoringRegionReset (dlb_monitor_t *handle)
Reset monitoring region.
int DLB_MonitoringRegionStart (dlb_monitor_t *handle)
Start (or unpause) monitoring region.
int DLB_MonitoringRegionStop (dlb_monitor_t *handle)
Stop (or pause) monitoring region.
int DLB_MonitoringRegionReport (const dlb_monitor_t *handle)
Print a report to stderr of the monitoring region.
int DLB_MonitoringRegionsUpdate (void)
Update all monitoring regions.
int DLB_TALP_CollectPOPMetrics (dlb_monitor_t *monitor, dlb_pop_metrics_t *pop_metrics)
Perform an MPI collective communication to collect POP metrics.
int DLB_TALP_CollectPOPNodeMetrics (dlb_monitor_t *monitor, dlb_node_metrics_t *node_metrics)
Perform a node collective communication to collect TALP node metrics.
int DLB_TALP_CollectNodeMetrics (dlb_monitor_t *monitor, dlb_node_metrics_t *node_metrics) __attribute__((deprecated(‘DLB_TALP_CollectPOPNodeMetrics’)))

Macro Definition Documentation

#define DLB_MPI_REGION NULL

Enumeration Type Documentation

anonymous enum

Enumerator

DLB_MONITOR_NAME_MAX

Function Documentation

int DLB_TALP_Attach (void)

Attach current process to DLB system as TALP administrator.

Returns

DLB_SUCCESS on success

Once the process is attached to DLB as TALP administrator, it may perform the below actions described in this file. This way, the process is able to obtain some TALP values such as time spent in computation or MPI for each of the DLB running processes.

int DLB_TALP_Detach (void)

Detach current process from DLB system.

Returns

DLB_SUCCESS on success

DLB_ERR_NOSHMEM if cannot find shared memory to detach from

If previously attached, a process must call this function to correctly close file descriptors and clean data.

int DLB_TALP_GetNumCPUs (int * ncpus)

Get the number of CPUs in the node.

Parameters

ncpus the number of CPUs

Returns

DLB_SUCCESS on success

int DLB_TALP_GetPidList (int * pidlist, int * nelems, int max_len)

Get the list of running processes registered in the DLB system.

Parameters

pidlist The output list
nelems Number of elements in the list
max_len Max capacity of the list

Returns

DLB_SUCCESS on success

DLB_ERR_NOSHMEM if cannot find shared memory

int DLB_TALP_GetTimes (int pid, double * mpi_time, double * useful_time)

Get the CPU time spent on MPI and useful computation for the given process.

Parameters

pid target Process ID
mpi_time CPU time spent on MPI in seconds
useful_time CPU time spend on useful computation in seconds

Returns

DLB_SUCCESS on success

DLB_ERR_NOPROC if target pid is not registered in the DLB system

int DLB_TALP_GetNodeTimes (const char * name, dlb_node_times_t* node_times_list, int * nelems, int max_len)

Get the list of raw times for the specified region.

Parameters

name Name to identify the region
node_times_list The output list
nelems Number of elements in the list
max_len Max capacity of the list

Returns

DLB_SUCCESS on success

DLB_ERR_NOSHMEM if cannot find shared memory

int DLB_TALP_QueryPOPNodeMetrics (const char * name, dlb_node_metrics_t* node_metrics)

From either 1st or 3rd party, query node metrics for one region.

Parameters

name Name to identify the region
node_metrics Allocated structure where the collected metrics will be stored

Returns

DLB_SUCCESS on success

DLB_ERR_NOENT if no data for the given name

If called from a third party, this function requires the application to run with DLB_ARGS+=’ –talp-external-profiler’

const dlb_monitor_t* DLB_MonitoringRegionGetMPIRegion (void)

Get the pointer of the implicit MPI Monitoring Region. DISCLAIMER: The functions declared above are intended to be called from 1st-party or 3rd-party programs indistinctly; that is, DLB applications, or external profilers as long as they invoke DLB_TALP_Attach.

The functions declared below are intended to be called only from 1st-party programs, and they should return an error if they are called from external profilers.

This header file may be split in two in the next major release.

Returns

monitor handle to be used on queries, or NULL if TALP is not enabled

dlb_monitor_t* DLB_MonitoringRegionRegister (const char * name)

Register a new Monitoring Region.

Parameters

name Name to identify the new region

Returns

monitor handle to be used on subsequent calls, or NULL if TALP is not enabled

int DLB_MonitoringRegionReset (dlb_monitor_t* handle)

Reset monitoring region.

Parameters

handle Monitoring handle that identifies the region, or DLB_MPI_REGION

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

int DLB_MonitoringRegionStart (dlb_monitor_t* handle)

Start (or unpause) monitoring region.

Parameters

handle Monitoring handle that identifies the region, or DLB_MPI_REGION

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

int DLB_MonitoringRegionStop (dlb_monitor_t* handle)

Stop (or pause) monitoring region.

Parameters

handle Monitoring handle that identifies the region, or DLB_MPI_REGION

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

int DLB_MonitoringRegionReport (const dlb_monitor_t* handle)

Print a report to stderr of the monitoring region.

Parameters

handle Monitoring handle that identifies the region, or DLB_MPI_REGION

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

int DLB_MonitoringRegionsUpdate (void)

Update all monitoring regions.

Returns

DLB_SUCCESS on success

Monitoring regions are only updated in certain situations, like when starting/stopping a region, or finalizing MPI. This routine forces the update of all started monitoring regions

int DLB_TALP_CollectPOPMetrics (dlb_monitor_t* monitor,dlb_pop_metrics_t* pop_metrics)

Perform an MPI collective communication to collect POP metrics.

Parameters

monitor Monitoring handle that identifies the region, or DLB_MPI_REGION macro (NULL) if implicit MPI region
pop_metrics Allocated structure where the collected metrics will be stored

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

int DLB_TALP_CollectPOPNodeMetrics (dlb_monitor_t* monitor,dlb_node_metrics_t* node_metrics)

Perform a node collective communication to collect TALP node metrics.

Parameters

monitor Monitoring handle that identifies the region, or DLB_MPI_REGION macro (NULL) if implicit MPI region
node_metrics Allocated structure where the collected metrics will be stored

Returns

DLB_SUCCESS on success

DLB_ERR_NOTALP if TALP is not enabled

DLB_ERR_NOCOMP if support for barrier is disabled, i.e., –no-barrier

This functions performs a node barrier to collect the data. All processes that are running in the node must invoke this function.

int DLB_TALP_CollectNodeMetrics (dlb_monitor_t* monitor,dlb_node_metrics_t* node_metrics)

Author

Generated automatically by Doxygen for Dynamic Load Balance from the source code.