AMD: slurmstepd: error: mpi/pmi2: value not properly terminated in client request
It looks like Intel MPI is having some trouble communicating via pmi2 with SLURM. The experiment in question launched by the new siege stage /nix/store/6n4g4hapqs3768wiyqdaz3smj8q0hwsg-siege
uses the default slurm of nixpkgs /nix/store/4kaywdsrqzfx31zxww3c6lgvfnf2sg0g-slurm-20.02.6.1
, which doesn't match the one running in the AMD cluster (20.02.4).
Additional debug information is enabled using PMI_DEBUG=1. The error is:
cpu-bind=MASK - amd15, task 0 0 [85339]: mask 0x10000000000000001 set
cpu-bind=MASK - amd16, task 1 0 [85478]: mask 0x10000000000000001 set
slurmstepd: error: mpi/pmi2: value not properly terminated in client request
slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd: error: mpi/pmi2: full request is: 000000000000000000000000000000000000000000000
cmd=put kvsname=18137.0 key=bc-1-seg-2/3 value=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
slurmstepd: error: mpi/pmi2: invalid client request
slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd: error: mpi/pmi2: value not properly terminated in client request
slurmstepd: error: mpi/pmi2: full request is: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
slurmstepd: error: mpi/pmi2: invalid client request
slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd: error: mpi/pmi2: full request is: 000000000000000000000000000000000000000000000
cmd=put kvsname=18137.0 key=bc-0-seg-2/3 value=0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
slurmstepd: error: mpi/pmi2: invalid client request
slurmstepd: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd: error: mpi/pmi2: full request is: 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
slurmstepd: error: mpi/pmi2: invalid client request
Additional information can be seen with strace.
[pid 11863] write(13, "cmd=put kvsname=18138.0 key=bc-0 value=segments=3\n", 50) = 50
[pid 11863] read(13, "cmd=put_result rc=0\n", 1023) = 20
[pid 11863] write(13, "cmd=put kvsname=18138.0 key=bc-0-seg-1/3 value=mpi#5355CA994A1F0B6F4008394714C8035B020EA7D377CC2B32004C3E5077CCAB33004F130088572E0040040000C041083947
14C8035B020E8D5377CC2B32004C3E5077CCAB33004F13008802000000000000002200637577CC2B3200F8D74F00000000004F0300886A9A7596AC06426F2403211100278AB00FA133EF48B8D1B00F213532DF0A0000
7AD477CC2B33EF48B8D1B00F213532DF0A00010031D15C7CE133EF48B8D1E1BE233532030F000362B800B30F77CCAB33EF48B8D1E1BE233532030F008363B8002688394714C8035B020E478B95BFD63400242E5077CC
AB330092010084572E0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000000000000000000000000000000000000000000000000000000\n", 1070) = 1070
[pid 11863] read(13, slurmstepd: error: mpi/pmi2: value not properly terminated in client request
slurmstepd: error: mpi/pmi2: request not begin with "cmd=put_result rc=0\n", 1023) = 20
'cmd='
slurmstepd: error: mpi/pmi2: full request is: 000000000000000000000000000000000000000000000
[pid 11863] write(13, "cmd=put kvsname=18138.0 key=bc-0-seg-2/3 value=000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\n", 1070) = 1070
slurmstepd: error: mpi/pmi2: invalid client request
[pid 11863] read(13, slurmstepd: error: mpi/pmi2: value not properly terminated in client request
slurmstepd: error: mpi/pmi2: "cmd=put_result rc=0\n", 1023) = 20
request not begin with 'cmd='
[pid 11863] write(13, "cmd=put kvsname=18138.0 key=bc-0-seg-3/3 value=00000000$\n", 57slurmstepd: error: mpi/pmi2: full request is: 000000000000000000000000000000000000000000000
) = 57
slurmstepd: error: mpi/pmi2: invalid client request
[pid 11863] read(13, "cmd=put_result rc=0\n", 1023) = 20
[pid 11863] write(13, "cmd=barrier_in\n", 15) = 15
[pid 11863] read(13, "cmd=barrier_out rc=0\n", 1023) = 21
[pid 11863] write(13, "cmd=get kvsname=18138.0 key=bc-0\n", 33) = 33
[pid 11863] read(13, "cmd=get_result rc=0 value=segments=3\n", 1023) = 37
[pid 11863] write(13, "cmd=get kvsname=18138.0 key=bc-0-seg-1/3\n", 41) = 41
[pid 11863] read(13, "cmd=get_result rc=1\n", 1023) = 20
[pid 11863] write(2, "Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:\nMPIR_Init_thread(138)........: \nMPID_Init(1139)..............: \nMPIDI_OFI_mpi_init_hook(1647): \nMPIDU_bc_table_create(333)...: \n", 229Abort(2664079) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1139)..............:
It looks like PMI2 is only reading 1024 bytes, truncating the value. Then tries to parse the leftover value which causes and error, as it is not a recognized cmd.
Let see if this is fixed in the same slurm version from the cluster.