Skip to content

Tool attaches after PALS barrier #19

@kent-cheung-arm

Description

@kent-cheung-arm

In moderately-sized PALS jobs, the tool sometimes attaches after the barrier even though cti_releaseAppBarrier was not yet called.

Attaching at barrier as expected:

[57.938218490]info stack
&"info stack\n"
~"#0  0x00002b99d2b92707 in kill () from /lib64/libc.so.6\n"
~"#1  0x00002b99d4971123 in pals_start_barrier (state=state@entry=0x2b99d3a0db00) at /workspace/rpmbuild/BUILD/cray-pals-1.1.3/src/libpals/libpals.c:843\n"
~"#2  0x00002b99d37ff64e in _pmi_pals_sync () at /workspace/src/pals/pals_utils.c:408\n"
~"#3  0x00002b99d37f6ab4 in _pmi_init (spawned=spawned@entry=0x7ffd06de0c1c) at /workspace/src/pmi_core/_pmi_init.c:1431\n"
~"#4  0x00002b99d37f74f4 in _pmi_constructor () at /workspace/src/pmi_core/_pmi_init.c:366\n"
~"#5  0x00002b99cf708aba in call_init.part () from /lib64/ld-linux-x86-64.so.2\n"
~"#6  0x00002b99cf708bc6 in _dl_init () from /lib64/ld-linux-x86-64.so.2\n"
~"#7  0x00002b99cf6f9eda in _dl_start_user () from /lib64/ld-linux-x86-64.so.2\n"
~"#8  0x0000000000000002 in ?? ()\n"
~"#9  0x00007ffd06de261e in ?? ()\n"
~"#10 0x00007ffd06de263e in ?? ()\n"
~"#11 0x0000000000000000 in ?? ()\n"

Attaching at MPI_Init after barrier:

[58.538222668]info stack
&"info stack\n"
~"#0  0x00002b10b6ff64eb in _pmi_smp_barrier_join (smp_bar=0x2b10b7218310, restrict_to_app=restrict_to_app@entry=0) at /workspace/src/pmi_core/smp_barrier.c:81\n"
~"#1  0x00002b10b6fee137 in _pmi_barrier (bar_tag=bar_tag@entry=BARRIER_PACKET, restrict_to_app=restrict_to_app@entry=0) at /workspace/src/pmi_core/_pmi_barrier.c:50\n"
~"#2  0x00002b10b6ff90d1 in PMI_Barrier () at /workspace/src/api/coll/pmi_barrier.c:27\n"
~"#3  0x00002b10b6ff9977 in PMI2_Init (spawned=0x7ffc4ecea6a0, size=0x7ffc4ecea6a8, rank=0x7ffc4ecea6a4, appnum=0x7ffc4ecea6ac) at /workspace/src/api/misc/pmi_init.c:182\n"
~"#4  0x00002b10b577bd41 in MPIR_pmi_init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#5  0x00002b10b5780f76 in MPID_Init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#6  0x00002b10b3cec96d in MPIR_Init_thread () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#7  0x00002b10b3cec744 in PMPI_Init () from /opt/cray/pe/lib64/libmpi_gnu_91.so.12\n"
~"#8  0x000000000040138d in main (argc=2, argv=0x7ffc4ecea908)\n"

The call to cti_releaseAppBarrier occurred at 59.749728688 in this run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions