Skip to content

PerfMap / CodeFragmentHeap lock-ordering deadlock during GC suspension #128401

@leculver

Description

@leculver

Description

A .NET 10 process on Linux can hang permanently when the DOTNET_PerfMapEnabled code path is active (env var or runtime-enabled via the Diagnostics Server IPC, e.g. dotnet-trace --enable-perfmap). The hang is a three-way deadlock between the GC suspension machinery, CodeFragmentHeap::m_Lock, and PerfMap::s_csPerfMap, triggered while a virtual-call-stub resolve worker is generating a new resolve stub.

The two locks are taken in the wrong order with respect to GC-mode handling:

  • CodeFragmentHeap::m_Lock is constructed with CRST_UNSAFE_ANYMODE, so acquiring it does not toggle the calling thread to preemptive GC.
  • PerfMap::s_csPerfMap is constructed with CRST_DEFAULT (no flags), so acquiring it does toggle a cooperative thread to preemptive via Thread::RareDisablePreemptiveGC, which can block while a GC suspension is in progress.

The result: a cooperative thread holding m_Lock calls into PerfMap, gets stuck inside RareDisablePreemptiveGC waiting for the GC to finish — but the GC is waiting for that same thread to reach a safe point, and every other thread that needs to allocate a stub is queued behind the held m_Lock.

Reproduction Steps

Likely synthetic repro recipe:

  1. Run any .NET 10 Linux x64 workload with PerfMap enabled (DOTNET_PerfMapEnabled=1, or attach with dotnet-trace collect -p <pid> --enable-perfmap).
  2. Drive heavy virtual-call-stub creation (large numbers of polymorphic interface call sites being warmed up concurrently) while also driving allocation that forces frequent GCs.
  3. Eventually a cooperative thread is suspended inside PerfMap::LogStubs -> s_csPerfMap while holding CodeFragmentHeap::m_Lock, and the process deadlocks.

Crash dumps available to debug this directly though.

Expected behavior

PerfMap logging on a stub-allocation path must not be able to deadlock with the GC.

Not sure the actual fix, as CodeFragmentHeap::m_Lock was correctly "default" leaving us in coop mode because this was meant as a quick call, and the work behind PerfMap::s_csPerfMap is heavyweight, meaning we should let the GC run. This change makes ResolveWorkerAsmStub a more heavyweight function which may need to swap to preemptive mode, or possibly the calls into PerfMap need to be lighter weight. Or maybe I'm overthinking it.

Actual behavior

The GC thread is suspending the world and waiting on a cooperative thread to reach a safe point.

Thread 60 is preemptive, was about to acquire s_csPerfMap from inside PerfMap::LogStubs. The default-flagged Crst is toggling it to preemptive via RareDisablePreemptiveGC, where it now sits indefinitely because the GC is already trying to suspend it:

02  libcoreclr!GCEvent::Impl::Wait+0xd2                 unix/events.cpp:179
03  libcoreclr!Thread::RareDisablePreemptiveGC+0x14e    threadsuspend.cpp:2223
04  libcoreclr!CrstBase::AcquireLock+0xc                crst.h:174
05  libcoreclr!CrstBase::CrstHolder::CrstHolder+0xc     crosscomp.h:349
06  libcoreclr!PerfMap::LogStubs+0x19a                  perfmap.cpp:462
07  libcoreclr!CodeFragmentHeap::RealAllocAlignedMem+0x122
08  libcoreclr!VirtualCallStubManager::GenerateResolveStub+0xb6
09  libcoreclr!VirtualCallStubManager::ResolveWorker+0x8a8
0a  libcoreclr!VSD_ResolveWorker+0x2e7
0b  libcoreclr!ResolveWorkerAsmStub+0x71

Thread 60 is still holding CodeFragmentHeap::m_Lock from frame 07.

Thread 61 is cooperative, blocked at trying to acquire m_Lock (held by thread 60):

00  libc_so!__lll_lock_wait_private+0x90
01  libc_so!pthread_mutex_lock+0x167
02  libcoreclr!CrstBase::Enter+0x94                     crst.cpp:265
03  libcoreclr!CrstBase::AcquireLock+0x5                crst.h:174
04  libcoreclr!CrstBase::CrstHolder::CrstHolder+0x5     crosscomp.h:349
05  libcoreclr!CodeFragmentHeap::RealAllocAlignedMem+0x2a
06  libcoreclr!VirtualCallStubManager::GenerateResolveStub+0xb6
07  libcoreclr!VirtualCallStubManager::ResolveWorker+0x8a8
08  libcoreclr!VSD_ResolveWorker+0x2e7
09  libcoreclr!ResolveWorkerAsmStub+0x71

Because thread 61 is Cooperative and the GC is trying to suspend the runtime, the GC thread also waits on this thread to either reach a safe point or go preemptive — which it cannot do because it is parked inside pthread_mutex_lock.

Cycle:

GC  -> waits for cooperative threads to suspend
T60 -> preemptive, in RareDisablePreemptiveGC waiting for GC to finish
       (holds CodeFragmentHeap::m_Lock)
T61 -> cooperative, blocked on CodeFragmentHeap::m_Lock held by T60
       (and the GC is waiting on T61 to suspend)

Regression?

Likely a regression from #113943.

Known Workarounds

Two environment knobs that, set together, prevent the bad path from being exercised:

DOTNET_EnableDiagnostics_IPC=0
DOTNET_PerfMapEnabled=0
  • DOTNET_PerfMapEnabled=0 ensures PerfMap::s_enabled stays false at startup, so PerfMap::LogStubs early-outs and never touches s_csPerfMap.
  • DOTNET_EnableDiagnostics_IPC=0 shuts off the Diagnostics Server, which is otherwise able to enable PerfMap at runtime via the ds_rt_enable_perfmap IPC handler regardless of the env var (dotnet-trace --enable-perfmap, dotnet-monitor, third-party APM agents).

Both are needed: setting only one leaves the other path open.

Configuration

Likely not Linux specific, but that's where I debugged it.

Metadata

Metadata

Assignees

Labels

area-VM-coreclruntriagedNew issue has not been triaged by the area owner

Type

No type
No fields configured for issues without a type.

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions