Skip to content

Conversation

@dmitry-monakhov
Copy link

Work in par with https://github.com/poolsideai/nccl-telemetry-py

from nccl_telemetry import NCCLProfiler, Phase

profiler = NCCLProfiler()

for step in range(num_steps):
    with profiler.phase(step, Phase.FORWARD):
        output = model(input)
        loss = criterion(output, target)

    with profiler.phase(step, Phase.BACKWARD):
        loss.backward()

    with profiler.phase(step, Phase.OPTIMIZER):
        optimizer.step()
  • titan integration: TBD

dmonakhov and others added 6 commits December 6, 2025 17:57
Use get_or_init to create Profiler only on first init, reuse on subsequent cycles.
Always respawn daemon when INIT_FLAG=0 (daemon stops on finalize).
Implement phase-aware telemetry to track NCCL operations by training phase
(FORWARD/BACKWARD/OPTIMIZER). Uses reference counting for correct attribution,
replacing unreliable PhaseFlush control messages.

Architecture:
- PhaseScope tracking with atomic reference counting
- TelemetryPool: shared memory FIFO for Python client access
- Phase API: ncclProfilerBeginPhase/EndPhase with phase IDs
- NcclOp-based accounting: capture phase on creation, account on completion

Core components:
- src/phase_scope.rs: Phase tracking with reference counting
- src/phase_api.rs: C FFI for phase control
- src/telemetry_pool.rs: Shared memory export mechanism
- src/telemetry_ffi.rs: C FFI for telemetry access

Changes since v12:
- Fixes OPTIMIZER phases showing 0 operations
- Fixes first phase duration race condition
- Production verified with 8-rank MoE training (11.2GB model, 10 steps).
1. Increase CTRL_FIFO_SZ from 256 to 4096
   - Supports more concurrent thread registrations
   - 256 was too small for large-scale deployments
     (e.g., 64 GPUs x 4 process groups x 3 threads = 768)

2. Handle ctrl_fifo overflow gracefully
   - Previously: .unwrap() panics if queue full
   - Now: log error and continue (thread loses telemetry)
   - Training continues instead of crashing

3. Improve PROFILER.get() error message
   - Previously: generic unwrap panic
   - Now: clear message explaining the race condition
   - Helps debugging if NCCL calls handlers before init completes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants