compilAR is a distributed GPU communication compiler for straggler-resilient AllReduce. Inspired by the paper Efficient AllReduce with Stragglers (Devraj et al.), it takes a schedule produced by the StragglAR algorithm as input and emits a complete, standalone CUDA + MPI + NCCL implementation of that schedule for any number of GPUs.
The core problem it solves is that standard AllReduce algorithms (ring, tree, recursive-halving-doubling) stall every healthy rank until the slowest GPU catches up. StragglAR lets the N-1 healthy ranks make progress among themselves while the straggler is still computing, then merges the straggler's contribution with a minimal number of additional communication rounds once it is ready.
Given N GPUs with one designated straggler (rank N-1):
-
Reduce-scatter phase: While the straggler is delayed, the N-1 healthy ranks run
ncclReduceScatteramong themselves over N-1 equal chunks of the buffer. After this, rank r holds the partial sum of chunk r across all healthy ranks. -
Straggler merge phase: The schedule synthesizer produces a sequence of rounds, each containing a batch of pairwise exchanges. Three exchange types are used:
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
reduce_addto finalize. - OneWayMatching: a rank holding a fully-reduced chunk pushes it to a rank that does not. Plain copy.
- TwoWayMatching: two ranks each hold a fully-reduced chunk the other lacks. They swap simultaneously.
- StragglerMatching: a healthy rank and the straggler both hold a partial sum of the same chunk. They swap into a scratch buffer, then both call
-
After the last round, every rank holds every chunk fully reduced.
compilAR.py takes a schedule file (from synthesizer_pow2.py or synthesizer_nonpow2.py) and generates a complete .cu source file. It uses allreduce_multinode.cu.template as its skeleton and substitutes:
NUM_RANKS: number of GPUs, inferred from the schedulekStragglerRank: straggler rank, inferred from the schedule- Body of
stragglar_allreduce_helper: per-round NCCL group blocks, generated from the schedule matchings
Everything else in the template (MPI bootstrap, NCCL communicator init, reduce-scatter sub-communicator, benchmark loop, correctness check, cleanup) is agnostic of the number of GPUs.
All builds and runs happen inside an Apptainer container defined by stragglar/compilar.def. The container ships a coherent CUDA + NCCL + OpenMPI + PyTorch toolchain (NGC pytorch:23.10-py3), so the host only needs:
- An NVIDIA driver (≥ 535 for our base image)
- Apptainer ≥ 1.3
- An MPI launcher (host
mpirunon a TCP cluster, or Slurmsrunon an HPC cluster)
No host-side CUDA, NCCL, MPI, Python, or PyTorch installation is required.
cd stragglar
sudo apptainer build compilAR.sif compilar.def
# or, if sudo is unavailable:
apptainer build --fakeroot compilAR.sif compilar.defPlace compilAR.sif somewhere visible from every node (e.g. NFS-shared $HOME).
cd stragglar/schedules
python synthesizer_pow2.py 8 > 8gpusched.txtPre-generated schedules for N=2, 4, 8 are already in schedules/.
cd stragglar
apptainer exec --nv compilAR.sif python3 compilAR.py schedules/8gpusched.txt generated_8gpu.cuThe container provides stragglar-build, which produces a fat binary that runs on every shipping NVIDIA GPU from Pascal (sm_61) through Hopper (sm_90), with PTX fallback for newer cards:
apptainer exec --nv compilAR.sif stragglar-build generated_8gpu.cu stragglar_8gpuFor a faster, GPU-specific build (host's GPU only):
apptainer exec --nv compilAR.sif stragglar-build --native generated_8gpu.cu stragglar_8gpumpirun -np 8 -H host1:4,host2:4 \
--mca btl_tcp_if_include <iface> \
-x NCCL_P2P_DISABLE=1 \
-x PMIX_MCA_psec=native \
-x PMIX_MCA_gds=hash \
apptainer exec --nv compilAR.sif ./stragglar_8gpu 117440512 stragglar 10 100.0The PMIX_MCA_* flags reconcile the host PMIx with the container's PMIx; drop them if your host and container OpenMPI versions match. NCCL_P2P_DISABLE=1 is needed only on consumer (GeForce) GPUs where PCIe P2P is disabled in the driver.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=00:30:00
srun --mpi=pmix \
apptainer exec --nv \
--bind /dev/infiniband --bind /sys/class/infiniband \
compilAR.sif ./stragglar_8gpu 117440512 stragglar 10 100.0Add -x NCCL_DEBUG=INFO (TCP) or --export=ALL,NCCL_DEBUG=INFO (Slurm) on first run to confirm Using network IB (or Using network Socket on TCP) and reasonable bandwidth.
launch.sh runs a smoketester to identify the physically slow GPU and binds it to rank N-1, useful when you don't want to simulate a delay. It currently assumes a single-node, host-toolchain layout — if you use it, run it inside the container:
apptainer exec --nv compilAR.sif ./stragglar/launch.sh 8 ./stragglar_8gpu 117440512 stragglar 10 -1<buffer_bytes> <algorithm> <num_iters> <sleep_ms>
| Argument | Description |
|---|---|
buffer_bytes |
Total AllReduce buffer size in bytes. Must satisfy (buffer_bytes / sizeof(float)) % (N - 1) == 0; the binary errors out otherwise with suggested valid sizes. The examples above use 117440512 (112 MiB, valid for N=8) and 50331648 (48 MiB, valid for N=4) — both pick a 4 MiB chunk per rank |
algorithm |
Must be stragglar |
num_iters |
Timed iterations; first is discarded as warmup |
sleep_ms |
Milliseconds to delay rank N-1. -1 skips reduce-scatter and runs the merge schedule only |
allreduce_multinode.cu (and all generated files) use one MPI rank per GPU. Each process initializes its NCCL communicator via ncclCommInitRank with a token distributed by MPI_Bcast. This scales to multi-node configurations.
The single-process variants in reference_code/ and allreduce_4GPU_rewrite.cu use ncclCommInitAll and only work on one host. They are kept for reference.
Two NCCL communicators are maintained per process:
comm: all N ranks; used during the straggler merge schedulesubComm: ranks 0 through N-2; used for the reduce-scatter phase while the straggler is delayed (built viaMPI_Comm_split)
Each MPI process binds to its GPU via LOCAL_RANK (set by mpirun, torchrun, or Slurm). Without it, the process falls back to myRank % cudaDeviceCount. For accurate straggler behavior, rank N-1 must be bound to the physically slow GPU. launch.sh does this automatically.
launch.shassumes single-node. The smoketester enumerates GPUs viatorch.cuda.device_count()on one host. Multi-node runs need either a per-node aggregation step or manualLOCAL_RANKassignment.- Data type is hardcoded to
float32. Supporting fp16 / bf16 requires changes to both the template and the generator. - Buffer size must be divisible by
(N-1) * sizeof(float). The binary checks at startup and exits with an error and suggested valid sizes if not. Note that round-power-of-2 sizes (1 MiB, 1 GiB) are not valid for non-power-of-2 values of N-1 — for N=4 use multiples of 12 bytes (e.g. 48 MiB, 192 MiB), for N=8 use multiples of 28 bytes (e.g. 112 MiB, 896 MiB). The helper atstragglar/smoketest/pad_buffers.py Nprints a list of valid sizes for a given N. - Straggler is always rank N-1. Baked into the generated code;
launch.shmaps the physical straggler GPU to that rank at launch. - Correctness check assumes the built-in fill pattern.
kExpectedSum = 6.0fonly holds when the straggler fills its buffer with the arbitrary3.0fand each non-straggler fills its own chunk. Replace this check when wiring real input data. - Clock-based straggler delay is calibrated to device 0. On heterogeneous-clock GPUs,
sleep_mswon't match wall-clock ms on other devices. Doesn't affect correctness.
We achieved ~1.3x speedup compared to standard Ring AllReduce on our in-house CDS Compute Cluster. This is with a deliberate 50 ms simulated delay to showcase how stragglers are handled, and across multiple nodes with heterogeneous and hardware (RTX 2080 Ti and GTX 1070). We reasonably project that on production or research grade homogeneous hardware on a single host device, we can approach a 2x speedup compared to traditional AllReduce methods.
Our straggler detection system also correctly identifies the slowest GPU. In our in-house tests, one of our nodes had older GPUs and so they were always identified as the stragglers at a statistically significant rate.
The StragglAR algorithm and the schedule synthesizer in stragglar/schedules/ are the work of Devraj et al., Efficient AllReduce with Stragglers. This project is not the original algorithm, but a compiler and launch harness built around it.