Describe the Request
APPFL currently supports run_* under /src/appfl. (MPI/gPRC/Globus)
While run_serial.py exists, it will be deprecated in future, which seems useful for laptop-scale simulation experiments.
- branching simulation modes
- I suggest to revive the
run_serial.py, by having four modes internally:
run_serial: running on CPU/GPU in serial manner (for-loop style)
run_gloo: running in parallel on CPUs (torch.distributed.init_process_group(backend='gloo'))
run_nccl: running in parallel on multiple GPUs (torch.distributed.init_process_group(backend='nccl'))
run_mpi: running in parallel on multiple GPUs for intel GPUs (i.e., XPU)
- While
gloo can be regarded as an alternative of mpi, but it is thread-safe, pytorch-native backend, and thus no dependency on mpiexec.
- (draft example):
|
def run_distributed(config, backend: str) -> None: |
- branching entry points
appfl
run
mode=real
comm=grpc
comm=globus
comm=ray
mode=sim
backend=serial
backend=gloo
backend=nccl
backend=mpi
commit
interface=chat // vibe coding interface to auto generate essential files for algorithm implementation.
interface=manual // sanity check interface to confirm manually-coded essential modules by users (e.g., aggregator,scheduler,trainer)
- Currently a simple suggestion -- needs further discussion on this if acceptable
Sample Code
// simulation modes
appfl run mode=sim backend=serial config=...
appfl run mode=sim backend=gloo config=...
appfl run mode=sim backend=ncll config=...
// deployment modes
appfl run mode=real comm=grpc config=...
appfl run mode=real comm=globus config=...
...
// commit modes
appfl commit interface=chat config=...
appfl commit interface=manual config=...
Additional Code or Information
To-Do
Describe the Request
APPFLcurrently supportsrun_*under/src/appfl. (MPI/gPRC/Globus)While
run_serial.pyexists, it will be deprecated in future, which seems useful for laptop-scale simulation experiments.run_serial.py, by having four modes internally:run_serial: running on CPU/GPU in serial manner (for-loop style)run_gloo: running in parallel on CPUs (torch.distributed.init_process_group(backend='gloo'))run_nccl: running in parallel on multiple GPUs (torch.distributed.init_process_group(backend='nccl'))run_mpi: running in parallel on multiple GPUs for intel GPUs (i.e., XPU)gloocan be regarded as an alternative ofmpi, but it is thread-safe, pytorch-native backend, and thus no dependency onmpiexec.APPFL/src/appfl/sim/runner.py
Line 580 in 4a26e03
appflrunmode=realcomm=grpccomm=globuscomm=raymode=simbackend=serialbackend=gloobackend=ncclbackend=mpicommitinterface=chat// vibe coding interface to auto generate essential files for algorithm implementation.interface=manual// sanity check interface to confirm manually-coded essential modules by users (e.g.,aggregator,scheduler,trainer)Sample Code
Additional Code or Information
To-Do
torch.distributedcompatibility with Intel XPUrun_mpidesign