Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 21 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,21 @@ torchada is an adapter that makes [torch_musa](https://github.com/MooreThreads/t

Many PyTorch projects are written for NVIDIA GPUs using `torch.cuda.*` APIs. To run these on Moore Threads GPUs, you would normally need to change every `cuda` reference to `musa`. torchada eliminates this by automatically translating CUDA API calls to MUSA equivalents at runtime.

## Architecture

torchada sits between user code and the PyTorch MUSA backend. User applications import `torchada` once and continue calling standard `torch.cuda.*` APIs. The compatibility layer patches CUDA entry points, translates CUDA device references to MUSA, installs CUDA-shaped module shims, and ports CUDA extension sources and symbols for the MUSA toolchain. On MUSA platforms, calls are redirected to `torch.musa`, custom operators use the `PrivateUse1` dispatch key, distributed NCCL requests map to MCCL, and runtime calls target the MUSA runtime.

Import-time setup follows a small sequence:

- Detect the active platform (`MUSA`, native `CUDA`, or `CPU`).
- Load optional C++/MUSA operator overrides on MUSA platforms.
- Apply PyTorch compatibility patches through the `_patch.py` registry.
- Configure the bundled Triton/MoE defaults for SGLang and vLLM.

The main compatibility modules are `_device_compat.py`, `_cuda_compat.py`, `_runtime.py`, `_ctypes_compat.py`, `_accelerator_compat.py`, `utils/cpp_extension.py`, `_mapping.py`, `_cpp_ops.py`, `csrc/`, `cuda/`, and `triton/`.

For downstream compatibility, `torch.cuda.is_available()` and `torch.version.cuda` are intentionally left unpatched so projects can still distinguish native CUDA environments from MUSA environments.

## Prerequisites

- **torch_musa**: You must have [torch_musa](https://github.com/MooreThreads/torch_musa) installed (this provides MUSA support for PyTorch)
Expand Down Expand Up @@ -53,9 +68,11 @@ That's it! All `torch.cuda.*` APIs are automatically redirected to `torch.musa.*
| Device operations | `tensor.cuda()`, `model.cuda()`, `torch.device("cuda")` |
| Memory management | `torch.cuda.memory_allocated()`, `empty_cache()` |
| Synchronization | `torch.cuda.synchronize()`, `Stream`, `Event` |
| Compatibility aliases | `memory_cached()`, `torch.cuda.streams`, `torch.cuda.sparse` |
| Mixed precision | `torch.cuda.amp.autocast()`, `GradScaler()` |
| CUDA Graphs | `torch.cuda.CUDAGraph`, `torch.cuda.graph()` |
| CUDA Runtime | `torch.cuda.cudart()` → uses MUSA runtime |
| CUDA Introspection | `get_gencode_flags()`, `get_sync_debug_mode()`, `set_sync_debug_mode()` |
| Profiler | `ProfilerActivity.CUDA` → uses PrivateUse1 |
| Custom Ops | `Library.impl(..., "CUDA")` → uses PrivateUse1 |
| Distributed | `dist.init_process_group(backend='nccl')` → uses MCCL |
Expand Down Expand Up @@ -199,11 +216,10 @@ with torch.accelerator.stream(torch.musa.Stream()):
...
```

**Forward compatibility:** The wrapper always prefers the real
`torch.accelerator` implementation and only falls back to `torch.musa` when an
attribute is missing, so upgrading to a future PyTorch release that ships
official implementations requires no changes on your side — you will
automatically get the upstream version.
**Forward compatibility:** The wrapper first applies torchada overrides for
MUSA-specific fixes such as synchronization and memory APIs, then prefers the
real `torch.accelerator` implementation, and finally falls back to `torch.musa`
when an attribute is missing.

## Platform Detection

Expand Down
21 changes: 19 additions & 2 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,21 @@ torchada 是一个适配器,让 [torch_musa](https://github.com/MooreThreads/t

许多 PyTorch 项目使用 `torch.cuda.*` API 为 NVIDIA GPU 编写。要在摩尔线程 GPU 上运行这些项目,通常需要把每个 `cuda` 引用改成 `musa`。torchada 通过在运行时自动将 CUDA API 调用转换为 MUSA 等效调用来消除这一问题。

## 架构

torchada 位于用户代码和 PyTorch MUSA 后端之间。用户应用只需导入一次 `torchada`,之后继续使用标准 `torch.cuda.*` API。兼容层会修补 CUDA 入口点,将 CUDA 设备引用转换为 MUSA,安装 CUDA 形态的兼容模块,并为 MUSA 工具链转换 CUDA 扩展源码和符号。在 MUSA 平台上,调用会重定向到 `torch.musa`,自定义算子使用 `PrivateUse1` 调度键,分布式 NCCL 请求映射到 MCCL,运行时调用则落到 MUSA 运行时。

导入时的初始化流程很短:

- 检测当前平台(`MUSA`、原生 `CUDA` 或 `CPU`)。
- 在 MUSA 平台上加载可选的 C++/MUSA 算子覆盖。
- 通过 `_patch.py` 注册表应用 PyTorch 兼容性补丁。
- 为 SGLang 和 vLLM 设置内置 Triton/MoE 默认配置。

主要兼容模块包括 `_device_compat.py`、`_cuda_compat.py`、`_runtime.py`、`_ctypes_compat.py`、`_accelerator_compat.py`、`utils/cpp_extension.py`、`_mapping.py`、`_cpp_ops.py`、`csrc/`、`cuda/` 和 `triton/`。

为了保持下游项目的平台检测逻辑,`torch.cuda.is_available()` 和 `torch.version.cuda` 会有意保持不修补,这样项目仍然可以区分原生 CUDA 环境和 MUSA 环境。

## 前置条件

- **torch_musa**:必须安装 [torch_musa](https://github.com/MooreThreads/torch_musa)(提供 PyTorch 的 MUSA 支持)
Expand Down Expand Up @@ -53,9 +68,11 @@ torch.cuda.synchronize()
| 设备操作 | `tensor.cuda()`, `model.cuda()`, `torch.device("cuda")` |
| 显存管理 | `torch.cuda.memory_allocated()`, `empty_cache()` |
| 同步 | `torch.cuda.synchronize()`, `Stream`, `Event` |
| 兼容别名 | `memory_cached()`、`torch.cuda.streams`、`torch.cuda.sparse` |
| 混合精度 | `torch.cuda.amp.autocast()`, `GradScaler()` |
| CUDA Graphs | `torch.cuda.CUDAGraph`, `torch.cuda.graph()` |
| CUDA 运行时 | `torch.cuda.cudart()` → 使用 MUSA 运行时 |
| CUDA 自省/调试 | `get_gencode_flags()`、`get_sync_debug_mode()`、`set_sync_debug_mode()` |
| 性能分析 | `ProfilerActivity.CUDA` → 使用 PrivateUse1 |
| 自定义算子 | `Library.impl(..., "CUDA")` → 使用 PrivateUse1 |
| 分布式训练 | `dist.init_process_group(backend='nccl')` → 使用 MCCL |
Expand Down Expand Up @@ -197,8 +214,8 @@ with torch.accelerator.stream(torch.musa.Stream()):
...
```

**前向兼容性:** 包装器始终优先使用真正的 `torch.accelerator` 实现,只有在缺少属性时才回退到
`torch.musa`,因此升级到提供官方实现的未来 PyTorch 版本时无需任何更改 —— 您将自动获得上游版本
**前向兼容性:** 包装器会先应用 torchada 针对 MUSA 的修复(例如同步和显存 API),然后优先使用真正的
`torch.accelerator` 实现,最后在属性缺失时回退到 `torch.musa`

## 平台检测

Expand Down
38 changes: 38 additions & 0 deletions docs/compat_gap_cuda_introspection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# CUDA Introspection Compatibility Gap

## Status

- Fixed in `src/torchada/_patch.py`
- Covered by `tests/test_cuda_patching.py::TestCudaBuildAndDebugIntrospection`

## Gap

In the `yeahdongcn1` torch_musa 2.7.1 container, these top-level CUDA APIs exist
on `torch.cuda` but are absent from `torch.musa`:

- `torch.cuda.get_gencode_flags`
- `torch.cuda.get_sync_debug_mode`
- `torch.cuda.set_sync_debug_mode`

After torchada redirects `torch.cuda` to `torch.musa`, those calls raised
`AttributeError` instead of preserving CUDA-compatible API access.

## Fix

torchada now installs MUSA-safe shims when torch_musa does not provide these
attributes:

- `get_gencode_flags()` returns `""` because NVCC gencode flags are CUDA-specific
and should not be passed to the MUSA toolchain.
- `get_sync_debug_mode()` and `set_sync_debug_mode()` maintain a process-local
debug mode value so CUDA-oriented code can call the public API without
requiring unavailable CUDA C++ hooks.

## Verification

Run in the MUSA test container:

```bash
docker exec -w /ws yeahdongcn1 python -m pytest \
tests/test_cuda_patching.py::TestCudaBuildAndDebugIntrospection -v
```
27 changes: 27 additions & 0 deletions docs/compat_gap_cuda_nccl_attr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# CUDA NCCL Module Attribute Compatibility Gap

## Status

- Fixed in `src/torchada/_patch.py`
- Covered by `tests/test_cuda_patching.py::TestNCCLModule::test_nccl_module_alias_available`

## Gap

CUDA exposes `torch.cuda.nccl` as both an importable module and a module
attribute. torchada already registered `torch.cuda.nccl` in `sys.modules`, but
plain attribute access still failed on MUSA because the CUDA wrapper redirected
`torch.cuda.nccl` to missing `torch.musa.nccl` instead of `torch.musa.mccl`.

## Fix

torchada now aliases `torch.musa.nccl` to `torch.musa.mccl` when MCCL is
available and also remaps `torch.cuda.nccl` attribute access to `mccl`.

## Verification

Run in the MUSA test container:

```bash
docker exec -w /ws yeahdongcn1 python -m pytest \
tests/test_cuda_patching.py::TestNCCLModule::test_nccl_module_alias_available -v
```
44 changes: 44 additions & 0 deletions docs/compat_gap_cuda_public_aliases.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# CUDA Public API Alias Compatibility Gap

## Status

- Fixed in `src/torchada/_patch.py`
- Covered by `tests/test_cuda_patching.py::TestCudaPublicApiAliases`

## Gap

A broader `dir(torch.cuda)` versus `dir(torch.musa)` comparison in the
`yeahdongcn1` torch_musa 2.7.1 container found additional CUDA public attributes
that are commonly used as imports or compatibility aliases but were missing
after torchada redirected `torch.cuda` to `torch.musa`.

## Fix

torchada now provides MUSA-backed or safe compatibility aliases for:

- Deprecated memory aliases: `memory_cached`, `max_memory_cached`
- Host-memory stat APIs with no MUSA counters: `host_memory_stats`,
`host_memory_stats_as_nested_dict`, `reset_accumulated_host_memory_stats`,
`reset_peak_host_memory_stats`
- Static CUDA build flags: `has_half`, `has_magma`
- Top-level `CUDAPluggableAllocator`
- `torch.cuda.streams`
- `torch.cuda.sparse`
- `torch.cuda.init`
- `torch.cuda.default_generators`
- `torch.cuda.get_stream_from_external`

## Deferred

Allocator-control APIs such as `caching_allocator_enable` and telemetry APIs
such as `utilization` remain deferred because torch_musa has no equivalent
behavior in the tested build.

## Verification

Run in the MUSA test container:

```bash
docker exec -w /ws yeahdongcn1 python -m pytest \
tests/test_cuda_patching.py::TestCudaPublicApiAliases -v
```
62 changes: 62 additions & 0 deletions docs/compat_gap_inventory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# CUDA to MUSA Compatibility Gap Inventory

## Method

Compared selected `torch.cuda` attributes against `torch.musa` in the
`yeahdongcn1` torch_musa 2.7.1 container, then verified behavior after
`import torchada`.

## Fixed

- `torch.cuda.get_gencode_flags`
- `torch.cuda.get_sync_debug_mode`
- `torch.cuda.set_sync_debug_mode`
- `torch.cuda.nccl`
- Deprecated memory aliases and host-memory stat APIs
- `torch.cuda.streams`
- `torch.cuda.sparse`
- `torch.cuda.init`
- `torch.cuda.default_generators`
- `torch.cuda.get_stream_from_external`
- `torch.cuda.CUDAPluggableAllocator`

See:

- `docs/compat_gap_cuda_introspection.md`
- `docs/compat_gap_cuda_nccl_attr.md`
- `docs/compat_gap_cuda_public_aliases.md`

## Deferred

These remaining names are not patched as part of CUDA-to-MUSA runtime
compatibility:

- Imported helper symbols from the CUDA Python module: `Any`, `Callable`,
`Optional`, `Union`, `cast`, `classproperty`, `importlib`, `lru_cache`,
`threading`, `traceback`
- CUDA-only internal classes or APIs with no MUSA object model equivalent in the
tested build: `CudaError`, `cudaStatus`, `DeferredCudaCallError`, `Device`,
`ComplexFloatStorage`, `ComplexDoubleStorage`, `jiterator`

The following CUDA APIs are NVIDIA/NVML telemetry helpers and still have no
`torch.musa` equivalent in the tested torch_musa build:

- `torch.cuda.list_gpu_processes`
- `torch.cuda.utilization`
- `torch.cuda.memory_usage`
- `torch.cuda.temperature`
- `torch.cuda.power_draw`
- `torch.cuda.clock_rate`
- `torch.cuda.device_memory_used`
- `torch.cuda.caching_allocator_alloc`
- `torch.cuda.caching_allocator_delete`
- `torch.cuda.caching_allocator_enable`
- `torch.cuda.get_per_process_memory_fraction`
- `torch.cuda.gds`
- `torch.cuda.tunable`

In the same container, the original CUDA implementations depend on `pynvml` and
do not provide real values without NVIDIA NVML support. They are not patched in
this pass to avoid returning misleading MUSA telemetry. Allocator-control and
tunable/GDS APIs are also left unpatched because torch_musa does not expose an
equivalent behavior in this tested build.
Loading