MooreThreads · yeahdongcn · May 17, 2026
diff --git a/README.md b/README.md
@@ -16,6 +16,21 @@ torchada is an adapter that makes [torch_musa](https://github.com/MooreThreads/t
 
 Many PyTorch projects are written for NVIDIA GPUs using `torch.cuda.*` APIs. To run these on Moore Threads GPUs, you would normally need to change every `cuda` reference to `musa`. torchada eliminates this by automatically translating CUDA API calls to MUSA equivalents at runtime.
 
+## Architecture
+
+torchada sits between user code and the PyTorch MUSA backend. User applications import `torchada` once and continue calling standard `torch.cuda.*` APIs. The compatibility layer patches CUDA entry points, translates CUDA device references to MUSA, installs CUDA-shaped module shims, and ports CUDA extension sources and symbols for the MUSA toolchain. On MUSA platforms, calls are redirected to `torch.musa`, custom operators use the `PrivateUse1` dispatch key, distributed NCCL requests map to MCCL, and runtime calls target the MUSA runtime.
+
+Import-time setup follows a small sequence:
+
+- Detect the active platform (`MUSA`, native `CUDA`, or `CPU`).
+- Load optional C++/MUSA operator overrides on MUSA platforms.
+- Apply PyTorch compatibility patches through the `_patch.py` registry.
+- Configure the bundled Triton/MoE defaults for SGLang and vLLM.
+
+The main compatibility modules are `_device_compat.py`, `_cuda_compat.py`, `_runtime.py`, `_ctypes_compat.py`, `_accelerator_compat.py`, `utils/cpp_extension.py`, `_mapping.py`, `_cpp_ops.py`, `csrc/`, `cuda/`, and `triton/`.
+
+For downstream compatibility, `torch.cuda.is_available()` and `torch.version.cuda` are intentionally left unpatched so projects can still distinguish native CUDA environments from MUSA environments.
+
 ## Prerequisites
 
 - **torch_musa**: You must have [torch_musa](https://github.com/MooreThreads/torch_musa) installed (this provides MUSA support for PyTorch)
@@ -53,9 +68,11 @@ That's it! All `torch.cuda.*` APIs are automatically redirected to `torch.musa.*
 | Device operations | `tensor.cuda()`, `model.cuda()`, `torch.device("cuda")` |
 | Memory management | `torch.cuda.memory_allocated()`, `empty_cache()` |
 | Synchronization | `torch.cuda.synchronize()`, `Stream`, `Event` |
+| Compatibility aliases | `memory_cached()`, `torch.cuda.streams`, `torch.cuda.sparse` |
 | Mixed precision | `torch.cuda.amp.autocast()`, `GradScaler()` |
 | CUDA Graphs | `torch.cuda.CUDAGraph`, `torch.cuda.graph()` |
 | CUDA Runtime | `torch.cuda.cudart()` → uses MUSA runtime |
+| CUDA Introspection | `get_gencode_flags()`, `get_sync_debug_mode()`, `set_sync_debug_mode()` |
 | Profiler | `ProfilerActivity.CUDA` → uses PrivateUse1 |
 | Custom Ops | `Library.impl(..., "CUDA")` → uses PrivateUse1 |
 | Distributed | `dist.init_process_group(backend='nccl')` → uses MCCL |
@@ -199,11 +216,10 @@ with torch.accelerator.stream(torch.musa.Stream()):
     ...
 ```
 
-**Forward compatibility:** The wrapper always prefers the real
-`torch.accelerator` implementation and only falls back to `torch.musa` when an
-attribute is missing, so upgrading to a future PyTorch release that ships
-official implementations requires no changes on your side — you will
-automatically get the upstream version.
+**Forward compatibility:** The wrapper first applies torchada overrides for
+MUSA-specific fixes such as synchronization and memory APIs, then prefers the
+real `torch.accelerator` implementation, and finally falls back to `torch.musa`
+when an attribute is missing.
 
 ## Platform Detection
 

diff --git a/README_CN.md b/README_CN.md
@@ -16,6 +16,21 @@ torchada 是一个适配器，让 [torch_musa](https://github.com/MooreThreads/t
 
 许多 PyTorch 项目使用 `torch.cuda.*` API 为 NVIDIA GPU 编写。要在摩尔线程 GPU 上运行这些项目，通常需要把每个 `cuda` 引用改成 `musa`。torchada 通过在运行时自动将 CUDA API 调用转换为 MUSA 等效调用来消除这一问题。
 
+## 架构
+
+torchada 位于用户代码和 PyTorch MUSA 后端之间。用户应用只需导入一次 `torchada`，之后继续使用标准 `torch.cuda.*` API。兼容层会修补 CUDA 入口点，将 CUDA 设备引用转换为 MUSA，安装 CUDA 形态的兼容模块，并为 MUSA 工具链转换 CUDA 扩展源码和符号。在 MUSA 平台上，调用会重定向到 `torch.musa`，自定义算子使用 `PrivateUse1` 调度键，分布式 NCCL 请求映射到 MCCL，运行时调用则落到 MUSA 运行时。
+
+导入时的初始化流程很短：
+
+- 检测当前平台（`MUSA`、原生 `CUDA` 或 `CPU`）。
+- 在 MUSA 平台上加载可选的 C++/MUSA 算子覆盖。
+- 通过 `_patch.py` 注册表应用 PyTorch 兼容性补丁。
+- 为 SGLang 和 vLLM 设置内置 Triton/MoE 默认配置。
+
+主要兼容模块包括 `_device_compat.py`、`_cuda_compat.py`、`_runtime.py`、`_ctypes_compat.py`、`_accelerator_compat.py`、`utils/cpp_extension.py`、`_mapping.py`、`_cpp_ops.py`、`csrc/`、`cuda/` 和 `triton/`。
+
+为了保持下游项目的平台检测逻辑，`torch.cuda.is_available()` 和 `torch.version.cuda` 会有意保持不修补，这样项目仍然可以区分原生 CUDA 环境和 MUSA 环境。
+
 ## 前置条件
 
 - **torch_musa**：必须安装 [torch_musa](https://github.com/MooreThreads/torch_musa)（提供 PyTorch 的 MUSA 支持）
@@ -53,9 +68,11 @@ torch.cuda.synchronize()
 | 设备操作 | `tensor.cuda()`, `model.cuda()`, `torch.device("cuda")` |
 | 显存管理 | `torch.cuda.memory_allocated()`, `empty_cache()` |
 | 同步 | `torch.cuda.synchronize()`, `Stream`, `Event` |
+| 兼容别名 | `memory_cached()`、`torch.cuda.streams`、`torch.cuda.sparse` |
 | 混合精度 | `torch.cuda.amp.autocast()`, `GradScaler()` |
 | CUDA Graphs | `torch.cuda.CUDAGraph`, `torch.cuda.graph()` |
 | CUDA 运行时 | `torch.cuda.cudart()` → 使用 MUSA 运行时 |
+| CUDA 自省/调试 | `get_gencode_flags()`、`get_sync_debug_mode()`、`set_sync_debug_mode()` |
 | 性能分析 | `ProfilerActivity.CUDA` → 使用 PrivateUse1 |
 | 自定义算子 | `Library.impl(..., "CUDA")` → 使用 PrivateUse1 |
 | 分布式训练 | `dist.init_process_group(backend='nccl')` → 使用 MCCL |
@@ -197,8 +214,8 @@ with torch.accelerator.stream(torch.musa.Stream()):
     ...
 ```
 
-**前向兼容性：** 包装器始终优先使用真正的 `torch.accelerator` 实现，只有在缺少属性时才回退到
-`torch.musa`，因此升级到提供官方实现的未来 PyTorch 版本时无需任何更改 —— 您将自动获得上游版本。
+**前向兼容性：** 包装器会先应用 torchada 针对 MUSA 的修复（例如同步和显存 API），然后优先使用真正的
+`torch.accelerator` 实现，最后在属性缺失时回退到 `torch.musa`。
 
 ## 平台检测
 

diff --git a/docs/compat_gap_cuda_introspection.md b/docs/compat_gap_cuda_introspection.md
@@ -0,0 +1,38 @@
+# CUDA Introspection Compatibility Gap
+
+## Status
+
+- Fixed in `src/torchada/_patch.py`
+- Covered by `tests/test_cuda_patching.py::TestCudaBuildAndDebugIntrospection`
+
+## Gap
+
+In the `yeahdongcn1` torch_musa 2.7.1 container, these top-level CUDA APIs exist
+on `torch.cuda` but are absent from `torch.musa`:
+
+- `torch.cuda.get_gencode_flags`
+- `torch.cuda.get_sync_debug_mode`
+- `torch.cuda.set_sync_debug_mode`
+
+After torchada redirects `torch.cuda` to `torch.musa`, those calls raised
+`AttributeError` instead of preserving CUDA-compatible API access.
+
+## Fix
+
+torchada now installs MUSA-safe shims when torch_musa does not provide these
+attributes:
+
+- `get_gencode_flags()` returns `""` because NVCC gencode flags are CUDA-specific
+  and should not be passed to the MUSA toolchain.
+- `get_sync_debug_mode()` and `set_sync_debug_mode()` maintain a process-local
+  debug mode value so CUDA-oriented code can call the public API without
+  requiring unavailable CUDA C++ hooks.
+
+## Verification
+
+Run in the MUSA test container:
+
+```bash
+docker exec -w /ws yeahdongcn1 python -m pytest \
+  tests/test_cuda_patching.py::TestCudaBuildAndDebugIntrospection -v
+```
diff --git a/docs/compat_gap_cuda_nccl_attr.md b/docs/compat_gap_cuda_nccl_attr.md
@@ -0,0 +1,27 @@
+# CUDA NCCL Module Attribute Compatibility Gap
+
+## Status
+
+- Fixed in `src/torchada/_patch.py`
+- Covered by `tests/test_cuda_patching.py::TestNCCLModule::test_nccl_module_alias_available`
+
+## Gap
+
+CUDA exposes `torch.cuda.nccl` as both an importable module and a module
+attribute. torchada already registered `torch.cuda.nccl` in `sys.modules`, but
+plain attribute access still failed on MUSA because the CUDA wrapper redirected
+`torch.cuda.nccl` to missing `torch.musa.nccl` instead of `torch.musa.mccl`.
+
+## Fix
+
+torchada now aliases `torch.musa.nccl` to `torch.musa.mccl` when MCCL is
+available and also remaps `torch.cuda.nccl` attribute access to `mccl`.
+
+## Verification
+
+Run in the MUSA test container:
+
+```bash
+docker exec -w /ws yeahdongcn1 python -m pytest \
+  tests/test_cuda_patching.py::TestNCCLModule::test_nccl_module_alias_available -v
+```
diff --git a/docs/compat_gap_cuda_public_aliases.md b/docs/compat_gap_cuda_public_aliases.md
@@ -0,0 +1,44 @@
+# CUDA Public API Alias Compatibility Gap
+
+## Status
+
+- Fixed in `src/torchada/_patch.py`
+- Covered by `tests/test_cuda_patching.py::TestCudaPublicApiAliases`
+
+## Gap
+
+A broader `dir(torch.cuda)` versus `dir(torch.musa)` comparison in the
+`yeahdongcn1` torch_musa 2.7.1 container found additional CUDA public attributes
+that are commonly used as imports or compatibility aliases but were missing
+after torchada redirected `torch.cuda` to `torch.musa`.
+
+## Fix
+
+torchada now provides MUSA-backed or safe compatibility aliases for:
+
+- Deprecated memory aliases: `memory_cached`, `max_memory_cached`
+- Host-memory stat APIs with no MUSA counters: `host_memory_stats`,
+  `host_memory_stats_as_nested_dict`, `reset_accumulated_host_memory_stats`,
+  `reset_peak_host_memory_stats`
+- Static CUDA build flags: `has_half`, `has_magma`
+- Top-level `CUDAPluggableAllocator`
+- `torch.cuda.streams`
+- `torch.cuda.sparse`
+- `torch.cuda.init`
+- `torch.cuda.default_generators`
+- `torch.cuda.get_stream_from_external`
+
+## Deferred
+
+Allocator-control APIs such as `caching_allocator_enable` and telemetry APIs
+such as `utilization` remain deferred because torch_musa has no equivalent
+behavior in the tested build.
+
+## Verification
+
+Run in the MUSA test container:
+
+```bash
+docker exec -w /ws yeahdongcn1 python -m pytest \
+  tests/test_cuda_patching.py::TestCudaPublicApiAliases -v
+```
diff --git a/docs/compat_gap_inventory.md b/docs/compat_gap_inventory.md
@@ -0,0 +1,62 @@
+# CUDA to MUSA Compatibility Gap Inventory
+
+## Method
+
+Compared selected `torch.cuda` attributes against `torch.musa` in the
+`yeahdongcn1` torch_musa 2.7.1 container, then verified behavior after
+`import torchada`.
+
+## Fixed
+
+- `torch.cuda.get_gencode_flags`
+- `torch.cuda.get_sync_debug_mode`
+- `torch.cuda.set_sync_debug_mode`
+- `torch.cuda.nccl`
+- Deprecated memory aliases and host-memory stat APIs
+- `torch.cuda.streams`
+- `torch.cuda.sparse`
+- `torch.cuda.init`
+- `torch.cuda.default_generators`
+- `torch.cuda.get_stream_from_external`
+- `torch.cuda.CUDAPluggableAllocator`
+
+See:
+
+- `docs/compat_gap_cuda_introspection.md`
+- `docs/compat_gap_cuda_nccl_attr.md`
+- `docs/compat_gap_cuda_public_aliases.md`
+
+## Deferred
+
+These remaining names are not patched as part of CUDA-to-MUSA runtime
+compatibility:
+
+- Imported helper symbols from the CUDA Python module: `Any`, `Callable`,
+  `Optional`, `Union`, `cast`, `classproperty`, `importlib`, `lru_cache`,
+  `threading`, `traceback`
+- CUDA-only internal classes or APIs with no MUSA object model equivalent in the
+  tested build: `CudaError`, `cudaStatus`, `DeferredCudaCallError`, `Device`,
+  `ComplexFloatStorage`, `ComplexDoubleStorage`, `jiterator`
+
+The following CUDA APIs are NVIDIA/NVML telemetry helpers and still have no
+`torch.musa` equivalent in the tested torch_musa build:
+
+- `torch.cuda.list_gpu_processes`
+- `torch.cuda.utilization`
+- `torch.cuda.memory_usage`
+- `torch.cuda.temperature`
+- `torch.cuda.power_draw`
+- `torch.cuda.clock_rate`
+- `torch.cuda.device_memory_used`
+- `torch.cuda.caching_allocator_alloc`
+- `torch.cuda.caching_allocator_delete`
+- `torch.cuda.caching_allocator_enable`
+- `torch.cuda.get_per_process_memory_fraction`
+- `torch.cuda.gds`
+- `torch.cuda.tunable`
+
+In the same container, the original CUDA implementations depend on `pynvml` and
+do not provide real values without NVIDIA NVML support. They are not patched in
+this pass to avoid returning misleading MUSA telemetry. Allocator-control and
+tunable/GDS APIs are also left unpatched because torch_musa does not expose an
+equivalent behavior in this tested build.