Skip to content

Port kvcached to ROCM#1

Open
hieple-moreh wants to merge 1 commit into
mainfrom
feat/rocm-port
Open

Port kvcached to ROCM#1
hieple-moreh wants to merge 1 commit into
mainfrom
feat/rocm-port

Conversation

@hieple-moreh
Copy link
Copy Markdown

@hieple-moreh hieple-moreh commented May 19, 2026

Ticket: MV-4347

Unit Tests:

  • Only test files with GPU-related logic:
root@mv-mi250-07:/app/kvcached# python3 -m pytest tests/test_kvcache_manager.py     
================================ test session starts ================================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /app/kvcached
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 5 items                                                                   

tests/test_kvcache_manager.py ..s..                                           [100%]

=========================== 4 passed, 1 skipped in 2.14s ============================
root@mv-mi250-07:/app/kvcached# python3 tests/test_paged_allocator_aliasing.py  
Init C++ PageAllocator: num_layers=2, mem_size_per_layer=64MB, total_mem_size=256MB, page_size=2MB, world_size=1, pp_rank=0, async_sched=0, contiguous_layout=1, enable_prealloc=1, num_kv_buffers=2, group_id=0, min_reserved_pages=5, max_reserved_pages=10
Setup: 65536 tokens, page_size=16, available blocks=4095

[PASS] test_without_alloc_data_corrupted
       Only 28/60 tokens retained their value — zero_page aliasing confirmed.
       First 10 expected : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
       First 10 actual   : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

[PASS] test_with_alloc_data_correct
       All 30/30 tokens correct — unique physical pages confirmed.

==================================================
Results: 2 passed, 0 failed

End-to-End tests:

  • Pre-requisites:
python3 tools/dev_copy_pth.py
export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1
  • Offline vLLM serving:
root@mv-mi250-07:/app/kvcached# python3 tests/test_offline_serving.py 
[kvcached][INFO][2026-05-19 09:24:10][patch_base.py:98] Applying 6 patches for vllm
[kvcached][INFO][2026-05-19 09:24:14][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:14][version_utils.py:189] Detected vllm version: 0.17.0
WARNING 05-19 09:24:15 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][patch_base.py:178] Successfully patched vllm: elastic_block_pool, engine_core, gpu_model_runner, gpu_worker, kv_cache_coordinator
INFO 05-19 09:24:16 [utils.py:238] non-default args: {'enable_prefix_caching': False, 'disable_log_stats': True, 'model': 'gpt2'}
INFO 05-19 09:24:18 [model.py:531] Resolved architecture: GPT2LMHeadModel
INFO 05-19 09:24:19 [model.py:1889] Downcasting torch.float32 to torch.bfloat16.
INFO 05-19 09:24:19 [model.py:1554] Using max model len 1024
INFO 05-19 09:24:19 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-19 09:24:19 [vllm.py:747] Asynchronous scheduling is enabled.
WARNING 05-19 09:24:21 [system_utils.py:152] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
[kvcached][INFO][2026-05-19 09:24:23][patch_base.py:98] Applying 6 patches for vllm
[kvcached][INFO][2026-05-19 09:24:26][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:26][version_utils.py:189] Detected vllm version: 0.17.0
WARNING 05-19 09:24:28 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][patch_base.py:178] Successfully patched vllm: elastic_block_pool, engine_core, gpu_model_runner, gpu_worker, kv_cache_coordinator
(EngineCore_DP0 pid=6113) [kvcached][INFO][2026-05-19 09:24:29][interfaces.py:68] kvcached async scheduler enabled
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gpt2, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False, 'fuse_rope_kvcache': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.192.17:33471 backend=nccl
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=6113) Worker 0 IPC listener started at /tmp/kvcached-tp-kvcached_vLLM_5876-4645f457/w0.sock
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [gpu_model_runner.py:4255] Starting to load model gpt2...
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [rocm.py:463] Using Triton Attention backend.
(EngineCore_DP0 pid=6113) WARNING 05-19 09:24:29 [compilation.py:1131] Op 'sparse_attn_indexer' doesn't exist (or wasn't imported/registered), enabling with '+sparse_attn_indexer' has no effect
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [weight_utils.py:601] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.46it/s]
(EngineCore_DP0 pid=6113) 
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [default_loader.py:293] Loading weights took 0.31 seconds
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [gpu_model_runner.py:4338] Model loading took 0.32 GiB memory and 2.636792 seconds
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:34 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/7b597994e1/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:34 [backends.py:976] Dynamo bytecode transform time: 1.70 s
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [backends.py:266] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.044 s
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [monitor.py:35] torch.compile takes 3.14 s in total
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [gpu_worker.py:424] Available KV cache memory: 56.72 GiB
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [kv_cache_utils.py:1314] GPU KV cache size: 1,652,016 tokens
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [kv_cache_utils.py:1319] Maximum concurrency for 1,024 tokens per request: 1613.30x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█| 51/51 [00:01<00:00, 
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 24.13it/s]
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:40 [gpu_model_runner.py:5360] Graph capturing finished in 4 secs, took 0.23 GiB
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:40 [core.py:282] init engine (profile, create kv cache, warmup model) took 7.72 seconds
Init C++ PageAllocator: num_layers=12, mem_size_per_layer=2419MB, total_mem_size=58078MB, page_size=2MB, world_size=1, pp_rank=0, async_sched=1, contiguous_layout=1, enable_prealloc=1, num_kv_buffers=2, group_id=0, min_reserved_pages=5, max_reserved_pages=10
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:42 [vllm.py:747] Asynchronous scheduling is enabled.
INFO 05-19 09:24:42 [llm.py:388] Supported tasks: ['generate']
Rendering prompts: 100%|██████████████████████████████| 4/4 [00:00<00:00, 159.49it/s]
Processed prompts: 100%|█| 4/4 [00:00<00:00, 11.81it/s, est. speed input: 65.02 toks/

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Scott! I'm a board game designer, but you may know me from my"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' sitting right behind them. And he\'s ignoring us. He\'s ignoring us."'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' two days away from the French capital, the capital of other countries, and the'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' dependent on how Google intends to integrate AI into Android and iOS products. Given the'
------------------------------------------------------------
[rank0]:[W519 09:24:43.459989042 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@hieple-moreh hieple-moreh changed the title Modify setup.py, csrc/ to build extension on a ROCm box instead of on… Port kvcached to ROCM May 19, 2026
@hieple-moreh hieple-moreh requested a review from loctxmoreh May 19, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant