Port kvcached to ROCM by hieple-moreh · Pull Request #1 · moreh-dev/kvcached

hieple-moreh · 2026-05-19T09:14:58Z

Ticket: MV-4347

Unit Tests:

Only test files with GPU-related logic:

root@mv-mi250-07:/app/kvcached# python3 -m pytest tests/test_kvcache_manager.py     
================================ test session starts ================================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /app/kvcached
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 5 items                                                                   

tests/test_kvcache_manager.py ..s..                                           [100%]

=========================== 4 passed, 1 skipped in 2.14s ============================

root@mv-mi250-07:/app/kvcached# python3 tests/test_paged_allocator_aliasing.py  
Init C++ PageAllocator: num_layers=2, mem_size_per_layer=64MB, total_mem_size=256MB, page_size=2MB, world_size=1, pp_rank=0, async_sched=0, contiguous_layout=1, enable_prealloc=1, num_kv_buffers=2, group_id=0, min_reserved_pages=5, max_reserved_pages=10
Setup: 65536 tokens, page_size=16, available blocks=4095

[PASS] test_without_alloc_data_corrupted
       Only 28/60 tokens retained their value — zero_page aliasing confirmed.
       First 10 expected : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
       First 10 actual   : [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

[PASS] test_with_alloc_data_correct
       All 30/30 tokens correct — unique physical pages confirmed.

==================================================
Results: 2 passed, 0 failed

End-to-End tests:

Pre-requisites:

python3 tools/dev_copy_pth.py
export ENABLE_KVCACHED=true
export KVCACHED_AUTOPATCH=1

Offline vLLM serving:

root@mv-mi250-07:/app/kvcached# python3 tests/test_offline_serving.py 
[kvcached][INFO][2026-05-19 09:24:10][patch_base.py:98] Applying 6 patches for vllm
[kvcached][INFO][2026-05-19 09:24:14][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:14][version_utils.py:189] Detected vllm version: 0.17.0
WARNING 05-19 09:24:15 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:15][patch_base.py:178] Successfully patched vllm: elastic_block_pool, engine_core, gpu_model_runner, gpu_worker, kv_cache_coordinator
INFO 05-19 09:24:16 [utils.py:238] non-default args: {'enable_prefix_caching': False, 'disable_log_stats': True, 'model': 'gpt2'}
INFO 05-19 09:24:18 [model.py:531] Resolved architecture: GPT2LMHeadModel
INFO 05-19 09:24:19 [model.py:1889] Downcasting torch.float32 to torch.bfloat16.
INFO 05-19 09:24:19 [model.py:1554] Using max model len 1024
INFO 05-19 09:24:19 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 05-19 09:24:19 [vllm.py:747] Asynchronous scheduling is enabled.
WARNING 05-19 09:24:21 [system_utils.py:152] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
[kvcached][INFO][2026-05-19 09:24:23][patch_base.py:98] Applying 6 patches for vllm
[kvcached][INFO][2026-05-19 09:24:26][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:26][version_utils.py:189] Detected vllm version: 0.17.0
WARNING 05-19 09:24:28 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][version_utils.py:189] Detected vllm version: 0.17.0
[kvcached][INFO][2026-05-19 09:24:28][patch_base.py:178] Successfully patched vllm: elastic_block_pool, engine_core, gpu_model_runner, gpu_worker, kv_cache_coordinator
(EngineCore_DP0 pid=6113) [kvcached][INFO][2026-05-19 09:24:29][interfaces.py:68] kvcached async scheduler enabled
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gpt2, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+sparse_attn_indexer', 'none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False, 'fuse_rope_kvcache': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.192.17:33471 backend=nccl
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=6113) Worker 0 IPC listener started at /tmp/kvcached-tp-kvcached_vLLM_5876-4645f457/w0.sock
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [gpu_model_runner.py:4255] Starting to load model gpt2...
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:29 [rocm.py:463] Using Triton Attention backend.
(EngineCore_DP0 pid=6113) WARNING 05-19 09:24:29 [compilation.py:1131] Op 'sparse_attn_indexer' doesn't exist (or wasn't imported/registered), enabling with '+sparse_attn_indexer' has no effect
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [weight_utils.py:601] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.46it/s]
(EngineCore_DP0 pid=6113) 
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [default_loader.py:293] Loading weights took 0.31 seconds
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:32 [gpu_model_runner.py:4338] Model loading took 0.32 GiB memory and 2.636792 seconds
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:34 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/7b597994e1/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:34 [backends.py:976] Dynamo bytecode transform time: 1.70 s
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [backends.py:266] Directly load the compiled graph(s) for compile range (1, 8192) from the cache, took 1.044 s
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [monitor.py:35] torch.compile takes 3.14 s in total
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [gpu_worker.py:424] Available KV cache memory: 56.72 GiB
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [kv_cache_utils.py:1314] GPU KV cache size: 1,652,016 tokens
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:36 [kv_cache_utils.py:1319] Maximum concurrency for 1,024 tokens per request: 1613.30x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█| 51/51 [00:01<00:00, 
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 24.13it/s]
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:40 [gpu_model_runner.py:5360] Graph capturing finished in 4 secs, took 0.23 GiB
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:40 [core.py:282] init engine (profile, create kv cache, warmup model) took 7.72 seconds
Init C++ PageAllocator: num_layers=12, mem_size_per_layer=2419MB, total_mem_size=58078MB, page_size=2MB, world_size=1, pp_rank=0, async_sched=1, contiguous_layout=1, enable_prealloc=1, num_kv_buffers=2, group_id=0, min_reserved_pages=5, max_reserved_pages=10
(EngineCore_DP0 pid=6113) INFO 05-19 09:24:42 [vllm.py:747] Asynchronous scheduling is enabled.
INFO 05-19 09:24:42 [llm.py:388] Supported tasks: ['generate']
Rendering prompts: 100%|██████████████████████████████| 4/4 [00:00<00:00, 159.49it/s]
Processed prompts: 100%|█| 4/4 [00:00<00:00, 11.81it/s, est. speed input: 65.02 toks/

Generated Outputs:
------------------------------------------------------------
Prompt:    'Hello, my name is'
Output:    " Scott! I'm a board game designer, but you may know me from my"
------------------------------------------------------------
Prompt:    'The president of the United States is'
Output:    ' sitting right behind them. And he\'s ignoring us. He\'s ignoring us."'
------------------------------------------------------------
Prompt:    'The capital of France is'
Output:    ' two days away from the French capital, the capital of other countries, and the'
------------------------------------------------------------
Prompt:    'The future of AI is'
Output:    ' dependent on how Google intends to integrate AI into Android and iOS products. Given the'
------------------------------------------------------------
[rank0]:[W519 09:24:43.459989042 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

…ly CUDA

Modify setup.py, csrc/ to build extension on a ROCm box instead of on…

2d1f7c6

…ly CUDA

gitgod-bot assigned hieple-moreh May 19, 2026

hieple-moreh changed the title ~~Modify setup.py, csrc/ to build extension on a ROCm box instead of on…~~ Port kvcached to ROCM May 19, 2026

hieple-moreh requested a review from loctxmoreh May 19, 2026 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port kvcached to ROCM#1

Port kvcached to ROCM#1
hieple-moreh wants to merge 1 commit into
mainfrom
feat/rocm-port

hieple-moreh commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hieple-moreh commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ticket: MV-4347

Unit Tests:

End-to-End tests:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hieple-moreh commented May 19, 2026 •

edited

Loading