Update documentation from main repository

future-xy · future-xy · commit 64fbeceb2a42 · 2025-12-12T14:53:05.000Z
diff --git a/docs/api/sllm-store-cli.md b/docs/api/sllm-store-cli.md
@@ -231,7 +231,7 @@ sllm-store load --model facebook/opt-1.3b --backend transformers --adapter --ada
 
 #### Note: loading vLLM models
 
-To load models with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.9.0.1`.
+To load models with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.11.2`.
 
 ```bash
    ./sllm_store/vllm_patch/patch.sh
diff --git a/docs/deployment/single_machine.md b/docs/deployment/single_machine.md
@@ -92,7 +92,7 @@ To install ServerlessLLM from source, follow these steps:
 
 ### vLLM Patch
 
-To use vLLM with ServerlessLLM, you must apply a patch. The patch file is located at `sllm_store/vllm_patch/sllm_load.patch` within the ServerlessLLM repository. This patch has been tested with vLLM version `0.9.0.1`.
+To use vLLM with ServerlessLLM, you must apply a patch. The patch file is located at `sllm_store/vllm_patch/sllm_load.patch` within the ServerlessLLM repository. This patch has been tested with vLLM version `0.11.2`.
 
 Apply the patch using the following script:
 
diff --git a/docs/store/quickstart.md b/docs/store/quickstart.md
@@ -114,7 +114,7 @@ ServerlessLLM integrates with vLLM to provide fast model loading capabilities. F
 
 ### Prerequisites
 
-Before using ServerlessLLM with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.9.0.1`.
+Before using ServerlessLLM with vLLM, you need to apply a compatibility patch to your vLLM installation. This patch has been tested with vLLM version `0.11.2`.
 
 ### Apply the vLLM Patch
 
diff --git a/docs/store/rocm_quickstart.md b/docs/store/rocm_quickstart.md
@@ -106,7 +106,7 @@ INFO 06-05 13:02:52 [__init__.py:36] All plugins in this group will be loaded. S
 INFO 06-05 13:03:00 [config.py:793] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
 INFO 06-05 13:03:00 [arg_utils.py:1594] rocm is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
 INFO 06-05 13:03:04 [config.py:1910] Disabled the custom all-reduce kernel because it is not supported on current platform.
-INFO 06-05 13:03:04 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0.1) with config: model='/models/facebook/opt-125m', speculative_config=None, tokenizer='/models/facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.SERVERLESS_LLM, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/models/facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False,
+INFO 06-05 13:03:04 [llm_engine.py:230] Initializing a V0 LLM engine (v0.11.2) with config: model='/models/facebook/opt-125m', speculative_config=None, tokenizer='/models/facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.SERVERLESS_LLM, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/models/facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False,
 INFO 06-05 13:03:04 [rocm.py:208] None is not supported in AMD GPUs.
 INFO 06-05 13:03:04 [rocm.py:209] Using ROCmFlashAttention backend.
 INFO 06-05 13:03:05 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0