ModelTC
diff --git a/‎docs/CN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 15 additions & 66 deletions b/‎docs/CN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 15 additions & 66 deletions
diff --git a/‎docs/EN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 13 additions & 64 deletions b/‎docs/EN/source/tutorial/fp8_kv_quantization.rst‎
Lines changed: 13 additions & 64 deletions
diff --git a/‎lightllm/common/kv_cache_mem_manager/__init__.py‎
Lines changed: 0 additions & 2 deletions b/‎lightllm/common/kv_cache_mem_manager/__init__.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎lightllm/common/kv_cache_mem_manager/calibration_fp8kv_mem_manager.py‎
Lines changed: 1 addition & 1 deletion b/‎lightllm/common/kv_cache_mem_manager/calibration_fp8kv_mem_manager.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎lightllm/common/kv_cache_mem_manager/export_calibration_mem_manager.py‎
Lines changed: 0 additions & 23 deletions b/‎lightllm/common/kv_cache_mem_manager/export_calibration_mem_manager.py‎
Lines changed: 0 additions & 23 deletions
diff --git a/‎lightllm/common/kv_cache_mem_manager/mem_utils.py‎
Lines changed: 0 additions & 4 deletions b/‎lightllm/common/kv_cache_mem_manager/mem_utils.py‎
Lines changed: 0 additions & 4 deletions
@@ -3,20 +3,19 @@
 FP8 KV 量化与校准指南
 ======================
 
-本章节介绍 LightLLM 中 FP8 KV 量化的完整流程，包括：
+本章节介绍 LightLLM 中 FP8 KV 推理的使用方式，包括：
 
-- 导出校准文件（``--export_fp8kv_calibration``）
 - 使用校准文件进行推理（``fp8kv``）
 - FA3 与 FlashInfer 后端下的量化粒度差异
 - 常见报错与排查建议
 
 功能概览
 --------
 
-LightLLM 的 FP8 KV 量化采用离线校准方案：
-
-1. 先运行导出模式，统计 KV 的最大绝对值并导出 ``kv_cache_calib.json``。
-2. 再在推理模式加载该文件，将 KV 按 scale 量化为 ``float8_e4m3fn`` 存储。
+LightLLM 的 FP8 KV 推理需要准备好的校准文件（``kv_cache_calib.json``），
+并通过 ``--kv_quant_calibration_config_path`` 加载。
+你可以直接使用 ``test/advanced_config/`` 目录下已有的校准文件，
+也可以使用 `LightCompress <https://github.com/ModelTC/LightCompress>`_ 工具导出，或使用自有兼容文件。
 
 后端与量化粒度
 --------------
@@ -28,59 +27,13 @@ LightLLM 的 FP8 KV 量化采用离线校准方案：
 
 因此，校准文件与后端强相关：
 
-- ``fa3`` 生成的 ``per_head`` 校准文件用于 ``fa3`` 推理。
-- ``flashinfer`` 生成的 ``per_tensor`` 校准文件用于 ``flashinfer`` 推理。
-
-不建议混用不同后端导出的校准文件。
-
-步骤一：导出校准文件
---------------------
-
-导出模式示例（FA3）：
-
-.. code-block:: console
-
-    $ python -m lightllm.server.api_server \
-        --model_dir /path/to/model \
-        --export_fp8kv_calibration \
-        --llm_prefill_att_backend fa3 \
-        --llm_decode_att_backend fa3 \
-        --disable_cudagraph
-
-导出模式示例（FlashInfer）：
-
-.. code-block:: console
-
-    $ python -m lightllm.server.api_server \
-        --model_dir /path/to/model \
-        --export_fp8kv_calibration \
-        --llm_prefill_att_backend flashinfer \
-        --llm_decode_att_backend flashinfer \
-        --disable_cudagraph
-
-说明：
-
-- 设置 ``--export_fp8kv_calibration`` 后，会在运行过程中收集 KV 统计信息。
-- 校准完成后，会在当前工作目录输出 ``kv_cache_calib.json``。
-- 导出模式要求 ``--disable_cudagraph``，且 ``--llm_kv_type`` 保持为 ``None``。
-- 仓库 ``test/advanced_config/`` 目录中已存放常用模型的校准文件，可按需直接使用或作为参考。
-
-使用 benchmark_qps.py 进行随机数据校准
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+- ``fa3`` 对应 ``per_head`` 校准文件，应配合 ``fa3`` 推理。
+- ``flashinfer`` 对应 ``per_tensor`` 校准文件，应配合 ``flashinfer`` 推理。
 
-除了在线业务流量，也可以使用 ``test/benchmark/service/benchmark_qps.py`` 工具构造随机请求进行校准。
+不建议混用不同后端的校准文件。
 
-- 默认累计约 4000 次推理后会输出一次校准结果。
-- 实践中可执行以下命令两次，以更稳定地覆盖统计范围。
-
-示例命令：
-
-.. code-block:: console
-
-    $ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
-
-步骤二：使用校准文件启动 FP8 推理
----------------------------------
+使用校准文件启动 FP8 推理
+-------------------------
 
 推理模式示例（FA3）：
 
@@ -107,12 +60,12 @@ LightLLM 的 FP8 KV 量化采用离线校准方案：
 说明：
 
 - ``fp8kv`` 模式必须提供 ``--kv_quant_calibration_config_path``。
-- 建议推理时的 attention backend 与导出校准时保持一致。
+- 建议推理时的 attention backend 与校准文件要求保持一致。
 
 校准文件格式
 ------------
 
-导出的 ``kv_cache_calib.json`` 主要字段包括：
+``kv_cache_calib.json`` 主要字段包括：
 
 - ``quant_type``: ``per_head`` 或 ``per_tensor``
 - ``num_layers``: 层数
@@ -136,14 +89,10 @@ LightLLM 的 FP8 KV 量化采用离线校准方案：
 
    说明你使用了 ``--llm_kv_type fp8kv`` 但未传入校准文件路径。
 
-2. 启动时报错要求 ``--disable_cudagraph``
-
-    说明你使用了 ``--export_fp8kv_calibration``，该模式必须禁用 cudagraph。
-
-3. 报错 ``quant_type not match``
+2. 报错 ``quant_type not match``
 
    通常是后端与校准文件类型不一致。例如拿 ``per_head`` 文件去跑 ``flashinfer``。
 
-4. 切换后端后效果异常
+3. 切换后端后效果异常
 
-   建议按目标后端重新导出校准文件，不要跨后端复用。
+   建议使用与目标后端匹配的校准文件，不要跨后端复用不兼容文件。
@@ -3,20 +3,19 @@
 FP8 KV Quantization and Calibration Guide
 =========================================
 
-This chapter describes the end-to-end FP8 KV quantization workflow in LightLLM, including:
+This chapter describes FP8 KV inference in LightLLM, including:
 
-- Exporting calibration data (``--export_fp8kv_calibration``)
 - Running inference with calibration data (``fp8kv``)
 - Quantization granularity differences between FA3 and FlashInfer
 - Common errors and troubleshooting
 
 Overview
 --------
 
-LightLLM uses an offline calibration flow for FP8 KV quantization:
-
-1. Run export mode to collect KV statistics and produce ``kv_cache_calib.json``.
-2. Run inference mode with that file, and quantize KV into ``float8_e4m3fn`` storage.
+LightLLM FP8 KV inference requires a prepared calibration file (``kv_cache_calib.json``),
+which is loaded by ``--kv_quant_calibration_config_path``.
+You can use calibration files provided in ``test/advanced_config/``,
+export one with `LightCompress <https://github.com/ModelTC/LightCompress>`_, or use your own compatible file.
 
 Backend and Quantization Granularity
 ------------------------------------
@@ -28,59 +27,13 @@ Current behavior:
 
 Calibration files are backend-dependent:
 
-- ``per_head`` files exported with ``fa3`` should be used with ``fa3`` inference.
-- ``per_tensor`` files exported with ``flashinfer`` should be used with ``flashinfer`` inference.
+- ``per_head`` files for ``fa3`` should be used with ``fa3`` inference.
+- ``per_tensor`` files for ``flashinfer`` should be used with ``flashinfer`` inference.
 
 Avoid mixing calibration files across different backends.
 
-Step 1: Export Calibration File
---------------------------------
-
-Export mode example (FA3):
-
-.. code-block:: console
-
-    $ python -m lightllm.server.api_server \
-        --model_dir /path/to/model \
-        --export_fp8kv_calibration \
-        --llm_prefill_att_backend fa3 \
-        --llm_decode_att_backend fa3 \
-        --disable_cudagraph
-
-Export mode example (FlashInfer):
-
-.. code-block:: console
-
-    $ python -m lightllm.server.api_server \
-        --model_dir /path/to/model \
-        --export_fp8kv_calibration \
-        --llm_prefill_att_backend flashinfer \
-        --llm_decode_att_backend flashinfer \
-        --disable_cudagraph
-
-Notes:
-
-- Setting ``--export_fp8kv_calibration`` collects KV statistics during runtime.
-- After calibration is completed, ``kv_cache_calib.json`` is written to the current working directory.
-- Export mode requires ``--disable_cudagraph``, and ``--llm_kv_type`` should remain ``None``.
-- The repository already provides calibration files for common models under ``test/advanced_config/``, which can be used directly or as references.
-
-Use benchmark_qps.py for random-data calibration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Besides online traffic, you can use ``test/benchmark/service/benchmark_qps.py`` to generate random requests for calibration.
-
-- By default, one calibration result is exported after around 4000 inferences are accumulated.
-- In practice, you can run the following command twice to improve coverage stability.
-
-Example command:
-
-.. code-block:: console
-
-    $ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
-
-Step 2: Start FP8 Inference with Calibration
----------------------------------------------
+Start FP8 Inference with Calibration
+------------------------------------
 
 Inference mode example (FA3):
 
@@ -107,7 +60,7 @@ Inference mode example (FlashInfer):
 Notes:
 
 - ``fp8kv`` requires ``--kv_quant_calibration_config_path``.
-- Keep the inference backend consistent with the backend used during calibration export.
+- Keep the inference backend consistent with the backend expected by the calibration file.
 
 Calibration File Schema
 -----------------------
@@ -136,14 +89,10 @@ Common Issues
 
    You are using ``--llm_kv_type fp8kv`` without a calibration file path.
 
-2. Error says ``--disable_cudagraph`` is required
-
-    You are using ``--export_fp8kv_calibration``; this mode requires cudagraph disabled.
-
-3. ``quant_type not match`` error
+2. ``quant_type not match`` error
 
    Usually caused by backend/file mismatch (for example, using a ``per_head`` file with ``flashinfer``).
 
-4. Abnormal quality after backend switch
+3. Abnormal quality after backend switch
 
-   Re-export calibration using the target backend instead of reusing files across backends.
+   Use a calibration file that matches the target backend instead of reusing an incompatible file.
@@ -1,6 +1,5 @@
 from .mem_manager import MemoryManager, ReadOnlyStaticsMemoryManager
 from .calibration_fp8kv_mem_manager import CalibrationFP8KVMemoryManager
-from .export_calibration_mem_manager import ExportCalibrationMemoryManager
 from .ppl_int8kv_mem_manager import PPLINT8KVMemoryManager
 from .ppl_int4kv_mem_manager import PPLINT4KVMemoryManager
 from .deepseek2_mem_manager import Deepseek2MemoryManager
@@ -10,7 +9,6 @@
     "MemoryManager",
     "ReadOnlyStaticsMemoryManager",
     "CalibrationFP8KVMemoryManager",
-    "ExportCalibrationMemoryManager",
     "PPLINT4KVMemoryManager",
     "PPLINT8KVMemoryManager",
     "Deepseek2MemoryManager",
 
@@ -5,7 +5,7 @@
 
 class CalibrationFP8KVMemoryManager(OfflineFP8QuantMemManager):
     def __init__(self, size, dtype, head_num, head_dim, layer_num, always_copy=False, mem_fraction=0.9):
-        super().__init__(size, dtype, head_num, head_dim, layer_num, always_copy, mem_fraction, is_export_mode=False)
+        super().__init__(size, dtype, head_num, head_dim, layer_num, always_copy, mem_fraction)
 
     def copy_kv_to_mem_manager(self, layer_index: int, mem_index: torch.Tensor, kv: torch.Tensor):
         """
 
@@ -1,7 +1,6 @@
 from . import (
     MemoryManager,
     CalibrationFP8KVMemoryManager,
-    ExportCalibrationMemoryManager,
     PPLINT8KVMemoryManager,
     PPLINT4KVMemoryManager,
     Deepseek2MemoryManager,
@@ -46,9 +45,6 @@ def select_mem_manager_class():
     elif get_env_start_args().llm_kv_type == "None":
         memory_manager_class = MemoryManager
 
-    if get_env_start_args().export_fp8kv_calibration:
-        memory_manager_class = ExportCalibrationMemoryManager
-
     logger.info(f"Model kv cache using mem_manager class: {memory_manager_class}")
     return memory_manager_class