Skip to content

Commit 7a153ab

Browse files
author
钮圣虓
committed
feat: remove export_fp8kv_calibration code
1 parent 69d2e34 commit 7a153ab

12 files changed

Lines changed: 38 additions & 280 deletions

File tree

docs/CN/source/tutorial/fp8_kv_quantization.rst

Lines changed: 15 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,19 @@
33
FP8 KV 量化与校准指南
44
======================
55

6-
本章节介绍 LightLLM 中 FP8 KV 量化的完整流程,包括:
6+
本章节介绍 LightLLM 中 FP8 KV 推理的使用方式,包括:
77

8-
- 导出校准文件(``--export_fp8kv_calibration``)
98
- 使用校准文件进行推理(``fp8kv``)
109
- FA3 与 FlashInfer 后端下的量化粒度差异
1110
- 常见报错与排查建议
1211

1312
功能概览
1413
--------
1514

16-
LightLLM 的 FP8 KV 量化采用离线校准方案:
17-
18-
1. 先运行导出模式,统计 KV 的最大绝对值并导出 ``kv_cache_calib.json``。
19-
2. 再在推理模式加载该文件,将 KV 按 scale 量化为 ``float8_e4m3fn`` 存储
15+
LightLLM 的 FP8 KV 推理需要准备好的校准文件(``kv_cache_calib.json``),
16+
并通过 ``--kv_quant_calibration_config_path`` 加载。
17+
你可以直接使用 ``test/advanced_config/`` 目录下已有的校准文件,
18+
也可以使用 `LightCompress <https://github.com/ModelTC/LightCompress>`_ 工具导出,或使用自有兼容文件
2019

2120
后端与量化粒度
2221
--------------
@@ -28,59 +27,13 @@ LightLLM 的 FP8 KV 量化采用离线校准方案:
2827
2928
因此,校准文件与后端强相关:
3029
31-
- ``fa3`` 生成的 ``per_head`` 校准文件用于 ``fa3`` 推理。
32-
- ``flashinfer`` 生成的 ``per_tensor`` 校准文件用于 ``flashinfer`` 推理。
33-
34-
不建议混用不同后端导出的校准文件。
35-
36-
步骤一:导出校准文件
37-
--------------------
38-
39-
导出模式示例(FA3):
40-
41-
.. code-block:: console
42-
43-
$ python -m lightllm.server.api_server \
44-
--model_dir /path/to/model \
45-
--export_fp8kv_calibration \
46-
--llm_prefill_att_backend fa3 \
47-
--llm_decode_att_backend fa3 \
48-
--disable_cudagraph
49-
50-
导出模式示例(FlashInfer):
51-
52-
.. code-block:: console
53-
54-
$ python -m lightllm.server.api_server \
55-
--model_dir /path/to/model \
56-
--export_fp8kv_calibration \
57-
--llm_prefill_att_backend flashinfer \
58-
--llm_decode_att_backend flashinfer \
59-
--disable_cudagraph
60-
61-
说明:
62-
63-
- 设置 ``--export_fp8kv_calibration`` 后,会在运行过程中收集 KV 统计信息。
64-
- 校准完成后,会在当前工作目录输出 ``kv_cache_calib.json``。
65-
- 导出模式要求 ``--disable_cudagraph``,且 ``--llm_kv_type`` 保持为 ``None``。
66-
- 仓库 ``test/advanced_config/`` 目录中已存放常用模型的校准文件,可按需直接使用或作为参考。
67-
68-
使用 benchmark_qps.py 进行随机数据校准
69-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
30+
- ``fa3`` 对应 ``per_head`` 校准文件,应配合 ``fa3`` 推理。
31+
- ``flashinfer`` 对应 ``per_tensor`` 校准文件,应配合 ``flashinfer`` 推理。
7032

71-
除了在线业务流量,也可以使用 ``test/benchmark/service/benchmark_qps.py`` 工具构造随机请求进行校准
33+
不建议混用不同后端的校准文件
7234

73-
- 默认累计约 4000 次推理后会输出一次校准结果。
74-
- 实践中可执行以下命令两次,以更稳定地覆盖统计范围。
75-
76-
示例命令:
77-
78-
.. code-block:: console
79-
80-
$ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
81-
82-
步骤二:使用校准文件启动 FP8 推理
83-
---------------------------------
35+
使用校准文件启动 FP8 推理
36+
-------------------------
8437

8538
推理模式示例(FA3):
8639

@@ -107,12 +60,12 @@ LightLLM 的 FP8 KV 量化采用离线校准方案:
10760
说明:
10861

10962
- ``fp8kv`` 模式必须提供 ``--kv_quant_calibration_config_path``。
110-
- 建议推理时的 attention backend 与导出校准时保持一致
63+
- 建议推理时的 attention backend 与校准文件要求保持一致
11164

11265
校准文件格式
11366
------------
11467

115-
导出的 ``kv_cache_calib.json`` 主要字段包括:
68+
``kv_cache_calib.json`` 主要字段包括:
11669

11770
- ``quant_type``: ``per_head`` 或 ``per_tensor``
11871
- ``num_layers``: 层数
@@ -136,14 +89,10 @@ LightLLM 的 FP8 KV 量化采用离线校准方案:
13689

13790
说明你使用了 ``--llm_kv_type fp8kv`` 但未传入校准文件路径。
13891

139-
2. 启动时报错要求 ``--disable_cudagraph``
140-
141-
说明你使用了 ``--export_fp8kv_calibration``,该模式必须禁用 cudagraph。
142-
143-
3. 报错 ``quant_type not match``
92+
2. 报错 ``quant_type not match``
14493

14594
通常是后端与校准文件类型不一致。例如拿 ``per_head`` 文件去跑 ``flashinfer``。
14695

147-
4. 切换后端后效果异常
96+
3. 切换后端后效果异常
14897

149-
建议按目标后端重新导出校准文件,不要跨后端复用
98+
建议使用与目标后端匹配的校准文件,不要跨后端复用不兼容文件

docs/EN/source/tutorial/fp8_kv_quantization.rst

Lines changed: 13 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,19 @@
33
FP8 KV Quantization and Calibration Guide
44
=========================================
55

6-
This chapter describes the end-to-end FP8 KV quantization workflow in LightLLM, including:
6+
This chapter describes FP8 KV inference in LightLLM, including:
77

8-
- Exporting calibration data (``--export_fp8kv_calibration``)
98
- Running inference with calibration data (``fp8kv``)
109
- Quantization granularity differences between FA3 and FlashInfer
1110
- Common errors and troubleshooting
1211

1312
Overview
1413
--------
1514

16-
LightLLM uses an offline calibration flow for FP8 KV quantization:
17-
18-
1. Run export mode to collect KV statistics and produce ``kv_cache_calib.json``.
19-
2. Run inference mode with that file, and quantize KV into ``float8_e4m3fn`` storage.
15+
LightLLM FP8 KV inference requires a prepared calibration file (``kv_cache_calib.json``),
16+
which is loaded by ``--kv_quant_calibration_config_path``.
17+
You can use calibration files provided in ``test/advanced_config/``,
18+
export one with `LightCompress <https://github.com/ModelTC/LightCompress>`_, or use your own compatible file.
2019

2120
Backend and Quantization Granularity
2221
------------------------------------
@@ -28,59 +27,13 @@ Current behavior:
2827

2928
Calibration files are backend-dependent:
3029

31-
- ``per_head`` files exported with ``fa3`` should be used with ``fa3`` inference.
32-
- ``per_tensor`` files exported with ``flashinfer`` should be used with ``flashinfer`` inference.
30+
- ``per_head`` files for ``fa3`` should be used with ``fa3`` inference.
31+
- ``per_tensor`` files for ``flashinfer`` should be used with ``flashinfer`` inference.
3332

3433
Avoid mixing calibration files across different backends.
3534

36-
Step 1: Export Calibration File
37-
--------------------------------
38-
39-
Export mode example (FA3):
40-
41-
.. code-block:: console
42-
43-
$ python -m lightllm.server.api_server \
44-
--model_dir /path/to/model \
45-
--export_fp8kv_calibration \
46-
--llm_prefill_att_backend fa3 \
47-
--llm_decode_att_backend fa3 \
48-
--disable_cudagraph
49-
50-
Export mode example (FlashInfer):
51-
52-
.. code-block:: console
53-
54-
$ python -m lightllm.server.api_server \
55-
--model_dir /path/to/model \
56-
--export_fp8kv_calibration \
57-
--llm_prefill_att_backend flashinfer \
58-
--llm_decode_att_backend flashinfer \
59-
--disable_cudagraph
60-
61-
Notes:
62-
63-
- Setting ``--export_fp8kv_calibration`` collects KV statistics during runtime.
64-
- After calibration is completed, ``kv_cache_calib.json`` is written to the current working directory.
65-
- Export mode requires ``--disable_cudagraph``, and ``--llm_kv_type`` should remain ``None``.
66-
- The repository already provides calibration files for common models under ``test/advanced_config/``, which can be used directly or as references.
67-
68-
Use benchmark_qps.py for random-data calibration
69-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
70-
71-
Besides online traffic, you can use ``test/benchmark/service/benchmark_qps.py`` to generate random requests for calibration.
72-
73-
- By default, one calibration result is exported after around 4000 inferences are accumulated.
74-
- In practice, you can run the following command twice to improve coverage stability.
75-
76-
Example command:
77-
78-
.. code-block:: console
79-
80-
$ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
81-
82-
Step 2: Start FP8 Inference with Calibration
83-
---------------------------------------------
35+
Start FP8 Inference with Calibration
36+
------------------------------------
8437

8538
Inference mode example (FA3):
8639

@@ -107,7 +60,7 @@ Inference mode example (FlashInfer):
10760
Notes:
10861

10962
- ``fp8kv`` requires ``--kv_quant_calibration_config_path``.
110-
- Keep the inference backend consistent with the backend used during calibration export.
63+
- Keep the inference backend consistent with the backend expected by the calibration file.
11164

11265
Calibration File Schema
11366
-----------------------
@@ -136,14 +89,10 @@ Common Issues
13689

13790
You are using ``--llm_kv_type fp8kv`` without a calibration file path.
13891

139-
2. Error says ``--disable_cudagraph`` is required
140-
141-
You are using ``--export_fp8kv_calibration``; this mode requires cudagraph disabled.
142-
143-
3. ``quant_type not match`` error
92+
2. ``quant_type not match`` error
14493

14594
Usually caused by backend/file mismatch (for example, using a ``per_head`` file with ``flashinfer``).
14695

147-
4. Abnormal quality after backend switch
96+
3. Abnormal quality after backend switch
14897

149-
Re-export calibration using the target backend instead of reusing files across backends.
98+
Use a calibration file that matches the target backend instead of reusing an incompatible file.

lightllm/common/kv_cache_mem_manager/__init__.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
from .mem_manager import MemoryManager, ReadOnlyStaticsMemoryManager
22
from .calibration_fp8kv_mem_manager import CalibrationFP8KVMemoryManager
3-
from .export_calibration_mem_manager import ExportCalibrationMemoryManager
43
from .ppl_int8kv_mem_manager import PPLINT8KVMemoryManager
54
from .ppl_int4kv_mem_manager import PPLINT4KVMemoryManager
65
from .deepseek2_mem_manager import Deepseek2MemoryManager
@@ -10,7 +9,6 @@
109
"MemoryManager",
1110
"ReadOnlyStaticsMemoryManager",
1211
"CalibrationFP8KVMemoryManager",
13-
"ExportCalibrationMemoryManager",
1412
"PPLINT4KVMemoryManager",
1513
"PPLINT8KVMemoryManager",
1614
"Deepseek2MemoryManager",

lightllm/common/kv_cache_mem_manager/calibration_fp8kv_mem_manager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
class CalibrationFP8KVMemoryManager(OfflineFP8QuantMemManager):
77
def __init__(self, size, dtype, head_num, head_dim, layer_num, always_copy=False, mem_fraction=0.9):
8-
super().__init__(size, dtype, head_num, head_dim, layer_num, always_copy, mem_fraction, is_export_mode=False)
8+
super().__init__(size, dtype, head_num, head_dim, layer_num, always_copy, mem_fraction)
99

1010
def copy_kv_to_mem_manager(self, layer_index: int, mem_index: torch.Tensor, kv: torch.Tensor):
1111
"""

lightllm/common/kv_cache_mem_manager/export_calibration_mem_manager.py

Lines changed: 0 additions & 23 deletions
This file was deleted.

lightllm/common/kv_cache_mem_manager/mem_utils.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
from . import (
22
MemoryManager,
33
CalibrationFP8KVMemoryManager,
4-
ExportCalibrationMemoryManager,
54
PPLINT8KVMemoryManager,
65
PPLINT4KVMemoryManager,
76
Deepseek2MemoryManager,
@@ -46,9 +45,6 @@ def select_mem_manager_class():
4645
elif get_env_start_args().llm_kv_type == "None":
4746
memory_manager_class = MemoryManager
4847

49-
if get_env_start_args().export_fp8kv_calibration:
50-
memory_manager_class = ExportCalibrationMemoryManager
51-
5248
logger.info(f"Model kv cache using mem_manager class: {memory_manager_class}")
5349
return memory_manager_class
5450

0 commit comments

Comments
 (0)