33FP8 KV Quantization and Calibration Guide
44=========================================
55
6- This chapter describes the end-to-end FP8 KV quantization workflow in LightLLM, including:
6+ This chapter describes FP8 KV inference in LightLLM, including:
77
8- - Exporting calibration data (``--export_fp8kv_calibration ``)
98- Running inference with calibration data (``fp8kv ``)
109- Quantization granularity differences between FA3 and FlashInfer
1110- Common errors and troubleshooting
1211
1312Overview
1413--------
1514
16- LightLLM uses an offline calibration flow for FP8 KV quantization:
17-
18- 1. Run export mode to collect KV statistics and produce `` kv_cache_calib.json ``.
19- 2. Run inference mode with that file, and quantize KV into `` float8_e4m3fn `` storage .
15+ LightLLM FP8 KV inference requires a prepared calibration file (`` kv_cache_calib.json ``),
16+ which is loaded by `` --kv_quant_calibration_config_path ``.
17+ You can use calibration files provided in `` test/advanced_config/ ``,
18+ export one with ` LightCompress < https://github.com/ModelTC/LightCompress >`_, or use your own compatible file .
2019
2120Backend and Quantization Granularity
2221------------------------------------
@@ -28,59 +27,13 @@ Current behavior:
2827
2928Calibration files are backend-dependent:
3029
31- - ``per_head `` files exported with ``fa3 `` should be used with ``fa3 `` inference.
32- - ``per_tensor `` files exported with ``flashinfer `` should be used with ``flashinfer `` inference.
30+ - ``per_head `` files for ``fa3 `` should be used with ``fa3 `` inference.
31+ - ``per_tensor `` files for ``flashinfer `` should be used with ``flashinfer `` inference.
3332
3433Avoid mixing calibration files across different backends.
3534
36- Step 1: Export Calibration File
37- --------------------------------
38-
39- Export mode example (FA3):
40-
41- .. code-block :: console
42-
43- $ python -m lightllm.server.api_server \
44- --model_dir /path/to/model \
45- --export_fp8kv_calibration \
46- --llm_prefill_att_backend fa3 \
47- --llm_decode_att_backend fa3 \
48- --disable_cudagraph
49-
50- Export mode example (FlashInfer):
51-
52- .. code-block :: console
53-
54- $ python -m lightllm.server.api_server \
55- --model_dir /path/to/model \
56- --export_fp8kv_calibration \
57- --llm_prefill_att_backend flashinfer \
58- --llm_decode_att_backend flashinfer \
59- --disable_cudagraph
60-
61- Notes:
62-
63- - Setting ``--export_fp8kv_calibration `` collects KV statistics during runtime.
64- - After calibration is completed, ``kv_cache_calib.json `` is written to the current working directory.
65- - Export mode requires ``--disable_cudagraph ``, and ``--llm_kv_type `` should remain ``None ``.
66- - The repository already provides calibration files for common models under ``test/advanced_config/ ``, which can be used directly or as references.
67-
68- Use benchmark_qps.py for random-data calibration
69- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
70-
71- Besides online traffic, you can use ``test/benchmark/service/benchmark_qps.py `` to generate random requests for calibration.
72-
73- - By default, one calibration result is exported after around 4000 inferences are accumulated.
74- - In practice, you can run the following command twice to improve coverage stability.
75-
76- Example command:
77-
78- .. code-block :: console
79-
80- $ python test/benchmark/service/benchmark_qps.py --url http://127.0.0.1:8000/generate_stream --tokenizer_path ../Qwen3-30B-A3B --input_len 1000 --output_len 2000 --input_qps 10 --input_num 200 --range_ratio 0.9
81-
82- Step 2: Start FP8 Inference with Calibration
83- ---------------------------------------------
35+ Start FP8 Inference with Calibration
36+ ------------------------------------
8437
8538Inference mode example (FA3):
8639
@@ -107,7 +60,7 @@ Inference mode example (FlashInfer):
10760 Notes:
10861
10962- ``fp8kv `` requires ``--kv_quant_calibration_config_path ``.
110- - Keep the inference backend consistent with the backend used during calibration export .
63+ - Keep the inference backend consistent with the backend expected by the calibration file .
11164
11265Calibration File Schema
11366-----------------------
@@ -136,14 +89,10 @@ Common Issues
13689
13790 You are using ``--llm_kv_type fp8kv `` without a calibration file path.
13891
139- 2. Error says ``--disable_cudagraph `` is required
140-
141- You are using ``--export_fp8kv_calibration ``; this mode requires cudagraph disabled.
142-
143- 3. ``quant_type not match `` error
92+ 2. ``quant_type not match `` error
14493
14594 Usually caused by backend/file mismatch (for example, using a ``per_head `` file with ``flashinfer ``).
14695
147- 4 . Abnormal quality after backend switch
96+ 3 . Abnormal quality after backend switch
14897
149- Re-export calibration using the target backend instead of reusing files across backends .
98+ Use a calibration file that matches the target backend instead of reusing an incompatible file .
0 commit comments