[XPU][Docs] Update Release2.5 Note#7187
[XPU][Docs] Update Release2.5 Note#7187iosmers wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在更新昆仑芯 XPU 相关文档到 Release 2.5 版本信息,并补充/调整支持模型与部署命令说明。
Changes:
- 更新 XPU 安装文档中的 Docker 镜像与 pip 包版本到 2.5.0,并同步 PaddlePaddle XPU 版本到 3.3.1
- 重写/扩展 XPU 支持模型表格,新增“快速/最优”两类部署命令列
- 调整英文部署文档中若干启动命令示例(包含参数项删减)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| docs/zh/usage/kunlunxin_xpu_deployment.md | 更新支持模型表为 2.5.0,并加入快速/最优部署命令列 |
| docs/usage/kunlunxin_xpu_deployment.md | 同步英文版支持模型表与部署示例到 2.5.0 |
| docs/zh/get_started/installation/kunlunxin_xpu.md | 更新中文安装指引版本号(Docker/pip/paddlepaddle-xpu) |
| docs/get_started/installation/kunlunxin_xpu.md | 更新英文安装指引版本号(Docker/pip/paddlepaddle-xpu) |
| |ERNIE-4.5-VL-424B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-424B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --quantization "wint8" \ <br> --max-model-len 32768 \ <br> --max-num-seqs 8 \ <br> --enable-mm \ <br> --mm-processor-kwargs '{"video_max_frames": 30}' \ <br> --limit-mm-per-prompt '{"image": 10, "video": 3}' \ <br> --reasoning-parser ernie-45-vl \ <br> --gpu-memory-utilization 0.7|2.4.0| | ||
| |PaddleOCR-VL-0.9B|32K|BF16|1|export FD_ENABLE_MAX_PREFILL=1 <br>export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡 <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/PaddleOCR-VL \ <br> --port 8188 \ <br> --metrics-port 8181 \ <br> --engine-worker-queue-port 8182 \ <br> --max-model-len 16384 \ <br> --max-num-batched-tokens 16384 \ <br> --gpu-memory-utilization 0.8 \ <br> --max-num-seqs 256|2.4.0| | ||
| |ERNIE-4.5-VL-28B-A3B-Thinking|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --quantization "wint8" \ <br> --max-model-len 131072 \ <br> --max-num-seqs 32 \ <br> --engine-worker-queue-port 8189 \ <br> --metrics-port 8190 \ <br> --cache-queue-port 8191 \ <br> --reasoning-parser ernie-45-vl-thinking \ <br> --tool-call-parser ernie-45-vl-thinking \ <br> --mm-processor-kwargs '{"image_max_pixels": 12845056 }'|2.4.0| | ||
| |模型名|上下文长度|量化|所需卡数|(快速)部署命令|(最优)部署命令|适用版本| |
There was a problem hiding this comment.
表头里“(最优)部署命令”括号混用了半角/全角,且右括号为全角“)”但左括号为半角“(”。建议统一使用中文全角括号“(最优)”,避免渲染与排版不一致。
| |模型名|上下文长度|量化|所需卡数|(快速)部署命令|(最优)部署命令|适用版本| | |
| |模型名|上下文长度|量化|所需卡数|(快速)部署命令|(最优)部署命令|适用版本| |
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | ||
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| |
There was a problem hiding this comment.
该行标注“128K”上下文,但(最优)部署命令里仍将 --max-model-len 写成了 32768,和上下文长度/快速命令(131072)不一致,容易导致实际仅支持 32K。建议将最优命令的 --max-model-len 同步为 131072(或明确说明为何不同)。
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| |
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' |2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| |
There was a problem hiding this comment.
这里同一模型/配置(ERNIE-4.5-21B-A3B,128K,BF16)出现了重复行;且第一行的(最优)部署命令包含 --quantization "wint4" 并把 --max-model-len 写成 32768,与表格“BF16 / 128K”列值矛盾。建议删除重复项,并修正该配置对应的量化与 max-model-len 参数,确保与表格列一致。
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| |
| |ERNIE-4.5-0.3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|-|2.5.0| | ||
| |ERNIE-4.5-0.3B|32K|WINT8|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|-|2.5.0| | ||
| |ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|-|2.5.0| | ||
| |ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|-|2.5.0| |
There was a problem hiding this comment.
该表格中 ERNIE-4.5-0.3B / 128K / BF16 这一行重复出现了两次,会造成读者混淆。建议删除其中一行或改为缺失的配置项。
| |ERNIE-4.5-0.3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-0.3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|-|2.5.0| |
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | ||
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' |2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| |
There was a problem hiding this comment.
(最优)部署命令里的 --speculative-config JSON 使用了 "model": "'${mtp_model_path}'" 这种带额外单引号的写法,容易把单引号也作为路径的一部分传入;同时变量名与仓库其他文档常用的 ${path_to_mtp_model} 不一致(例如 docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md:77)。建议统一占位符命名并去掉多余引号,确保传入的是实际模型路径字符串。
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与线上网卡名一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # 指定任意一张卡<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| |
| --quantization "wint4" \ | ||
| --gpu-memory-utilization 0.9 \ | ||
| --load-choices "default" | ||
| --gpu-memory-utilization 0.9 |
There was a problem hiding this comment.
Quick start 示例里不再显式指定 --load-choices。但在代码里 --load-choices 的默认值是 default_v1(fastdeploy/engine/args_utils.py:493),因此如果示例原本依赖 --load-choices "default"(v0 loader),删除后会改变实际加载器行为。建议:若确实需要 v0 loader 则保留 --load-choices "default";否则请补充一句说明示例默认使用 default_v1 loader。
| --gpu-memory-utilization 0.9 | |
| --gpu-memory-utilization 0.9 \ | |
| --load-choices "default_v1" |
| @@ -113,8 +114,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ | |||
| --enable-mm \ | |||
| --mm-processor-kwargs '{"video_max_frames": 30}' \ | |||
| --limit-mm-per-prompt '{"image": 10, "video": 3}' \ | |||
There was a problem hiding this comment.
该处启动命令同样未指定 --load-choices。考虑到默认值为 default_v1(fastdeploy/engine/args_utils.py:493),建议明确该模型在 XPU 上推荐/要求的 loader(必要时保留 --load-choices 参数,或补充说明默认使用 default_v1)。
| --limit-mm-per-prompt '{"image": 10, "video": 3}' \ | |
| --limit-mm-per-prompt '{"image": 10, "video": 3}' \ | |
| --load-choices default_v1 \ |
| @@ -258,8 +258,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ | |||
| --cache-queue-port 8191 \ | |||
| --reasoning-parser ernie-45-vl-thinking \ | |||
| --tool-call-parser ernie-45-vl-thinking \ | |||
There was a problem hiding this comment.
该处启动命令块也移除了显式的 --load-choices。由于 --load-choices 默认是 default_v1(fastdeploy/engine/args_utils.py:493),如果此模型/场景需要非默认 loader,省略会导致行为变化。建议明确所需 loader 并在文档中写清楚(需要非默认则保留该参数)。
| --tool-call-parser ernie-45-vl-thinking \ | |
| --tool-call-parser ernie-45-vl-thinking \ | |
| --load-choices default_v1 \ |
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' |2.5.0| | ||
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | ||
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| |
There was a problem hiding this comment.
“(Best) Deployment Commands”里的 --speculative-config 使用了 "model": "'${mtp_model_path}'" 这种在 JSON 字符串内部额外加单引号的写法,可能会把单引号也作为路径的一部分传入;同时占位符命名也与仓库其他文档不一致(例如 docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md:77 使用 ${path_to_mtp_model})。建议统一占位符命名并去掉多余引号,确保传入的是实际模型路径字符串。
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "'${mtp_model_path}'"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}' |2.5.0| | |
| |ERNIE-4.5-21B-A3B|32K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|BF16|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT8|1|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| | |
| |ERNIE-4.5-21B-A3B|128K|WINT4|1 |export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0" # Specify any card <br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-21B-A3B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 1 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 128 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9 <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${path_to_mtp_model}"}'|2.5.0| |
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | ||
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | ||
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| |
There was a problem hiding this comment.
该行最优部署命令同时开启了 --enable-prefix-caching 并配置了 MTP(--speculative-config)。但仓库最佳实践文档注明 MTP 目前不支持与 Prefix Caching 同时使用(docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md:79-83)。建议在此处明确说明两者的兼容关系(是否会自动关闭 Prefix Caching),或移除/调整冲突参数,避免读者按文档运行后效果与预期不符。
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --enable-prefix-caching \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT8|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint8" \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|4 |export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3" or "4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 4 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}'|2.5.0| | |
| |ERNIE-4.5-300B-A47B|32K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.95|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # 与实际网卡保持一致 <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.95 \ <br> --enable-expert-parallel \ <br> --data-parallel-size 1 \ <br> --speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| | |
| |ERNIE-4.5-300B-A47B|128K|WINT4|8|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"<br>python -m fastdeploy.entrypoints.openai.api_server \ <br> --model PaddlePaddle/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8188 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 131072 \ <br> --max-num-seqs 64 \ <br> --quantization "wint4" \ <br> --gpu-memory-utilization 0.9|export XPU_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" <br> export BKCL_ENABLE_XDR=1 <br> export BKCL_RDMA_NICS=eth1,eth1,eth3,eth4 # Consistent with your network card name <br> export BKCL_TRACE_TOPO=1 <br> export BKCL_PCIE_RING=1 <br> export XSHMEM_MODE=1 <br> export XSHMEM_QP_NUM_PER_RANK=32 <br> export BKCL_RDMA_VERBS=1 <br> python -m fastdeploy.entrypoints.openai.api_server \ <br> --model /home/ERNIE-4.5-300B-A47B-Paddle \ <br> --port 8123 \ <br> --engine-worker-queue-port 8124 \ <br> --metrics-port 8125 \ <br> --cache-queue-port 55996 \ <br> --tensor-parallel-size 8 \ <br> --max-model-len 32768 \ <br> --max-num-seqs 64 \ <br> --quantization wint4 \ <br> --gpu-memory-utilization 0.9 \ <br> --enable-expert-parallel \ <br> --data-parallel-size 1 \ <br>--speculative-config '{"method": "mtp", "num_speculative_tokens": 1, "model": "${mtp_model_path}"}' |2.5.0| |
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.