diff --git a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md index 9d1496fc..2f2bda3e 100644 --- a/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md +++ b/Feature_Guide/Speculative_Decoding/TRT-LLM/README.md @@ -35,7 +35,7 @@ ## About Speculative Decoding -This tutorial shows how to build and serve speculative decoding models in Triton Inference Server with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend) on a single node with one GPU. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. +This tutorial shows how to serve speculative decoding models in Triton Inference Server with [TensorRT-LLM Backend](https://github.com/triton-inference-server/tensorrtllm_backend) using the PyTorch backend and LLMAPI. The LLMAPI backend provides a simplified deployment approach - **no engine building required**. Please go to [Speculative Decoding](../README.md) main page to learn more about other supported backends. According to [Spec-Bench](https://sites.google.com/view/spec-bench), EAGLE is currently the top-performing approach for speeding up LLM inference across different tasks. In this tutorial, we'll focus on [EAGLE](#eagle) and demonstrate how to make it work with Triton Inference Server. However, we'll also cover [MEDUSA](#medusa) and [Draft Model-Based Speculative Decoding](#draft-model-based-speculative-decoding) for those interested in exploring alternative methods. This way, you can choose the best fit for your needs. @@ -46,122 +46,58 @@ EAGLE ([paper](https://arxiv.org/pdf/2401.15077) | [github](https://github.com/S *NOTE: EAGLE-2 is not supported via Triton Inference Server using TensorRT-LLM backend yet.* -### Acquiring EAGLE Model and its Base Model +### EAGLE Model Information -In this example, we will be using the [EAGLE-Vicuna-7B-v1.3](https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3) model. -More types of EAGLE models can be found [here](https://huggingface.co/yuhuili). The base model [Vicuna-7B-v1.3](https://huggingface.co/lmsys/vicuna-7b-v1.3) is also needed for EAGLE to work. - -To download both models, run the following command: -```bash -# Install git-lfs if needed -apt-get update && apt-get install git-lfs -y --no-install-recommends -git lfs install -git clone https://huggingface.co/lmsys/vicuna-7b-v1.3 -git clone https://huggingface.co/yuhuili/EAGLE-Vicuna-7B-v1.3 -``` +In this example, we will be using [EAGLE3-LLaMA3.1-Instruct-8B](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B) with the [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) base model. More EAGLE3 models can be found [here](https://huggingface.co/yuhuili). With the LLMAPI backend, models are downloaded automatically from HuggingFace when first used. ### Launch Triton TensorRT-LLM container Launch Triton docker container with TensorRT-LLM backend. -Note that we're mounting the downloaded EAGLE and base models to `/hf-models` in the docker container. -Make an `engines` folder outside docker to reuse engines for future runs. Please, make sure to replace with the version of Triton that you want to use (must be >= 25.01). The latest Triton Server container is recommended and can be found [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags). ```bash docker run --rm -it --net host --shm-size=2g \ --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ - -v :/hf-models \ - -v :/engines \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/tritonserver:-trtllm-python-py3 ``` -### Create Engines for Each Model [skip this step if you already have an engine] - -TensorRT-LLM requires each model to be compiled for the configuration -you need before running. To do so, before you run your model for the first time -on Triton Server you will need to create a TensorRT-LLM engine. +### Prepare Model Repository -Starting with [24.04 release](https://github.com/triton-inference-server/server/releases/tag/v2.45.0), -Triton Server TensrRT-LLM container comes with -pre-installed TensorRT-LLM package, which allows users to build engines inside -the Triton container. Simply follow the next steps in the container: - -```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -EAGLE_MODEL=/hf-models/EAGLE-Vicuna-7B-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna/7b/ -ENGINE_DIR=/engines/eagle-vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/eagle/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --eagle_model_dir ${EAGLE_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 \ - --max_draft_len 63 \ - --num_eagle_layers 4 \ - --max_non_leaves_per_layer 10 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --gemm_plugin float16 \ - --use_paged_context_fmha enable \ - --speculative_decoding_mode eagle \ - --max_batch_size 4 -``` +Copy the LLMAPI model template and configure it for EAGLE speculative decoding: -To verify that the engine is built correctly, run the following command: ```bash -python3 /app/examples/run.py --engine_dir ${ENGINE_DIR} \ - --tokenizer_dir ${BASE_MODEL} \ - --max_output_len=100 \ - --input_text "Once upon" -``` -Sample output: -``` -> Input [Text 0]: " Once upon" -> Output [Text 0 Beam 0]: "a time, there was a young girl who loved to read. She would spend hours in the library, devouring books of all genres. She had a special love for fairy tales, and would often dream of living in a magical world where she could meet princes and princesses, and have adventures with talking animals. -> One day, while she was reading a book, she came across a passage that spoke to her heart. It said, "You are the author of" -> [TensorRT-LLM][INFO] Refreshed the MPI local session +cp -R /app/all_models/llmapi/ llmapi_repo/ ``` -### Serving with Triton +Edit `llmapi_repo/tensorrt_llm/1/model.yaml` with your EAGLE configuration: -The last step is to create a Triton readable model and serve it. You can find a template of a model that uses inflight batching in -[tensorrtllm_backend/all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm). To run EAGLE model, you will need to: +```yaml +model: meta-llama/Llama-3.1-8B-Instruct +backend: pytorch -1. Copy over the inflight batcher models repository -```bash -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/. -``` +tensor_parallel_size: 1 +pipeline_parallel_size: 1 -2. Modify config.pbtxt for the preprocessing, postprocessing and processing steps. +speculative_config: + decoding_type: Eagle + speculative_model: yuhuili/EAGLE3-LLaMA3.1-Instruct-8B + max_draft_len: 4 -```bash -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/eagle-vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} +triton_config: + max_batch_size: 0 + decoupled: False ``` -*NOTE: you can specify `eagle_choices` by manually changing tensorrt_llm/config.pbtxt. If you do not specify any choices, the default, [mc_sim_7b_63](https://github.com/FasterDecoding/Medusa/blob/main/medusa/model/medusa_choices.py#L1) choices are used. For more information regarding choices tree, refer to [Medusa Tree](https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html#medusa-tree).* +*NOTE: On the PyTorch backend, `decoding_type: Eagle` is treated as `Eagle3`. EAGLE (v1/v2) draft checkpoints are not compatible - you must use an Eagle3 draft model. See the [TensorRT-LLM speculative decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/speculative-decoding.md) for available Eagle3 models.* -3. Launch Tritonserver +### Serving with Triton -Launch Tritonserver with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/release/0.5.0/scripts/launch_triton_server.py) script. Here, we launch a single instance of `tritonserver` with MPI by setting `--world_size=1`. +Launch Tritonserver with the [launch_triton_server.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/scripts/launch_triton_server.py) script: ```bash -python3 /app/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm +python3 /app/scripts/launch_triton_server.py --model_repo=llmapi_repo/ ``` > You should expect the following response: @@ -180,251 +116,133 @@ pkill tritonserver ### Send Inference Requests -You can test the results of the run with: -1. The [inflight_batcher_llm_client.py](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/inflight_batcher_llm_client.py) script. Run below in another terminal: +You can test the results of the run with the [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html): ```bash -# Using the SDK container as an example. is the version of Triton Server you are using. -docker run --rm -it --net host --shm-size=2g \ - --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \ - -v :/hf-models \ - nvcr.io/nvidia/tritonserver:-py3-sdk -# Install extra dependencies for the script -pip3 install transformers sentencepiece -python3 /tensorrtllm_client/inflight_batcher_llm_client.py --request-output-len 50 --tokenizer-dir /hf-models/vicuna-7b-v1.3 --text "What is ML?" -``` -> You should expect the following response: -> ``` -> ... -> Input: What is ML? -> Output beam 0: -> ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation. -> ... -> ``` - -2. The [generate endpoint](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/protocol/extension_generate.html). - -```bash -curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is ML?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' +curl -X POST localhost:8000/v2/models/tensorrt_llm/generate \ + -d '{"text_input": "What is ML?", "sampling_param_max_tokens": 50}' | jq ``` > You should expect the following response: -> ``` -> {"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"What is ML?\nML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation."} +> ```json +> { +> "model_name": "tensorrt_llm", +> "model_version": "1", +> "text_output": "ML is a branch of AI that allows computers to learn from data, identify patterns, and make predictions. It is a powerful tool that can be used in a variety of industries, including healthcare, finance, and transportation." +> } > ``` -### Evaluating Performance with Gen-AI Perf +### Evaluating Performance -Gen-AI Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server. -You can read more about Gen-AI Perf [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html). We will use Gen-AI Perf to evaluate the performance gain of EAGLE model over the base model. +You can benchmark the performance gain of EAGLE over the base model using [AIPerf](https://github.com/ai-dynamo/aiperf), NVIDIA's comprehensive benchmarking tool for generative AI models. -*NOTE: below experiment is done on a single node with one GPU - RTX 5880 (48GB GPU memory). The number below is only for reference. The actual number may vary due to the different hardware and environment.* +*NOTE: The experiments below are done on a single node with one GPU - RTX 5880 (48GB GPU memory). The numbers below are for reference only. Actual performance may vary due to different hardware and environment.* -1. Prepare Dataset +1. Install AIPerf -We will be using the HumanEval dataset for our evaluation, which is used in the original EAGLE paper. The HumanEval dataset has been converted to the format required by EAGLE and is available [here](https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/humaneval/question.jsonl). To make it compatible for Gen-AI Perf, we need to do another conversion. You may use other datasets besides HumanEval as well, as long as it could be converted to the -format required by Gen-AI Perf. Note that MT-bench could not be used since Gen-AI Perf does not support multiturn dataset as input yet. Follow the steps below to download and convert the dataset. +Install AIPerf in the container (or run from a separate client machine): ```bash -wget https://raw.githubusercontent.com/SafeAILab/EAGLE/main/eagle/data/humaneval/question.jsonl - -# dataset-converter.py file can be found in the parent folder of this README. -python3 dataset-converter.py --input_file question.jsonl --output_file converted_humaneval.jsonl +pip install aiperf ``` -2. Install GenAI-Perf (Ubuntu 24.04, Python 3.10+) +2. Run Benchmark on EAGLE Model +Run AIPerf against the EAGLE model. AIPerf will generate synthetic prompts automatically: ```bash -pip install genai-perf +aiperf profile \ + --model tensorrt_llm \ + --url http://localhost:8000/v2/models/tensorrt_llm/generate \ + --endpoint-type template \ + --extra-inputs payload_template:'{"text_input": {{ text|tojson }}, "sampling_param_max_tokens": {{ max_tokens }}}' \ + --extra-inputs response_field:'text_output' \ + --synthetic-input-tokens-mean 128 \ + --output-tokens-mean 256 \ + --request-count 50 \ + --concurrency 1 ``` -*NOTE: you must already have CUDA 12 installed.* - -3. Run Gen-AI Perf -Run the following command in the SDK container: -```bash -genai-perf \ - profile \ - -m ensemble \ - --service-kind triton \ - --backend tensorrtllm \ - --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ - --tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 \ - --concurrency 1 -``` -*NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits. This approach ensures that the benchmark reflects the true performance gains of speculative decoding in real-world, low-concurrency scenarios.* +*NOTE: When benchmarking the speedup of speculative decoding versus the base model, use `--concurrency 1`. This setting is crucial because speculative decoding is designed to trade extra computation for reduced token generation latency. By limiting concurrency, we avoid saturating hardware resources with multiple requests, allowing for a more accurate assessment of the technique's latency benefits.* -A sample output that looks like this: -``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ -│ Request Latency (ms) │ 1,355.35 │ 387.84 │ 2,002.81 │ 2,000.44 │ 1,868.83 │ 1,756.85 │ -│ Output Sequence Length (tokens) │ 348.27 │ 153.00 │ 534.00 │ 517.25 │ 444.50 │ 426.75 │ -│ Input Sequence Length (tokens) │ 156.54 │ 63.00 │ 278.00 │ 265.75 │ 203.00 │ 185.75 │ -│ Output Token Throughput (per sec) │ 256.94 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Throughput (per sec) │ 0.74 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Count (count) │ 26.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ -└───────────────────────────────────┴──────────┴────────┴──────────┴──────────┴──────────┴──────────┘ -``` +AIPerf will output comprehensive metrics including: +- **Output Token Throughput (tokens/sec)**: Key metric for comparing EAGLE vs base model +- **Time to First Token (TTFT)**: Latency to receive the first token +- **Inter Token Latency (ITL)**: Average time between tokens +- **Request Latency**: End-to-end request latency -4. Run Gen-AI Perf on Base Model +3. Run Benchmark on Base Model -To compare performance between EAGLE and base model (i.e. vanilla LLM w/o speculative decoding), we need to run Gen-AI Perf Tool on the base model as well. To do so, we need to repeat the steps above for the base model with minor changes. +To compare performance between EAGLE and the base model (i.e., vanilla LLM without speculative decoding), repeat the steps for the base model. -Kill the existing Triton Server and run the following command in the Triton Server container: +Kill the existing Triton Server: ```bash pkill tritonserver ``` -Build the TRT-LLM engine for the base model: +Create a model repository for the base model (without speculative decoding): ```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna-base/7b/ -ENGINE_DIR=/engines/vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/llama/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --remove_input_padding enable \ - --gpt_attention_plugin float16 \ - --context_fmha enable \ - --gemm_plugin float16 \ - --paged_kv_cache enable \ - --max_batch_size 4 +cp -R /app/all_models/llmapi/ llmapi_base_repo/ ``` -Create a Triton readable model for the base model: -```bash -mkdir -p /opt/tritonserver/vicuna_base -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/vicuna_base/. - -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/vicuna_base/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} +Edit `llmapi_base_repo/tensorrt_llm/1/model.yaml` for the base model (no speculative_config): + +```yaml +model: meta-llama/Llama-3.1-8B-Instruct +backend: pytorch + +tensor_parallel_size: 1 +pipeline_parallel_size: 1 + +triton_config: + max_batch_size: 0 + decoupled: False ``` Launch Triton Server with the base model: ```bash -python3 /app/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/vicuna_base/inflight_batcher_llm +python3 /app/scripts/launch_triton_server.py --model_repo=llmapi_base_repo/ ``` -Run Gen-AI Perf Tool on Base Model: +Run AIPerf on the base model: ```bash -genai-perf \ - profile \ - -m ensemble \ - --service-kind triton \ - --backend tensorrtllm \ - --input-file /path/to/converted/dataset/converted_humaneval.jsonl \ - --tokenizer /path/to/hf-models/vicuna-7b-v1.3/ \ - --profile-export-file my_profile_export.json \ - --url localhost:8001 \ - --concurrency 1 +aiperf profile \ + --model tensorrt_llm \ + --url http://localhost:8000/v2/models/tensorrt_llm/generate \ + --endpoint-type template \ + --extra-inputs payload_template:'{"text_input": {{ text|tojson }}, "sampling_param_max_tokens": {{ max_tokens }}}' \ + --extra-inputs response_field:'text_output' \ + --synthetic-input-tokens-mean 128 \ + --output-tokens-mean 256 \ + --request-count 50 \ + --concurrency 1 ``` -Sample performance output for base model: -``` - NVIDIA GenAI-Perf | LLM Metrics -┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ -┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃ -┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ -│ Request Latency (ms) │ 2,663.13 │ 1,017.15 │ 4,197.72 │ 4,186.59 │ 4,096.25 │ 4,090.93 │ -│ Output Sequence Length (tokens) │ 310.75 │ 153.00 │ 441.00 │ 440.12 │ 431.70 │ 415.50 │ -│ Input Sequence Length (tokens) │ 145.67 │ 63.00 │ 195.00 │ 194.12 │ 186.90 │ 185.25 │ -│ Output Token Throughput (per sec) │ 116.68 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Throughput (per sec) │ 0.38 │ N/A │ N/A │ N/A │ N/A │ N/A │ -│ Request Count (count) │ 12.00 │ N/A │ N/A │ N/A │ N/A │ N/A │ -└───────────────────────────────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘ -``` +4. Compare Performance + +From the sample runs above, we can see that the EAGLE model has a lower latency and higher throughput than the base model. In our tests, EAGLE achieved approximately 2.2x speedup in output token throughput compared to the base model. -5. Compare Performance +For more advanced benchmarking options, refer to the [AIPerf documentation](https://github.com/ai-dynamo/aiperf), including: +- Request rate testing with `--request-rate` +- Goodput measurement with `--goodput` +- GPU telemetry with `--enable-gpu-telemetry` -From the sample runs above, we can see that the EAGLE model has a lower latency and higher throughput than the base model. Specifically, the EAGLE model can generate 256.94 tokens per second, while the base model can only generate 116.68 tokens per second with a speed up of 2.2x. +As stated above, the numbers are gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to different hardware and environment. -As stated above, the number above is gathered from a single node with one GPU - RTX 5880 (48GB GPU memory). The actual number may vary due to the different hardware and environment. -## Medusa +## MEDUSA -MEDUSA ([paper](https://arxiv.org/pdf/2401.10774) | [github](https://github.com/FasterDecoding/Medusa) | [blog](https://sites.google.com/view/medusa-llm)) is a speculative decoding framework that, like EAGLE, aims to accelerate LLM inference. However, there are several key differences between the two approaches: +MEDUSA ([paper](https://arxiv.org/pdf/2401.10774)) is a speculative decoding technique that adds extra decoding heads to LLMs to predict multiple subsequent tokens in parallel. Here are the key differences between MEDUSA and EAGLE: - Architecture: MEDUSA adds extra decoding heads to LLMs to predict multiple subsequent tokens in parallel, while EAGLE extrapolates second-top-layer contextual feature vectors of LLMs. - - Generation structure: MEDUSA generates a fully connected tree across adjacent layers through the Cartesian product, often resulting in nonsensical combinations. In contrast, EAGLE creates a sparser, more selective tree structure that is more context-aware1. + - Generation structure: MEDUSA generates a fully connected tree across adjacent layers through the Cartesian product, often resulting in nonsensical combinations. In contrast, EAGLE creates a sparser, more selective tree structure that is more context-aware. - Consistency: MEDUSA's non-greedy generation does not guarantee lossless performance, while EAGLE provably maintains consistency with vanilla decoding in the distribution of generated texts. - Accuracy: MEDUSA achieves an accuracy of about 0.6 in generating drafts, whereas EAGLE attains a higher accuracy of approximately 0.8 as claimed in the EAGLE paper. - - Speed: EAGLE is reported to be 1.6x faster than MEDUSA for certained models as claimed in the EAGLE paper. - -To run MEDUSA with Triton Inference Server, it is very similar to the steps above for EAGLE with only a few simple configuration changes. We only list the changes below. The rest steps not listed below are the same as the steps for EAGLE above, e.g. launch docker, launch triton server, send requests, evalaution. - -### Download the MEDUSA model - -We will be using [medusa-vicuna-7b-v1.3](https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3), same model family as what we used for EAGLE above: + - Speed: EAGLE is reported to be 1.6x faster than MEDUSA for certain models as claimed in the EAGLE paper. -```bash -git clone https://huggingface.co/FasterDecoding/medusa-vicuna-7b-v1.3 -``` +**NOTE:** MEDUSA is **not supported** on the PyTorch/LLMAPI backend. To use MEDUSA with Triton Inference Server, you must use the TensorRT engine-based workflow with `trtllm-build`. Please refer to the [TensorRT-LLM MEDUSA documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/advanced/speculative-decoding.md#medusa) for detailed instructions on building and deploying MEDUSA models. -### Build the TRT-LLM engine for MEDUSA: -```bash -BASE_MODEL=/hf-models/vicuna-7b-v1.3 -MEDUSA_MODEL=/hf-models/medusa-vicuna-7b-v1.3 -CKPT_PATH=/tmp/ckpt/vicuna-medusa/7b/ -ENGINE_DIR=/engines/medusa-vicuna-7b/1-gpu/ -CONVERT_CHKPT_SCRIPT=/app/examples/medusa/convert_checkpoint.py -python3 ${CONVERT_CHKPT_SCRIPT} --model_dir ${BASE_MODEL} \ - --medusa_model_dir ${MEDUSA_MODEL} \ - --output_dir ${CKPT_PATH} \ - --dtype float16 \ - --num_medusa_heads 4 -trtllm-build --checkpoint_dir ${CKPT_PATH} \ - --output_dir ${ENGINE_DIR} \ - --gemm_plugin float16 \ - --speculative_decoding_mode medusa \ - --max_batch_size 4 -``` - -### Create a Triton readable model for MEDUSA: -```bash -mkdir -p /opt/tritonserver/vicuna_medusa -cp -R /app/all_models/inflight_batcher_llm /opt/tritonserver/vicuna_medusa/. - -TOKENIZER_DIR=/hf-models/vicuna-7b-v1.3 -TOKENIZER_TYPE=auto -ENGINE_DIR=/engines/medusa-vicuna-7b/1-gpu/ -DECOUPLED_MODE=false -MODEL_FOLDER=/opt/tritonserver/vicuna_medusa/inflight_batcher_llm -MAX_BATCH_SIZE=4 -INSTANCE_COUNT=1 -MAX_QUEUE_DELAY_MS=10000 -TRITON_BACKEND=tensorrtllm -LOGITS_DATATYPE="TYPE_FP32" -FILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},tokenizer_type:${TOKENIZER_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE} -python3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:${TRITON_BACKEND},triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE} -``` ## Draft Model-Based Speculative Decoding @@ -442,4 +260,25 @@ Draft Model-Based Speculative Decoding ([paper](https://arxiv.org/pdf/2302.01318 - Accuracy: its draft accuracy can vary depending on the draft model used, while EAGLE achieves a higher draft accuracy (about 0.8) compared to MEDUSA (about 0.6). - Please follow the steps [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md#using-draft-target-model-approach-with-triton-inference-server) to run Draft Model-Based Speculative Decoding with Triton Inference Server. \ No newline at end of file +### Draft Model Configuration + +Edit `llmapi_repo/tensorrt_llm/1/model.yaml` with your draft model configuration: + +```yaml +model: meta-llama/Llama-3.1-70B-Instruct +backend: pytorch + +tensor_parallel_size: 4 +pipeline_parallel_size: 1 + +speculative_config: + decoding_type: DraftTarget + speculative_model: meta-llama/Llama-3.1-8B-Instruct + max_draft_len: 5 + +triton_config: + max_batch_size: 0 + decoupled: False +``` + +For more details on draft model speculative decoding, please refer to the [TensorRT-LLM speculative decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/speculative-decoding.md).