From e915df80a8a3ee50cd8799c15e89a5756d00a3d5 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz Date: Wed, 28 Jan 2026 14:06:18 -0500 Subject: [PATCH 01/10] Add vllm docs for mbridge ckpt Signed-off-by: Onur Yilmaz --- docs/llm/mbridge/optimized/index.md | 11 +- docs/llm/mbridge/optimized/vllm.md | 189 ++++++++++++++++++++++++++++ 2 files changed, 199 insertions(+), 1 deletion(-) create mode 100644 docs/llm/mbridge/optimized/vllm.md diff --git a/docs/llm/mbridge/optimized/index.md b/docs/llm/mbridge/optimized/index.md index 88630a37d5..90429b532d 100644 --- a/docs/llm/mbridge/optimized/index.md +++ b/docs/llm/mbridge/optimized/index.md @@ -1,4 +1,13 @@ # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries -**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM and vLLM is coming soon. Please check back for updates. +Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM. + +```{toctree} +:maxdepth: 1 +:titlesonly: + +vLLM +``` + +**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates. diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md new file mode 100644 index 0000000000..d3105db4e9 --- /dev/null +++ b/docs/llm/mbridge/optimized/vllm.md @@ -0,0 +1,189 @@ +# Deploy Megatron-Bridge LLMs with vLLM and Triton Inference Server + +This section shows how to use scripts and APIs to export a Megatron-Bridge LLM to vLLM and deploy it with the NVIDIA Triton Inference Server. + +## Quick Example + +1. Follow the steps in the [Generate a Megatron-Bridge Checkpoint](../gen_mbridge_ckpt.md) to generate a Megatron-Bridge Llama checkpoint. + +2. In a terminal, go to the folder where the ``hf_llama31_8B_mbridge`` checkpoint is located. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: + + ```shell + docker pull nvcr.io/nvidia/nemo:vr + + docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ + -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge \ + -w /opt/Export-Deploy \ + --name nemo-fw \ + nvcr.io/nvidia/nemo:vr + ``` + +3. Install vLLM by executing the following command inside the container if it is not available in the container: + + ```shell + cd /opt/Export-Deploy + uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt) + + ``` + +4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama Megatron-Bridge checkpoint to vLLM and subsequently serves it on the Triton server: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ + --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \ + --triton_model_name llama \ + --tensor_parallelism_size 1 + ``` + +5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example). + +6. In a separate terminal, access the running container as follows: + + ```shell + docker exec -it nemo-fw bash + ``` + +7. To send a query to the Triton server, run the following script: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50 + ``` + +## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server + +You can deploy a LLM from a Megatron-Bridge checkpoint on Triton using the provided script. + +### Export and Deploy a Megatron-Bridge LLM + +After executing the script, it will export the model to vLLM and then initiate the service on Triton. + +1. Start the container using the steps described in the **Quick Example** section. + +2. To begin serving the downloaded model, run the following script: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ + --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \ + --triton_model_name llama \ + --tensor_parallelism_size 1 + ``` + + The following parameters are defined in the ``deploy_vllm_triton.py`` script: + + - ``--model_path_id``: Path of a Megatron-Bridge checkpoint, or Hugging Face model ID or path. (Required) + - ``--tokenizer``: Tokenizer file if it is not provided in the checkpoint. (Optional) + - ``--lora_ckpt``: List of LoRA checkpoints in HF format. (Optional, can specify multiple) + - ``--tensor_parallelism_size``: Number of GPUs to use for tensor parallelism. Default is 1. + - ``--dtype``: Data type for the model in vLLM. Choices: "auto", "bfloat16", "float16", "float32". Default is "auto". + - ``--quantization``: Quantization method for vLLM. Choices: "awq", "gptq", "fp8". Default is None. + - ``--seed``: Random seed for reproducibility. Default is 0. + - ``--gpu_memory_utilization``: GPU memory utilization percentage for vLLM. Default is 0.9. + - ``--swap_space``: Size (GiB) of CPU memory per GPU to use as swap space. Default is 4. + - ``--cpu_offload_gb``: Size (GiB) of CPU memory to use for offloading model weights. Default is 0. + - ``--enforce_eager``: Whether to enforce eager execution. Default is False. + - ``--max_seq_len_to_capture``: Maximum sequence length covered by CUDA graphs. Default is 8192. + - ``--triton_model_name``: Name for the service/model on Triton. (Required) + - ``--triton_model_version``: Version for the service/model. Default is 1. + - ``--triton_port``: Port for the Triton server to listen for requests. Default is 8000. + - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0. + - ``--max_batch_size``: Maximum batch size of the model. Default is 8. + - ``--debug_mode``: Enable debug/verbose output. Default is False. + +3. Access the models with a Hugging Face token. + + If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you'll need to generate a Hugging Face token that has access to these models. Visit `Hugging Face `__ for more information. After you have the token, perform one of the following steps. + + - Log in to Hugging Face: + + ```shell + huggingface-cli login + ``` + + - Or, set the HF_TOKEN environment variable: + + ```shell + export HF_TOKEN=your_token_here + ``` + +## Supported LLMs + +Megatron-Bridge models are supported for export and deployment if they are listed as compatible in the [vLLM supported models list](https://docs.vllm.ai/en/v0.9.2/models/supported_models.html). + + +## Use NeMo Export and Deploy APIs to Export + +Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Megatron-Bridge checkpoints to vLLM. + + +### Export Megatron-Bridge LLMs + +You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path. + +```python +from nemo_export.vllm_exporter import vLLMExporter + +checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge" + +exporter = vLLMExporter() +exporter.export( + model_path_id=checkpoint_file, + tensor_parallel_size=1, +) + +# The correct argument for output length is 'max_tokens', not 'max_output_len' +output = exporter.forward( + ["What is the best city in the world?"], + max_tokens=50, + top_k=1, + top_p=0.0, + temperature=1.0, +) +print("output: ", output) +``` + +Be sure to check the vLLMExporter class docstrings for details. + + +## How To Send a Query + +### Send a Query using the Script + +You can send queries to your deployed Megatron-Bridge LLM using the provided query script. This script allows you to interact with the model via HTTP requests, sending prompts and receiving generated responses directly from the Triton server. + +The example below demonstrates how to use the query script to send a prompt to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below. + +```shell +python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" +``` + +**Additional parameters:** +- `--prompt_file`: Read prompt from a file instead of the command line +- `--max_tokens`: Maximum number of tokens to generate (default: 16) +- `--min_tokens`: Minimum number of tokens to generate (default: 0) +- `--n_log_probs`: Number of log probabilities to return per output token +- `--n_prompt_log_probs`: Number of log probabilities to return per prompt token +- `--seed`: Random seed for generation +- `--top_k`: Top-k sampling (default: 1) +- `--top_p`: Top-p sampling (default: 0.1) +- `--temperature`: Sampling temperature (default: 1.0) +- `--lora_task_uids`: List of LoRA task UIDs for LoRA-enabled models (use -1 to disable) +- `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0) + + +### Send a Query using the NeMo APIs + +Please see the below if you would like to use APIs to send a query. + +```python +from nemo_deploy.nlp import NemoQueryvLLM + +nq = NemoQueryvLLM(url="localhost:8000", model_name="llama") +output = nq.query_llm( + prompts=["What is the capital of United States? "], + max_tokens=100, + top_k=1, + top_p=0.8, + temperature=1.0, +) +print("output: ", output) +``` From 14b5c47c78c59ce6f56b7dce6e57c2b3c8569b2b Mon Sep 17 00:00:00 2001 From: Onur Yilmaz Date: Wed, 28 Jan 2026 14:34:09 -0500 Subject: [PATCH 02/10] Add params Signed-off-by: Onur Yilmaz --- docs/llm/mbridge/optimized/vllm.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md index d3105db4e9..78c8645f14 100644 --- a/docs/llm/mbridge/optimized/vllm.md +++ b/docs/llm/mbridge/optimized/vllm.md @@ -12,7 +12,7 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ - -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge \ + -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr @@ -30,7 +30,8 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t ```shell python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ - --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \ + --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ + --model_format megatron_bridge \ --triton_model_name llama \ --tensor_parallelism_size 1 ``` From 1a624a02cb9002b8833dfee238289fb35948b69d Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 14:25:29 -0500 Subject: [PATCH 03/10] Update vllm.md --- docs/llm/mbridge/optimized/vllm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md index 78c8645f14..fae2f109bb 100644 --- a/docs/llm/mbridge/optimized/vllm.md +++ b/docs/llm/mbridge/optimized/vllm.md @@ -123,7 +123,7 @@ You can use the APIs in the export module to export a Megatron-Bridge checkpoint ```python from nemo_export.vllm_exporter import vLLMExporter -checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge" +checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/" exporter = vLLMExporter() exporter.export( From 1574682e824b20d17026e62b73b8ea7809e45772 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:17:05 -0500 Subject: [PATCH 04/10] Update vllm.md --- docs/llm/mbridge/optimized/vllm.md | 38 +++++++++++++++++------------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md index fae2f109bb..e377e97c91 100644 --- a/docs/llm/mbridge/optimized/vllm.md +++ b/docs/llm/mbridge/optimized/vllm.md @@ -121,25 +121,31 @@ Up until now, we have used scripts for exporting and deploying LLM models. Howev You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path. ```python -from nemo_export.vllm_exporter import vLLMExporter +def run_test(): + from nemo_export.vllm_exporter import vLLMExporter -checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/" + checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/" -exporter = vLLMExporter() -exporter.export( - model_path_id=checkpoint_file, - tensor_parallel_size=1, -) + exporter = vLLMExporter() + exporter.export( + model_path_id=checkpoint_file, + tensor_parallel_size=1, + ) -# The correct argument for output length is 'max_tokens', not 'max_output_len' -output = exporter.forward( - ["What is the best city in the world?"], - max_tokens=50, - top_k=1, - top_p=0.0, - temperature=1.0, -) -print("output: ", output) + # The correct argument for output length is 'max_tokens', not 'max_output_len' + output = exporter.forward( + ["What is the best city in the world?"], + max_tokens=50, + top_k=1, + top_p=0.1, + temperature=1.0, + ) + print("output: ", output) + + + +if __name__ == "__main__": + run_test() ``` Be sure to check the vLLMExporter class docstrings for details. From d1e3883f0e705bc6f20026f67d76e51e3cfa19d1 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:29:37 -0500 Subject: [PATCH 05/10] Update in-framework.md --- docs/llm/mbridge/in-framework.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md index d63fd402da..259a6ad714 100644 --- a/docs/llm/mbridge/in-framework.md +++ b/docs/llm/mbridge/in-framework.md @@ -21,7 +21,7 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA- 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama ``` 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%). @@ -47,7 +47,6 @@ You can deploy an LLM from a Megatron-Bridge checkpoint on Triton using the prov The following instructions are very similar to those for [deploying NeMo 2.0 models](../nemo_2/in-framework.md), with only a few key differences specific to Megatron-Bridge highlighted below. - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file. -- Set `--model_format megatron` to indicate the model type. Executing the script will directly deploy the Megatron-Bridge LLM model and start the service on Triton. @@ -57,13 +56,12 @@ Executing the script will directly deploy the Megatron-Bridge LLM model and star 2. To begin serving the downloaded model, run the following script: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama ``` The following parameters are defined in the ``deploy_inframework_triton.py`` script: - ``-nc``, ``--megatron_checkpoint``: Path to the Megatron-Bridge checkpoint file to deploy. (Required) - - ``-mf``, ``--model_format``: Whether to load megatron-bridge or nemo 2 model. This should be set to megatron. - ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required) - ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1 - ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080 @@ -137,4 +135,4 @@ output = nq.query_llm( repetition_penalty=1.0 ) print(output) -``` \ No newline at end of file +``` From 70ffb6d8625c49ef95b67980792a85f485cbad03 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:30:09 -0500 Subject: [PATCH 06/10] Update in-framework-ray.md --- docs/llm/mbridge/in-framework-ray.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/llm/mbridge/in-framework-ray.md b/docs/llm/mbridge/in-framework-ray.md index d52026d287..d90badf1e9 100644 --- a/docs/llm/mbridge/in-framework-ray.md +++ b/docs/llm/mbridge/in-framework-ray.md @@ -27,7 +27,6 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI ```shell python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \ - --model_format megatron \ --model_id llama \ --num_replicas 1 \ --num_gpus 1 \ @@ -56,6 +55,5 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI Deploying Megatron-Bridge models with Ray Serve closely follows the same process as deploying NeMo 2.0 models. The primary differences are: - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file. -- Set `--model_format megatron` to indicate the model type. -All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation. \ No newline at end of file +All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation. From db795741f2114989e95bd911089a3acb86f5e79b Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:30:41 -0500 Subject: [PATCH 07/10] Update vllm.md --- docs/llm/mbridge/optimized/vllm.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md index e377e97c91..da3e904c2d 100644 --- a/docs/llm/mbridge/optimized/vllm.md +++ b/docs/llm/mbridge/optimized/vllm.md @@ -31,7 +31,6 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t ```shell python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ - --model_format megatron_bridge \ --triton_model_name llama \ --tensor_parallelism_size 1 ``` From 27760b52c1f7a1a3dd4a731fc3f6ee1839a0da72 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:33:07 -0500 Subject: [PATCH 08/10] Update in-framework.md --- docs/llm/mbridge/in-framework.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md index 259a6ad714..bc171fe035 100644 --- a/docs/llm/mbridge/in-framework.md +++ b/docs/llm/mbridge/in-framework.md @@ -21,7 +21,7 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA- 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama ``` 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%). @@ -56,7 +56,7 @@ Executing the script will directly deploy the Megatron-Bridge LLM model and star 2. To begin serving the downloaded model, run the following script: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama ``` The following parameters are defined in the ``deploy_inframework_triton.py`` script: From b318a05267d711e8d3c53d99f02c2367a6be01eb Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:33:32 -0500 Subject: [PATCH 09/10] Update in-framework-ray.md --- docs/llm/mbridge/in-framework-ray.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/llm/mbridge/in-framework-ray.md b/docs/llm/mbridge/in-framework-ray.md index d90badf1e9..8a7c80a804 100644 --- a/docs/llm/mbridge/in-framework-ray.md +++ b/docs/llm/mbridge/in-framework-ray.md @@ -26,7 +26,7 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI ```shell python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ - --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \ + --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ --model_id llama \ --num_replicas 1 \ --num_gpus 1 \ From 478a53667e388aecd6d95454e129bdd3a13faca6 Mon Sep 17 00:00:00 2001 From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com> Date: Wed, 11 Feb 2026 15:34:58 -0500 Subject: [PATCH 10/10] Update in-framework.md --- docs/llm/mbridge/in-framework.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md index bc171fe035..83a6009e07 100644 --- a/docs/llm/mbridge/in-framework.md +++ b/docs/llm/mbridge/in-framework.md @@ -21,7 +21,9 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA- 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \ + --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ + --triton_model_name llama ``` 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%). @@ -35,7 +37,10 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA- 6. To send a query to the Triton server, run the following script: ```shell - python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py -mn llama -p "What is the color of a banana?" -mol 5 + python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py \ + -mn llama \ + -p "What is the color of a banana?" \ + -mol 5 ``` ## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server