From e915df80a8a3ee50cd8799c15e89a5756d00a3d5 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <oyilmaz@nvidia.com>
Date: Wed, 28 Jan 2026 14:06:18 -0500
Subject: [PATCH 01/10] Add vllm docs for mbridge ckpt

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
---
 docs/llm/mbridge/optimized/index.md |  11 +-
 docs/llm/mbridge/optimized/vllm.md  | 189 ++++++++++++++++++++++++++++
 2 files changed, 199 insertions(+), 1 deletion(-)
 create mode 100644 docs/llm/mbridge/optimized/vllm.md

diff --git a/docs/llm/mbridge/optimized/index.md b/docs/llm/mbridge/optimized/index.md
index 88630a37d5..90429b532d 100644
--- a/docs/llm/mbridge/optimized/index.md
+++ b/docs/llm/mbridge/optimized/index.md
@@ -1,4 +1,13 @@
 # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries
 
-**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM and vLLM is coming soon. Please check back for updates.
+Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM.
+
+```{toctree}
+:maxdepth: 1
+:titlesonly:
+
+vLLM <vllm.md>
+```
+
+**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates.
 
diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
new file mode 100644
index 0000000000..d3105db4e9
--- /dev/null
+++ b/docs/llm/mbridge/optimized/vllm.md
@@ -0,0 +1,189 @@
+# Deploy Megatron-Bridge LLMs with vLLM and Triton Inference Server
+
+This section shows how to use scripts and APIs to export a Megatron-Bridge LLM to vLLM and deploy it with the NVIDIA Triton Inference Server.
+
+## Quick Example
+
+1. Follow the steps in the [Generate a Megatron-Bridge Checkpoint](../gen_mbridge_ckpt.md) to generate a Megatron-Bridge Llama checkpoint.
+
+2. In a terminal, go to the folder where the ``hf_llama31_8B_mbridge`` checkpoint is located. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use:
+
+   ```shell
+   docker pull nvcr.io/nvidia/nemo:vr
+
+   docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \
+       -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge \
+       -w /opt/Export-Deploy \
+       --name nemo-fw \
+       nvcr.io/nvidia/nemo:vr
+   ```
+
+3. Install vLLM by executing the following command inside the container if it is not available in the container:
+
+   ```shell
+   cd /opt/Export-Deploy
+   uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
+
+   ```
+
+4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama Megatron-Bridge checkpoint to vLLM and subsequently serves it on the Triton server:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
+       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \
+       --triton_model_name llama \
+       --tensor_parallelism_size 1
+   ```
+
+5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example).
+
+6. In a separate terminal, access the running container as follows:
+
+   ```shell
+   docker exec -it nemo-fw bash
+   ```
+
+7. To send a query to the Triton server, run the following script:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50
+   ```
+
+## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server
+
+You can deploy a LLM from a Megatron-Bridge checkpoint on Triton using the provided script.
+
+### Export and Deploy a Megatron-Bridge LLM
+
+After executing the script, it will export the model to vLLM and then initiate the service on Triton.
+
+1. Start the container using the steps described in the **Quick Example** section.
+
+2. To begin serving the downloaded model, run the following script:
+
+   ```shell
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
+       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \
+       --triton_model_name llama \
+       --tensor_parallelism_size 1
+   ```
+
+   The following parameters are defined in the ``deploy_vllm_triton.py`` script:
+
+   - ``--model_path_id``: Path of a Megatron-Bridge checkpoint, or Hugging Face model ID or path. (Required)
+   - ``--tokenizer``: Tokenizer file if it is not provided in the checkpoint. (Optional)
+   - ``--lora_ckpt``: List of LoRA checkpoints in HF format. (Optional, can specify multiple)
+   - ``--tensor_parallelism_size``: Number of GPUs to use for tensor parallelism. Default is 1.
+   - ``--dtype``: Data type for the model in vLLM. Choices: "auto", "bfloat16", "float16", "float32". Default is "auto".
+   - ``--quantization``: Quantization method for vLLM. Choices: "awq", "gptq", "fp8". Default is None.
+   - ``--seed``: Random seed for reproducibility. Default is 0.
+   - ``--gpu_memory_utilization``: GPU memory utilization percentage for vLLM. Default is 0.9.
+   - ``--swap_space``: Size (GiB) of CPU memory per GPU to use as swap space. Default is 4.
+   - ``--cpu_offload_gb``: Size (GiB) of CPU memory to use for offloading model weights. Default is 0.
+   - ``--enforce_eager``: Whether to enforce eager execution. Default is False.
+   - ``--max_seq_len_to_capture``: Maximum sequence length covered by CUDA graphs. Default is 8192.
+   - ``--triton_model_name``: Name for the service/model on Triton. (Required)
+   - ``--triton_model_version``: Version for the service/model. Default is 1.
+   - ``--triton_port``: Port for the Triton server to listen for requests. Default is 8000.
+   - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0.
+   - ``--max_batch_size``: Maximum batch size of the model. Default is 8.
+   - ``--debug_mode``: Enable debug/verbose output. Default is False.
+   
+3. Access the models with a Hugging Face token.
+
+   If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you'll need to generate a Hugging Face token that has access to these models. Visit `Hugging Face <https://huggingface.co/>`__ for more information. After you have the token, perform one of the following steps.
+
+   - Log in to Hugging Face:
+
+   ```shell
+   huggingface-cli login
+   ```
+
+   - Or, set the HF_TOKEN environment variable:
+
+   ```shell
+   export HF_TOKEN=your_token_here
+   ```
+
+## Supported LLMs
+
+Megatron-Bridge models are supported for export and deployment if they are listed as compatible in the [vLLM supported models list](https://docs.vllm.ai/en/v0.9.2/models/supported_models.html).
+
+
+## Use NeMo Export and Deploy APIs to Export
+
+Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Megatron-Bridge checkpoints to vLLM.
+
+
+### Export Megatron-Bridge LLMs
+
+You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path.
+
+```python
+from nemo_export.vllm_exporter import vLLMExporter
+
+checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge"
+
+exporter = vLLMExporter()
+exporter.export(
+    model_path_id=checkpoint_file,
+    tensor_parallel_size=1,
+)
+
+# The correct argument for output length is 'max_tokens', not 'max_output_len'
+output = exporter.forward(
+    ["What is the best city in the world?"],
+    max_tokens=50,
+    top_k=1,
+    top_p=0.0,
+    temperature=1.0,
+)
+print("output: ", output)
+```
+
+Be sure to check the vLLMExporter class docstrings for details.
+
+
+## How To Send a Query
+
+### Send a Query using the Script
+
+You can send queries to your deployed Megatron-Bridge LLM using the provided query script. This script allows you to interact with the model via HTTP requests, sending prompts and receiving generated responses directly from the Triton server.
+
+The example below demonstrates how to use the query script to send a prompt to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below.
+
+```shell
+python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?"
+```
+
+**Additional parameters:**
+- `--prompt_file`: Read prompt from a file instead of the command line
+- `--max_tokens`: Maximum number of tokens to generate (default: 16)
+- `--min_tokens`: Minimum number of tokens to generate (default: 0)
+- `--n_log_probs`: Number of log probabilities to return per output token
+- `--n_prompt_log_probs`: Number of log probabilities to return per prompt token
+- `--seed`: Random seed for generation
+- `--top_k`: Top-k sampling (default: 1)
+- `--top_p`: Top-p sampling (default: 0.1)
+- `--temperature`: Sampling temperature (default: 1.0)
+- `--lora_task_uids`: List of LoRA task UIDs for LoRA-enabled models (use -1 to disable)
+- `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0)
+
+
+### Send a Query using the NeMo APIs
+
+Please see the below if you would like to use APIs to send a query.
+
+```python
+from nemo_deploy.nlp import NemoQueryvLLM
+
+nq = NemoQueryvLLM(url="localhost:8000", model_name="llama")
+output = nq.query_llm(
+    prompts=["What is the capital of United States? "],
+    max_tokens=100,
+    top_k=1,
+    top_p=0.8,
+    temperature=1.0,
+)
+print("output: ", output)
+```

From 14b5c47c78c59ce6f56b7dce6e57c2b3c8569b2b Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <oyilmaz@nvidia.com>
Date: Wed, 28 Jan 2026 14:34:09 -0500
Subject: [PATCH 02/10] Add params

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
---
 docs/llm/mbridge/optimized/vllm.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
index d3105db4e9..78c8645f14 100644
--- a/docs/llm/mbridge/optimized/vllm.md
+++ b/docs/llm/mbridge/optimized/vllm.md
@@ -12,7 +12,7 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t
    docker pull nvcr.io/nvidia/nemo:vr
 
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \
-       -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge \
+       -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge/ \
        -w /opt/Export-Deploy \
        --name nemo-fw \
        nvcr.io/nvidia/nemo:vr
@@ -30,7 +30,8 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
-       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \
+       --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/  \
+       --model_format megatron_bridge \
        --triton_model_name llama \
        --tensor_parallelism_size 1
    ```

From 1a624a02cb9002b8833dfee238289fb35948b69d Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 14:25:29 -0500
Subject: [PATCH 03/10] Update vllm.md

---
 docs/llm/mbridge/optimized/vllm.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
index 78c8645f14..fae2f109bb 100644
--- a/docs/llm/mbridge/optimized/vllm.md
+++ b/docs/llm/mbridge/optimized/vllm.md
@@ -123,7 +123,7 @@ You can use the APIs in the export module to export a Megatron-Bridge checkpoint
 ```python
 from nemo_export.vllm_exporter import vLLMExporter
 
-checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge"
+checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/"
 
 exporter = vLLMExporter()
 exporter.export(

From 1574682e824b20d17026e62b73b8ea7809e45772 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:17:05 -0500
Subject: [PATCH 04/10] Update vllm.md

---
 docs/llm/mbridge/optimized/vllm.md | 38 +++++++++++++++++-------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
index fae2f109bb..e377e97c91 100644
--- a/docs/llm/mbridge/optimized/vllm.md
+++ b/docs/llm/mbridge/optimized/vllm.md
@@ -121,25 +121,31 @@ Up until now, we have used scripts for exporting and deploying LLM models. Howev
 You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path.
 
 ```python
-from nemo_export.vllm_exporter import vLLMExporter
+def run_test():
+    from nemo_export.vllm_exporter import vLLMExporter
 
-checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/"
+    checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/"
 
-exporter = vLLMExporter()
-exporter.export(
-    model_path_id=checkpoint_file,
-    tensor_parallel_size=1,
-)
+    exporter = vLLMExporter()
+    exporter.export(
+        model_path_id=checkpoint_file,
+        tensor_parallel_size=1,
+    )
 
-# The correct argument for output length is 'max_tokens', not 'max_output_len'
-output = exporter.forward(
-    ["What is the best city in the world?"],
-    max_tokens=50,
-    top_k=1,
-    top_p=0.0,
-    temperature=1.0,
-)
-print("output: ", output)
+    # The correct argument for output length is 'max_tokens', not 'max_output_len'
+    output = exporter.forward(
+        ["What is the best city in the world?"],
+        max_tokens=50,
+        top_k=1,
+        top_p=0.1,
+        temperature=1.0,
+    )
+    print("output: ", output)
+
+
+
+if __name__ == "__main__":
+    run_test()
 ```
 
 Be sure to check the vLLMExporter class docstrings for details.

From d1e3883f0e705bc6f20026f67d76e51e3cfa19d1 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:29:37 -0500
Subject: [PATCH 05/10] Update in-framework.md

---
 docs/llm/mbridge/in-framework.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md
index d63fd402da..259a6ad714 100644
--- a/docs/llm/mbridge/in-framework.md
+++ b/docs/llm/mbridge/in-framework.md
@@ -21,7 +21,7 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama
    ```
 
 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).
@@ -47,7 +47,6 @@ You can deploy an LLM from a Megatron-Bridge checkpoint on Triton using the prov
 The following instructions are very similar to those for [deploying NeMo 2.0 models](../nemo_2/in-framework.md), with only a few key differences specific to Megatron-Bridge highlighted below.
 
 - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file.
-- Set `--model_format megatron` to indicate the model type.
 
 
 Executing the script will directly deploy the Megatron-Bridge LLM model and start the service on Triton.
@@ -57,13 +56,12 @@ Executing the script will directly deploy the Megatron-Bridge LLM model and star
 2. To begin serving the downloaded model, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama --model_format megatron
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama
    ```
 
    The following parameters are defined in the ``deploy_inframework_triton.py`` script:
 
    - ``-nc``, ``--megatron_checkpoint``: Path to the Megatron-Bridge checkpoint file to deploy. (Required)
-   - ``-mf``, ``--model_format``: Whether to load megatron-bridge or nemo 2 model. This should be set to megatron.
    - ``-tmn``, ``--triton_model_name``: Name to register the model under in Triton. (Required)
    - ``-tmv``, ``--triton_model_version``: Version number for the model in Triton. Default: 1
    - ``-sp``, ``--server_port``: Port for the REST server to listen for requests. Default: 8080
@@ -137,4 +135,4 @@ output = nq.query_llm(
     repetition_penalty=1.0
 )
 print(output)
-```
\ No newline at end of file
+```

From 70ffb6d8625c49ef95b67980792a85f485cbad03 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:30:09 -0500
Subject: [PATCH 06/10] Update in-framework-ray.md

---
 docs/llm/mbridge/in-framework-ray.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/llm/mbridge/in-framework-ray.md b/docs/llm/mbridge/in-framework-ray.md
index d52026d287..d90badf1e9 100644
--- a/docs/llm/mbridge/in-framework-ray.md
+++ b/docs/llm/mbridge/in-framework-ray.md
@@ -27,7 +27,6 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
       --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \
-      --model_format megatron \
       --model_id llama \
       --num_replicas 1 \
       --num_gpus 1 \
@@ -56,6 +55,5 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI
 Deploying Megatron-Bridge models with Ray Serve closely follows the same process as deploying NeMo 2.0 models. The primary differences are:
 
 - Use the `--megatron_checkpoint` argument to specify your Megatron-Bridge checkpoint file.
-- Set `--model_format megatron` to indicate the model type.
 
-All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation.
\ No newline at end of file
+All other deployment steps, parameters, and Ray Serve features remain the same as for NeMo 2.0 LLMs. For a comprehensive walkthrough of advanced options, scaling, and troubleshooting, refer to the [Deploy NeMo 2.0 LLMs with Ray Serve](../nemo_2/in-framework-ray.md) documentation.

From db795741f2114989e95bd911089a3acb86f5e79b Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:30:41 -0500
Subject: [PATCH 07/10] Update vllm.md

---
 docs/llm/mbridge/optimized/vllm.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md
index e377e97c91..da3e904c2d 100644
--- a/docs/llm/mbridge/optimized/vllm.md
+++ b/docs/llm/mbridge/optimized/vllm.md
@@ -31,7 +31,6 @@ This section shows how to use scripts and APIs to export a Megatron-Bridge LLM t
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \
        --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/  \
-       --model_format megatron_bridge \
        --triton_model_name llama \
        --tensor_parallelism_size 1
    ```

From 27760b52c1f7a1a3dd4a731fc3f6ee1839a0da72 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:33:07 -0500
Subject: [PATCH 08/10] Update in-framework.md

---
 docs/llm/mbridge/in-framework.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md
index 259a6ad714..bc171fe035 100644
--- a/docs/llm/mbridge/in-framework.md
+++ b/docs/llm/mbridge/in-framework.md
@@ -21,7 +21,7 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama
    ```
 
 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).
@@ -56,7 +56,7 @@ Executing the script will directly deploy the Megatron-Bridge LLM model and star
 2. To begin serving the downloaded model, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge --triton_model_name llama
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama
    ```
 
    The following parameters are defined in the ``deploy_inframework_triton.py`` script:

From b318a05267d711e8d3c53d99f02c2367a6be01eb Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:33:32 -0500
Subject: [PATCH 09/10] Update in-framework-ray.md

---
 docs/llm/mbridge/in-framework-ray.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/llm/mbridge/in-framework-ray.md b/docs/llm/mbridge/in-framework-ray.md
index d90badf1e9..8a7c80a804 100644
--- a/docs/llm/mbridge/in-framework-ray.md
+++ b/docs/llm/mbridge/in-framework-ray.md
@@ -26,7 +26,7 @@ This section demonstrates how to deploy [Megatron-Bridge](https://github.com/NVI
 
    ```shell
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
-      --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge \
+      --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
       --model_id llama \
       --num_replicas 1 \
       --num_gpus 1 \

From 478a53667e388aecd6d95454e129bdd3a13faca6 Mon Sep 17 00:00:00 2001
From: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
Date: Wed, 11 Feb 2026 15:34:58 -0500
Subject: [PATCH 10/10] Update in-framework.md

---
 docs/llm/mbridge/in-framework.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/docs/llm/mbridge/in-framework.md b/docs/llm/mbridge/in-framework.md
index bc171fe035..83a6009e07 100644
--- a/docs/llm/mbridge/in-framework.md
+++ b/docs/llm/mbridge/in-framework.md
@@ -21,7 +21,9 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 3. Using a Megatron-Bridge model, run the following deployment script to verify that everything is working correctly. The script directly serves the Megatron-Bridge model on the Triton server:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --triton_model_name llama
+   python /opt/Export-Deploy/scripts/deploy/nlp/deploy_inframework_triton.py \
+       --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
+       --triton_model_name llama
    ```
 
 4. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (for example, gradually by 50%).
@@ -35,7 +37,10 @@ This section explains how to deploy [Megatron-Bridge](https://github.com/NVIDIA-
 6. To send a query to the Triton server, run the following script:
 
    ```shell
-   python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py -mn llama -p "What is the color of a banana?" -mol 5
+   python /opt/Export-Deploy/scripts/deploy/nlp/query_inframework.py \
+       -mn llama \
+       -p "What is the color of a banana?" \
+       -mol 5
    ```
 
 ## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server