There are currently five workflow transactions that can be initiated using HTTP or WebSocket when the NeMo Agent Toolkit server is running: generate non-streaming, generate async, generate streaming, chat non-streaming, and chat streaming. The following are types of interfaces you can use to interact with your running workflows.
- Generate Interface: Uses the transaction schema defined by your workflow. The interface documentation is accessible
using Swagger while the server is running
http://localhost:8000/docs. - Chat Interface: OpenAI API Documentation provides details on chat formats compatible with the NeMo Agent Toolkit server.
The default endpoint paths use a versioned URL scheme.
:::{note} Versioned paths are currently experimental due to the added support of HTTP Human-in-the-loop (HITL) and OAuth. They are 1:1 compatible with the legacy endpoints for workflows not relying on those behaviors. :::
Legacy paths are registered by default for backward compatibility unless explicitly disabled.
| Endpoint | Versioned Path | Legacy Path |
|---|---|---|
| Generate (non-streaming) | /v1/workflow |
/generate |
| Generate (streaming) | /v1/workflow/stream |
/generate/stream |
| Generate (full) | /v1/workflow/full |
/generate/full |
| Generate (async) | /v1/workflow/async |
/generate/async |
| Chat (non-streaming) | /v1/chat |
/chat |
| Chat (streaming) | /v1/chat/stream |
/chat/stream |
| OpenAI v1 Completions | /v1/chat/completions |
(none) |
By default, both the versioned and legacy paths are active. You can control this behavior through the front-end configuration:
general:
front_end:
_type: fastapi
workflow:
path: /v1/workflow
openai_api_path: /v1/chat
openai_api_v1_path: /v1/chat/completions
legacy_path: /generate # Optional legacy generate path
legacy_openai_api_path: /chat # Optional legacy chat path
disable_legacy_routes: false # Set to true to disable legacy pathsSetting disable_legacy_routes to true removes the legacy paths entirely. Set legacy_path
or legacy_openai_api_path to null on individual endpoints to disable specific legacy routes
while keeping others.
Interactive workflows (Human-in-the-Loop and OAuth) can be used over plain HTTP without WebSockets. For details, see HTTP Interactive Execution.
This section describes how to start the NeMo Agent Toolkit server.
If you have not already done so, follow the Obtaining API Keys instructions to obtain an NVIDIA API key.
export NVIDIA_API_KEY=<YOUR_API_KEY>Before you use the following examples, ensure that the simple calculator workflow has been installed and is running on http://localhost:8000 by running the following commands:
uv pip install -e examples/getting_started/simple_calculator
nat serve --config_file examples/getting_started/simple_calculator/configs/config.yml- Route:
/v1/workflow(legacy:/generate) - Description: A non-streaming transaction that waits until all workflow data is available before sending the result back to the client. The transaction schema is defined by the workflow.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/workflow \ --header 'Content-Type: application/json' \ --data '{ "input_message": "Is 4 + 4 greater than the current hour of the day" }'
- HTTP Response Example: (actual response will vary based on the time of day)
{ "value":"No, 4 + 4 is not greater than the current hour of the day." }
The asynchronous generate endpoint allows clients to submit a workflow to run in the background and return a response immediately with a unique identifier for the workflow. This can be used to query the status and results of the workflow at a later time. This is useful for long-running workflows, which would otherwise cause the client to time out.
This endpoint is only available when the async_endpoints optional dependency extra is installed. For users installing from source, this can be done by running uv pip install -e ".[async_endpoints]" from the root directory of the NeMo Agent Toolkit library. Similarly, for users installing from PyPI, this can be done by running pip install "nvidia-nat[async_endpoints]".
Asynchronous jobs are managed using Dask. By default, a local Dask cluster is created at start time, however you can also configure the server to connect to an existing Dask scheduler by setting the scheduler_address configuration parameter. The Dask scheduler is used to manage the execution of asynchronous jobs, and can be configured to run on a single machine or across a cluster of machines. Job history and metadata is stored in a SQL database using SQLAlchemy. By default, a temporary SQLite database is created at start time, however you can also configure the server to use a persistent database by setting the db_url configuration parameter. Refer to the SQLAlchemy documentation for the format of the db_url parameter. Any database supported by SQLAlchemy's Asynchronous I/O extension can be used. Refer to SQLAlchemy's Dialects for a complete list (many but not all of these support Asynchronous I/O).
The following CLI flags are available to configure the asynchronous generate endpoint when using nat serve:
--dask_log_level: The logging level for Dask. Default isWARNING.--dask_threads_per_worker: The number of threads to use per Dask worker. Default is1. When set to0the value uses the Dask default. This is only used whenscheduler_addressis not set.--dask_workers: The type of Dask workers to use. Options arethreadsfor Threaded Dask workers orprocessesfor Process based Dask workers. Default isprocesses. This is only used whenscheduler_addressis not set.--dask_worker_memory_limit: The memory limit for each Dask worker. Can be 'auto', a memory string like '4GB', or a float representing a fraction of the system memory. The default value is '0', which means there is no memory limit. Refer to https://docs.dask.org/en/stable/deploying-python.html#reference for details.--db_url: The SQLAlchemy database URL to use for storing job history and metadata. If not set, a temporary SQLite database will be created.--max_concurrent_jobs: Maximum number of Dask workers to create for running async jobs. The name of this parameter is misleading as the actual number of concurrent async jobs is:max_running_async_jobs * dask_threads_per_worker. Default is 10. This is only used whenscheduler_addressis not set.--scheduler_address: The address of an existing Dask scheduler to connect to. If not set, a local Dask cluster will be created.
- Route:
/v1/workflow/async(legacy:/generate/async) - Description: A non-streaming transaction that submits a workflow to run in the background.
- Optional Fields:
job_id: A unique identifier for the job. If not provided, a UUID will be generated. It can be any string value. However, it is the caller's responsibility to ensure uniqueness. Ifjob_idalready exists, the server will return the latest status for that job.sync_timeout: The maximum time in seconds to wait for the job to complete before returning a response. If the job completes in less thansync_timeoutseconds then the response will include the job result, otherwise thejob_idandstatusis returned. The default is0, which causes the request to return immediately, and maximum value for this field is300.expiry_seconds: The amount of time in seconds after the job completes (either successfully or unsuccessfully), which any output files will be preserved before being deleted. Default is3600(1 hours), minimum value is600(10 minutes) and maximum value for this field is86400(24 hours). The text output in the response is not affected by this field.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/workflow/async \ --header 'Content-Type: application/json' \ --data '{ "input_message": "Is 4 + 4 greater than the current hour of the day" }'
- HTTP Response Example:
{ "job_id": "8548a0e6-ecdc-44b0-a253-695cd746594c", "status": "submitted" }
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/workflow/async \ --header 'Content-Type: application/json' \ --data '{ "input_message": "Is 4 + 4 greater than the current hour of the day", "job_id": "example-job-123", "sync_timeout": 10 }'
- HTTP Response Example:
{ "created_at": "2025-09-10T20:52:24.768066", "error": null, "expires_at": "2025-09-10T21:52:30.734659Z", "job_id": "example-job-123", "output": { "value": "No, 4 + 4 is not greater than the current hour of the day." }, "status": "success", "updated_at": "2025-09-10T20:52:30.734659" }
- Route:
/v1/workflow/stream(legacy:/generate/stream) - Description: A streaming transaction that allows data to be sent in chunks as it becomes available from the workflow, rather than waiting for the complete response to be available.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/workflow/stream \ --header 'Content-Type: application/json' \ --data '{ "input_message": "Is 4 + 4 greater than the current hour of the day" }'
- HTTP Intermediate Step Stream Example:
"intermediate_data": { "id": "ba5191e6-b818-4206-ac14-863112e597fe", "parent_id": "5db32854-d9b2-4e75-9001-543da6a55dd0", "type": "markdown", "name": "meta/llama-3.1-70b-instruct", "payload": "**Input:**\n```python\n[SystemMessage(content='\\nAnswer the following questions as best you can. You may ask the human to use the following tools:\\n\\ncalculator_multiply: This is a mathematical tool used to multiply two numbers together. It takes 2 numbers as an input and computes their numeric product as the output.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\ncalculator_inequality: This is a mathematical tool used to perform an inequality comparison between two numbers. It takes two numbers as an input and determines if one is greater or are equal.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\ncurrent_datetime: Returns the current date and time in human readable format.. . Arguments must be provided as a valid JSON object following this format: {\\'unused\\': FieldInfo(annotation=str, required=True)} \\ncalculator_divide: This is a mathematical tool used to divide one number by another. It takes 2 numbers as an input and computes their numeric quotient as the output.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\n\\nYou may respond in one of two formats.\\nUse the following format exactly to ask the human to use a tool:\\n\\nQuestion: the input question you must answer\\nThought: you should always think about what to do\\nAction: the action to take, should be one of [calculator_multiply,calculator_inequality,current_datetime,calculator_divide]\\nAction Input: the input to the action (if there is no required input, include \"Action Input: None\") \\nObservation: wait for the human to respond with the result from the tool, do not assume the response\\n\\n... (this Thought/Action/Action Input/Observation can repeat N times. If you do not need to use a tool, or after asking the human to use any tools and waiting for the human to respond, you might know the final answer.)\\nUse the following format once you have the final answer:\\n\\nThought: I now know the final answer\\nFinal Answer: the final answer to the original input question\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='\\nQuestion: Is 4 + 4 greater than the current hour of the day\\n', additional_kwargs={}, response_metadata={}), AIMessage(content='Thought: To answer this question, I need to know the current hour of the day and compare it to 4 + 4.\\n\\nAction: current_datetime\\nAction Input: None\\n\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='The current time of day is 2025-03-11 16:05:11', additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the current time, I can extract the hour and compare it to 4 + 4. \\n\\nAction: calculator_multiply\\nAction Input: {'text': '4 + 4'}\", additional_kwargs={}, response_metadata={}), HumanMessage(content='The product of 4 * 4 is 16', additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the result of 4 + 4, which is 8, I can compare it to the current hour.\\n\\nAction: calculator_inequality\\nAction Input: {'text': '8 > 16'}\", additional_kwargs={}, response_metadata={}), HumanMessage(content='First number 8 is less than the second number 16', additional_kwargs={}, response_metadata={})]\n```\n\n**Output:**\nThought: I now know the final answer\n\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16)." }
- HTTP Response Example:
"data": { "value": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 15)." }
-
Route:
/v1/workflow/full(legacy:/generate/full) -
Description: Same as
/v1/workflow/streambut provides rawIntermediateStepobjects without any step adaptor translations. Use thefilter_stepsquery parameter to filter steps by type (comma-separated list) or set to 'none' to suppress all intermediate steps. -
HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/workflow/full \ --header 'Content-Type: application/json' \ --data '{ "input_message": "Is 4 + 4 greater than the current hour of the day" }'
-
HTTP Intermediate Step Stream Example:
"intermediate_data": {"id":"dda55b33-edd1-4dde-b938-182676a42a19","parent_id":"8282eb42-01dd-4db6-9fd5-915ed4a2a032","type":"LLM_END","name":"meta/llama-3.1-70b-instruct","payload":"{\"event_type\":\"LLM_END\",\"event_timestamp\":1744051441.449566,\"span_event_timestamp\":1744051440.5072863,\"framework\":\"langchain\",\"name\":\"meta/llama-3.1-70b-instruct\",\"tags\":null,\"metadata\":{\"chat_responses\":[{\"text\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"generation_info\":null,\"type\":\"ChatGenerationChunk\",\"message\":{\"content\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"additional_kwargs\":{},\"response_metadata\":{\"finish_reason\":\"stop\",\"model_name\":\"meta/llama-3.1-70b-instruct\"},\"type\":\"AIMessageChunk\",\"name\":null,\"id\":\"run-dda55b33-edd1-4dde-b938-182676a42a19\"}}],\"chat_inputs\":null,\"tool_inputs\":null,\"tool_outputs\":null,\"tool_info\":null},\"data\":{\"input\":\"First number 8 is less than the second number 11\",\"output\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"chunk\":null},\"usage_info\":{\"token_usage\":{\"prompt_tokens\":37109,\"completion_tokens\":902,\"total_tokens\":38011},\"num_llm_calls\":0,\"seconds_between_calls\":0},\"UUID\":\"dda55b33-edd1-4dde-b938-182676a42a19\"}"} -
HTTP Response Example:
"data": {"value":"No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11)."}
-
HTTP Request Example with Filter: By default, all intermediate steps are streamed. Use the
filter_stepsquery parameter to filter steps by type (comma-separated list) or set tononeto suppress all intermediate steps.Suppress all intermediate steps (only get final output):
curl --request POST \ --url 'http://localhost:8000/v1/workflow/full?filter_steps=none' \ --header 'Content-Type: application/json' \ --data '{"input_message": "Is 4 + 4 greater than the current hour of the day"}'
Get only specific step types:
curl --request POST \ --url 'http://localhost:8000/v1/workflow/full?filter_steps=LLM_END,TOOL_END' \ --header 'Content-Type: application/json' \ --data '{"input_message": "Is 4 + 4 greater than the current hour of the day"}'
- Route:
/v1/chat(legacy:/chat) - Description: An OpenAI compatible non-streaming chat transaction.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/chat \ --header 'Content-Type: application/json' \ --data '{ "messages": [ { "role": "user", "content": "Is 4 + 4 greater than the current hour of the day" } ] }'
- HTTP Response Example:
{ "id": "b92d1f05-200a-4540-a9f1-c1487bfb3685", "object": "chat.completion", "model": "", "created": "2025-03-11T21:12:43.671665Z", "choices": [ { "message": { "content": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16).", "role": null }, "finish_reason": "stop", "index": 0 } ], "usage": { "prompt_tokens": 0, "completion_tokens": 20, "total_tokens": 20 } }
- Route:
/v1/chat/stream(legacy:/chat/stream) - Description: An OpenAI compatible streaming chat transaction.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/v1/chat/stream \ --header 'Content-Type: application/json' \ --data '{ "messages": [ { "role": "user", "content": "Is 4 + 4 greater than the current hour of the day" } ] }'
- HTTP Intermediate Step Example:
"intermediate_data": { "id": "9ed4bce7-191c-41cb-be08-7a72d30166cc", "parent_id": "136edafb-797b-42cd-bd11-29153359b193", "type": "markdown", "name": "meta/llama-3.1-70b-instruct", "payload": "**Input:**\n```python\n[SystemMessage(content='\\nAnswer the following questions as best you can. You may ask the human to use the following tools:\\n\\ncalculator_multiply: This is a mathematical tool used to multiply two numbers together. It takes 2 numbers as an input and computes their numeric product as the output.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\ncalculator_inequality: This is a mathematical tool used to perform an inequality comparison between two numbers. It takes two numbers as an input and determines if one is greater or are equal.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\ncurrent_datetime: Returns the current date and time in human readable format.. . Arguments must be provided as a valid JSON object following this format: {\\'unused\\': FieldInfo(annotation=str, required=True)}\\ncalculator_divide: This is a mathematical tool used to divide one number by another. It takes 2 numbers as an input and computes their numeric quotient as the output.. . Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\n\\nYou may respond in one of two formats.\\nUse the following format exactly to ask the human to use a tool:\\n\\nQuestion: the input question you must answer\\nThought: you should always think about what to do\\nAction: the action to take, should be one of [calculator_multiply,calculator_inequality,current_datetime,calculator_divide]\\nAction Input: the input to the action (if there is no required input, include \"Action Input: None\") \\nObservation: wait for the human to respond with the result from the tool, do not assume the response\\n\\n... (this Thought/Action/Action Input/Observation can repeat N times. If you do not need to use a tool, or after asking the human to use any tools and waiting for the human to respond, you might know the final answer.)\\nUse the following format once you have the final answer:\\n\\nThought: I now know the final answer\\nFinal Answer: the final answer to the original input question\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='\\nQuestion: Is 4 + 4 greater than the current hour of the day\\n', additional_kwargs={}, response_metadata={}), AIMessage(content='Thought: To answer this question, I need to know the current hour of the day and compare it to 4 + 4.\\n\\nAction: current_datetime\\nAction Input: None\\n\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='The current time of day is 2025-03-11 16:24:52', additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the current time, I can extract the hour and compare it to 4 + 4.\\n\\nAction: calculator_multiply\\nAction Input: {'text': '4 + 4'}\", additional_kwargs={}, response_metadata={}), HumanMessage(content='The product of 4 * 4 is 16', additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the result of 4 + 4, which is 8, I can compare it to the current hour.\\n\\nAction: calculator_inequality\\nAction Input: {'text': '8 > 16'}\", additional_kwargs={}, response_metadata={}), HumanMessage(content='First number 8 is less than the second number 16', additional_kwargs={}, response_metadata={})]\n```\n\n**Output:**\nThought: I now know the final answer\n\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16)." }
- HTTP Response Example:
"data": { "id": "194d22dc-6c1b-44ee-a8d7-bf2b59c1cb6b", "choices": [ { "message": { "content": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16).", "role": null }, "finish_reason": "stop", "index": 0 } ], "created": "2025-03-11T21:24:56.961939Z", "model": "", "object": "chat.completion.chunk" }
The NeMo Agent Toolkit provides full OpenAI Chat Completions API compatibility through a dedicated endpoint that enables seamless integration with existing OpenAI-compatible client libraries and workflows.
When the OpenAI v1 compatible endpoint is configured, the toolkit creates a single endpoint that fully implements the OpenAI Chat Completions API specification. This endpoint handles both streaming and non-streaming requests based on the stream parameter, exactly like the official OpenAI API.
- Drop-in Replacement: Works with existing OpenAI client libraries without code changes
- Full API Compatibility: Supports all OpenAI Chat Completions API parameters
- Industry Standard: Familiar interface for developers already using OpenAI
- Future-Proof: Aligned with established API patterns and ecosystem tools
To enable the OpenAI v1 compatible endpoint, set openai_api_v1_path in your FastAPI front-end configuration:
general:
front_end:
_type: fastapi
workflow:
method: POST
openai_api_v1_path: /v1/chat/completions| Parameter | Type | Default | Description |
|---|---|---|---|
path |
string | /v1/workflow |
Path for the generate endpoint |
openai_api_path |
string | /v1/chat |
Path for the OpenAI chat endpoint |
openai_api_v1_path |
string | /v1/chat/completions |
Path for the OpenAI v1 compatible endpoint |
legacy_path |
string or null | /generate |
Legacy path for the generate endpoint. Set to null to disable |
legacy_openai_api_path |
string or null | /chat |
Legacy path for the chat endpoint. Set to null to disable |
method |
string | POST |
HTTP method for the endpoint |
disable_legacy_routes |
boolean | false |
Disable all legacy routes globally |
enable_interactive_extensions |
boolean | false |
Enable HTTP interactive execution on OpenAI Chat Completions endpoint |
Creates a single endpoint that handles both streaming and non-streaming requests:
- Route:
/v1/chat/completions(configurable viaopenai_api_v1_path) - Method: POST
- Content-Type:
application/json - Behavior: Routes to streaming or non-streaming based on
streamparameter
Creates separate endpoints for different request types:
- Non-streaming:
/<openai_api_path> - Streaming:
<openai_api_path>/stream
The endpoint accepts all standard OpenAI Chat Completions API parameters:
| Parameter | Type | Description | Validation |
|---|---|---|---|
messages |
array | Required. List of messages in conversation format | min 1 item |
model |
string | Model identifier | - |
frequency_penalty |
number | Decreases likelihood of repeating tokens | -2.0 to 2.0 |
logit_bias |
object | Modify likelihood of specific tokens | token ID → bias |
logprobs |
boolean | Return log probabilities | - |
top_logprobs |
integer | Number of most likely tokens to return | 0 to 20 |
max_tokens |
integer | Maximum tokens to generate | ≥ 1 |
n |
integer | Number of completions to generate | 1 to 128 |
presence_penalty |
number | Increases likelihood of new topics | -2.0 to 2.0 |
response_format |
object | Specify response format | - |
seed |
integer | Random seed for deterministic outputs | - |
service_tier |
string | Service tier selection | "auto" or "default" |
stop |
string/array | Stop sequences | - |
stream |
boolean | Enable streaming responses | default: false |
stream_options |
object | Streaming configuration options | - |
temperature |
number | Sampling temperature | 0.0 to 2.0 |
top_p |
number | Nucleus sampling parameter | 0.0 to 1.0 |
tools |
array | Available function tools | - |
tool_choice |
string/object | Tool selection strategy | - |
parallel_tool_calls |
boolean | Enable parallel tool execution | default: true |
user |
string | End-user identifier | - |
Non-Streaming Request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false,
"temperature": 0.7,
"max_tokens": 100
}'Streaming Request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "Tell me a short story"}
],
"stream": true,
"temperature": 0.7
}'OpenAI Python Client:
from openai import OpenAI
# Initialize client pointing to your NeMo Agent Toolkit server
client = OpenAI(
api_key="not-needed", # API key not required for local deployment
base_url="http://localhost:8000/v1"
)
# Non-streaming chat completion
response = client.chat.completions.create(
model="nvidia/llama-3.1-8b-instruct",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
stream=False,
temperature=0.7,
max_tokens=150
)
print(response.choices[0].message.content)AI SDK (JavaScript/TypeScript):
import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';
// Configure custom OpenAI provider
const customOpenAI = openai({
baseURL: 'http://localhost:8000/v1',
apiKey: 'not-needed'
});
// Non-streaming generation
const { text } = await generateText({
model: customOpenAI('nvidia/llama-3.1-8b-instruct'),
prompt: 'Explain the benefits of renewable energy',
temperature: 0.7,
maxTokens: 200
});
console.log(text);The default endpoint paths have been updated to use a versioned URL scheme. Legacy paths
(/generate, /chat) continue to work by default. To migrate:
- Update client URLs: Replace
/generatewith/v1/workflowand/chatwith/v1/chat - Update streaming URLs: Replace
/generate/streamwith/v1/workflow/streamand/chat/streamwith/v1/chat/stream - Test thoroughly: Verify all endpoints work with the new paths
- Disable legacy routes (optional): Set
disable_legacy_routes: truein your configuration once all clients have been migrated
If you are currently using legacy mode with separate endpoints:
- Update Configuration: Set
openai_api_v1_path: /v1/chat/completions - Update Client Code: Use the single endpoint with a
streamparameter - Test Thoroughly: Verify both streaming and non-streaming functionality
If you are migrating from the OpenAI API:
- Update Base URL: Point to your NeMo Agent Toolkit server
- Update Model Names: Use your configured model identifiers
- Test Compatibility: Verify all features work as expected
- Route:
/feedback - Description: Add reaction feedback for an assistant message through observability trace ID. This endpoint is available when using the Weave FastAPI plugin worker. For setup instructions, see the Weave observability guide.
- HTTP Request Example:
curl --request POST \ --url http://localhost:8000/feedback \ --header 'Content-Type: application/json' \ --data '{ "observability_trace_id": "01933b2e-1234-5678-9abc-def012345678", "reaction_type": "👍" }'
- HTTP Response Example:
{ "message": "Added reaction '👍' to call 01933b2e-1234-5678-9abc-def012345678" }
The NeMo Agent Toolkit provides a built-in monitoring endpoint for per-user workflows that exposes real-time resource usage metrics. This is useful for debugging, capacity planning, and operational monitoring of multi-user deployments.
To enable the monitoring endpoint, set enable_per_user_monitoring to true in your workflow configuration:
general:
enable_per_user_monitoring: true:::{note} This endpoint is only available when:
- The workflow is registered as a per-user workflow (using
@register_per_user_function) - The
enable_per_user_monitoringconfiguration option is set totrue:::
- Route:
/monitor/users - Method: GET
- Description: Returns resource usage metrics for all active per-user workflow sessions, or for a specific user if a
user_idquery parameter is provided.
| Parameter | Type | Required | Description |
|---|---|---|---|
user_id |
string | No | Filter results to a specific user. If omitted, returns metrics for all active users. |
The response includes the following metrics for each user:
| Field | Description |
|---|---|
total_active_users |
Count of users with active per-user sessions (builders still in memory), regardless of in-flight requests |
user_id |
The user identifier (from nat-session cookie or JWT in Authorization header; see User Identification) |
session.created_at |
When the per-user workflow was first created |
session.last_activity |
Timestamp of the most recent request |
session.ref_count |
Number of active concurrent requests for this user |
session.is_active |
Whether the user session is currently active |
requests.total_requests |
Total number of requests processed |
requests.active_requests |
Number of requests currently in progress |
requests.avg_latency_ms |
Average request latency in milliseconds |
requests.error_count |
Number of failed requests |
memory.per_user_functions_count |
Number of per-user functions built for this user |
memory.per_user_function_groups_count |
Number of per-user function groups built |
memory.exit_stack_size |
Number of resources held in the async exit stack |
Get metrics for all users:
curl http://localhost:8000/monitor/users | jqResponse:
{
"timestamp": "2025-12-17T10:30:00.000000",
"total_active_users": 2,
"users": [
{
"user_id": "alice",
"session": {
"created_at": "2025-12-17T10:00:00.000000",
"last_activity": "2025-12-17T10:29:55.000000",
"ref_count": 1,
"is_active": true
},
"requests": {
"total_requests": 42,
"active_requests": 1,
"avg_latency_ms": 1250.5,
"error_count": 2
},
"memory": {
"per_user_functions_count": 3,
"per_user_function_groups_count": 1,
"exit_stack_size": 2
}
},
{
"user_id": "bob",
"session": {
"created_at": "2025-12-17T10:15:00.000000",
"last_activity": "2025-12-17T10:28:00.000000",
"ref_count": 0,
"is_active": false
},
"requests": {
"total_requests": 10,
"active_requests": 0,
"avg_latency_ms": 980.0,
"error_count": 0
},
"memory": {
"per_user_functions_count": 2,
"per_user_function_groups_count": 1,
"exit_stack_size": 1
}
}
]
}Get metrics for a specific user:
curl "http://localhost:8000/monitor/users?user_id=alice"Response:
{
"timestamp": "2025-12-17T10:30:00.000000",
"total_active_users": 1,
"users": [
{
"user_id": "alice",
"session": {...},
"requests": {...},
"memory": {...}
}
]
}
The monitoring endpoint is useful for:
- Debugging: Identify users with high error counts or unusual latency patterns
- Capacity Planning: Monitor resource usage across users to plan scaling
- Usage Analytics: Track LLM token consumption and request volumes per user
- Session Management: Identify inactive sessions for cleanup or investigate active sessions
For more information about per-user workflows, refer to:
You can also evaluate workflows via the NeMo Agent Toolkit evaluate endpoint. The endpoint is registered by the core FastAPI worker and enabled only when nvidia-nat-eval is installed (plus async_endpoints support for async job handling). For more information, refer to the NeMo Agent Toolkit Evaluation Endpoint documentation.
Use streaming if you need real-time updates or live communication where users expect immediate feedback. Use non-streaming if your workflow responds with simple updates and less feedback is needed.
A custom user interface can communicate with the API server using both HTTP requests and WebSocket connections.
- For details on proper WebSocket messaging integration, refer to the WebSocket Messaging Interface documentation.
- For HTTP-based interactive workflows (Human-in-the-Loop and OAuth without WebSockets), refer to the HTTP Interactive Execution documentation.