Note
Complexity: 🛑 Advanced
This guide walks through the complete process of running decision-only evaluations using the react_benchmark_agent: downloading data, configuring evaluations, running experiments, and analyzing results.
Currently this agent supports evaluation exclusively for the Galileo Agent Leaderboard v2. However, we plan to extend the set of evaluation tool sets and benchmarks and will update this document accordingly.
Important
Prerequisite: Before running these examples, complete the Dynamo Backend Setup Guide to set up and verify your Dynamo inference server is running and responding to curl requests.
- Prerequisites
- Environment Setup
- Dataset Preparation
- Configuration Files
- Running Evaluations
- Self-Evaluation Loop
- Understanding Results
- Performance Analysis
- Concurrency Benchmarking
- Troubleshooting
Warning
This example requires a Linux system with an NVIDIA GPU. See the Dynamo Support Matrix for full details.
Supported Platforms:
- Ubuntu 22.04 / 24.04 (x86_64)
- Ubuntu 24.04 (ARM64)
- CentOS Stream 9 (x86_64, experimental)
Not Supported:
- ❌ macOS (Intel or Apple Silicon)
- ❌ Windows
You do not need to install ai-dynamo or ai-dynamo-runtime packages locally. The Dynamo server runs inside pre-built Docker images from NGC (nvcr.io/nvidia/ai-dynamo/sglang-runtime), which include all necessary components. The NeMo Agent Toolkit Dynamo LLM client (_type: dynamo) is a pure HTTP client that works on any platform.
-
Python 3.11, 3.12, or 3.13 installed
-
NeMo Agent Toolkit repository cloned
-
Docker with NVIDIA Container Toolkit
-
NVIDIA Driver with CUDA 12.0+ support,
nvidia-fabricmanagerenabled, and matching your driver version. Verify with:docker run --rm --gpus all nvidia/cuda:12.4.0-runtime-ubuntu22.04 \ bash -c "apt-get update && apt-get install -y python3-pip && pip3 install torch && python3 -c 'import torch; print(torch.cuda.is_available())'"The output should show
True. If it showsFalsewith error 802, ensurenvidia-fabricmanageris installed, running, and matches your driver version. -
Hugging Face account with access to Llama-3.3-70B-Instruct model (requires approval from Meta)
-
Model weights downloaded - Follow the model download instructions in the Dynamo Setup Guide
Running these evaluations requires a Dynamo backend with adequate GPU resources. The following are the minimum and recommended specifications:
| Component | Minimum | Recommended |
|---|---|---|
| GPU Architecture | NVIDIA Hopper (H100) | B200 for optimal performance |
| GPU Count | 4 GPUs (TP=4 for 70B model) | 8 GPUs for optimal performance |
| GPU Memory | 96GB per GPU (H100) | 192GB per GPU (B200) |
Note: The Llama-3.3-70B-Instruct model requires approximately 140GB of GPU memory when loaded with TP=4 (tensor parallelism across 4 GPUs). Ensure your GPU configuration has sufficient aggregate memory.
The Dynamo backend must be running on localhost:8099 before executing evaluations. See the Dynamo Setup Guide for detailed instructions on:
- Starting Dynamo in unified or disaggregated mode
- Configuring GPU workers and tensor parallelism
- Setting up the Thompson Sampling router for KV cache optimization
- Troubleshooting common issues
Note: For a more abbreviated way to kick off experimentation, see the Quick Start section in the parent README. This document provides a more detailed explanations of the different test patterns and configurations available.
# Navigate to the repository root
cd /path/to/NeMo-Agent-Toolkit
# Create virtual environment with uv
uv venv "${HOME}/.venvs/nat_dynamo_eval" --python 3.13
source "${HOME}/.venvs/nat_dynamo_eval/bin/activate"
# Install nvidia-nat with LangChain support
uv pip install -e ".[langchain]"
# Install visualization dependencies
uv pip install matplotlib scipy
# Install the workflow package
cd examples/dynamo_integration/react_benchmark_agent
uv pip install -e .To activate an existing environment:
source "${HOME}/.venvs/nat_dynamo_eval/bin/activate"If not already configured from running ../README.md, copy .env.example to a new .env, update the environment variable values, and source it in the current terminal
cd ../ # NeMo-Agent-Toolkit/examples/dynamo_integration
cp .env.example .env
vi .env # update the environment variables then source
[ -f .env ] && source .env || { echo "Warning: .env not found" >&2; false; }Note: Dynamo-specific environment variables (
DYNAMO_BACKEND,DYNAMO_MODEL,DYNAMO_PORT) are used by the test scripts inexternal/dynamo/and are not required for running evaluations. See Dynamo Setup Guide for those options.
Before running evaluations, ensure Dynamo is running:
cd ../../external/dynamo/ # NeMo-Agent-Toolkit/external/dynamo
bash start_dynamo_unified.sh
bash test_dynamo_integration.shNote: To customize GPU workers and tensor parallelism, edit the configuration variables at the top of
external/dynamo/start_dynamo_unified.sh:
WORKER_GPUS="4,5,6,7"- GPU device IDs to use (for example,"0,1"for first 2 GPUs)TP_SIZE=4- Tensor parallel size (must match number of GPUs)HTTP_PORT=8099- API endpoint portLOCAL_MODEL_DIR="..."- Path to your local model weights
See Dynamo Setup Guide for detailed configuration options.
Note
Requires the virtual environment to be active. See Environment Setup.
cd ../../examples/dynamo_integration
export HF_TOKEN=your_huggingface_token
python scripts/download_agent_leaderboard_v2.py --domains bankingCreates:
data/agent_leaderboard_v2_banking.json- 100 enriched scenariosdata/raw/banking/tools.json- 20 banking tool schemas- Each scenario includes
expected_tool_callsderived fromuser_goals
The minimal test config (eval_config_no_rethinking_minimal_test.yml) requires a test subset. This configuration can be used for quick end-to-end tests, without running the entire dataset through nat eval. Create it with:
# cd /path/to/NeMo-Agent-Toolkit/examples/dynamo_integration
# 3-scenario subset for quick testing (required by eval_config_no_rethinking_minimal_test.yml)
python scripts/create_test_subset.py \
--input-file ./data/agent_leaderboard_v2_banking.json \
--output-file ./data/agent_leaderboard_v2_test_subset.json \
--num-scenarios 3
# Single scenario for debugging
python scripts/create_test_subset.py \
--input-file ./data/agent_leaderboard_v2_banking.json \
--output-file ./data/agent_leaderboard_v2_single.json \
--num-scenarios 1Each scenario in the dataset contains:
{
"id": "banking_scenario_000",
"question": "I need to check my balance and transfer $500...",
"user_goals": ["Check account balance", "Transfer funds", "Verify transaction"],
"available_tools": [...],
"expected_tool_calls": ["get_account_balance", "transfer_funds", "get_transaction_history"],
"metadata": {...}
}| Configuration File | Description | Dataset | Use Case |
|---|---|---|---|
eval_config_no_rethinking_full_test.yml |
Full evaluation | 100 scenarios | Production benchmarks |
eval_config_no_rethinking_minimal_test.yml |
Quick test | 3 scenarios | Validation |
eval_config_rethinking_full_test.yml |
Self-evaluation loop | 100 scenarios | Quality optimization |
profile_rethinking_full_test.yml |
Profiler + self-eval | 100 scenarios | Performance analysis |
optimize_rethinking_full_test.yml |
Prefix header optimization | 100 scenarios | Dynamo Predictive KV-Aware Cache router tuning |
config_dynamo_e2e_test.yml |
LangChain + Dynamo integration | Single query | Framework integration test |
config_dynamo_prefix_e2e_test.yml |
LangChain + Dynamo with prefix headers | Single query | KV cache optimization test |
config_dynamo_adk_e2e_test.yml |
Google ADK + Dynamo integration | Single query | ADK framework integration test |
All config files are located in react_benchmark_agent/configs/.
The Dynamo LLM provider supports multiple agent frameworks. Each framework has a dedicated e2e test configuration to verify the integration works correctly.
Google ADK (Agent Development Kit) is an increasingly popular framework for building AI agents. Testing the Dynamo + ADK integration is important because:
-
Different header injection mechanism: ADK uses LiteLLM under the hood, which requires passing headers via
extra_headersat client initialization time, unlike LangChain which useshttpxevent hooks for per-request injection. -
Conversation-level prefix ID consistency: All requests from the same ADK client instance share the same prefix ID, which is ideal for KV cache optimization in multi-turn conversations.
-
Provider prefix requirements: LiteLLM requires model names to be prefixed with the provider (for example,
openai:llama-3.3-70b) for custom endpoints, which differs from LangChain's direct model name usage.
# Install ADK demo package (required for ADK workflow)
cd ../../ # /path/to/NeMo-Agent-Toolkit
pip install -e './examples/frameworks/adk_demo' # may need to use --no-deps depending on working branch version
# Run the ADK + Dynamo integration test (basic I/O)
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml \
--input "Hello! What is 2+2?"Expected output: The Dynamo prefix headers should be logged, and all LLM calls within the conversation will share the same prefix ID (for example, adk-dynamo-test-7a31631c0ec24857).
Note: The ADK e2e test is configured for basic I/O testing only (no tool calling). This is because ADK with LiteLLM requires OpenAI-style function calling support from the model endpoint, which vanilla llama models served via vLLM or Dynamo don't support out of the box. For tool-calling workflows with Dynamo, use the LangChain + ReAct agent
configs(for example,config_dynamo_prefix_e2e_test.yml) which parse tool calls from text output.
llms:
dynamo_llm:
_type: dynamo
model_name: llama-3.3-70b
base_url: http://localhost:8099/v1
api_key: dummy
temperature: 0.0
max_tokens: 8192
stop: ["Observation:", "\nThought:"] # CRITICAL: Prevents observation hallucination
# Optional: Customize prefix headers (sent by default with "nat-dynamo-{uuid}")
# prefix_template: "react-benchmark-{uuid}" # Custom template
prefix_total_requests: 10
prefix_osl: MEDIUM # Output Sequence Length: LOW | MEDIUM | HIGH
prefix_iat: MEDIUM # Inter-Arrival Time: LOW | MEDIUM | HIGHNote: The
dynamoLLM type automatically sends prefix headers for KV cache optimization. Headers are enabled by default using the templatenat-dynamo-{uuid}. You can customize the template withprefix_templateor disable headers entirely by settingprefix_template: null. These headers help the Predictive KVCache-Aware Thompson Sampling router make optimal routing decisions (see Dynamo Setup Guide).
For TSQ evaluation, tools must be configured in decision-only mode:
functions:
react_benchmark_agent:
_type: react_benchmark_agent
prefix: "Agent:"
decision_only: true
canned_response_template: "Successfully executed {tool_name}. Operation completed."
function_groups:
banking_tools:
_type: banking_tools_group
# tools.json available after running: /examples/dynamo_integration/scripts/download_agent_leaderboard_v2.py
tools_json_path: ./examples/dynamo_integration/data/raw/banking/tools.json
decision_only: true
include: [
get_account_balance,
get_transaction_history,
transfer_funds,
# ... all 20 banking tools
]Note: The
decision_only: truesetting is required for TSQ evaluation. It makes tools return canned responses instead of executing real banking operations. Thecanned_response_templatedefines the response format (for example, "Successfully executed {tool_name}"). This allows evaluation of tool selection without needing actual backend services.
workflow:
_type: react_agent
llm_name: dynamo_llm
tool_names: [
banking_tools.get_account_balance,
banking_tools.transfer_funds,
# ... all tools with banking_tools. prefix
]
verbose: true
max_tool_calls: 25
recursion_limit: 50
pass_tool_call_errors_to_agent: trueeval:
general:
max_concurrency: 36 # Range: 1-64
output:
dir: ./examples/dynamo_integration/react_benchmark_agent/outputs/dynamo_evals/
cleanup: false
job_management:
append_job_id_to_output_dir: true
dataset:
_type: json
file_path: ./examples/dynamo_integration/data/agent_leaderboard_v2_banking.json
structure:
disable: true
evaluators:
tool_selection_quality:
_type: tsq_evaluator
llm_name: eval_llm
strict_mode: false
tool_weight: 1.0
parameter_weight: 0.0 # Set > 0 to evaluate parameter accuracy
verbose: trueNote
Commands in this section require the virtual environment to be active. See Environment Setup.
curl http://localhost:8099/health
# Expected: HTTP 200 OK, else check dynamo runtimeIf Dynamo isn't running, see Dynamo Setup Guide.
Prerequisite: Create the test subset file first (if not already created):
cd /path/to/NeMo-Agent-Toolkit/examples/dynamo_integration python scripts/create_test_subset.py \ --input-file ./data/agent_leaderboard_v2_banking.json \ --output-file ./data/agent_leaderboard_v2_test_subset.json
cd /path/to/NeMo-Agent-Toolkit
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.ymlRuntime: <1 minute Expected TSQ: 0.3 - 0.6
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_no_rethinking_full_test.ymlRuntime: ~30-60 minutes (depends on concurrency) Expected TSQ: 0.4 - 0.7
✓ 20/20 banking tool stubs registered
✓ Tool stub executed: get_exchange_rates with 3 parameters
✓ Tool stub executed: setup_automatic_bill_pay with 8 parameters
Running workflow: 100%|██████████| 100/100 [00:45:12<00:00]
✓ TSQ Evaluation complete: average_score=0.571
The self-evaluation mechanism allows the agent to evaluate its own tool selection and retry if insufficient. This can improve TSQ scores by 5-15%.
User Question
↓
[Attempt 1] ReAct Agent executes
↓
Tool calls captured: [Tool A, Tool B, Tool C]
↓
Self-Evaluator LLM reviews:
- Are these tools sufficient?
- Is anything missing?
↓
Evaluation Result:
- is_sufficient: false
- confidence: 0.60
- missing_steps: ["verify_transaction"]
↓
[Decision] Confidence < threshold → Retry
↓
[Attempt 2] ReAct Agent executes (with feedback)
↓
Tool calls captured: [Tool A, Tool B, Tool C, Tool D]
↓
Self-Evaluator: is_sufficient: true, confidence: 0.85
↓
✓ Accept result
Use eval_config_rethinking_full_test.yml:
functions:
# Define the ReAct workflow as a function
react_workflow:
_type: react_agent
llm_name: dynamo_llm
tool_names: [banking_tools.get_account_balance, ...]
verbose: true
max_tool_calls: 25
# Wrap with self-evaluating agent
workflow:
_type: self_evaluating_agent_with_feedback
wrapped_agent: react_workflow
evaluator_llm: eval_llm
max_retries: 5
min_confidence_threshold: 0.85
pass_feedback_to_agent: true # KEY: Pass evaluation feedback on retry
verbose: true
feedback_template: |
PREVIOUS ATTEMPT FEEDBACK:
Your previous tool selection was evaluated and found to be insufficient.
EVALUATION: {reasoning}
MISSING STEPS: {missing_steps}
SUGGESTIONS: {suggestions}
Please try again, addressing the issues identified above.| Parameter | Type | Default | Description |
|---|---|---|---|
wrapped_agent |
FunctionRef |
required | Reference to underlying ReAct agent |
evaluator_llm |
LLMRef |
required | LLM for self-evaluation |
max_retries |
int |
2 | Maximum retry attempts (0-5) |
min_confidence_threshold |
float |
0.7 | Minimum confidence to accept (0.0-1.0) |
pass_feedback_to_agent |
bool |
false | Pass evaluation feedback on retry |
verbose |
bool |
true | Enable detailed logging |
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_rethinking_full_test.yml================================================================================
Attempt 1/6
================================================================================
INFO: Captured 2 tool calls
INFO: 1. get_account_balance
INFO: 2. transfer_funds
--------------------------------------------------------------------------------
Self-Evaluation Result:
Sufficient: False
Confidence: 0.60
Reasoning: Missing verification step after transfer
Missing steps: verify_transaction_status
--------------------------------------------------------------------------------
✗ Tool sequence insufficient - retrying...
================================================================================
Attempt 2/6
================================================================================
INFO: Captured 3 tool calls
INFO: 1. get_account_balance
INFO: 2. transfer_funds
INFO: 3. get_transaction_history
--------------------------------------------------------------------------------
Self-Evaluation Result:
Sufficient: True
Confidence: 0.85
--------------------------------------------------------------------------------
✓ Tool sequence accepted
| Metric | Without Self-Eval | With Self-Eval |
|---|---|---|
| Average attempts per question | 1 | 1.3-1.8 |
| Token usage | Baseline | +15-20% |
| Latency | Baseline | +30-80% |
| TSQ score improvement | - | +5-15% |
For Speed:
max_retries: 1
min_confidence_threshold: 0.6For Quality:
max_retries: 3
min_confidence_threshold: 0.85
pass_feedback_to_agent: trueResults are saved to react_benchmark_agent/outputs/dynamo_evals/<job_id>/:
| File | Description |
|---|---|
tool_selection_quality_output.json |
TSQ scores per scenario |
standardized_data_all.csv |
Profiler data (tokens, timestamps) |
all_requests_profiler_traces.json |
Raw trace data |
workflow_profiling_report.txt |
Human-readable profiling summary |
{
"average_score": 0.571,
"eval_output_items": [{
"id": "banking_scenario_000",
"score": 0.571,
"reasoning": {
"tool_selection_accuracy": 0.571,
"parameter_usage_accuracy": 0.0,
"actual_tool_calls": 5,
"expected_tool_calls": 8,
"details": {
"actual_tools": ["get_exchange_rates", "setup_automatic_bill_pay", ...],
"expected_tools": ["get_credit_card_information", "report_lost_stolen_card", ...]
}
}
}]
}TSQ uses F1 score to balance precision and recall:
Precision = Correct Tools / Actual Tools Called
Recall = Correct Tools / Expected Tools
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Example:
actual_tools = {tool1, tool2, tool3} # 3 tools called
expected_tools = {tool2, tool3, tool4, tool5} # 4 tools expected
intersection = {tool2, tool3} # 2 correct
precision = 2/3 = 0.667 # Called 1 extra unnecessary tool
recall = 2/4 = 0.500 # Missed 2 expected tools
f1_score = 2 × (0.667 × 0.500) / (0.667 + 0.500) = 0.571| Score Range | Quality | Interpretation |
|---|---|---|
| 0.0 - 0.3 | Poor | Agent selecting wrong tools |
| 0.3 - 0.6 | Moderate | Right general idea, some confusion |
| 0.6 - 0.8 | Good | Mostly correct tool selection |
| 0.8 - 1.0 | Excellent | Near-perfect tool selection |
Note
Commands in this section require the virtual environment to be active. See Environment Setup.
After evaluation, analyze token generation performance:
cd examples/dynamo_integration # /path/to/NeMo-Agent-Toolkit/examples/dynamo_integration
python scripts/throughput_analysis.py \
react_benchmark_agent/outputs/dynamo_evals/<workflow_output_dir>/<job_id>/standardized_data_all.csvOutput metrics:
- TTFT (Time To First Token): Mean, median, P90, P95, P99
- ITL (Inter-Token Latency): Time between consecutive tokens
- Per-Request Throughput: Tokens per second for individual calls
- Aggregate Throughput: Total tokens / wall-clock time
Example output:
================================================================================
LLM Performance Analysis Summary
================================================================================
Dataset Overview:
Total LLM Calls: 210
Total Tokens Generated: 20,880
Wall-Clock Time: 236.3s
--------------------------Time To First Token (TTFT)----------------------------
Mean: 52.44 ms
Median: 52.70 ms
P95: 54.10 ms
------------Inter-Token Latency (ITL) / Time Per Output Token (TPOT)------------
Mean: 10.74 ms
Median: 10.88 ms
P95: 11.21 ms
-----------------------Per-Request Throughput (Tokens Per Second)---------------
Mean: 89.43 tok per second
Median: 89.42 tok per second
-----------------Aggregate Throughput (All Concurrent Requests)-----------------
Aggregate Throughput: 88.37 tokens per second
================================================================================
Generate scatter plots comparing throughput metrics against TSQ scores:
# cd /path/to/NeMo-Agent-Toolkit/examples/dynamo_integration
python scripts/plot_throughput_histograms_per_request.py \
react_benchmark_agent/outputs/dynamo_evals/<workflow-output-dir>/jobsGenerated plots:
ttft_histogram.png- Time To First Token distributionitl_histogram.png- Inter-Token Latency distributiontps_histogram.png- Tokens Per Second distributiontotal_tokens_histogram.png- Total tokens per request distributionllm_calls_histogram.png- LLM calls per request distributiontotal_duration_histogram.png- Request duration distributionsummary_throughput_histograms.png- Multi-panel summarythroughput_histogram_data.csv- Aggregated histogram datathroughput_histogram_per_llm_call_data.csv- Per-LLM-call data
Note
Commands in this section require the virtual environment to be active. See Environment Setup.
The scripts/run_concurrency_benchmark.sh script automates performance testing across different concurrency levels.
- Runs evaluations with
max_concurrencyset to 2, 4 (configurable) - Tracks each job and its output directory
- Analyzes performance using
scripts/throughput_analysis.py - Aggregates results into CSV and markdown reports
# cd /path/to/NeMo-Agent-Toolkit/examples/dynamo_integration
./scripts/run_concurrency_benchmark.sh # could take ~30 minutes to run
# When prompted, enter a unique name (e.g., "baseline_v1")react_benchmark_agent/outputs/benchmarks/<name>_<timestamp>/
├── benchmark_results.csv # Machine-readable CSV
├── benchmark_report.md # Human-readable markdown
├── analysis_16.txt # Detailed analysis for concurrency=16
├── analysis_32.txt # Detailed analysis for concurrency=32
└── ...
concurrency,total_llm_calls,total_tokens,total_duration_sec,
ttft_mean_ms,ttft_median_ms,ttft_p90_ms,ttft_p95_ms,
itl_mean_ms,itl_median_ms,itl_p90_ms,itl_p95_ms,
throughput_mean_toks,throughput_median_toks,...
- Each eval run: 15-30 minutes (depends on dataset size)
- Total benchmark (2 concurrency levels by default): 30-60 minutes
- Runs sequentially to avoid interference
Edit scripts/run_concurrency_benchmark.sh to change concurrency levels, for example:
# Change concurrency levels (around line 66)
CONCURRENCY_LEVELS=(1 2 4 8 16 32)Symptom: PermissionError: [Errno 13] Permission denied: '.../.cache/huggingface/hub/.locks/...'
Cause: Your home directory is on NFS and doesn't support file locking
Fix: Set HF_HOME to a local writable directory (not on NFS):
export HF_HOME=/path/to/local/storage/.cache/huggingfaceSymptom: Observations don't match mock JSON responses
Fix: Ensure stop sequence and system prompt are set:
llms:
dynamo_llm:
stop: ["Observation:"]
workflow:
system_prompt: |
... STOP HERE. DO NOT generate the Observation ...Symptom: actual_tool_calls: 0
Cause: Tools aren't being executed or tool stubs aren't configured for decision-only mode.
Fix:
- Check logs for "Tool stub executed" - if missing, tools aren't running
- Ensure your
functionandfunction_groupsconfigfiles havedecision_only: trueand acanned_response_template:
functions:
react_benchmark_agent:
_type: react_benchmark_agent
prefix: "Agent:"
decision_only: true
canned_response_template: "Successfully executed {tool_name}. Operation completed."
function_groups:
banking_tools:
_type: banking_tools_group
# tools.json available after running: /examples/dynamo_integration/scripts/download_agent_leaderboard_v2.py
tools_json_path: ./examples/dynamo_integration/data/raw/banking/tools.json
decision_only: trueBoth decision_only: true settings are required. The canned_response_template defines the mock response format returned by tools. See Decision-Only Tool Configuration for details.
Symptom: ModuleNotFoundError: react_benchmark_agent
Fix:
cd examples/dynamo_integration/react_benchmark_agent
pip install -e . --force-reinstallSymptom: Configuration paths not resolving
Fix: Run nat eval from the repository root:
cd /path/to/NeMo-Agent-Toolkit # repository root, not workflow directory
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/...Symptom: GraphRecursionError: Recursion limit of 42 reached
Fix: Increase recursion limit in config:
workflow:
recursion_limit: 100
max_tool_calls: 40Symptom: Agent never accepts tool sequence
Fix: Lower confidence threshold:
workflow:
_type: self_evaluating_agent_with_feedback
min_confidence_threshold: 0.6 # More lenientCheck Dynamo health:
curl http://localhost:8099/healthRestart if needed:
cd /path/to/NeMo-Agent-Toolkit/external/dynamo
bash stop_dynamo.sh
bash start_dynamo_unified.shSee Dynamo Setup Guide for detailed troubleshooting.
Note
All commands should be run from the repository root with the virtual environment active. See Environment Setup.
# Basic Dynamo connectivity test
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_e2e_test.yml \
--input "What time is it?"
# Dynamo with prefix headers (for KV cache optimization)
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_prefix_e2e_test.yml \
--input "What time is it?"
# ADK + Dynamo integration test
nat run --config_file examples/dynamo_integration/react_benchmark_agent/configs/config_dynamo_adk_e2e_test.yml \
--input "Hello! What is 2+2?"# Quick validation (3 scenarios, ~1 minute)
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_no_rethinking_minimal_test.yml
# Full evaluation without self-evaluation (100 scenarios, ~5-10 min)
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_no_rethinking_full_test.yml
# Full evaluation with self-evaluation loop (100 scenarios, ~45 min)
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/eval_config_rethinking_full_test.yml# Optimize Dynamo prefix header parameters for the Predictive KV-Aware Thompson Sampling router
#
# Parameters optimized:
# - prefix_total_requests: Expected requests per prefix (search space: 1-20, step 5)
# - prefix_osl: Output Sequence Length hint (LOW | MEDIUM | HIGH)
# - prefix_iat: Inter-Arrival Time hint (LOW | MEDIUM | HIGH)
#
# Objectives (multi-objective optimization, all minimized):
# - avg_llm_latency (70% weight) - Primary: reduce LLM response time
# - avg_workflow_runtime (20% weight) - Secondary: reduce total task time
# - avg_num_llm_calls (10% weight) - Tertiary: improve efficiency
#
# Uses grid search over the parameter space to find optimal routing hints.
# WARNING: this run could use MANY tokens - be mindful and run at your own risk.
nat optimize --config_file examples/dynamo_integration/react_benchmark_agent/configs/optimize_rethinking_full_test.yml# Profile with comprehensive LLM and workflow metrics
#
# Metrics collected:
# - TTFT (Time To First Token) - measures prompt processing latency
# - ITL (Inter-Token Latency) - measures token generation speed
# - Throughput (tokens/second) - measures generation efficiency
# - Token usage patterns and forecasting
# - Bottleneck analysis with nested call stacks
# - Concurrency spike detection
#
# Output: standardized_data_all.csv for Pareto optimality analysis
# Use with: python scripts/throughput_analysis.py <output_dir>/standardized_data_all.csv
#
# The Pareto analysis identifies configurations that are optimal trade-offs
# between latency, throughput, and quality (TSQ). No single point dominates
# all others across all objectives - these form the Pareto frontier.
nat eval --config_file examples/dynamo_integration/react_benchmark_agent/configs/profile_rethinking_full_test.yml