Skip to content

Fix alert triage agent: switch to nemotron-3-nano model and improve prompts#1750

Merged
rapids-bot[bot] merged 1 commit intoNVIDIA:release/1.5from
hsin-c:fix/alert-triage-nemotron-config-and-prompt-update
Mar 5, 2026
Merged

Fix alert triage agent: switch to nemotron-3-nano model and improve prompts#1750
rapids-bot[bot] merged 1 commit intoNVIDIA:release/1.5from
hsin-c:fix/alert-triage-nemotron-config-and-prompt-update

Conversation

@hsin-c
Copy link
Contributor

@hsin-c hsin-c commented Mar 5, 2026

Summary

  • Switch LLM model from nvidia/llama-3.3-nemotron-super-49b-v1.5 to nvidia/nemotron-3-nano-30b-a3b across all LLM configs in config_offline_mode.yml, with temperature: 0 and max_tokens: 16384 for deterministic and more complete outputs.
  • Improve triage agent prompts to produce more accurate diagnoses, especially for broad alerts like InstanceDown where the root cause could span hardware, network, or software — the agent is now instructed to use all available tools before drawing conclusions.
  • Clarify categorizer prompt definitions for false_positive and need_investigation categories to reduce misclassification — false_positive now requires all diagnostic checks to indicate a healthy system, and need_investigation is reserved for genuinely incomplete or conflicting data.
  • Fix warning in categorizer.py: replace result.text() (method call) with result.text (property access).
  • Fix typo: "CPR" → "CPU" in host_performance_check tool description.

Test plan

  • Ran pytest -v --run_slow --run_integration examples/advanced_agents/alert_triage_agent/tests/test_alert_triage_agent_workflow.py::test_full_workflow 10 times — all 10 runs passed with no intermittent failures.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced alert categorization guidance with clearer instructions on diagnostic tool usage and deeper root cause analysis
    • Improved categorization definitions to reduce false positives and better identify alerts requiring further investigation
  • Improvements

    • Updated LLM model configurations with optimized parameters for enhanced performance

…e and a high max_token count; Prompt updates for the agent to work better with nemotron2-nano

Signed-off-by: Hsin Chen <hsinc@nvidia.com>
@hsin-c hsin-c requested a review from a team as a code owner March 5, 2026 06:44
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Walkthrough

The pull request updates an alert triage agent system by correcting a method call to property access in the categorizer logic, migrating LLM models to a different variant with adjusted hyperparameters in the offline configuration, and refining prompt guidance for improved diagnostic tool usage and classification accuracy.

Changes

Cohort / File(s) Summary
Code Logic Fix
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/categorizer.py
Changed result.text() method call to result.text property access when computing report_content.
LLM Configuration Updates
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/configs/config_offline_mode.yml
Updated 6 LLM configurations: migrated model from nvidia/llama-3.3-nemotron-super-49b-v1.5 to nvidia/nemotron-3-nano-30b-a3b; set temperature to 0; increased max_tokens to 16384; removed top_p setting where applicable. Affects ata_agent_llm, tool_reasoning_llm, telemetry_metrics_analysis_agent_llm, maintenance_check_llm, categorizer_llm, and nim_rag_eval_llm.
Prompt Refinements
examples/advanced_agents/alert_triage_agent/src/nat_alert_triage_agent/prompts.py
Expanded diagnostic tool guidance emphasizing comprehensive tool usage for broad alerts; clarified that network unreachability is a symptom not root cause; refined CategorizerPrompts definitions with more explicit, evidence-based explanations for false_positive and need_investigation categories.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: switching to nemotron-3-nano model and improving prompts, with all three file modifications reflected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@hsin-c hsin-c added bug Something isn't working non-breaking Non-breaking change labels Mar 5, 2026
Copy link
Contributor

@mnajafian-nv mnajafian-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mnajafian-nv
Copy link
Contributor

/merge

@willkill07
Copy link
Member

/ok to test 27489a2

@dagardner-nv dagardner-nv reopened this Mar 5, 2026
@dagardner-nv
Copy link
Contributor

dagardner-nv commented Mar 5, 2026

/ok to test 27489a2

1 similar comment
@dagardner-nv
Copy link
Contributor

/ok to test 27489a2

@rapids-bot rapids-bot bot merged commit 4d5a991 into NVIDIA:release/1.5 Mar 5, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants