Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ async def _arun(report: str) -> str:
# - Add newlines before and after section
# - Use extracted heading level for consistency
# - Add extra newline between category and reasoning for readability
report_content = result.text().replace('\n', '\n\n')
report_content = result.text.replace('\n', '\n\n')
report_section = f"""\n\n{pound_signs} Root Cause Category\n{report_content}"""

# Log the result for tracking
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,38 +72,37 @@ workflow:
llms:
ata_agent_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
temperature: 0.2
max_tokens: 2048
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0
max_tokens: 16384

tool_reasoning_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
temperature: 0.2
top_p: 0.7
max_tokens: 2048
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0
max_tokens: 16384

telemetry_metrics_analysis_agent_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0
max_tokens: 2048
max_tokens: 16384

maintenance_check_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0
max_tokens: 2048
max_tokens: 16384

categorizer_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0
max_tokens: 2048
max_tokens: 16384

nim_rag_eval_llm:
_type: nim
model_name: nvidia/llama-3.3-nemotron-super-49b-v1.5
model_name: nvidia/nemotron-3-nano-30b-a3b
max_tokens: 8

eval:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@
Begin by interpreting the incoming alert. Identify its type (e.g., *InstanceDown*, *HighCPUUsage*) and note any relevant details.

2. **Select and Use Diagnostic Tools**
Based on the alert type, choose the most relevant tools to gather system metrics. Use each tool only once per alert.
Based on the alert type, select the appropriate diagnostic tools. Use each tool only once per alert. For broad alerts like *InstanceDown* where the root cause could range from software to network to hardware, use all available tools to build a complete picture before drawing conclusions.

- `hardware_check`: Retrieves server power status and hardware health via IPMI. Useful for diagnosing instance down alerts or suspected hardware failures.
- `host_performance_check`: Collects system-level CPU and memory usage using commands like `top` and `ps`. Use this to identify host's resource (CPR and memory) usage bottlenecks.
- `host_performance_check`: Collects system-level CPU and memory usage using commands like `top` and `ps`. Use this to identify host's resource (CPU and memory) usage bottlenecks.
- `monitoring_process_check`: Checks whether critical processes are running on the host. Useful for verifying system functionality during instance down or degraded performance.
- `network_connectivity_check`: Tests host connectivity through ping, telnet, and HTTP health checks. Helps determine if the server is reachable from the network.
- `telemetry_metrics_analysis_agent`: Pulls telemetry metrics to check host status and analyze usage trends. Effective for validating instance uptime and system load over time.
Expand All @@ -39,6 +39,7 @@
- Evaluate the retrieved metrics against the alert details.
- Determine if the alert reflects a real problem or is a false positive.
- If an issue is detected, identify likely causes—such as hardware failure, performance bottlenecks, or network issues.
- Network unreachability (e.g., failed ping or telnet) is a **symptom**, not necessarily a root cause. A host may be unreachable due to a network issue, but also due to hardware failure (e.g., power loss). Always look for the deepest underlying cause that explains all observed symptoms.

4. **Generate a Structured Triage Report (in Markdown format)**
Organize your findings clearly under these sections:
Expand Down Expand Up @@ -66,8 +67,8 @@ class CategorizerPrompts:
- `network_connectivity`: The host is not reachable via ping or curl, or there are signs of connection issues due to blocked ports, broken services, or firewall rules (e.g., telnet fails).
- `hardware`: The alert is caused by a hardware failure or degradation.
- `repetitive_behavior`: The alert is triggered by a recurring or periodic behavior pattern (e.g., regular CPU spikes or memory surges).
- `false_positive`: No clear signs of failure or degradation; system appears healthy and no suspicious pattern is found.
- `need_investigation`: The report contains conflicting, ambiguous, or insufficient information to determine a clear root cause.
- `false_positive`: All diagnostic checks indicate the system is healthy (e.g., network is reachable, monitoring services are running, CPU/memory usage is normal, hardware is fine). The alert does not appear to reflect a real problem — even if the alert itself claims an issue, the collected evidence contradicts it.
- `need_investigation`: The diagnostic data is genuinely incomplete (e.g., key tools failed or returned no data), or multiple checks return mixed signals where some indicate a real problem and others do not, making it impossible to determine a clear root cause.

**Response Format**
- Line 1: Output only the category name (e.g., `hardware`)
Expand Down