feat: implement production usage monitoring with data-driven thresholds#38
Open
Daniel Beer (DanielB945) wants to merge 14 commits intomainfrom
Open
feat: implement production usage monitoring with data-driven thresholds#38Daniel Beer (DanielB945) wants to merge 14 commits intomainfrom
Daniel Beer (DanielB945) wants to merge 14 commits intomainfrom
Conversation
Transform all monitoring agents to autonomous problem detectors and implement
a production-ready usage monitoring system with statistically-derived alerting
thresholds based on 60-day analysis.
Key Changes:
1. Autonomous Monitoring Across All Agents
- Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime)
- Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis"
- Auto-analyze ALL metrics, segments, and time windows without user prompting
- Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline)
2. Production Usage Monitoring Implementation
- Data-driven thresholds per segment based on 60-day volatility analysis
- 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns)
- Segment-specific thresholds:
* Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only)
* Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens
* Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens
* Free: -25% DAU, -35% Image Gens, -20% Tokens
- Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold)
- Weekend suppression for Enterprise (weekday-only alerting)
- Skip Enterprise video generation alerts (CV > 100%, too volatile)
3. Root Cause Investigation Workflow
- Enterprise segments: Drill down to organization level to identify which clients drove drops
- Other segments: Analyze by tier distribution (Standard vs Pro vs Lite)
- Alert format includes current vs baseline, drop %, threshold, and recommended actions
4. Full Segmentation Enforcement
- Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md
- Enforces proper hierarchy: Enterprise → Heavy → Paying → Free
- Consistent segmentation across all usage monitoring queries
Technical Details:
- Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)
- 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt))
- Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date
- Partition pruning on dt (DATE) for performance
Anti-Patterns Addressed:
- Generic thresholds replaced with segment-specific, data-driven values
- Day-of-week effects handled via same-DOW comparison (not DoD on weekends)
- Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy)
- Enterprise video gen alerts skipped (single-user dominated, CV > 100%)
- Small segment noise handled via production threshold calibration
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Keep only production monitoring implementation (usage_monitoring_v2.py)
Reference data-driven thresholds from 60-day analysis: - Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio - Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump - Tier 3 Low Priority: Volume drop, Overhead spike - Per-vertical thresholds for LTX API and LTX Studio Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Include comprehensive shared knowledge files: - product-context.md for business model and user types - bq-schema.md for subscription tables and segmentation - metric-standards.md for revenue metric definitions - event-registry.yaml for feature-driven revenue analysis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Expand Step 2 to include all shared knowledge files: - product-context.md for LTX products and API context - bq-schema.md for API tables and GPU cost data - metric-standards.md for performance metrics - event-registry.yaml for event-driven metrics - gpu-cost-query-templates.md for cost-related performance - gpu-cost-analysis-patterns.md for cost analysis patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Changes: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules + Anti-Patterns into Constraints section - Moved Reference Files + Production Scripts to Context section - Added completion criteria - Kept under 500 lines (402 lines total) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rmat Convert all monitoring agents to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Updated skills: - be-cost-monitoring (314 lines) - GPU cost with production thresholds - revenue-monitor (272 lines) - Revenue/subscription monitoring - enterprise-monitor (341 lines) - Enterprise account health - api-runtime-monitor (359 lines) - API performance monitoring All skills: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules into Constraints section (DO/DO NOT) - Moved Reference Files to Context section - Added completion criteria - Kept under 500 lines per Agent Skills spec Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes: - Replace separate SQL + Python files with combined usage_monitor.py - Update from percentage thresholds to statistical 3σ method - Embed SQL query as string in Python script - Execute BigQuery directly (no CSV intermediate) - Add --date parameter for flexible date selection - Update SKILL.md to reference combined script Benefits: - Single file to maintain (353 lines vs 413 lines) - No intermediate CSV files needed - Easier to run and schedule - Self-contained SQL + alerting logic
Change focus from 'detecting drops' to 'detecting data spikes' to reflect that the statistical method detects anomalies in both directions: - Increases (e.g., +46.6% token spike on 2026-03-09) - Decreases (e.g., churn, engagement drops) Updated: - Overview: Problem solved now mentions both increases and decreases - Requirements: Changed 'drops' to 'spikes (increases or decreases)' - Description: Changed 'detecting usage drops' to 'detecting usage anomalies'
Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context.
agents/monitoring/usage/SKILL.md
Outdated
| ### 2. Read Shared Knowledge | ||
| - [ ] DAU spikes (increases or decreases) by segment (Enterprise Contract/Pilot, Heavy, Paying, Free) | ||
| - [ ] Image generation volume changes (both increases and decreases) | ||
| - [ ] Video generation volume changes (skip for Enterprise - too volatile) |
Collaborator
Author
There was a problem hiding this comment.
remove it (skip for Enterprise - too volatile)
agents/monitoring/usage/SKILL.md
Outdated
| **For anomaly detection:** | ||
| - Flag days with Z-score > 2 (metric deviates > 2 std devs from rolling avg) | ||
| - Investigate root cause: product issue, marketing campaign, seasonal pattern | ||
| Before writing SQL, read: |
Collaborator
Author
There was a problem hiding this comment.
bring all the share
agents/monitoring/usage/SKILL.md
Outdated
| - LT team already excluded (no `is_lt_team` filter needed) | ||
| - Enterprise patterns: Strong weekday/weekend differences, use same-DOW comparisons | ||
|
|
||
| ### Phase 3: Write Monitoring SQL |
Collaborator
Author
There was a problem hiding this comment.
put in on the file and read it from there
agents/monitoring/usage/SKILL.md
Outdated
| 3. Configure notification threshold | ||
| 4. Route alerts to Slack channel (#product-alerts, #engineering-alerts) | ||
| # Monitor specific date | ||
| python3 usage_monitor.py --date 2026-03-05 |
Collaborator
Author
There was a problem hiding this comment.
run on yesterday
…isclosure Major changes: - Reduced from 335 lines to 182 lines (-45%) - Removed duplicate SQL query (already in usage_monitor.py) - Removed duplicate alert format examples - Consolidated overlapping phases (Phase 4 + 5 → Phase 4) - Simplified DO/DO NOT section (removed repetitive rules) - Applied progressive disclosure (method → run → analyze → present) - Kept only essential information, reference scripts for details Benefits: - Clearer, more focused instructions - Less maintenance (single source of truth in Python script) - Easier to scan and understand - Follows Agent Skills spec better (<500 lines, minimal duplication)
agents/monitoring/usage/SKILL.md
Outdated
| - Calculate % change and flag significant shifts (>15-20%) | ||
| - Check if changes are consistent across all tiers or specific to one segment | ||
| # Monitor specific date | ||
| python3 usage_monitor.py --date 2026-03-09 |
Collaborator
Author
There was a problem hiding this comment.
change it to yesterday
1. Change date example from specific date to 'yesterday' 2. Remove '(skip for Enterprise - too volatile)' from video generation line The exception details are still preserved in Phase 1 where they belong.
… exceptions - Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3) - Compare yesterday's metrics instead of today's - Remove enterprise video gen exceptions - Add event-registry.yaml to shared knowledge - Remove Phase 6, Context & References section - Add Free segment to tier distribution checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Mar 10, 2026
Daniel Beer (DanielB945)
added a commit
that referenced
this pull request
Mar 19, 2026
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
usage_monitor.py)Key Changes
Usage Monitor (Statistical Anomaly Detection)
usage_monitor.pyfileAll Monitoring Skills Restructured
Files Changed (7 files)
agents/monitoring/usage/SKILL.md(+407 lines) - Statistical method, 6-part structureagents/monitoring/usage/usage_monitor.py(+303 lines, NEW) - Combined SQL + alertingagents/monitoring/usage/investigate_root_cause.sql(+85 lines, NEW) - Root cause drill-downagents/monitoring/be-cost/SKILL.md(+211 lines) - Production GPU thresholdsagents/monitoring/revenue/SKILL.md(+250 lines) - All shared filesagents/monitoring/enterprise/SKILL.md(+310 lines) - Exact segmentationagents/monitoring/api-runtime/SKILL.md(+335 lines) - All shared filesTest Plan
python3 usage_monitor.py --date 2026-03-05to verify statistical alertinginvestigate_root_cause.sqlfor Enterprise segment drill-downBenefits
🤖 Generated with Claude Code