Skip to content

feat: implement production usage monitoring with data-driven thresholds#38

Open
Daniel Beer (DanielB945) wants to merge 14 commits intomainfrom
update-usage-segmentation
Open

feat: implement production usage monitoring with data-driven thresholds#38
Daniel Beer (DanielB945) wants to merge 14 commits intomainfrom
update-usage-segmentation

Conversation

@DanielB945
Copy link
Collaborator

@DanielB945 Daniel Beer (DanielB945) commented Mar 8, 2026

Summary

  • Transform all monitoring agents to autonomous problem detectors following 6-part Agent Skills specification
  • Implement statistical anomaly detection for usage monitoring using 3 standard deviations (3σ)
  • Combine SQL + Python into single self-contained monitoring script (usage_monitor.py)
  • Use last 10 same-day-of-week data points to calculate mean (μ) and standard deviation (σ)
  • Alert when |today_value - μ| > 3σ (captures 99.7% of normal variance)
  • Two-tier severity: WARNING (|z| > 3), CRITICAL (|z| > 4.5)
  • Enterprise weekend suppression and video gen skip logic preserved
  • All monitoring skills restructured to 6-part format: Overview, Requirements, Progress Tracker, Implementation Plan, Context & References, Constraints & Done

Key Changes

Usage Monitor (Statistical Anomaly Detection)

  • Method: 3 standard deviations from last 10 same-day-of-week mean
  • Alert Logic: |z-score| > 3 triggers alert (z = (current - μ) / σ)
  • Auto-adaptive: Each segment's variance determines its own thresholds
  • Combined Script: SQL query + alerting logic in single usage_monitor.py file
  • No manual tuning: Statistical method adapts to natural variance patterns

All Monitoring Skills Restructured

  • ✅ Usage Monitor (3σ statistical thresholds)
  • ✅ BE Cost Monitor (production GPU thresholds, 3-day lag)
  • ✅ Revenue Monitor (all shared knowledge files)
  • ✅ Enterprise Monitor (exact segmentation CTEs)
  • ✅ API Runtime Monitor (all shared files, latency/errors/throughput)

Files Changed (7 files)

  1. agents/monitoring/usage/SKILL.md (+407 lines) - Statistical method, 6-part structure
  2. agents/monitoring/usage/usage_monitor.py (+303 lines, NEW) - Combined SQL + alerting
  3. agents/monitoring/usage/investigate_root_cause.sql (+85 lines, NEW) - Root cause drill-down
  4. agents/monitoring/be-cost/SKILL.md (+211 lines) - Production GPU thresholds
  5. agents/monitoring/revenue/SKILL.md (+250 lines) - All shared files
  6. agents/monitoring/enterprise/SKILL.md (+310 lines) - Exact segmentation
  7. agents/monitoring/api-runtime/SKILL.md (+335 lines) - All shared files

Test Plan

  • Run python3 usage_monitor.py --date 2026-03-05 to verify statistical alerting
  • Verify alerts show: Current | Mean (μ) | Std Dev (σ) | Z-score | % change
  • Confirm Enterprise alerts suppressed on weekends
  • Confirm Enterprise video generation alerts skipped (CV > 100%)
  • Test investigate_root_cause.sql for Enterprise segment drill-down
  • Verify severity: WARNING (3 < |z| ≤ 4.5), CRITICAL (|z| > 4.5)

Benefits

  • Auto-adaptive thresholds: No manual tuning needed, adapts to each segment's variance
  • Cleaner codebase: Combined script (1 file vs 2 files, -58 lines)
  • Self-contained: SQL + Python in one file, no intermediate CSVs
  • Statistically sound: 3σ = 99.7% confidence, reduces false positives
  • Consistent structure: All monitoring skills follow 6-part Agent Skills spec

🤖 Generated with Claude Code

Transform all monitoring agents to autonomous problem detectors and implement
a production-ready usage monitoring system with statistically-derived alerting
thresholds based on 60-day analysis.

Key Changes:

1. Autonomous Monitoring Across All Agents
   - Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime)
   - Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis"
   - Auto-analyze ALL metrics, segments, and time windows without user prompting
   - Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline)

2. Production Usage Monitoring Implementation
   - Data-driven thresholds per segment based on 60-day volatility analysis
   - 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns)
   - Segment-specific thresholds:
     * Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only)
     * Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens
     * Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens
     * Free: -25% DAU, -35% Image Gens, -20% Tokens
   - Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold)
   - Weekend suppression for Enterprise (weekday-only alerting)
   - Skip Enterprise video generation alerts (CV > 100%, too volatile)

3. Root Cause Investigation Workflow
   - Enterprise segments: Drill down to organization level to identify which clients drove drops
   - Other segments: Analyze by tier distribution (Standard vs Pro vs Lite)
   - Alert format includes current vs baseline, drop %, threshold, and recommended actions

4. Full Segmentation Enforcement
   - Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md
   - Enforces proper hierarchy: Enterprise → Heavy → Paying → Free
   - Consistent segmentation across all usage monitoring queries

Technical Details:
- Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)
- 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt))
- Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date
- Partition pruning on dt (DATE) for performance

Anti-Patterns Addressed:
- Generic thresholds replaced with segment-specific, data-driven values
- Day-of-week effects handled via same-DOW comparison (not DoD on weekends)
- Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy)
- Enterprise video gen alerts skipped (single-user dominated, CV > 100%)
- Small segment noise handled via production threshold calibration

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Daniel Beer (DanielB945) and others added 10 commits March 8, 2026 23:14
Keep only production monitoring implementation (usage_monitoring_v2.py)
Reference data-driven thresholds from 60-day analysis:
- Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio
- Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump
- Tier 3 Low Priority: Volume drop, Overhead spike
- Per-vertical thresholds for LTX API and LTX Studio

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday.
Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Include comprehensive shared knowledge files:
- product-context.md for business model and user types
- bq-schema.md for subscription tables and segmentation
- metric-standards.md for revenue metric definitions
- event-registry.yaml for feature-driven revenue analysis

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Expand Step 2 to include all shared knowledge files:
- product-context.md for LTX products and API context
- bq-schema.md for API tables and GPU cost data
- metric-standards.md for performance metrics
- event-registry.yaml for event-driven metrics
- gpu-cost-query-templates.md for cost-related performance
- gpu-cost-analysis-patterns.md for cost analysis patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert to Agent Skills spec structure:
1. Overview (Why?) - Problem context and solution
2. Requirements (What?) - Checklist of outcomes
3. Progress Tracker - Visual step indicator
4. Implementation Plan - Phases with progressive disclosure
5. Context & References - Files, scripts, data sources
6. Constraints & Done - DO/DO NOT rules and completion criteria

Changes:
- Applied progressive disclosure (PREFERRED patterns first)
- Consolidated Rules + Anti-Patterns into Constraints section
- Moved Reference Files + Production Scripts to Context section
- Added completion criteria
- Kept under 500 lines (402 lines total)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…rmat

Convert all monitoring agents to Agent Skills spec structure:
1. Overview (Why?) - Problem context and solution
2. Requirements (What?) - Checklist of outcomes
3. Progress Tracker - Visual step indicator
4. Implementation Plan - Phases with progressive disclosure
5. Context & References - Files, scripts, data sources
6. Constraints & Done - DO/DO NOT rules and completion criteria

Updated skills:
- be-cost-monitoring (314 lines) - GPU cost with production thresholds
- revenue-monitor (272 lines) - Revenue/subscription monitoring
- enterprise-monitor (341 lines) - Enterprise account health
- api-runtime-monitor (359 lines) - API performance monitoring

All skills:
- Applied progressive disclosure (PREFERRED patterns first)
- Consolidated Rules into Constraints section (DO/DO NOT)
- Moved Reference Files to Context section
- Added completion criteria
- Kept under 500 lines per Agent Skills spec

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes:
- Replace separate SQL + Python files with combined usage_monitor.py
- Update from percentage thresholds to statistical 3σ method
- Embed SQL query as string in Python script
- Execute BigQuery directly (no CSV intermediate)
- Add --date parameter for flexible date selection
- Update SKILL.md to reference combined script

Benefits:
- Single file to maintain (353 lines vs 413 lines)
- No intermediate CSV files needed
- Easier to run and schedule
- Self-contained SQL + alerting logic
Change focus from 'detecting drops' to 'detecting data spikes' to reflect
that the statistical method detects anomalies in both directions:
- Increases (e.g., +46.6% token spike on 2026-03-09)
- Decreases (e.g., churn, engagement drops)

Updated:
- Overview: Problem solved now mentions both increases and decreases
- Requirements: Changed 'drops' to 'spikes (increases or decreases)'
- Description: Changed 'detecting usage drops' to 'detecting usage anomalies'
Drop paragraph about segment/day-of-week variance details.
Keep Overview focused on the solution (statistical anomaly detection)
rather than the detailed problem context.
### 2. Read Shared Knowledge
- [ ] DAU spikes (increases or decreases) by segment (Enterprise Contract/Pilot, Heavy, Paying, Free)
- [ ] Image generation volume changes (both increases and decreases)
- [ ] Video generation volume changes (skip for Enterprise - too volatile)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it (skip for Enterprise - too volatile)

**For anomaly detection:**
- Flag days with Z-score > 2 (metric deviates > 2 std devs from rolling avg)
- Investigate root cause: product issue, marketing campaign, seasonal pattern
Before writing SQL, read:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bring all the share

- LT team already excluded (no `is_lt_team` filter needed)
- Enterprise patterns: Strong weekday/weekend differences, use same-DOW comparisons

### Phase 3: Write Monitoring SQL
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in on the file and read it from there

3. Configure notification threshold
4. Route alerts to Slack channel (#product-alerts, #engineering-alerts)
# Monitor specific date
python3 usage_monitor.py --date 2026-03-05
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run on yesterday

…isclosure

Major changes:
- Reduced from 335 lines to 182 lines (-45%)
- Removed duplicate SQL query (already in usage_monitor.py)
- Removed duplicate alert format examples
- Consolidated overlapping phases (Phase 4 + 5 → Phase 4)
- Simplified DO/DO NOT section (removed repetitive rules)
- Applied progressive disclosure (method → run → analyze → present)
- Kept only essential information, reference scripts for details

Benefits:
- Clearer, more focused instructions
- Less maintenance (single source of truth in Python script)
- Easier to scan and understand
- Follows Agent Skills spec better (<500 lines, minimal duplication)
- Calculate % change and flag significant shifts (>15-20%)
- Check if changes are consistent across all tiers or specific to one segment
# Monitor specific date
python3 usage_monitor.py --date 2026-03-09
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to yesterday

1. Change date example from specific date to 'yesterday'
2. Remove '(skip for Enterprise - too volatile)' from video generation line

The exception details are still preserved in Phase 1 where they belong.
… exceptions

- Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3)
- Compare yesterday's metrics instead of today's
- Remove enterprise video gen exceptions
- Add event-registry.yaml to shared knowledge
- Remove Phase 6, Context & References section
- Add Free segment to tier distribution checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants