feat: implement production usage monitoring with data-driven thresholds by DanielB945 · Pull Request #38 · Lightricks/ltx-analytics-agents

Daniel Beer (DanielB945) · 2026-03-08T21:07:31Z

Summary

Transform all monitoring agents to autonomous problem detectors following 6-part Agent Skills specification
Implement statistical anomaly detection for usage monitoring using 3 standard deviations (3σ)
Combine SQL + Python into single self-contained monitoring script (usage_monitor.py)
Use last 10 same-day-of-week data points to calculate mean (μ) and standard deviation (σ)
Alert when |today_value - μ| > 3σ (captures 99.7% of normal variance)
Two-tier severity: WARNING (|z| > 3), CRITICAL (|z| > 4.5)
Enterprise weekend suppression and video gen skip logic preserved
All monitoring skills restructured to 6-part format: Overview, Requirements, Progress Tracker, Implementation Plan, Context & References, Constraints & Done

Key Changes

Usage Monitor (Statistical Anomaly Detection)

Method: 3 standard deviations from last 10 same-day-of-week mean
Alert Logic: |z-score| > 3 triggers alert (z = (current - μ) / σ)
Auto-adaptive: Each segment's variance determines its own thresholds
Combined Script: SQL query + alerting logic in single usage_monitor.py file
No manual tuning: Statistical method adapts to natural variance patterns

All Monitoring Skills Restructured

✅ Usage Monitor (3σ statistical thresholds)
✅ BE Cost Monitor (production GPU thresholds, 3-day lag)
✅ Revenue Monitor (all shared knowledge files)
✅ Enterprise Monitor (exact segmentation CTEs)
✅ API Runtime Monitor (all shared files, latency/errors/throughput)

Files Changed (7 files)

agents/monitoring/usage/SKILL.md (+407 lines) - Statistical method, 6-part structure
agents/monitoring/usage/usage_monitor.py (+303 lines, NEW) - Combined SQL + alerting
agents/monitoring/usage/investigate_root_cause.sql (+85 lines, NEW) - Root cause drill-down
agents/monitoring/be-cost/SKILL.md (+211 lines) - Production GPU thresholds
agents/monitoring/revenue/SKILL.md (+250 lines) - All shared files
agents/monitoring/enterprise/SKILL.md (+310 lines) - Exact segmentation
agents/monitoring/api-runtime/SKILL.md (+335 lines) - All shared files

Test Plan

Run python3 usage_monitor.py --date 2026-03-05 to verify statistical alerting
Verify alerts show: Current | Mean (μ) | Std Dev (σ) | Z-score | % change
Confirm Enterprise alerts suppressed on weekends
Confirm Enterprise video generation alerts skipped (CV > 100%)
Test investigate_root_cause.sql for Enterprise segment drill-down
Verify severity: WARNING (3 < |z| ≤ 4.5), CRITICAL (|z| > 4.5)

Benefits

Auto-adaptive thresholds: No manual tuning needed, adapts to each segment's variance
Cleaner codebase: Combined script (1 file vs 2 files, -58 lines)
Self-contained: SQL + Python in one file, no intermediate CSVs
Statistically sound: 3σ = 99.7% confidence, reduces false positives
Consistent structure: All monitoring skills follow 6-part Agent Skills spec

🤖 Generated with Claude Code

Transform all monitoring agents to autonomous problem detectors and implement a production-ready usage monitoring system with statistically-derived alerting thresholds based on 60-day analysis. Key Changes: 1. Autonomous Monitoring Across All Agents - Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime) - Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis" - Auto-analyze ALL metrics, segments, and time windows without user prompting - Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline) 2. Production Usage Monitoring Implementation - Data-driven thresholds per segment based on 60-day volatility analysis - 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns) - Segment-specific thresholds: * Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only) * Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens * Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens * Free: -25% DAU, -35% Image Gens, -20% Tokens - Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold) - Weekend suppression for Enterprise (weekday-only alerting) - Skip Enterprise video generation alerts (CV > 100%, too volatile) 3. Root Cause Investigation Workflow - Enterprise segments: Drill down to organization level to identify which clients drove drops - Other segments: Analyze by tier distribution (Standard vs Pro vs Lite) - Alert format includes current vs baseline, drop %, threshold, and recommended actions 4. Full Segmentation Enforcement - Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md - Enforces proper hierarchy: Enterprise → Heavy → Paying → Free - Consistent segmentation across all usage monitoring queries Technical Details: - Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct) - 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt)) - Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date - Partition pruning on dt (DATE) for performance Anti-Patterns Addressed: - Generic thresholds replaced with segment-specific, data-driven values - Day-of-week effects handled via same-DOW comparison (not DoD on weekends) - Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy) - Enterprise video gen alerts skipped (single-user dominated, CV > 100%) - Small segment noise handled via production threshold calibration Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Keep only production monitoring implementation (usage_monitoring_v2.py)

Reference data-driven thresholds from 60-day analysis: - Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio - Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump - Tier 3 Low Priority: Volume drop, Overhead spike - Per-vertical thresholds for LTX API and LTX Studio Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Include comprehensive shared knowledge files: - product-context.md for business model and user types - bq-schema.md for subscription tables and segmentation - metric-standards.md for revenue metric definitions - event-registry.yaml for feature-driven revenue analysis Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Expand Step 2 to include all shared knowledge files: - product-context.md for LTX products and API context - bq-schema.md for API tables and GPU cost data - metric-standards.md for performance metrics - event-registry.yaml for event-driven metrics - gpu-cost-query-templates.md for cost-related performance - gpu-cost-analysis-patterns.md for cost analysis patterns Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Convert to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Changes: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules + Anti-Patterns into Constraints section - Moved Reference Files + Production Scripts to Context section - Added completion criteria - Kept under 500 lines (402 lines total) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…rmat Convert all monitoring agents to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Updated skills: - be-cost-monitoring (314 lines) - GPU cost with production thresholds - revenue-monitor (272 lines) - Revenue/subscription monitoring - enterprise-monitor (341 lines) - Enterprise account health - api-runtime-monitor (359 lines) - API performance monitoring All skills: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules into Constraints section (DO/DO NOT) - Moved Reference Files to Context section - Added completion criteria - Kept under 500 lines per Agent Skills spec Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changes: - Replace separate SQL + Python files with combined usage_monitor.py - Update from percentage thresholds to statistical 3σ method - Embed SQL query as string in Python script - Execute BigQuery directly (no CSV intermediate) - Add --date parameter for flexible date selection - Update SKILL.md to reference combined script Benefits: - Single file to maintain (353 lines vs 413 lines) - No intermediate CSV files needed - Easier to run and schedule - Self-contained SQL + alerting logic

Change focus from 'detecting drops' to 'detecting data spikes' to reflect that the statistical method detects anomalies in both directions: - Increases (e.g., +46.6% token spike on 2026-03-09) - Decreases (e.g., churn, engagement drops) Updated: - Overview: Problem solved now mentions both increases and decreases - Requirements: Changed 'drops' to 'spikes (increases or decreases)' - Description: Changed 'detecting usage drops' to 'detecting usage anomalies'

Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context.

Daniel Beer (DanielB945) · 2026-03-10T08:47:41Z