-
Notifications
You must be signed in to change notification settings - Fork 0
Intelligence Pipeline
This document explains how SentraCore transforms raw system telemetry into actionable, explainable intelligence across its six processing stages.
Raw telemetry alone is insufficient for meaningful system monitoring. A CPU reading of 95% could be normal during a scheduled batch job, or it could indicate a runaway process. SentraCore resolves this ambiguity through a sequential intelligence pipeline that layers context, statistics, and forecasting on top of raw data.
SystemCollector
→ Normalizer (EMA smoothing, spike detection)
→ TimeSeriesBuffer (ring buffers for historical context)
→ BaselineModel (per-segment adaptive learning)
→ TrendAnalyzer (linear regression, slope, volatility)
→ AnomalyDetector (Z-score deviation from baseline)
→ StressEngine (multi-state composite score)
→ PredictionEngine (ETA forecasting, risk scoring)
→ StabilityCalculator (global health index)
→ CorrelationEngine (root cause analysis, on alert)
→ AlertManager (threshold evaluation, RCA attachment)
Module: engine/collector/system_collector.py
The SystemCollector uses psutil to sample system telemetry at a configurable interval (default: 2 seconds). Each sample is captured as a SystemSnapshot containing:
- CPU percent (per-core and total)
- Memory usage (used, available, total, swap)
- Disk I/O rates (read/write bytes per second, read/write operations per second)
- A UNIX timestamp
Module: engine/normalization/normalizer.py
Raw telemetry can be noisy and inconsistent. The Normalizer applies an Exponential Moving Average (EMA) to each metric, significantly reducing the impact of instantaneous spikes on downstream analysis. It also independently detects spikes by comparing the raw value against a rolling average.
Key outputs per cycle:
-
cpu_percent_smoothed,memory_percent_smoothed disk_total_ops_per_sec-
cpu_is_spiking,memory_is_spiking,disk_is_spiking
Module: engine/buffer/time_series_buffer.py
Normalized snapshots are pushed into two ring buffers:
- Short-window buffer: Last ~60 seconds of data, used for trend analysis.
- Long-window buffer: Last ~30 minutes of data, used for baseline learning.
Module: engine/baseline/baseline_model.py
SentraCore does not use static, hardcoded alert thresholds. Instead, the BaselineModel continuously learns what is normal for the specific machine it is running on. It segments the day into four time-of-day windows (Night, Morning, Afternoon, Evening) and maintains a running mean and standard deviation per metric per segment.
The model requires a minimum number of samples before it is considered "ready" (configurable via BASELINE_MIN_SAMPLES).
Module: engine/intelligence/trend_analyzer.py
The TrendAnalyzer performs linear regression over the short-window buffer to compute:
- CPU Slope: Rate of CPU growth in percent per second.
- Memory Slope: Rate of memory growth in percent per second — a positive, sustained slope can indicate a memory leak.
- Volatility: Short-term standard deviation of CPU and Memory, measuring system instability.
Module: engine/intelligence/anomaly_detector.py
The AnomalyDetector calculates a Z-Score for each metric by comparing the current smoothed value against the baseline mean and standard deviation for the active time-of-day segment:
Z = (Current Value - Baseline Mean) / Baseline Standard Deviation
A high Z-Score indicates that the system is behaving significantly differently from its learned normal. Anomalies must be sustained over multiple consecutive cycles before being classified as elevated or severe, preventing transient micro-spikes from creating false alerts.
Module: engine/stress/stress_engine.py
The StressEngine consolidates all upstream analysis into a single Stress Score (0–100). Unlike naive threshold approaches, the score accounts for three dimensions simultaneously:
- Instantaneous Pressure: The raw resource utilization of CPU, Memory, and Disk, weighted by adaptive per-machine weights.
- Trend Modifier: Growing slopes (e.g., memory expanding at 0.5% per second) add a forward-looking penalty.
- Anomaly Modifier: A sustained, statistically significant anomaly score multiplies the base pressure.
Module: engine/intelligence/prediction_engine.py
The PredictionEngine uses the smoothed trend slopes from Stage 5 to forecast when critical resource thresholds will be breached:
- Time-to-Exhaustion (ETA): Calculates how many seconds until Memory hits 98% or CPU hits 95%, given current trend slopes.
- EMA Smoothing: Slopes are themselves smoothed via an EMA to prevent chaotic, rapidly-changing ETA values.
- Risk Score (0–100%): A probabilistic assessment of the likelihood of severe degradation within the next 5 minutes, calculated from ETA proximity and raw usage levels.
Module: engine/intelligence/stability_index.py
The StabilityCalculator synthesises all upstream signals into a single, holistic System Stability Index (1–100), where 100 represents perfect health.
The index is a weighted composite:
- 50% Instantaneous Stress Score
- 30% Predictive Risk Score
- 20% Anomaly Score
Module: engine/intelligence/correlation_engine.py
The CorrelationEngine is invoked lazily — only when the AlertManager determines that an alert should fire. It performs a three-dimensional cross-reference to generate a RootCauseAnalysis report:
- Bottleneck Identification: Determines whether CPU, Memory, or Disk is the primary stressor.
-
Suspect Identification: Cross-references the bottleneck against the
ProcessTrackerto identify the top-impact process. -
Trigger Identification: Cross-references the timeline against the
EventLoggerto find the system event that most plausibly caused the degradation.
The resulting RCA is attached directly to the fired Alert and broadcast via the WebSocket to the dashboard.