Skip to content

Intelligence Pipeline

Asiedu Minta Kwaku edited this page May 3, 2026 · 1 revision

Intelligence Pipeline

This document explains how SentraCore transforms raw system telemetry into actionable, explainable intelligence across its six processing stages.


Overview

Raw telemetry alone is insufficient for meaningful system monitoring. A CPU reading of 95% could be normal during a scheduled batch job, or it could indicate a runaway process. SentraCore resolves this ambiguity through a sequential intelligence pipeline that layers context, statistics, and forecasting on top of raw data.

SystemCollector
    → Normalizer (EMA smoothing, spike detection)
    → TimeSeriesBuffer (ring buffers for historical context)
    → BaselineModel (per-segment adaptive learning)
    → TrendAnalyzer (linear regression, slope, volatility)
    → AnomalyDetector (Z-score deviation from baseline)
    → StressEngine (multi-state composite score)
    → PredictionEngine (ETA forecasting, risk scoring)
    → StabilityCalculator (global health index)
    → CorrelationEngine (root cause analysis, on alert)
    → AlertManager (threshold evaluation, RCA attachment)

Stage 1: System Collection

Module: engine/collector/system_collector.py

The SystemCollector uses psutil to sample system telemetry at a configurable interval (default: 2 seconds). Each sample is captured as a SystemSnapshot containing:

  • CPU percent (per-core and total)
  • Memory usage (used, available, total, swap)
  • Disk I/O rates (read/write bytes per second, read/write operations per second)
  • A UNIX timestamp

Stage 2: Normalization

Module: engine/normalization/normalizer.py

Raw telemetry can be noisy and inconsistent. The Normalizer applies an Exponential Moving Average (EMA) to each metric, significantly reducing the impact of instantaneous spikes on downstream analysis. It also independently detects spikes by comparing the raw value against a rolling average.

Key outputs per cycle:

  • cpu_percent_smoothed, memory_percent_smoothed
  • disk_total_ops_per_sec
  • cpu_is_spiking, memory_is_spiking, disk_is_spiking

Stage 3: Time-Series Buffering

Module: engine/buffer/time_series_buffer.py

Normalized snapshots are pushed into two ring buffers:

  • Short-window buffer: Last ~60 seconds of data, used for trend analysis.
  • Long-window buffer: Last ~30 minutes of data, used for baseline learning.

Stage 4: Baseline Learning

Module: engine/baseline/baseline_model.py

SentraCore does not use static, hardcoded alert thresholds. Instead, the BaselineModel continuously learns what is normal for the specific machine it is running on. It segments the day into four time-of-day windows (Night, Morning, Afternoon, Evening) and maintains a running mean and standard deviation per metric per segment.

The model requires a minimum number of samples before it is considered "ready" (configurable via BASELINE_MIN_SAMPLES).


Stage 5: Trend Analysis

Module: engine/intelligence/trend_analyzer.py

The TrendAnalyzer performs linear regression over the short-window buffer to compute:

  • CPU Slope: Rate of CPU growth in percent per second.
  • Memory Slope: Rate of memory growth in percent per second — a positive, sustained slope can indicate a memory leak.
  • Volatility: Short-term standard deviation of CPU and Memory, measuring system instability.

Stage 6: Anomaly Detection

Module: engine/intelligence/anomaly_detector.py

The AnomalyDetector calculates a Z-Score for each metric by comparing the current smoothed value against the baseline mean and standard deviation for the active time-of-day segment:

Z = (Current Value - Baseline Mean) / Baseline Standard Deviation

A high Z-Score indicates that the system is behaving significantly differently from its learned normal. Anomalies must be sustained over multiple consecutive cycles before being classified as elevated or severe, preventing transient micro-spikes from creating false alerts.


Stage 7: Multi-State Stress Engine

Module: engine/stress/stress_engine.py

The StressEngine consolidates all upstream analysis into a single Stress Score (0–100). Unlike naive threshold approaches, the score accounts for three dimensions simultaneously:

  1. Instantaneous Pressure: The raw resource utilization of CPU, Memory, and Disk, weighted by adaptive per-machine weights.
  2. Trend Modifier: Growing slopes (e.g., memory expanding at 0.5% per second) add a forward-looking penalty.
  3. Anomaly Modifier: A sustained, statistically significant anomaly score multiplies the base pressure.

Stage 8: Prediction & Risk Engine

Module: engine/intelligence/prediction_engine.py

The PredictionEngine uses the smoothed trend slopes from Stage 5 to forecast when critical resource thresholds will be breached:

  • Time-to-Exhaustion (ETA): Calculates how many seconds until Memory hits 98% or CPU hits 95%, given current trend slopes.
  • EMA Smoothing: Slopes are themselves smoothed via an EMA to prevent chaotic, rapidly-changing ETA values.
  • Risk Score (0–100%): A probabilistic assessment of the likelihood of severe degradation within the next 5 minutes, calculated from ETA proximity and raw usage levels.

Stage 9: System Stability Index

Module: engine/intelligence/stability_index.py

The StabilityCalculator synthesises all upstream signals into a single, holistic System Stability Index (1–100), where 100 represents perfect health.

The index is a weighted composite:

  • 50% Instantaneous Stress Score
  • 30% Predictive Risk Score
  • 20% Anomaly Score

Stage 10: Correlation & Root Cause Analysis

Module: engine/intelligence/correlation_engine.py

The CorrelationEngine is invoked lazily — only when the AlertManager determines that an alert should fire. It performs a three-dimensional cross-reference to generate a RootCauseAnalysis report:

  1. Bottleneck Identification: Determines whether CPU, Memory, or Disk is the primary stressor.
  2. Suspect Identification: Cross-references the bottleneck against the ProcessTracker to identify the top-impact process.
  3. Trigger Identification: Cross-references the timeline against the EventLogger to find the system event that most plausibly caused the degradation.

The resulting RCA is attached directly to the fired Alert and broadcast via the WebSocket to the dashboard.