Skip to content

kandera37/predictive-cloud-alerting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predictive Cloud Alerting

A baseline predictive alerting system for cloud-service metrics using a sliding-window formulation and logistic regression.

Project goal

The goal of this project is to predict whether an incident will occur within the next H time steps based on the previous W steps of service metrics.

In this prototype:

  • W (window size) is the number of past time steps used as input
  • H (horizon) is the number of future time steps in which an incident is predicted

The task is formulated as a binary classification problem:

  • 1 — an incident will occur within the next H steps
  • 0 — no incident will occur within the next H steps

Synthetic dataset

To keep the project focused on problem formulation, model design, and evaluation, I used a synthetic multivariate time-series dataset instead of a large real-world dataset.

The generated metrics are:

  • cpu_usage
  • memory_usage
  • request_rate
  • error_rate

The dataset also includes binary incident labels:

  • incident = 1 means the system is in an incident interval
  • incident = 0 means normal operation

Synthetic incident intervals are injected by increasing several metrics for short periods of time, which simulates abnormal system behavior.

Sliding-window formulation

The time series is converted into supervised learning examples using sliding windows.

For each sample:

  • input X contains the previous W time steps of all metrics
  • target y is 1 if at least one incident occurs in the next H time steps

In the current baseline:

  • W = 20
  • H = 5

Pipeline

The project pipeline is:

  1. Generate synthetic cloud-service metrics with incident intervals
  2. Create sliding-window samples
  3. Flatten windows into feature vectors for a classical ML model
  4. Split data into train and test sets
  5. Scale features with StandardScaler
  6. Train a LogisticRegression baseline
  7. Predict incident probabilities
  8. Apply a configurable alert threshold
  9. Evaluate the model using classification metrics

Model choice

I used Logistic Regression as a simple and interpretable baseline for binary classification.

This choice makes it easy to:

  • validate the problem formulation
  • establish a baseline before trying more complex models
  • inspect the effect of threshold selection on alert behavior

Why this baseline is useful

This baseline is useful because it provides a simple and interpretable starting point for predictive alerting.

It helps validate:

  • the sliding-window problem formulation
  • the incident labeling strategy
  • the probability-based alerting setup
  • the effect of threshold selection on alert behavior

Before moving to more complex models, this baseline makes it easier to understand whether the core framing of the task is reasonable.

Evaluation metrics

The model is evaluated using:

  • Precision — how often predicted incidents are correct
  • Recall — how many real incidents are detected
  • F1-score — balance between precision and recall
  • Confusion matrix — summary of correct and incorrect predictions

Alerting trade-offs

In predictive alerting, model errors have operational meaning:

  • False positives correspond to unnecessary alerts
  • False negatives correspond to missed incidents

This makes threshold selection especially important. A lower threshold may increase recall but also produce more alert noise, while a higher threshold may reduce false alarms at the cost of missing more real incidents.

Because of this, the project evaluates not only raw model predictions, but also how decision thresholds affect alert quality.

Threshold comparison

The model predicts incident probabilities, and an alert is raised if the probability exceeds a chosen threshold.

I tested multiple thresholds:

Threshold Precision Recall F1-score
0.3 0.7037 0.7917 0.7451
0.5 0.8636 0.7917 0.8261
0.7 0.9500 0.7917 0.8636

On the current synthetic split, 0.7 produced the strongest result because it reduced false positives while keeping recall unchanged.

Current best baseline

Current best baseline configuration on the synthetic test split:

  • Model: Logistic Regression
  • Window size (W): 20
  • Prediction horizon (H): 5
  • Best threshold: 0.7
  • Precision: 0.9500
  • Recall: 0.7917
  • F1-score: 0.8636

This threshold produced the strongest trade-off on the current split by reducing false positives while preserving recall.

Visualizations

The project generates plots in the artifacts/ folder:

  • metrics_with_incidents.png — service metrics over time with highlighted incident intervals
  • predicted_probabilities.png — predicted incident probabilities with a decision threshold

These plots help interpret both the synthetic data and the model behavior.

Metrics with incident intervals

Metrics with incidents

Predicted incident probabilities

Predicted probabilities

Project structure

predictive-cloud-alerting/
├── README.md
├── requirements.txt
├── data/
├── artifacts/
└── src/
    ├── main.py
    ├── data_generation.py
    ├── dataset.py
    ├── model.py
    ├── evaluation.py
    └── visualization.py

How to run

Install dependencies:

python3 -m pip install --user --break-system-packages -r requirements.txt

Run the project:

python3 src/main.py

CLI usage

The experiment can be configured from the command line:

python3 src/main.py --num-steps 300 --window-size 20 --horizon 5 --threshold 0.7 --random-seed 42

Key configurable parameters:

  • --num-steps — number of generated time steps
  • --window-size — sliding window size W
  • --horizon — prediction horizon H
  • --threshold — alert decision threshold
  • --random-seed — random seed for reproducible synthetic data

Current limitations

This is still a simplified prototype.

Main limitations:

  • the dataset is synthetic and does not capture all real production behaviors
  • the baseline model uses flattened windows and does not explicitly model temporal structure
  • incident generation is rule-based and intentionally simplified
  • results may vary depending on synthetic data settings and train/test split

Possible improvements

Potential next steps:

  • add more realistic seasonality and noise patterns
  • include additional metrics such as latency
  • compare Logistic Regression with Random Forest or other baselines
  • save synthetic data and evaluation results automatically
  • tune thresholds based on operational goals
  • test the approach on a public real-world time-series dataset

Real-world adaptation

A similar predictive alerting approach could be adapted to real monitored systems such as:

  • cloud services
  • backend applications
  • infrastructure nodes
  • VPN gateways

Example: VPN or gateway infrastructure

For example, a similar pipeline could be applied to a small self-managed VPN or gateway-like server.

In such a setup, the model could monitor signals such as:

  • CPU usage
  • memory usage
  • active connection count
  • reconnect frequency
  • failed handshake rate
  • packet loss
  • latency

This could help raise alerts before severe overload, abnormal traffic spikes, or broader service degradation become critical.

About

Baseline predictive alerting system for cloud metrics using sliding windows and logistic regression

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages