Predictive Cloud Alerting

A baseline predictive alerting system for cloud-service metrics using a sliding-window formulation and logistic regression.

Project goal

The goal of this project is to predict whether an incident will occur within the next H time steps based on the previous W steps of service metrics.

In this prototype:

W (window size) is the number of past time steps used as input
H (horizon) is the number of future time steps in which an incident is predicted

The task is formulated as a binary classification problem:

1 — an incident will occur within the next H steps
0 — no incident will occur within the next H steps

Synthetic dataset

To keep the project focused on problem formulation, model design, and evaluation, I used a synthetic multivariate time-series dataset instead of a large real-world dataset.

The generated metrics are:

cpu_usage
memory_usage
request_rate
error_rate

The dataset also includes binary incident labels:

incident = 1 means the system is in an incident interval
incident = 0 means normal operation

Synthetic incident intervals are injected by increasing several metrics for short periods of time, which simulates abnormal system behavior.

Sliding-window formulation

The time series is converted into supervised learning examples using sliding windows.

For each sample:

input X contains the previous W time steps of all metrics
target y is 1 if at least one incident occurs in the next H time steps

In the current baseline:

W = 20
H = 5

Pipeline

The project pipeline is:

Generate synthetic cloud-service metrics with incident intervals
Create sliding-window samples
Flatten windows into feature vectors for a classical ML model
Split data into train and test sets
Scale features with StandardScaler
Train a LogisticRegression baseline
Predict incident probabilities
Apply a configurable alert threshold
Evaluate the model using classification metrics

Model choice

I used Logistic Regression as a simple and interpretable baseline for binary classification.

This choice makes it easy to:

validate the problem formulation
establish a baseline before trying more complex models
inspect the effect of threshold selection on alert behavior

Why this baseline is useful

This baseline is useful because it provides a simple and interpretable starting point for predictive alerting.

It helps validate:

the sliding-window problem formulation
the incident labeling strategy
the probability-based alerting setup
the effect of threshold selection on alert behavior

Before moving to more complex models, this baseline makes it easier to understand whether the core framing of the task is reasonable.

Evaluation metrics

The model is evaluated using:

Precision — how often predicted incidents are correct
Recall — how many real incidents are detected
F1-score — balance between precision and recall
Confusion matrix — summary of correct and incorrect predictions

Alerting trade-offs

In predictive alerting, model errors have operational meaning:

False positives correspond to unnecessary alerts
False negatives correspond to missed incidents

This makes threshold selection especially important. A lower threshold may increase recall but also produce more alert noise, while a higher threshold may reduce false alarms at the cost of missing more real incidents.

Because of this, the project evaluates not only raw model predictions, but also how decision thresholds affect alert quality.

Threshold comparison

The model predicts incident probabilities, and an alert is raised if the probability exceeds a chosen threshold.

I tested multiple thresholds:

Threshold	Precision	Recall	F1-score
0.3	0.7037	0.7917	0.7451
0.5	0.8636	0.7917	0.8261
0.7	0.9500	0.7917	0.8636

On the current synthetic split, 0.7 produced the strongest result because it reduced false positives while keeping recall unchanged.

Current best baseline

Current best baseline configuration on the synthetic test split:

Model: Logistic Regression
Window size (W): 20
Prediction horizon (H): 5
Best threshold: 0.7
Precision: 0.9500
Recall: 0.7917
F1-score: 0.8636

This threshold produced the strongest trade-off on the current split by reducing false positives while preserving recall.

Visualizations

The project generates plots in the artifacts/ folder:

metrics_with_incidents.png — service metrics over time with highlighted incident intervals
predicted_probabilities.png — predicted incident probabilities with a decision threshold

These plots help interpret both the synthetic data and the model behavior.

Metrics with incident intervals

Predicted incident probabilities

Project structure

predictive-cloud-alerting/
├── README.md
├── requirements.txt
├── data/
├── artifacts/
└── src/
    ├── main.py
    ├── data_generation.py
    ├── dataset.py
    ├── model.py
    ├── evaluation.py
    └── visualization.py

How to run

Install dependencies:

python3 -m pip install --user --break-system-packages -r requirements.txt

Run the project:

python3 src/main.py

CLI usage

The experiment can be configured from the command line:

python3 src/main.py --num-steps 300 --window-size 20 --horizon 5 --threshold 0.7 --random-seed 42

Key configurable parameters:

--num-steps — number of generated time steps
--window-size — sliding window size W
--horizon — prediction horizon H
--threshold — alert decision threshold
--random-seed — random seed for reproducible synthetic data

Current limitations

This is still a simplified prototype.

Main limitations:

the dataset is synthetic and does not capture all real production behaviors
the baseline model uses flattened windows and does not explicitly model temporal structure
incident generation is rule-based and intentionally simplified
results may vary depending on synthetic data settings and train/test split

Possible improvements

Potential next steps:

add more realistic seasonality and noise patterns
include additional metrics such as latency
compare Logistic Regression with Random Forest or other baselines
save synthetic data and evaluation results automatically
tune thresholds based on operational goals
test the approach on a public real-world time-series dataset

Real-world adaptation

A similar predictive alerting approach could be adapted to real monitored systems such as:

cloud services
backend applications
infrastructure nodes
VPN gateways

Example: VPN or gateway infrastructure

For example, a similar pipeline could be applied to a small self-managed VPN or gateway-like server.

In such a setup, the model could monitor signals such as:

CPU usage
memory usage
active connection count
reconnect frequency
failed handshake rate
packet loss
latency

This could help raise alerts before severe overload, abnormal traffic spikes, or broader service degradation become critical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predictive Cloud Alerting

Project goal

Synthetic dataset

Sliding-window formulation

Pipeline

Model choice

Why this baseline is useful

Evaluation metrics

Alerting trade-offs

Threshold comparison

Current best baseline

Visualizations

Metrics with incident intervals

Predicted incident probabilities

Project structure

How to run

CLI usage

Current limitations

Possible improvements

Real-world adaptation

Example: VPN or gateway infrastructure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
artifacts		artifacts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Predictive Cloud Alerting

Project goal

Synthetic dataset

Sliding-window formulation

Pipeline

Model choice

Why this baseline is useful

Evaluation metrics

Alerting trade-offs

Threshold comparison

Current best baseline

Visualizations

Metrics with incident intervals

Predicted incident probabilities

Project structure

How to run

CLI usage

Current limitations

Possible improvements

Real-world adaptation

Example: VPN or gateway infrastructure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages