A production-style uplift modeling pipeline for a direct marketing campaign. The idea is to learn which customers to contact to maximize net profit - not simply who is most likely to buy, but who is caused to buy by receiving an offer - taking into account communication costs, discount size and profit margins.
Built end-to-end: from raw transactional data through causal feature engineering, multi-model Optuna hyperparameter tuning, and a Prefect-orchestrated scoring pipeline with MLflow experiment tracking.
Standard propensity models rank customers by their baseline purchase probability. But the customers most likely to buy anyway are wasted spend - they would have converted without the offer. Thus, the relevant question for targeting is:
By how much does this specific customer's purchase probability increase because they received the campaign?
This is the Conditional Average Treatment Effect (CATE), also called the individual uplift. A customer is worth targeting when:
CATE × margin_per_gram > discount_cost + contact_cost
The campaign dataset contains randomized control group assignments, making it possible to estimate CATE directly from observed outcomes.
Three families of CATE estimators are implemented and benchmarked:
| Model | Description |
|---|---|
| S-Learner | Single model; treatment flag is just another feature. CATE = f(X, T=1) − f(X, T=0). Simple baseline. |
| X-Learner | Two-stage learner. Stage 1 fits outcome models on each arm; Stage 2 fits CATE models on imputed treatment effects. Propensity-weighted combination at prediction time. |
| R-Learner | Robinson decomposition. Residualizes both outcome and treatment against their conditional means, then fits CATE on the residuals. Theoretically efficient. |
| Uplift Tree / RF | Tree models with a KL-divergence splitting criterion that directly optimizes uplift in each node. Keep in mind that there is no GPU training options for these, which makes their training take A LONG time. |
Each meta-learner is paired with three gradient-boosted tree backends: LightGBM, XGBoost, and CatBoost. Technically, there could be many more combinations (and models used) - feel free to experiment.

Two metrics guide model selection and Optuna tuning:
- Qini coefficient - area under the Qini curve minus the random baseline. Primary tuning objective.
- Uplift@K - actual ATE in the top-K fraction ranked by predicted uplift. Used for business-level validation.
| Layer | Tool |
|---|---|
| Causal ML | causalml, custom S/X/R-Learner wrappers |
| Gradient boosting | LightGBM · XGBoost · CatBoost |
| Feature pipeline | scikit-learn |
| Hyperparameter tuning | Optuna |
| Orchestration | Prefect 3 |
| Experiment tracking | MLflow |
| Data layer | pandas · pyarrow |
uplift_modeling_setup/
│
├── configs/
│ ├── campaign.json # scoring run config (date, extract, selection threshold)
│ └── system.json # data paths, artifact root, MLflow URI
│
├── artifacts/
│ └── serving_extract_config.json # feature config used at inference time
│ # (*.pickle files are gitignored; generated by train.py)
│
├── src/
│ ├── datalib/
│ │ ├── __init__.py # Engine - lightweight pandas data store abstraction
│ │ ├── features.py # Feature calculators (receipts, recency, loyalty, …)
│ │ └── transforms.py # sklearn-compatible transformers (FillNa, LocationEncoder)
│ ├── training/
│ │ ├── learners.py # S/X/R-Learner and UpliftModel wrappers
│ │ └── metrics.py # Qini coefficient, Uplift@K, Optuna TrialLogger
│ ├── campaign_flow.py # Prefect flow: load → extract → transform → score → export
│ ├── model_utils.py # ModelKeeper - model + column list bundle with MLflow support
│ └── utils.py # I/O helpers
│
├── notebooks/
│ └── train_model.ipynb # Exploratory training notebook (full model comparison)
│
├── train.py # CLI training script (replicates the notebook end-to-end for programmatic usage)
├── run_campaign.py # CLI scoring script (Prefect flow entry point)
└── requirements.txt
python -m pip install -r requirements.txtRuns Optuna tuning across all 11 models, selects the best, and saves serving artifacts to artifacts/.
# Full training run (all 11 models)
python train.py
# Fast iteration - train a specific subset
python train.py --models slearner-lgb xlearner-lgb rlearner-lgb
# CPU-only
python train.py --device cpu
# Custom trial budget
python train.py --n-trials-fast 30 --n-trials-medium 20 --n-trials-slow 15Loads data, extracts features, scores all 2M customers, and outputs a submission CSV with the targeted customer IDs.
# One-shot run with defaults
python run_campaign.py
# Override the feature cut-off date (integer day-number in this dataset)
python run_campaign.py --date-to 95
# Custom output path
python run_campaign.py -o runs/my_submission.csv
# Register as a long-running Prefect deployment
python run_campaign.py --serveOutput: runs/submission.csv - a single-column CSV of customer_id values.
# Start the MLflow UI to browse all training and scoring runs
mlflow ui --port 5000Then open http://127.0.0.1:5000.
119 features are computed per customer from three raw tables (customers, receipts, campaigns):
| Feature group | Description |
|---|---|
| Receipt aggregates | Sum, mean, max, min, std of purchase amount and value over 7 time windows (7 / 15 / 30 / 60 / 90 / 180 / 365 days). Includes transaction count, mean inter-purchase interval, and recency within each window. |
| Global recency | Days since the customer's last purchase before the feature date. |
| Purchase trend | Ratio of short-window spend to long-window spend across 5 window pairs - captures whether the customer is accelerating or decelerating. |
| Demographics | Customer age and encoded city/location. |
| Campaign history | Number of past campaigns the customer appeared in; binary flag for prior treatment group membership. |
| Day-of-week | Share of purchases on each pseudo day-of-week (date mod 7), mode day, and weekend purchase share. |
| City cheque | Customer's average cheque relative to their city's average. |
| Loyalty | Purchase frequency (unique days / lifespan), spend per active day, composite loyalty score. |
After feature extraction a preprocessing pipeline (FillNaTransformer + LocationEncoder) fills missing values and one-hot encodes location, expanding to 125 features.
These are our results on the data we have (basically proving that it WORKS). Naturally, the actual numbers would be different in your case.
On the holdout evaluation set (campaign date = 101):
| Metric | Value |
|---|---|
| Customers scored | 2,000,000 |
| Customers selected (CATE > 0) | 867,292 (43.4%) |
| Mean CATE of selected customers | +$7.99 |
| Mean CATE overall | −$1.87 |
The negative overall mean CATE confirms that blanket outreach destroys value - targeting only positive-uplift customers is critical for campaign profitability.
configs/system.json - environment-level settings:
{
"database": { "root_path": "data" },
"artifacts_root_path": "artifacts",
"mlflow": {
"tracking_uri": "http://127.0.0.1:5000",
"experiment_name": "smart-reach-uplift"
}
}configs/campaign.json - campaign-level settings:
{
"date_to": "101",
"extract": "serving_extract_config.json",
"transform": [...],
"selection": {
"score_column": "uplift_score",
"threshold": 0.0
}
}Note on dates: The dataset uses integer day-numbers (0, 1, 2, …) rather than calendar dates. The value
101corresponds to the campaign launch day; features are computed from all history strictly before that day.