TSCFEval is a model-agnostic Python framework for systematic evaluation of counterfactual explanations in Time Series Classification (TSC). Unlike existing libraries that focus on counterfactual generation, TSCFEval is specifically designed for counterfactual evaluation, consolidating fragmented evaluation practices from the TSC counterfactual literature into a unified, extensible toolkit.
Given a time series classifier and a set of counterfactual explanations, TSCFEval provides:
- 11 evaluation metrics organized into six quality dimensions (core quality, distribution alignment, structural properties, model behavior, stability, and computational performance)
- Weighted scalarization for aggregating metrics into composite scores, enabling customizable method ranking
- Confidence-stratified instance selection for benchmarking across the decision boundary
- Three benchmarking scenarios: single dataset with multiple CF methods, single dataset with multiple classifiers, and multiple datasets with a fixed classifier
- 7 built-in CF methods for generating counterfactuals
- Pareto and Friedman analysis for principled multi-criteria comparison
- Installation
- Available Methods and Metrics
- Evaluating Counterfactuals
- Benchmarking Multiple Methods
- Weighted Scalarization
- Adding Your Own CF Method
- Citation
- References
pip install tscf-evalWith optional dependencies:
pip install tscf-eval[dtw] # DTW distance support
pip install tscf-eval[full] # All featuresFrom source:
git clone https://github.com/bzamith/tscf-eval.git
cd tscf-eval && pip install -e .| Method | Strategy | Description | Reference |
|---|---|---|---|
CELS |
Saliency map | Learned saliency map blending with nearest unlike neighbor | Li et al., 2023 |
NativeGuide |
Instance-based | Nearest unlike neighbor guidance with four variants (blend, ng, dtw_dba, cam) | Delaney et al., 2021 |
COMTE |
Instance-based | Greedy channel substitution for multivariate TS | Ates et al., 2021 |
SETS |
Shapelet-based | Class-specific shapelet manipulation with contiguous perturbations | Bahri et al., 2022 |
TSEvo |
Evolutionary | Multi-objective optimization via NSGA-II with three mutation operators | Hollig et al., 2022 |
Glacier |
Gradient-based | Gradient optimization with importance-weighted proximity constraints | Wang et al., 2024 |
LatentCF |
Gradient-based | Latent space optimization with local/global importance weighting | Wang et al., 2021 |
TSCFEval implements 11 metrics organized into six quality dimensions:
| Dimension | Metric | Description | Direction | Reference |
|---|---|---|---|---|
| Core Quality | Validity |
Fraction of CFs that flip the prediction (hard or soft mode) | maximize | Li et al., 2023 |
Proximity |
Closeness to original instance (L1, L2, L-inf, DTW) | maximize | Delaney et al., 2021; Bahri et al., 2022 | |
Sparsity |
Fraction of changed features | minimize | Mothilal et al., 2020 | |
| Distribution | Plausibility |
Whether CFs lie within data distribution (LOF, IF, MP-OCSVM, DTW-LOF) | maximize | Breunig et al., 2000; Liu et al., 2008 |
Diversity |
Variety among multiple CFs via DPP (Euclidean or DTW) | maximize | Mothilal et al., 2020 | |
| Structure | Contiguity |
How contiguous the edits are | maximize | Delaney et al., 2021; Ates et al., 2021 |
Composition |
Number and length of edit segments | minimize | Delaney et al., 2021; Ates et al., 2021 | |
| Model Behavior | Confidence |
Model confidence on original and CF predictions | maximize | Le et al., 2023 |
Controllability |
Ease of reverting CF changes via single-feature edits | maximize | Verma et al., 2024 | |
| Stability | Robustness |
Local Lipschitz-like stability to input perturbations (Euclidean or DTW) | minimize | Ates et al., 2021 |
| Performance | Efficiency |
Generation time per instance | minimize | Li et al., 2023 |
from sklearn.neighbors import KNeighborsClassifier
from tscf_eval import (
Evaluator, Validity, Proximity, Sparsity,
UCRLoader, NativeGuide,
)
# Load data
loader = UCRLoader("ItalyPowerDemand")
train, test = loader.load("train"), loader.load("test")
# Train classifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(train.X, train.y)
# Generate counterfactuals using NativeGuide
explainer = NativeGuide(clf, (train.X, train.y), method="blend")
X, X_cf, y, y_cf = [], [], [], []
for x in test.X[:10]:
cf, cf_label, _ = explainer.explain(x)
X.append(x)
X_cf.append(cf)
y.append(clf.predict(x.reshape(1, -1))[0])
y_cf.append(cf_label)
# Evaluate counterfactual quality
evaluator = Evaluator([
Validity(),
Proximity(p=2, distance="lp"),
Proximity(distance="dtw"),
Sparsity(),
])
results = evaluator.evaluate(X, X_cf, y=y, y_cf=y_cf)
print(f"Validity: {results['validity_soft']:.2f}")
print(f"Proximity (L2): {results['proximity_l2']:.2f}")
print(f"Proximity (DTW): {results['proximity_dtw']:.2f}")
print(f"Sparsity: {results['sparsity']:.2f}")from tscf_eval import COMTE, NativeGuide
from tscf_eval.data_loader import UCRLoader
from sklearn.neighbors import KNeighborsClassifier
# Load dataset
loader = UCRLoader("ItalyPowerDemand")
train = loader.load("train")
test = loader.load("test")
# Train classifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(train.X, train.y)
# Generate counterfactual with CoMTE
comte = COMTE(
model=clf,
data=(train.X, train.y),
distance="dtw",
)
cf, cf_label, meta = comte.explain(test.X[0])
print(f"CF label: {cf_label}, Edits: {meta['edits_variables']}")TSCFEval supports three benchmarking scenarios for systematic evaluation:
- Single dataset, multiple CF methods -- compare explainer algorithms on a fixed dataset and model
- Single dataset, multiple classifiers -- study how the classifier affects CF quality
- Multiple datasets, fixed classifier -- assess generalization across datasets
from tscf_eval import Evaluator, Validity, Proximity, Sparsity, COMTE, NativeGuide, TSEvo
from tscf_eval.benchmark import BenchmarkRunner, DatasetConfig, ModelConfig, ExplainerConfig
from tscf_eval.data_loader import UCRLoader
# Load data
loader = UCRLoader("ArrowHead")
train, test = loader.load("train"), loader.load("test")
# Train a classifier
from aeon.classification.convolution_based import RocketClassifier
clf = RocketClassifier(n_kernels=500, random_state=42)
clf.fit(train.X, train.y)
# Configure benchmark
runner = BenchmarkRunner(
datasets=[DatasetConfig("ArrowHead", train.X, train.y, test.X, test.y)],
models=[ModelConfig("rocket", clf)],
explainers=[
ExplainerConfig("comte_dtw", COMTE, {"distance": "dtw"}),
ExplainerConfig("ng_blend", NativeGuide, {"method": "blend"}),
ExplainerConfig("tsevo", TSEvo, {"n_generations": 50}),
],
evaluator=Evaluator([Validity(), Proximity(p=2, distance="lp"), Proximity(distance="dtw"), Sparsity()]),
n_instances=12,
instance_selection="stratified_confidence", # Confidence-stratified selection
)
results = runner.run()
# Aggregate results by explainer
df = results.aggregate(by="explainer")
print(df)Find Pareto-optimal methods balancing multiple objectives:
from tscf_eval.benchmark import ParetoAnalyzer
analyzer = ParetoAnalyzer(metrics=[
"validity_soft", "proximity_dtw", "sparsity",
])
# Dominance ranking
ranking = analyzer.dominance_ranking(results)
print(ranking)
# Visualize trade-offs
analyzer.plot_front(results, x_metric="validity_soft", y_metric="proximity_dtw")Since evaluation metrics often conflict (e.g., maximizing validity may reduce proximity), TSCFEval implements weighted sum scalarization for aggregating metrics into a composite score. Each metric is min-max normalized to [0, 1] with direction awareness, then combined via weighted sum:
from tscf_eval.benchmark import WeightedScalarizer
# Equal-weight composite across all metrics
scalarizer = WeightedScalarizer(metrics=[
"validity_soft", "proximity_dtw", "sparsity",
])
scores = scalarizer.score(results)
print(scores) # Ranked by composite score
# Custom weights emphasizing validity
scalarizer = WeightedScalarizer(
metrics=["validity_soft", "proximity_dtw", "sparsity"],
weights={"validity_soft": 3.0, "proximity_dtw": 1.0, "sparsity": 1.0},
)
# Sensitivity analysis: how ranking changes as one metric's weight varies
sens_df = scalarizer.sensitivity(results, vary_metric="validity_soft", n_steps=11)
scalarizer.plot_sensitivity(sens_df)To integrate your own counterfactual method with TSCFEval, inherit from the Counterfactual base class and implement the explain method:
from tscf_eval.counterfactuals.base import Counterfactual
import numpy as np
class MyCFMethod(Counterfactual):
"""Custom counterfactual generator."""
def __init__(self, model, data, my_param=1.0):
self.model = model
self.X_ref, self.y_ref = data
self.my_param = my_param
def explain(self, x, y_pred=None):
"""Generate counterfactual for instance x.
Parameters
----------
x : np.ndarray
Single instance, shape (T,) or (C, T).
y_pred : int, optional
Predicted label for x (computed if None).
Returns
-------
cf : np.ndarray
Counterfactual instance.
cf_label : int
Predicted label for the counterfactual.
meta : dict
Metadata about the generation process.
"""
# Your CF generation logic here
cf = x.copy()
# ... modify cf ...
cf_label = int(self.model.predict(cf[None, ...])[0])
meta = {
"method": "my_cf_method",
"param_used": self.my_param,
}
return cf, cf_label, metaUse your method with the evaluator and benchmarking tools:
# Direct evaluation
my_method = MyCFMethod(model=clf, data=(train.X, train.y))
cf, label, meta = my_method.explain(test.X[0])
# In benchmarks
configs = [
ExplainerConfig("my_method", MyCFMethod, {"my_param": 2.0}),
ExplainerConfig("comte", COMTE, {}),
]If you use TSCFEval in your research, please cite our paper:
Zamith Santos, B., Andrade Lira, M. F., Cerri, R., & Cavalcante Prudêncio, R. B. (2026). TSCFEval: A Model-Agnostic Framework for Evaluating Time Series Classification Counterfactuals. In Explainable Artificial Intelligence. xAI 2026. Communications in Computer and Information Science. Springer, Cham.
@inproceedings{santos2026tscfeval,
title = {{TSCFEval}: A Model-Agnostic Framework for Evaluating Time Series Classification Counterfactuals},
author = {Zamith Santos, Bruna and Andrade Lira, Maira Farias and Cerri, Ricardo and Cavalcante Prud{\^{e}}ncio, Ricardo Bastos},
booktitle = {Explainable Artificial Intelligence. xAI 2026},
series = {Communications in Computer and Information Science},
publisher = {Springer, Cham},
year = {2026}
}-
CELS: Li, P., Tang, B., & Ning, Y. (2023). CELS: Counterfactual Explanation of Time-Series via Learned Saliency Maps. IEEE International Conference on Big Data 2023, pp. 1952-1957. [Paper] [Code]
-
CoMTE: Ates, E., Aksar, B., Leung, V. J., & Coskun, A. K. (2021). Counterfactual Explanations for Multivariate Time Series. ICAPAI 2021. [Paper] [Code]
-
NativeGuide: Delaney, E., Greene, D., & Keane, M. T. (2021). Instance-Based Counterfactual Explanations for Time Series Classification. ICCBR 2021. [Paper] [Code]
-
SETS: Bahri, O., Filali Boubrahimi, S., & Hamdi, S. M. (2022). Shapelet-Based Counterfactual Explanations for Multivariate Time Series. KDD-MiLeTS 2022. [Paper] [Code]
-
TSEvo: Hollig, J., Kulbach, C., & Thoma, S. (2022). TSEvo: Evolutionary Counterfactual Explanations for Time Series Classification. ICMLA 2022. [Paper] [Code]
-
Glacier: Wang, Z., Samsten, I., Miliou, I., Mochaourab, R., & Papapetrou, P. (2024). Glacier: Guided Locally Constrained Counterfactual Explanations for Time Series Classification. Machine Learning, 113(3). [Paper] [Code]
-
LatentCF++: Wang, Z., Samsten, I., Mochaourab, R., & Papapetrou, P. (2021). Learning Time Series Counterfactuals via Latent Space Representations. DS 2021. [Paper] [Code]
-
Validity: Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). CELS: Counterfactual Explanations for Time Series Data via Learned Saliency Maps. IEEE BigData 2023. [Paper]
-
Proximity: Delaney, E., Greene, D., & Keane, M. T. (2021). ICCBR 2021; Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2022). Shapelet-Based Counterfactual Explanations for Multivariate Time Series.
-
Sparsity: Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations. FAT* 2020. [Paper]
-
Plausibility (LOF): Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD 2000. [Paper]
-
Plausibility (Isolation Forest): Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. ICDM 2008. [Paper]
-
Diversity: Mothilal, R. K., Sharma, A., & Tan, C. (2020). FAT* 2020; Kulesza, A., & Taskar, B. (2012). Determinantal Point Processes for Machine Learning. [Paper]
-
Composition, Contiguity: Delaney et al. (2021). ICCBR 2021; Ates et al. (2021). ICAPAI 2021.
-
Confidence: Le, T., Miller, T., Singh, R., & Sonenberg, L. (2023). Explaining model confidence using counterfactuals. AAAI 2023. [Paper]
-
Controllability: Verma, S., Boonsanong, V., Hoang, M., Hines, K. E., Dickerson, J. P., & Shah, C. (2024). Counterfactual Explanations for Machine Learning: Challenges Revisited. ACM Computing Surveys, 56(12), Article 304. [Paper]
-
Robustness: Ates et al. (2021). ICAPAI 2021.
-
Efficiency: Li, P., Bahri, O., Boubrahimi, S. F., & Hamdi, S. M. (2023). Attention-Based Counterfactual Explanation for Multivariate Time Series. DaWaK 2023. [Paper]
-
UCR Archive: Dau, H. A., et al. (2019). The UCR Time Series Archive. IEEE/CAA Journal of Automatica Sinica, 6(6), 1293-1305. [Archive]
-
aeon: Middlehurst, M., et al. (2024). aeon: a Python toolkit for learning from time series. JMLR. [Code]
MIT License - see LICENSE.