From @borchero:
This is still a very open idea: when comparing pipeline outputs, we are often interested in how model predictions/scores change. To this end, we often generate scatter plots/confusion matrices.
Potentially, we could support this to some extent via diffly?