Add BatchAdapter to simplify using PhysicalExprAdapter / Projector to map RecordBatch between schemas #19716

adriangb · 2026-01-09T17:03:24Z

I've now seen this pattern a couple of times, in our own codebase, working on apache/datafusion-comet#3047.

I was going to add an example but I think adding an API to handle it for users is a better experience.

This should also make it a bit easier to migrate from SchemaAdapter. In fact, I think it's possible to implement a SchemaAdapter using this as the foundation + some shim code. This won't be available in DF 51 to ease migration but it's easy enough to backport (just copy the code in this PR) for users that would find that helpful.

alamb · 2026-01-09T17:32:09Z

How is this different than https://docs.rs/datafusion/latest/datafusion/datasource/schema_adapter/trait.SchemaAdapterFactory.html / https://docs.rs/datafusion/latest/datafusion/datasource/schema_adapter/trait.SchemaAdapter.html ?

alamb · 2026-01-09T17:32:34Z

(should we undeprecate the schema adapter?)

alamb · 2026-01-09T17:32:54Z

we can also do a 52.1.0 release too

adriangb · 2026-01-09T17:50:59Z

I actually considered that and decided against it for a couple of reasons:

There are extra methods on there that we won't be implementing, e.g. map_column_index. There are also API idiosyncrasies that I think are good to deprecate / replace (create_with_projected_schema and create each have their issues; there is no notion of a file / table / projected schema anymore, just an input and output schema).
We don't need a trait (or nested traits), the dynamic behavior is entirely determined by PhysicalExprAdapter. So at most I'd consider un-deprecating DefaultSchemaAdapter.
We can't really un-deprecate it from ParquetSource/FileScanConfig, so at most we'd un-deprecate it as a standalone thing for other ad-hoc uses. Supporting both SchemaAdapter and PhysicalExprAdapter at the same time in ParquetOpener would add a lot of complexity.

So given that we don't really want the trait, that we'd be deprecating half of the methods anyway and that the other half could use a refactor / breaking changes to simplify the APIs and that it wouldn't really help most of the use cases (FileScanConfig) I think it's best to make a new, simpler API.

adriangb · 2026-01-09T17:51:42Z

we can also do a 52.1.0 release too

I see no reason why this can't be included in a 52.1.0 release :)

alamb · 2026-01-13T10:41:19Z

we can also do a 52.1.0 release too

I see no reason why this can't be included in a 52.1.0 release :)

Speaking of which, here it is:

Release DataFusion 52.1.0 or 52.0.1 (minor/patch) Release (Jan 2026) #19784

alamb

Thanks for this @adriangb -- I think it is a useful API for sure. However I wonder if we can make it easier for downstream users (see comment inline)

alamb · 2026-01-13T10:48:14Z

datafusion/physical-expr-adapter/src/schema_rewriter.rs

    }
 }

+/// Factory for creating [`BatchAdapter`] instances to adapt record batches


This looks like a useful API for sure (and we hit the same thing when upgrading to 52.0.0 internally)

It looks almost, but not quite the same, as SchemaAdapterFactor / Mapper. In other words, the API is different than SchemaAdapterFactor but looks to me like we could have it do the same thing.

To assist with people upgrading is there any way we can un deprecate SchemaAdapterFactory? For example, maybe we could move SchemaAdapterFactory into physical-expr-adapter and leave a reference back to the old location?

That way people who have the code that uses the old interface could continue to do so with minimal disruption

deprecation was not fully equal. So the mapper provided opportunities to map_schema which can be replaced by PhysicalExpeAdapter and also map_batch which is challenging(if even possible) to have with new API

Yes agreed, I give my reasoning in #19716 (comment) but TLDR is it would be hard / impossible to replicate the exact semantics and APIs of SchemaAdapter so any sort of un-deprecated version would only be half functional and probably cause more headache than it's worth.

@alamb I'm curious why you need this / why implementing a custom PhysicalExprAdapter isn't enough?

mbutrovich · 2026-01-13T16:07:24Z

@comphead FYI for Comet stuff.

comphead · 2026-01-13T16:09:09Z

Thanks @adriangb I missed this PR somehow, actually in Comet we were experimenting with 2 main directions:

apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)
implement custom DataFusion RB->RB callback mapper which is looking like this PR

I'll check this thoroughly today, thanks for thinking of it ahead of us ))

adriangb · 2026-01-13T16:18:54Z

apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)

I'm a bit confused, isn't it pretty much the same thing? What we do in the Parquet opener is stream.map(|maybe_batch| { let batch = maybe_batch?; projector.project(batch) }) which essentially builds a "casted" stream.

It also doesn't really seem like something you should have to do, if you provide the casting rules you want via PhysicalExprAdapter the Parquet opener will take care of essentially what this PR is doing and apply it to the stream.

… map RecordBatch between schemas

comphead · 2026-01-13T16:32:16Z

apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)

I'm a bit confused, isn't it pretty much the same thing? What we do in the Parquet opener is stream.map(|maybe_batch| { let batch = maybe_batch?; projector.project(batch) }) which essentially builds a "casted" stream.

It also doesn't really seem like something you should have to do, if you provide the casting rules you want via PhysicalExprAdapter the Parquet opener will take care of essentially what this PR is doing and apply it to the stream.

unfortunately it is slightly more than just casting(applying default values, unifying schemas, etc), we doing some RB->RB modification just after the scan, hopefully we can do this better in future as this part is expensive.

adriangb · 2026-01-13T16:37:14Z

unfortunately it is slightly more than just casting(applying default values, unifying schemas, etc), we doing some RB->RB modification just after the scan, hopefully we can do this better in future as this part is expensive.

from my investigation in apache/datafusion-comet#3047 it seemed that something like BatchAdapter might reduce some LOC in iceberg_scan.rs but isn't stricly necessary. For default value injection, unifying schemas, etc. should all be handled by PhysicalExprAdapter (SparkPhysicalExprAdapterFactory for you), but you don't need to do any stream mapping with that, you pass it into FileScanConfigBuilder::with_expr_adapter and it applies it to the stream.

comphead · 2026-01-13T16:52:00Z

that something like BatchAdapter might reduce some LOC

Correct, I'll try to do a migration today based on this PR, with adapter it is more promising.

And you are right, we also should do more on expression level than on RB level, but this would be quite some investigation to be done

adriangb · 2026-01-13T16:58:15Z

And you are right, we also should do more on expression level than on RB level, but this would be quite some investigation to be done

I may be missing something but I think the SparkPhysicalExprAdapterFactory I made in apache/datafusion-comet#3047 covers all of the functionality of your old SchemaAdapter. Anyway we should probably continue that discussion in apache/datafusion-comet#3046

comphead · 2026-01-14T03:37:00Z

Anyway we should probably continue that discussion in ....

Sounds good, just started discussion in the apache/datafusion-comet#3047 (comment)

adriangb requested a review from alamb January 9, 2026 17:03

alamb mentioned this pull request Jan 9, 2026

Andrew Lamb Weekly-ish Open Source plan - 2026-01-05 #19652

Open

44 tasks

alamb reviewed Jan 13, 2026

View reviewed changes

mbutrovich requested review from comphead and removed request for comphead January 13, 2026 16:07

adriangb added 2 commits January 13, 2026 11:29

Add BatchAdapter to simplify using PhysicalExprAdapter / Projector to…

4e6b83d

… map RecordBatch between schemas

fix failures

d39a948

adriangb force-pushed the batch-mapping-adapter branch from 76aec5b to d39a948 Compare January 13, 2026 16:29

comphead mentioned this pull request Jan 14, 2026

chore: migrate SchemaAdapter to PhysicalExprAdapter for DataFusion 52 compatibility apache/datafusion-comet#3047

Draft

4 tasks

Add BatchAdapter to simplify using PhysicalExprAdapter / Projector to map RecordBatch between schemas #19716

Are you sure you want to change the base?

Add BatchAdapter to simplify using PhysicalExprAdapter / Projector to map RecordBatch between schemas #19716

Conversation

adriangb commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jan 9, 2026

Uh oh!

alamb commented Jan 9, 2026

Uh oh!

alamb commented Jan 9, 2026

Uh oh!

adriangb commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jan 9, 2026

Uh oh!

alamb commented Jan 13, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

comphead Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Jan 13, 2026

Uh oh!

comphead commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Jan 13, 2026

Uh oh!

adriangb commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Jan 13, 2026

Uh oh!

adriangb commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adriangb commented Jan 9, 2026 •

edited

Loading

adriangb commented Jan 9, 2026 •

edited

Loading

comphead Jan 13, 2026 •

edited

Loading

adriangb Jan 13, 2026 •

edited

Loading

comphead commented Jan 13, 2026 •

edited

Loading

adriangb commented Jan 13, 2026 •

edited

Loading

adriangb commented Jan 13, 2026 •

edited

Loading

adriangb commented Jan 13, 2026 •

edited

Loading