Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 9, 2026

I've now seen this pattern a couple of times, in our own codebase, working on apache/datafusion-comet#3047.

I was going to add an example but I think adding an API to handle it for users is a better experience.

This should also make it a bit easier to migrate from SchemaAdapter. In fact, I think it's possible to implement a SchemaAdapter using this as the foundation + some shim code. This won't be available in DF 51 to ease migration but it's easy enough to backport (just copy the code in this PR) for users that would find that helpful.

@adriangb adriangb requested a review from alamb January 9, 2026 17:03
@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

(should we undeprecate the schema adapter?)

@alamb
Copy link
Contributor

alamb commented Jan 9, 2026

we can also do a 52.1.0 release too

@adriangb
Copy link
Contributor Author

adriangb commented Jan 9, 2026

I actually considered that and decided against it for a couple of reasons:

  1. There are extra methods on there that we won't be implementing, e.g. map_column_index. There are also API idiosyncrasies that I think are good to deprecate / replace (create_with_projected_schema and create each have their issues; there is no notion of a file / table / projected schema anymore, just an input and output schema).
  2. We don't need a trait (or nested traits), the dynamic behavior is entirely determined by PhysicalExprAdapter. So at most I'd consider un-deprecating DefaultSchemaAdapter.
  3. We can't really un-deprecate it from ParquetSource/FileScanConfig, so at most we'd un-deprecate it as a standalone thing for other ad-hoc uses. Supporting both SchemaAdapter and PhysicalExprAdapter at the same time in ParquetOpener would add a lot of complexity.

So given that we don't really want the trait, that we'd be deprecating half of the methods anyway and that the other half could use a refactor / breaking changes to simplify the APIs and that it wouldn't really help most of the use cases (FileScanConfig) I think it's best to make a new, simpler API.

@adriangb
Copy link
Contributor Author

adriangb commented Jan 9, 2026

we can also do a 52.1.0 release too

I see no reason why this can't be included in a 52.1.0 release :)

@alamb
Copy link
Contributor

alamb commented Jan 13, 2026

we can also do a 52.1.0 release too

I see no reason why this can't be included in a 52.1.0 release :)

Speaking of which, here it is:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @adriangb -- I think it is a useful API for sure. However I wonder if we can make it easier for downstream users (see comment inline)

}
}

/// Factory for creating [`BatchAdapter`] instances to adapt record batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a useful API for sure (and we hit the same thing when upgrading to 52.0.0 internally)

It looks almost, but not quite the same, as SchemaAdapterFactor / Mapper. In other words, the API is different than SchemaAdapterFactor but looks to me like we could have it do the same thing.

To assist with people upgrading is there any way we can un deprecate SchemaAdapterFactory? For example, maybe we could move SchemaAdapterFactory into physical-expr-adapter and leave a reference back to the old location?

That way people who have the code that uses the old interface could continue to do so with minimal disruption

Copy link
Contributor

@comphead comphead Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecation was not fully equal. So the mapper provided opportunities to map_schema which can be replaced by PhysicalExpeAdapter and also map_batch which is challenging(if even possible) to have with new API

Copy link
Contributor Author

@adriangb adriangb Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, I give my reasoning in #19716 (comment) but TLDR is it would be hard / impossible to replicate the exact semantics and APIs of SchemaAdapter so any sort of un-deprecated version would only be half functional and probably cause more headache than it's worth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I'm curious why you need this / why implementing a custom PhysicalExprAdapter isn't enough?

@mbutrovich mbutrovich requested review from comphead and removed request for comphead January 13, 2026 16:07
@mbutrovich
Copy link
Contributor

@comphead FYI for Comet stuff.

@comphead
Copy link
Contributor

comphead commented Jan 13, 2026

Thanks @adriangb I missed this PR somehow, actually in Comet we were experimenting with 2 main directions:

  • apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)
  • implement custom DataFusion RB->RB callback mapper which is looking like this PR

I'll check this thoroughly today, thanks for thinking of it ahead of us ))

@adriangb
Copy link
Contributor Author

adriangb commented Jan 13, 2026

  • apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)

I'm a bit confused, isn't it pretty much the same thing? What we do in the Parquet opener is stream.map(|maybe_batch| { let batch = maybe_batch?; projector.project(batch) }) which essentially builds a "casted" stream.

It also doesn't really seem like something you should have to do, if you provide the casting rules you want via PhysicalExprAdapter the Parquet opener will take care of essentially what this PR is doing and apply it to the stream.

@adriangb adriangb force-pushed the batch-mapping-adapter branch from 76aec5b to d39a948 Compare January 13, 2026 16:29
@comphead
Copy link
Contributor

  • apply casted stream on top of scan stream so we still can manage batch to batch mapping(potentially it could affect filter pushdown)

I'm a bit confused, isn't it pretty much the same thing? What we do in the Parquet opener is stream.map(|maybe_batch| { let batch = maybe_batch?; projector.project(batch) }) which essentially builds a "casted" stream.

It also doesn't really seem like something you should have to do, if you provide the casting rules you want via PhysicalExprAdapter the Parquet opener will take care of essentially what this PR is doing and apply it to the stream.

unfortunately it is slightly more than just casting(applying default values, unifying schemas, etc), we doing some RB->RB modification just after the scan, hopefully we can do this better in future as this part is expensive.

@adriangb
Copy link
Contributor Author

adriangb commented Jan 13, 2026

unfortunately it is slightly more than just casting(applying default values, unifying schemas, etc), we doing some RB->RB modification just after the scan, hopefully we can do this better in future as this part is expensive.

from my investigation in apache/datafusion-comet#3047 it seemed that something like BatchAdapter might reduce some LOC in iceberg_scan.rs but isn't stricly necessary. For default value injection, unifying schemas, etc. should all be handled by PhysicalExprAdapter (SparkPhysicalExprAdapterFactory for you), but you don't need to do any stream mapping with that, you pass it into FileScanConfigBuilder::with_expr_adapter and it applies it to the stream.

@comphead
Copy link
Contributor

that something like BatchAdapter might reduce some LOC

Correct, I'll try to do a migration today based on this PR, with adapter it is more promising.

And you are right, we also should do more on expression level than on RB level, but this would be quite some investigation to be done

@adriangb
Copy link
Contributor Author

adriangb commented Jan 13, 2026

And you are right, we also should do more on expression level than on RB level, but this would be quite some investigation to be done

I may be missing something but I think the SparkPhysicalExprAdapterFactory I made in apache/datafusion-comet#3047 covers all of the functionality of your old SchemaAdapter. Anyway we should probably continue that discussion in apache/datafusion-comet#3046

@comphead
Copy link
Contributor

Anyway we should probably continue that discussion in ....

Sounds good, just started discussion in the apache/datafusion-comet#3047 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants