Skip to content

Split proto serialization to encapsulate private state (#21835)#21929

Open
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:split-proto-serialization
Open

Split proto serialization to encapsulate private state (#21835)#21929
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:split-proto-serialization

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Apr 29, 2026

Which issue does this PR close?

Rationale for this change

datafusion-proto serializes every built-in PhysicalExpr through a single
~300-line downcast_ref chain, with a mirror match on the decode side. That
chain lives outside the crate where each expression is defined, so every field
an expression wants to round-trip has to be made pub. #21807 is the
cautionary tale: it had to add five pub "proto-only, not stable" items to
DynamicFilterPhysicalExpr just to serialize an RwLock-wrapped inner.

This PR adds the infrastructure so a PhysicalExpr can serialize itself and
keep its state private.

What changes are included in this PR?

A PhysicalExpr can now opt into serializing itself, in both directions:

fn try_to_proto(&self, ctx: &PhysicalExprEncodeCtx) -> Result<Option<PhysicalExprNode>>
fn try_from_proto(node: &PhysicalExprNode, ctx: &PhysicalExprDecodeCtx) -> Result<Arc<dyn PhysicalExpr>>

try_to_proto returning Ok(None) (the default) means "fall through to the
old downcast chain", so the change is purely additive — nothing is forced to
migrate.

Column and BinaryExpr are migrated as working demos; everything
else stays on the old path and migrates later, one expression at a time, with
no wire-format change.

Five stacked commits, each builds green on its own and is independently
reviewable (or splittable into its own PR):

  1. Extract datafusion-proto-models crate — move the .proto file and
    prost-generated types into a lightweight crate (mirrors the existing
    datafusion-proto-common split).
  2. Add the try_to_proto hook — feature-gated, off by default.
  3. Migrate Column encode.
  4. Add the decode side and migrate Column decode.
  5. Migrate BinaryExpr (both directions).

A few design decisions worth flagging

  • FromProto / TryFromProto traits instead of plain From / TryFrom.
    Once the prost types move into their own crate they are foreign to
    datafusion-proto, and the orphan rule forbids impl From<&protobuf::X> for Y
    when both X and Y are foreign. So those conversions become FromProto /
    TryFromProto traits in datafusion_proto::convert, and callers go from
    (&x).into() to Y::from_proto(&x). This is a known workaround, not the end
    state — see Future work.
  • The ctx is a concrete struct, not &dyn. PhysicalExprEncodeCtx /
    PhysicalExprDecodeCtx wrap a sealed dispatch trait. Keeping them concrete
    keeps &dyn out of every expression's signature and gives a stable place to
    add helpers (UDF encoding, registry hooks) later without churning a public
    trait.
  • try_from_proto takes the whole PhysicalExprNode, not the
    pre-unwrapped variant payload, so every expression's decoder has the same
    signature and can still see outer-node fields like expr_id.

Are these changes tested?

No new behavior, so no new tests. Column and BinaryExpr produce and consume
the same wire format as before; the existing roundtrip_physical_plan /
roundtrip_physical_expr tests already cover both directions and now exercise
the new path.

Are there any user-facing changes?

Small API breaks in datafusion-proto:

  • try_from_physical_plan_with_converter / try_into_physical_plan_with_converter
    move to a PhysicalPlanNodeExt trait — callers add
    use datafusion_proto::physical_plan::PhysicalPlanNodeExt;.
  • Foreign-foreign From / TryFrom conversions become FromProto /
    TryFromProto (see Design decisions above).
  • datafusion_proto::generated::* is deprecated in favor of
    datafusion_proto::protobuf; it still works.

The new proto feature on datafusion-physical-expr(-common) is off by
default, so crates that don't serialize plans pay nothing.

Future work

  • Migrate the remaining built-in expressions — including
    DynamicFilterPhysicalExpr, the original motivation — one per follow-up PR.
  • Apply the same pattern to ExecutionPlan serialization.
  • Drop the FromProto / TryFromProto workaround: collapse
    datafusion-proto-common into datafusion-proto-models and push the
    conversion impls down to the target-type crates so callers use plain From /
    TryFrom again. Full dep-graph analysis and a step-by-step plan are in
    #21835 (comment).

🤖 Generated with Claude Code

@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate development-process Related to development process of DataFusion labels Apr 29, 2026
@adriangb adriangb force-pushed the split-proto-serialization branch from 9302fa9 to 5360d7d Compare April 29, 2026 16:33
Comment thread .github/workflows/rust.yml
Comment thread datafusion/proto-models/LICENSE.txt
Comment thread datafusion/proto/src/logical_plan/file_formats.rs
@adriangb
Copy link
Copy Markdown
Contributor Author

@timsaucer wonder if you'd be willing to do an initial review on this change since we discussed it in the call yesterday?

@adriangb adriangb force-pushed the split-proto-serialization branch from 2a29583 to 0e6e8b0 Compare May 14, 2026 05:58
@adriangb adriangb marked this pull request as ready for review May 14, 2026 18:54
@adriangb adriangb force-pushed the split-proto-serialization branch from 0e6e8b0 to 32deeb5 Compare May 17, 2026 07:32
@github-actions github-actions Bot added the logical-expr Logical plan and expressions label May 17, 2026
@timsaucer timsaucer self-requested a review May 17, 2026 13:43
@timsaucer
Copy link
Copy Markdown
Member

I've started reviewing but this is a big one and it will take a bit

@adriangb adriangb force-pushed the split-proto-serialization branch from 222d25d to 8094825 Compare May 17, 2026 16:06
@adriangb
Copy link
Copy Markdown
Contributor Author

I've started reviewing but this is a big one and it will take a bit

I've tried to restack it to clean it up. But yes I know it's big, if you have any ideas for how to split it up or make it easier to review I'm all ears.

Mirror the existing datafusion-proto-common split for the physical/logical
plan schemas. The new crate contains only the .proto file and the prost-
generated Rust types, with no datafusion deps beyond datafusion-proto-common.

datafusion-proto re-exports the proto types via its existing protobuf module
(and keeps the legacy datafusion_proto::generated::* paths working behind a
deprecation notice), so downstream consumers (datafusion-ffi,
datafusion-examples, benchmarks) continue to work without Cargo.toml changes.

Because the prost-generated types are now foreign to datafusion-proto, the
orphan rule forbids:
  - inherent impl blocks on protobuf::PhysicalPlanNode
  - From/TryFrom impls between proto types and types from datafusion-common,
    datafusion-expr, datafusion-datasource (foreign + foreign)

To work around this:
  - Inherent impls on protobuf::PhysicalPlanNode are converted to a new
    PhysicalPlanNodeExt trait (callers must `use` it to call the methods).
  - Foreign-foreign From/TryFrom impls are converted to FromProto/TryFromProto
    traits defined in a new datafusion_proto::convert module.

CI configuration (rust.yml, licenserc.toml, rat_exclude_files.txt) is updated
for the new crate in the same commit so it builds green standalone.

This is groundwork for a follow-up that adds PhysicalExpr::try_to_proto so
expressions can serialize private state without exposing pub-for-proto
scaffolding (issue apache#21835).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the split-proto-serialization branch from 8094825 to c7d42ab Compare May 17, 2026 16:18
Copy link
Copy Markdown
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the description to have the answers I was looking for eventually, but the llm generated text had so much content I didn't want to read it all just to understand what is actually not a very complex refactor. This held me up more than anything. It would have been more helpful to have a more pithy description that got me right to the useful pieces.

I'm finding for my own PRs that I need to make a lot of edits to these generated descriptions because they're not always great at isolating what a reviewer really needs to understand vs cataloging all the work that was done.

Comment on lines +665 to +673
/// Takes the whole [`PhysicalExprNode`] — the exact inverse of what
/// [`PhysicalExpr::try_to_proto`] produces — so every expression's
/// `try_from_proto` shares one signature. The operator string is parsed
/// via the canonical [`Operator::from_proto_name`] mapping, so no `op`
/// argument needs to be threaded in by the caller.
///
/// [`PhysicalExprNode`]: datafusion_proto_models::protobuf::PhysicalExprNode
/// [`PhysicalExpr::try_to_proto`]: datafusion_physical_expr_common::physical_expr::PhysicalExpr::try_to_proto
/// [`PhysicalExprDecodeCtx::decode`]: datafusion_physical_expr_common::physical_expr::proto_decode::PhysicalExprDecodeCtx::decode
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like some generic text that doesn't necessarily need to be in each implementation

Comment on lines -82 to -83
impl From<&CsvOptionsProto> for CsvOptions {
fn from(proto: &CsvOptionsProto) -> Self {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and also from_factory probably needs a line in the upgrade doc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same argument holds for json options, parquet, etc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, from_factory wasn't pub so maybe this argument is moot.

}
}

impl TryFrom<protobuf::WindowFrame> for WindowFrame {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH Since the methods are so similar it feels like keeping impl TryFrom is better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I spent way more time than I wanted to understand the problem with the dependencies and why you had to introduce this.

@timsaucer
Copy link
Copy Markdown
Member

LMK if you want me to take another look since you did another push since I reviewed

adriangb and others added 2 commits May 17, 2026 09:25
Adds an opt-in `proto` feature on `datafusion-physical-expr-common` that
exposes `PhysicalExpr::try_to_proto`:

```rust
fn try_to_proto(
    &self,
    ctx: &PhysicalExprEncodeCtx<'_>,
) -> Result<Option<PhysicalExprNode>> { Ok(None) }
```

`Ok(None)` (the default) means "fall through" — `datafusion-proto`'s
existing downcast chain handles the expression. Returning `Ok(Some(node))`
lets the expression serialize itself, so types with private state
(`DynamicFilterPhysicalExpr`'s `RwLock`-wrapped inner, etc.) no longer
need pub-for-proto accessors.

The `try_` prefix is deliberate: the method is fallible, and it pairs with
the fallible `try_from_proto` decode constructors added in later commits
(and the `TryFromProto` trait in `datafusion-proto`).

Expression authors receive a concrete `PhysicalExprEncodeCtx` struct, not a
`&dyn` trait object. It wraps a sealed `PhysicalExprEncode` dispatch trait
(implemented by `ConverterEncoder` in `datafusion-proto`, which routes
through the existing `PhysicalExtensionCodec` + converter and preserves
dedup). Keeping the context a struct keeps `&dyn` out of every expression's
signature and gives a stable place to add helpers (UDF/UDAF/UDWF encoding,
registry hooks) without expanding a public trait surface — mirroring how
`PhysicalExprDecodeCtx` is shaped on the decode side.

`serialize_physical_expr_with_converter` calls `expr.try_to_proto(...)`
first, falling back to the existing chain on `Ok(None)`. No expression has
migrated yet, so behavior is unchanged. The feature is off by default on
`physical-expr-common`; `datafusion-proto` flips it on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First expression to use the new `try_to_proto` hook, demonstrating the
migration pattern. The downcast arm in `serialize_physical_expr_with_converter`
is removed; the trait dispatch path now produces the same `PhysicalExprNode`.

`datafusion-physical-expr` gains a passthrough `proto` feature that turns on
`datafusion-physical-expr-common/proto` and pulls in `datafusion-proto-models`
for the proto types. `datafusion-proto` enables it.

Other built-in expressions (`UnKnownColumn`, `Literal`, `BinaryExpr`,
`DynamicFilterPhysicalExpr`, etc.) still go through the downcast chain and
will migrate one at a time in follow-ups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

Thanks so much @timsaucer !

LMK if you want me to take another look since you did another push since I reviewed

It's the same content, I was just re-organizing the commits to make it easier to review 😄. I think it would be good for @alamb to take a look at this, especially since we can wait to ship it until after 54 has been released (so no big rush).

I found the description to have the answers I was looking for eventually, but the llm generated text had so much content I didn't want to read it all just to understand what is actually not a very complex refactor. This held me up more than anything. It would have been more helpful to have a more pithy description that got me right to the useful pieces.

I'm finding for my own PRs that I need to make a lot of edits to these generated descriptions because they're not always great at isolating what a reviewer really needs to understand vs cataloging all the work that was done.

Sorry about this. I agree with you. I usually leave the LLM generated descriptions up for draft PRs or where they are "fine" and better than "I'm a lazy human so really didn't write much at all" but I do try to rewrite them where it would be helpful (e.g. incidentally I was just rewriting the one in #22300 by hand while you were reviewing this). Just bad timing / judgment on this one. I'll update it for the next reviewer and note the feedback.

@adriangb adriangb force-pushed the split-proto-serialization branch from c7d42ab to 1d5a354 Compare May 17, 2026 16:27
Adds the decode-side counterpart to `PhysicalExprEncodeCtx`. The public
surface is a single `decode(node)` method on the context — the central
match (and, in future, third-party registry lookups) live behind it.

Per-expression contract:

```rust
impl Column {
    pub fn try_from_proto(
        node: &PhysicalExprNode,
        ctx: &PhysicalExprDecodeCtx,
    ) -> Result<Arc<dyn PhysicalExpr>>;
}
```

`try_from_proto` takes the whole `PhysicalExprNode` — the exact inverse of
what `try_to_proto` returns — and unwraps its own `ExprType` variant. That
keeps one signature for every expression (trait-ready) and gives decoders
access to outer-node fields such as `expr_id`. The central match in
`parse_physical_expr_with_converter` only routes to the right constructor.

`Column` is the first expression to migrate. Other variants stay inline;
they migrate in follow-ups, same shape, same trick (one expression per
commit, central match keeps shrinking).

Removes the temporary `FromProto<&PhysicalColumn> for Column` impl
introduced in the proto-models extraction — `Column` now owns both
directions of its own serialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second expression to use the new hooks, following the same pattern as
Column. The BinaryExpr downcast arm in
`serialize_physical_expr_with_converter` and the inline match arm in
`parse_physical_expr_with_converter` are removed; both directions now
live on `BinaryExpr` itself.

Linearization of nested same-op chains stays bit-for-bit identical with
the previous behavior, just expressed in terms of
`PhysicalExprEncodeCtx::encode_child` / `PhysicalExprDecodeCtx::decode`
instead of the proto converter.

`BinaryExpr::try_from_proto` takes the whole `PhysicalExprNode` and parses
the operator string itself. The proto-string-to-enum mapping moves from
`from_proto_binary_op` in `datafusion-proto` down to
`Operator::from_proto_name` in `datafusion-expr-common`, so the logical
plan path and the physical `BinaryExpr` decoder share one source of truth
without `physical-expr` depending on `datafusion-proto`; `from_proto_binary_op`
now delegates to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the split-proto-serialization branch from 1d5a354 to 89d6cf7 Compare May 17, 2026 16:33
fn try_to_proto(
&self,
_ctx: &proto_encode::PhysicalExprEncodeCtx<'_>,
) -> Result<Option<datafusion_proto_models::protobuf::PhysicalExprNode>> {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timsaucer do you think we should make this Result<...> instead of Result<Option<...>>? If we ever plan to remove the default impl, remove the codecs and force ourselves to implement this method on all expressions we produce then the default returning an error would make sense IMO. The price we pay is that in the meantime it would be hard to differentiate between "I called an expression with an implementation and it error, I should bubble that up to users so they can fix it instead of getting an unknown expression Column type error (by falling back to the match -> codec, etc.)" vs. "this expression hasn't been updated yet".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Split proto serialization to encapsulate private state

2 participants