feat: expose variety of features from DF54 update#1554
Open
timsaucer wants to merge 9 commits into
Open
Conversation
DataFusion 53 deprecated `TableFunctionImpl::call(args: &[Expr])` in favor of `call_with_args(args: TableFunctionArgs)`. `PyTableFunction` was migrated in 5a64b0d; this brings the FFI example along so it no longer relies on the deprecated entry point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR apache#1541 introduced `with_logical_extension_codec` / `with_physical_extension_codec` setters typed as `codec: Any`. The Rust extractors accept either a raw `PyCapsule` or any object exposing `__datafusion_logical_extension_codec__` / `__datafusion_physical_extension_codec__`. Add `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable` Protocols in `python/datafusion/user_defined.py` (matching the existing `ScalarUDFExportable` pattern) and tighten both setter signatures to `Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream exposes both `get_field(expr, name)` and `get_field_path(expr, [names...])`, but both ultimately call the same scalar UDF with a base expression plus one or more name args. Collapse the Python surface into a single variadic `get_field(expr, *names)` that accepts either a one-step lookup or a path of names, dispatching through a single Rust binding. Note in `.ai/skills/check-upstream/SKILL.md` that `get_field_path` is covered by the variadic form so future audits do not flag it as a gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap upstream `SessionContext::read_batches`, which materializes a DataFrame directly from a sequence of `RecordBatch`es without registering a named table. The single-batch convenience `SessionContext.read_batch` is implemented in pure Python by calling `read_batches([batch])`, so the Rust side only needs the one binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expose `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with the existing `register_udf` / `register_udaf` / `register_udwf` setters, plus `udfs()` / `udafs()` / `udwfs()` for enumerating registered function names. Looked-up functions come back as the same `ScalarUDF` / `AggregateUDF` / `WindowUDF` wrappers users already get from registration, so they can be called as expressions or re-registered into a different session. Returns Vec<String> from the list helpers (sorted) rather than the raw HashSet upstream returns, so calling code gets a stable ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyarrow.parquet promotes timestamp[s] to timestamp[ms] on write (apache/arrow#41382), so the read array never matched the input. Cast the expected array to timestamp[ms] in test_simple_select to assert DataFusion reads what Arrow actually stored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFrameHtmlFormatter(repr_rows=..., max_rows=...) fires the deprecation warning before raising ValueError, but pytest.raises does not catch warnings. The escaping warning surfaced in every pytest run. Wrap the call in both pytest.raises and pytest.warns so the warning is asserted, not leaked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add Examples docstrings (doctest) for `udf` / `udaf` / `udwf` / `udfs` / `udafs` / `udwfs` that demonstrate the lookup pattern, including a late-binding example where the function name comes from configuration. Add tests covering config-driven dispatch and built-in UDAF / UDWF lookup so the documented patterns are exercised end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Python bindings and examples to expose additional DataFusion 54-era functionality (notably UDF/UDAF/UDWF discovery + lookup helpers and Arrow RecordBatch ingestion conveniences), and adjusts tests/tooling accordingly.
Changes:
- Add
SessionContext.read_batch/read_batchesplus UDF/UDAF/UDWF lookup & listing helpers (udf/udaf/udwf,udfs/udafs/udwfs). - Extend
functions.get_fieldto support multi-segment nested field paths (and update the Rust binding accordingly). - Update tests to cover the new API surface and adjust timestamp/parquet and deprecation-warning expectations; bump pre-commit hook version.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| python/datafusion/context.py | Adds batch-reading helpers and UDF/UDAF/UDWF discovery + lookup methods; improves codec type hints. |
| python/datafusion/functions.py | Updates get_field to accept nested paths. |
| python/datafusion/user_defined.py | Introduces Protocol type hints for logical/physical extension codec exportables. |
| crates/core/src/context.rs | Exposes read_batches and function-registry lookup/listing to Python via PyO3. |
| crates/core/src/functions.rs | Updates internal get_field binding to accept a vector of path segments. |
| examples/datafusion-ffi-example/src/table_function.rs | Updates example for upstream TableFunctionImpl API changes (call_with_args). |
| python/tests/test_context.py | Adds coverage for read_batch/read_batches. |
| python/tests/test_dataframe.py | Adjusts test to assert both DeprecationWarning and ValueError. |
| python/tests/test_functions.py | Adds coverage for nested-path get_field and empty-arg error behavior. |
| python/tests/test_sql.py | Removes timestamp[s] xfail and compensates for parquet timestamp unit promotion. |
| python/tests/test_udf.py | Adds coverage for UDF/UDAF/UDWF lookup + late-binding dispatch. |
| .pre-commit-config.yaml | Bumps actionlint hook version to fix CI failures. |
| .ai/skills/check-upstream/SKILL.md | Documents that get_field_path is covered by variadic get_field. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently.
Rationale for this change
DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps.
What changes are included in this PR?
LogicalExtensionCodecExportable/PhysicalExtensionCodecExportableto make hinting signatures more understandableget_field_pathbut instead fold it intoget_fieldto be more pythonicSessionContext.read_batches/read_batchAre there any user-facing changes?
Yes, but they are all additions. No breaking changes to existing public APIs.