Fix add_files with non-identity transforms#1925
Merged
Fokko merged 1 commit intoapache:mainfrom Apr 16, 2025
Merged
Conversation
kevinjqliu
approved these changes
Apr 16, 2025
Comment on lines
-2264
to
-2266
| source_field = schema.find_field(partition_field.source_id) | ||
| transform = partition_field.transform.transform(source_field.field_type) | ||
| return transform(lower_value) |
Contributor
There was a problem hiding this comment.
ah bug was introduced here
the values need be to transformed first before comparison
Fokko
added a commit
that referenced
this pull request
Apr 17, 2025
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555)
Second attempt of #1539
This was already being discussed back here:
#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes #1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes #1917
<!-- In the case of user-facing changes, please add the changelog label.
-->
gabeiglio
pushed a commit
to Netflix/iceberg-python
that referenced
this pull request
Aug 13, 2025
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (apache#1555)
Second attempt of apache#1539
This was already being discussed back here:
apache#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes apache#1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes apache#1917
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Found out I broke this myself after doing a
git bisect:Closes #1917
Are these changes tested?
Are there any user-facing changes?