Skip to content

[python] Add conflict detection in shard update#7630

Merged
JingsongLi merged 4 commits intoapache:masterfrom
littlecoder04:conflict_check
Apr 12, 2026
Merged

[python] Add conflict detection in shard update#7630
JingsongLi merged 4 commits intoapache:masterfrom
littlecoder04:conflict_check

Conversation

@littlecoder04
Copy link
Copy Markdown
Contributor

@littlecoder04 littlecoder04 commented Apr 11, 2026

Purpose

This PR is a follow-up to #7323. PR #7323 introduced conflict detection for Python data evolution updates, but the shard-update path was not covered.

As a result, when shard update and compact run concurrently, the shard update may commit successfully against a stale scan snapshot instead of failing fast. The problem only shows up later during read, with the error:
All files in a field merge split should have the same row count.This PR extends the same conflict-detection coverage to the shard-update path.

Tests

run_compact_conflict_test in run_mixed_tests.sh

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! This is a real-world concurrency issue that would be hard to catch without the E2E test.

Overall the approach looks solid — passing snapshot_id through Plan → ShardTableUpdator → CommitMessage.check_from_snapshot → ConflictDetection._row_id_check_from_snapshot leverages the existing Java conflict detection mechanism correctly.

A few minor comments:

  1. FileScanner lambda return type change: The scanner lambdas now return (List, Snapshot) tuples instead of just List. The all_manifests() lambda also returns (manifests, snapshot) now, but it is only consumed in the FileScanner ctor where the new signature is expected — just want to double-check there is no other caller of FileScanner that still expects a plain list.

  2. Incremental scan edge case: In incremental_manifest(), end_snapshot is set inside the for loop. If start_id + 1 > end_id (empty range), end_snapshot stays None. The caller would get (manifests, None). Is this expected — i.e., should the Plan still carry snapshot_id = None when there are no incremental snapshots to scan?

  3. Plan dataclass: Adding snapshot_id: Optional[int] = None is backward-compatible. But the FileScanner.scan() path always sets it from snapshot_manager.get_latest_snapshot().id — might be worth confirming there is no code path where get_latest_snapshot() could return None at plan time.

The E2E test design (Java write base → Python scan → Java compact → Python commit conflict) is well thought out and clearly demonstrates the race condition.

Once the above questions are addressed, +1 from me! 🚀

@littlecoder04
Copy link
Copy Markdown
Contributor Author

Nice PR! This is a real-world concurrency issue that would be hard to catch without the E2E test.

Overall the approach looks solid — passing snapshot_id through Plan → ShardTableUpdator → CommitMessage.check_from_snapshot → ConflictDetection._row_id_check_from_snapshot leverages the existing Java conflict detection mechanism correctly.

A few minor comments:

  1. FileScanner lambda return type change: The scanner lambdas now return (List, Snapshot) tuples instead of just List. The all_manifests() lambda also returns (manifests, snapshot) now, but it is only consumed in the FileScanner ctor where the new signature is expected — just want to double-check there is no other caller of FileScanner that still expects a plain list.
  2. Incremental scan edge case: In incremental_manifest(), end_snapshot is set inside the for loop. If start_id + 1 > end_id (empty range), end_snapshot stays None. The caller would get (manifests, None). Is this expected — i.e., should the Plan still carry snapshot_id = None when there are no incremental snapshots to scan?
  3. Plan dataclass: Adding snapshot_id: Optional[int] = None is backward-compatible. But the FileScanner.scan() path always sets it from snapshot_manager.get_latest_snapshot().id — might be worth confirming there is no code path where get_latest_snapshot() could return None at plan time.

The E2E test design (Java write base → Python scan → Java compact → Python commit conflict) is well thought out and clearly demonstrates the race condition.

Once the above questions are addressed, +1 from me! 🚀

Thanks!

  1. I checked all FileScanner(...) call sites and updated them to the new (manifest_files, snapshot) contract.
  2. Good catch. I updated incremental_manifest() so that end_snapshot is initialized from end_id before the loop. I also added a test to cover it.
  3. Plan.snapshot_id is intentionally kept as Optional[int], which matches the Java semantics of SnapshotReader.Plan#snapshotId() being nullable for cases such as an empty table.

@leaves12138
Copy link
Copy Markdown
Contributor

All points addressed, thanks for the quick responses!

  1. ✅ Confirmed all call sites updated
  2. end_snapshot edge case fixed
  3. Optional[int] matches Java semantics — makes sense

LGTM, approving now! 🚀

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments addressed. LGTM! 🚀

@JingsongLi JingsongLi merged commit 82c3d83 into apache:master Apr 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants