Skip to content

Conversation

@jerrytqz
Copy link

@jerrytqz jerrytqz commented Jan 17, 2026

What changes were proposed in this pull request?

This PR will add validation when accessing the checkpoint to detect this inconsistent state and throw an error before the query can start with a new query ID.

Why are the changes needed?

When a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file (containing the streaming query ID), the query will generate a new UUID on restart. This breaks the deduplication mechanism of exactly-once sinks like DeltaSink which relies on the streaming query ID to skip already-processed batches, leading to data duplication.

Does this PR introduce any user-facing change?

Yes. There is a new error condition MISSING_METADATA_FILE that occurs when a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file.

How was this patch tested?

Unit tests and integration tests are added.

Was this patch authored or co-authored using generative AI tooling?

Used to assist in writing test suite.
Generated-by: Claude Sonnet 4.5

@github-actions
Copy link

github-actions bot commented Jan 17, 2026

JIRA Issue Information

=== Task SPARK-55058 ===
Summary: Throw an error if the /metadata file is not present, but offset or commit directories are non-empty
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@jerrytqz jerrytqz changed the title Throw error on inconsistent checkpoint metadata [SPARK-55058][SS] Throw error on inconsistent checkpoint metadata Jan 17, 2026
@jerrytqz jerrytqz force-pushed the jerry-zheng_data/SPARK-55058 branch from ee791d2 to 75d4a2c Compare January 19, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant