Skip to content

Conversation

@Shekharrajak
Copy link
Contributor

Which issue does this PR close?

Closes #3015.

Rationale for this change

The native Parquet writer needed a fix to use output_path as the base directory for file writes when work_dir is not set. Without this fix, files were being written to root (/) instead of the intended output directory.

What changes are included in this PR?

  1. Protobuf: Added staging_file_path field to ParquetWriter message for future 2PC support
  2. Native Rust: Fixed parquet_writer.rs to use output_path as fallback when work_dir is empty
  3. Scala/JVM: Simplified CometNativeWriteExec to write directly to output path
  4. Tests: Added CometParquetWriter2PCSuite with basic write functionality tests

How are these changes tested?

Added CometParquetWriter2PCSuite with 5 tests:

  • Basic successful write creates files in output directory
  • Multiple concurrent tasks write without file conflicts
  • Various data types write correctly
  • Overwrite mode replaces existing files

@Shekharrajak Shekharrajak changed the title Feature/issue 3015 2PC and staging output Jan 11, 2026
outputPath: String,
committer: Option[FileCommitProtocol] = None,
jobTrackerID: String = Utils.createTempDir().getName)
case class CometNativeWriteExec(nativeOp: Operator, child: SparkPlan, outputPath: String)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic execution that delegates to native writer

@wForget
Copy link
Member

wForget commented Jan 12, 2026

@Shekharrajak Thank you for your work. The file commit protocol has already been implemented in #2828, and work_dir is the staging dir. Is my understanding correct? cc @comphead @andygrove

@Shekharrajak
Copy link
Contributor Author

@Shekharrajak Thank you for your work. The file commit protocol has already been implemented in #2828, and work_dir is the staging dir. Is my understanding correct? cc @comphead @andygrove

I think current original implementation duplicated what InsertIntoHadoopFsRelationCommand already does. In this PR code changes we are not managing FileCommitProtocol ourself but delegated to Spark.

@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.54%. Comparing base (f09f8af) to head (4ad285e).
⚠️ Report is 845 commits behind head on main.

Files with missing lines Patch % Lines
.../apache/spark/sql/comet/CometNativeWriteExec.scala 44.44% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3068      +/-   ##
============================================
+ Coverage     56.12%   59.54%   +3.41%     
- Complexity      976     1374     +398     
============================================
  Files           119      167      +48     
  Lines         11743    15461    +3718     
  Branches       2251     2570     +319     
============================================
+ Hits           6591     9206    +2615     
- Misses         4012     4961     +949     
- Partials       1140     1294     +154     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet writer should support 2PC and staging output

3 participants