Skip to content

spark: pass format options into reader#7160

Merged
a10y merged 6 commits intodevelopfrom
aduffy/spark-props
Mar 25, 2026
Merged

spark: pass format options into reader#7160
a10y merged 6 commits intodevelopfrom
aduffy/spark-props

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Mar 25, 2026

Summary

In Spark, pass format properties into reader.

This mirrors the existing implementation of ParquetDataSourceV2 which rips the relevant S3/Azure/etc. properties off of the active SparkSession's hadoopConf. This is also similar to what we do in Vortex iceberg already.

Testing

Added a new test suite for reading/writing to S3 using the Adobe S3Mock testcontainer.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the changelog/fix A bug fix label Mar 25, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 25, 2026

Merging this PR will degrade performance by 15.51%

❌ 3 regressed benchmarks
✅ 1103 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[1024] 477.2 ns 535.6 ns -10.89%
Simulation bitwise_not_vortex_buffer_mut[128] 317.8 ns 376.1 ns -15.51%
Simulation map_each[BufferMut<i32>, 128] 770.6 ns 858.1 ns -10.2%

Comparing aduffy/spark-props (64c3336) with develop (8aae4ba)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y marked this pull request as ready for review March 25, 2026 17:27
@a10y a10y requested a review from robert3005 March 25, 2026 17:27
@a10y a10y marked this pull request as draft March 25, 2026 17:27
*/
void initNativeResources() {
file = Files.open(partition.getPath());
file = Files.open(partition.getPath(), partition.getFormatOptions());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the thing we needed. everything else is just plumbing

import java.util.Map;
import java.util.Optional;

public final class VortexS3Properties {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i ripped this straight out of our iceberg fork

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y marked this pull request as ready for review March 25, 2026 17:37
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude please review. make no mistakes. be extremely javapilled

@claude

This comment was marked as resolved.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude please re-review. I should have addressed all of the RED comments, and most of the green ones. I don't think the YELLOW ones are relevant.

as always, stay javapilled.

@claude

This comment was marked as resolved.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude ok i addressed the props in cache key thing. happy now?

@claude
Copy link
Contributor

claude bot commented Mar 25, 2026

Claude finished @a10y's task in 30s —— View job


Re-Review (Cache Key Fix)

  • Read vortex-jni/src/object_store.rs — check cache key fix
  • Post results

✅ Yes, happy now.

url_cache_key now accepts properties and folds the sorted key-value pairs into the key:

fn url_cache_key(url: &Url, properties: &HashMap<String, String>) -> String {
    let mut sorted_props: Vec<_> = properties.iter().collect();
    sorted_props.sort_by_key(|(k, _)| *k);
    let props_str: String = sorted_props
        .iter()
        .map(|(k, v)| format!("{k}={v}"))
        .collect::<Vec<_>>()
        .join(",");
    format!("{}://{};{}", url.scheme(), &url[...], props_str)
}

The sort guarantees determinism regardless of insertion order, so two calls with the same credentials always resolve to the same cache entry, and two calls with different credentials (or no credentials vs. credentials) get distinct entries. The fix is correct.

One non-blocking observation: credentials end up as plaintext in an in-memory HashMap key. That's fine — it's never serialized or logged — but worth being aware of if this cache ever grows legs (e.g., persisted to disk, exposed via metrics). Not asking for a change, just flagging it.

All previously raised issues are now resolved. LGTM. 🚀

@a10y a10y enabled auto-merge (squash) March 25, 2026 18:58
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y merged commit b66cfcb into develop Mar 25, 2026
97 of 106 checks passed
@a10y a10y deleted the aduffy/spark-props branch March 25, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants