spark: pass format options into reader by a10y · Pull Request #7160 · vortex-data/vortex

a10y · 2026-03-25T15:09:02Z

Summary

In Spark, pass format properties into reader.

This mirrors the existing implementation of ParquetDataSourceV2 which rips the relevant S3/Azure/etc. properties off of the active SparkSession's hadoopConf. This is also similar to what we do in Vortex iceberg already.

Testing

Added a new test suite for reading/writing to S3 using the Adobe S3Mock testcontainer.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

codspeed-hq · 2026-03-25T15:12:45Z

Merging this PR will degrade performance by 15.51%

❌ 3 regressed benchmarks
✅ 1103 untouched benchmarks
⏩ 1522 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	317.8 ns	376.1 ns	-15.51%
❌	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	477.2 ns	535.6 ns	-10.89%
❌	Simulation	`map_each[BufferMut<i32>, 128]`	770.6 ns	858.1 ns	-10.2%

_{Comparing aduffy/spark-props (24b6064) with develop (8aae4ba)}

1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2026-03-25T17:30:49Z

java/vortex-spark/src/main/java/dev/vortex/spark/read/VortexPartitionReader.java

     */
    void initNativeResources() {
-        file = Files.open(partition.getPath());
+        file = Files.open(partition.getPath(), partition.getFormatOptions());


this is the thing we needed. everything else is just plumbing

a10y · 2026-03-25T17:32:05Z

java/vortex-spark/src/main/java/dev/vortex/spark/config/VortexS3Properties.java

+import java.util.Map;
+import java.util.Optional;
+
+public final class VortexS3Properties {


i ripped this straight out of our iceberg fork

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2026-03-25T17:40:27Z

@claude please review. make no mistakes. be extremely javapilled

claude · 2026-03-25T17:41:05Z

Claude finished @a10y's task in 2m 8s —— View job

Code Review

Gather context and read changed files
Review new config files (VortexS3Properties, VortexAzureProperties, HadoopUtils)
Review plumbing changes (VortexDataSourceV2, VortexScan, VortexBatchExec, VortexPartitionReader)
Review JNI changes (Files.java, object_store.rs)
Review test file (VortexDataSourceS3MockTest)
Post review feedback

Overall the approach is clean and mirrors the established pattern from Iceberg. The plumbing chain from VortexDataSourceV2 → VortexScan → VortexBatchExec → VortexFilePartition → VortexPartitionReader is well-executed. A few issues worth addressing before merge:

🔴 Critical: `inferSchema` doesn't pass credentials

VortexDataSourceV2.inferSchema still opens files without credentials (line 91):

try (File file = Files.open(pathToInfer)) {  // no properties!

Schema inference for S3/Azure files will fail because credentials are never passed in this path. The options map passed in via CaseInsensitiveStringMap is accessible here and could be enriched with hadoop conf properties the same way getTable does. Fix this →

🔴 Critical: Object store cache doesn't include credentials in the cache key

In vortex-jni/src/object_store.rs, the cache key is only scheme://host:port (line 142-148):

fn url_cache_key(url: &Url) -> String {
    format!("{}://{}", url.scheme(), &url[url::Position::BeforeHost..url::Position::AfterPort])
}

If the first call to a given S3 endpoint happens with no credentials (anonymous/env-var path) and a later call includes explicit credentials, the cached unauthenticated store is returned. This is a silent correctness bug. The cache either needs credentials included in the key, or the caching approach should be reconsidered (e.g., cache only when properties are empty, or disable caching for credentialed stores). Fix this →

🟡 Bug: env-var fallback check uses wrong key format

In object_store.rs lines 66-80, the env-var fallback guards check for "access_key_id":

if !properties.contains_key("access_key_id") && let Ok(access_key) = ...

But VortexS3Properties emits the key as "aws_access_key_id". So the guard never fires when Java passes credentials, meaning the env var is applied alongside explicit credentials (the later with_config loop will override it, so it's not a security issue, just wasted work and a misleading guard). The check should use "aws_access_key_id" to match what Java sends, or the Rust side should normalize to the canonical object_store key names. Fix this →

🟡 Logic: Azure `skip_signature` set when no explicit key (may break managed identity)

In HadoopUtils.azurePropertiesFromHadoopConf (line 66-68):

if (properties.accessKey().isEmpty()) {
    properties.setSkipSignature(true);
}

This unconditionally sets skip_signature=true whenever no fs.azure.account.key.* or SAS token is in the Hadoop conf. If the cluster is running with a managed identity or service principal (configured separately, not via these Hadoop keys), we'll tell the object store to skip auth entirely — which will likely result in 403 errors on private containers.

🟡 `VortexAzureProperties.skipSignature()` return type inconsistency

skipSignature() on VortexAzureProperties returns Optional<String> while the equivalent on VortexS3Properties returns boolean. The two classes have inconsistent API shapes overall (S3 uses void setters; Azure uses fluent builders). Not a blocker, but might cause confusion if these are ever used together. Fix this →

🟡 Multi-account Azure support silently broken

The azurePropertiesFromHadoopConf loop overwrites accessKey/sasKey on each matching entry (TODO comment acknowledges this). With two storage accounts, only the last matched key survives. The TODO is fine for now, but worth a comment that cross-account reads within a single Spark session are currently unsupported.

🟢 Nit: `HadoopUtils()` constructor should be `private`

The constructor is package-private (HadoopUtils() {}), which means subclasses in the same package could technically instantiate it. Since this is a pure utility class, it should be private.

🟢 Nit: Unused Spark config in test

VortexDataSourceS3MockTest line 55 sets spark.hadoop.fs.s3a.impl — this is needed for the S3AFileSystem Hadoop driver but doesn't affect Vortex's own ObjectStore. It's harmless but slightly misleading in a test that's supposed to verify Vortex's S3 integration specifically.

🟢 Nit: `testWriteAndReadWithFormatOptionsFromS3Mock` doesn't really test option forwarding

The test passes row_group_size on write and projection_pushdown on read, but neither of these is a validated Vortex option — they're silently ignored. If the goal is to test that format options actually reach the object store, consider asserting that omitting credentials causes a failure (and providing them causes success), rather than passing no-op options.

The core mechanic (VortexPartitionReader → Files.open(path, formatOptions)) is correct. The critical issues to fix before merge are the inferSchema credentials gap and the object store cache key problem.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2026-03-25T18:37:57Z

@claude please re-review. I should have addressed all of the RED comments, and most of the green ones. I don't think the YELLOW ones are relevant.

as always, stay javapilled.

claude · 2026-03-25T18:38:35Z

Claude finished @a10y's task in 1m 36s —— View job

Re-Review

Read VortexDataSourceV2.java — check inferSchema credentials fix (🔴)
Read object_store.rs — check cache key fix (🔴)
Read HadoopUtils.java / VortexAzureProperties.java — check green nits
Post re-review

✅ Fixed: `inferSchema` now passes credentials

VortexDataSourceV2.inferSchema now calls buildDataSourceOptions and passes the result to both NativeFileMethods.listVortexFiles and Files.open. The credentials gap is closed. ✅

🔴 Still open: Object store cache key doesn't include credentials

url_cache_key in vortex-jni/src/object_store.rs (lines 142–148) is unchanged:

fn url_cache_key(url: &Url) -> String {
    format!(
        "{}://{}",
        url.scheme(),
        &url[url::Position::BeforeHost..url::Position::AfterPort],
    )
}

The properties map is still not part of the cache key. This means: if a store is first constructed for s3://my-bucket/foo with credentials A, and later opened with credentials B for the same endpoint, it silently returns the store built with credentials A. This is a correctness bug that will surface in real multi-tenant or test-then-production scenarios.

The simplest fix is to include a sorted, hashed representation of the properties in the key. Fix this →

✅ Fixed: `HadoopUtils()` private constructor

Line 10: private HadoopUtils() {} — done. ✅

✅ Fixed: `VortexAzureProperties.skipSignature()` return type

Now returns boolean (and both VortexS3Properties and VortexAzureProperties use the same style). ✅

✅ Fixed: Unused `fs.s3a.impl` config in test

Removed from VortexDataSourceS3MockTest. ✅

✅ Improved: `testWriteAndReadWithFormatOptionsFromS3Mock`

The test no longer passes the meaningless row_group_size/projection_pushdown no-op options. The first test (testWriteAndReadVortexFilesFromS3Mock) effectively validates the credential-forwarding path end-to-end by actually reading and writing from S3Mock. ✅

🟡 Acknowledged as not-relevant (author's call)

env-var guard key mismatch ("access_key_id" vs "aws_access_key_id"): safe at runtime because the explicit with_config loop runs after and overrides any spuriously-applied env-var value. Wasted work but not a bug.
Azure skip_signature managed-identity false positive: acknowledged as a limitation.
Multi-account Azure: tracked as a TODO comment.

Summary: One RED issue remains — the object store cache key needs to incorporate credentials. Everything else from the previous review has been addressed. The Java-side plumbing looks solid.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2026-03-25T18:58:27Z

@claude ok i addressed the props in cache key thing. happy now?

claude · 2026-03-25T18:58:57Z

Claude finished @a10y's task in 30s —— View job

Re-Review (Cache Key Fix)

Read vortex-jni/src/object_store.rs — check cache key fix
Post results

✅ Yes, happy now.

url_cache_key now accepts properties and folds the sorted key-value pairs into the key:

fn url_cache_key(url: &Url, properties: &HashMap<String, String>) -> String {
    let mut sorted_props: Vec<_> = properties.iter().collect();
    sorted_props.sort_by_key(|(k, _)| *k);
    let props_str: String = sorted_props
        .iter()
        .map(|(k, v)| format!("{k}={v}"))
        .collect::<Vec<_>>()
        .join(",");
    format!("{}://{};{}", url.scheme(), &url[...], props_str)
}

The sort guarantees determinism regardless of insertion order, so two calls with the same credentials always resolve to the same cache entry, and two calls with different credentials (or no credentials vs. credentials) get distinct entries. The fix is correct.

One non-blocking observation: credentials end up as plaintext in an in-memory HashMap key. That's fine — it's never serialized or logged — but worth being aware of if this cache ever grows legs (e.g., persisted to disk, exposed via metrics). Not asking for a change, just flagging it.

All previously raised issues are now resolved. LGTM. 🚀

spark: pass format options into reader

1b3c478

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y added the changelog/fix A bug fix label Mar 25, 2026

propagate S3, Azure properties

cc071c6

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y marked this pull request as ready for review March 25, 2026 17:27

a10y requested a review from robert3005 March 25, 2026 17:27

a10y marked this pull request as draft March 25, 2026 17:27

a10y commented Mar 25, 2026

View reviewed changes

more

4c69fa6

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y marked this pull request as ready for review March 25, 2026 17:37

address comments

dbf2610

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

include sorted props in cache key

24b6064

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y enabled auto-merge (squash) March 25, 2026 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark: pass format options into reader#7160

spark: pass format options into reader#7160
a10y wants to merge 5 commits intodevelopfrom
aduffy/spark-props

a10y commented Mar 25, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

a10y Mar 25, 2026

Uh oh!

a10y Mar 25, 2026

Uh oh!

a10y commented Mar 25, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 25, 2026 •

edited by a10y

Loading

Uh oh!

a10y commented Mar 25, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

a10y commented Mar 25, 2026

Uh oh!

claude bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

a10y commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

codspeed-hq bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 15.51%

Performance Changes

Footnotes

Uh oh!

a10y Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

a10y Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

a10y commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 25, 2026 • edited by a10y Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

🔴 Critical: inferSchema doesn't pass credentials

🔴 Critical: Object store cache doesn't include credentials in the cache key

🟡 Bug: env-var fallback check uses wrong key format

🟡 Logic: Azure skip_signature set when no explicit key (may break managed identity)

🟡 VortexAzureProperties.skipSignature() return type inconsistency

🟡 Multi-account Azure support silently broken

🟢 Nit: HadoopUtils() constructor should be private

🟢 Nit: Unused Spark config in test

🟢 Nit: testWriteAndReadWithFormatOptionsFromS3Mock doesn't really test option forwarding

Uh oh!

a10y commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-Review

✅ Fixed: inferSchema now passes credentials

🔴 Still open: Object store cache key doesn't include credentials

✅ Fixed: HadoopUtils() private constructor

✅ Fixed: VortexAzureProperties.skipSignature() return type

✅ Fixed: Unused fs.s3a.impl config in test

✅ Improved: testWriteAndReadWithFormatOptionsFromS3Mock

🟡 Acknowledged as not-relevant (author's call)

Uh oh!

a10y commented Mar 25, 2026

Uh oh!

claude bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-Review (Cache Key Fix)

✅ Yes, happy now.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

a10y commented Mar 25, 2026 •

edited

Loading

codspeed-hq bot commented Mar 25, 2026 •

edited

Loading

a10y commented Mar 25, 2026 •

edited

Loading

claude bot commented Mar 25, 2026 •

edited by a10y

Loading

🔴 Critical: `inferSchema` doesn't pass credentials

🟡 Logic: Azure `skip_signature` set when no explicit key (may break managed identity)

🟡 `VortexAzureProperties.skipSignature()` return type inconsistency

🟢 Nit: `HadoopUtils()` constructor should be `private`

🟢 Nit: `testWriteAndReadWithFormatOptionsFromS3Mock` doesn't really test option forwarding

a10y commented Mar 25, 2026 •

edited

Loading

claude bot commented Mar 25, 2026 •

edited

Loading

✅ Fixed: `inferSchema` now passes credentials

✅ Fixed: `HadoopUtils()` private constructor

✅ Fixed: `VortexAzureProperties.skipSignature()` return type

✅ Fixed: Unused `fs.s3a.impl` config in test

✅ Improved: `testWriteAndReadWithFormatOptionsFromS3Mock`

claude bot commented Mar 25, 2026 •

edited

Loading