Skip to content

spark: pass format options into reader#7160

Open
a10y wants to merge 5 commits intodevelopfrom
aduffy/spark-props
Open

spark: pass format options into reader#7160
a10y wants to merge 5 commits intodevelopfrom
aduffy/spark-props

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Mar 25, 2026

Summary

In Spark, pass format properties into reader.

This mirrors the existing implementation of ParquetDataSourceV2 which rips the relevant S3/Azure/etc. properties off of the active SparkSession's hadoopConf. This is also similar to what we do in Vortex iceberg already.

Testing

Added a new test suite for reading/writing to S3 using the Adobe S3Mock testcontainer.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the changelog/fix A bug fix label Mar 25, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 25, 2026

Merging this PR will degrade performance by 15.51%

❌ 3 regressed benchmarks
✅ 1103 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 317.8 ns 376.1 ns -15.51%
Simulation bitwise_not_vortex_buffer_mut[1024] 477.2 ns 535.6 ns -10.89%
Simulation map_each[BufferMut<i32>, 128] 770.6 ns 858.1 ns -10.2%

Comparing aduffy/spark-props (24b6064) with develop (8aae4ba)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y marked this pull request as ready for review March 25, 2026 17:27
@a10y a10y requested a review from robert3005 March 25, 2026 17:27
@a10y a10y marked this pull request as draft March 25, 2026 17:27
*/
void initNativeResources() {
file = Files.open(partition.getPath());
file = Files.open(partition.getPath(), partition.getFormatOptions());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the thing we needed. everything else is just plumbing

import java.util.Map;
import java.util.Optional;

public final class VortexS3Properties {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i ripped this straight out of our iceberg fork

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y marked this pull request as ready for review March 25, 2026 17:37
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude please review. make no mistakes. be extremely javapilled

@claude
Copy link
Contributor

claude bot commented Mar 25, 2026

Claude finished @a10y's task in 2m 8s —— View job


Code Review

  • Gather context and read changed files
  • Review new config files (VortexS3Properties, VortexAzureProperties, HadoopUtils)
  • Review plumbing changes (VortexDataSourceV2, VortexScan, VortexBatchExec, VortexPartitionReader)
  • Review JNI changes (Files.java, object_store.rs)
  • Review test file (VortexDataSourceS3MockTest)
  • Post review feedback

Overall the approach is clean and mirrors the established pattern from Iceberg. The plumbing chain from VortexDataSourceV2VortexScanVortexBatchExecVortexFilePartitionVortexPartitionReader is well-executed. A few issues worth addressing before merge:

🔴 Critical: inferSchema doesn't pass credentials

VortexDataSourceV2.inferSchema still opens files without credentials (line 91):

try (File file = Files.open(pathToInfer)) {  // no properties!

Schema inference for S3/Azure files will fail because credentials are never passed in this path. The options map passed in via CaseInsensitiveStringMap is accessible here and could be enriched with hadoop conf properties the same way getTable does. Fix this →

🔴 Critical: Object store cache doesn't include credentials in the cache key

In vortex-jni/src/object_store.rs, the cache key is only scheme://host:port (line 142-148):

fn url_cache_key(url: &Url) -> String {
    format!("{}://{}", url.scheme(), &url[url::Position::BeforeHost..url::Position::AfterPort])
}

If the first call to a given S3 endpoint happens with no credentials (anonymous/env-var path) and a later call includes explicit credentials, the cached unauthenticated store is returned. This is a silent correctness bug. The cache either needs credentials included in the key, or the caching approach should be reconsidered (e.g., cache only when properties are empty, or disable caching for credentialed stores). Fix this →

🟡 Bug: env-var fallback check uses wrong key format

In object_store.rs lines 66-80, the env-var fallback guards check for "access_key_id":

if !properties.contains_key("access_key_id") && let Ok(access_key) = ...

But VortexS3Properties emits the key as "aws_access_key_id". So the guard never fires when Java passes credentials, meaning the env var is applied alongside explicit credentials (the later with_config loop will override it, so it's not a security issue, just wasted work and a misleading guard). The check should use "aws_access_key_id" to match what Java sends, or the Rust side should normalize to the canonical object_store key names. Fix this →

🟡 Logic: Azure skip_signature set when no explicit key (may break managed identity)

In HadoopUtils.azurePropertiesFromHadoopConf (line 66-68):

if (properties.accessKey().isEmpty()) {
    properties.setSkipSignature(true);
}

This unconditionally sets skip_signature=true whenever no fs.azure.account.key.* or SAS token is in the Hadoop conf. If the cluster is running with a managed identity or service principal (configured separately, not via these Hadoop keys), we'll tell the object store to skip auth entirely — which will likely result in 403 errors on private containers.

🟡 VortexAzureProperties.skipSignature() return type inconsistency

skipSignature() on VortexAzureProperties returns Optional<String> while the equivalent on VortexS3Properties returns boolean. The two classes have inconsistent API shapes overall (S3 uses void setters; Azure uses fluent builders). Not a blocker, but might cause confusion if these are ever used together. Fix this →

🟡 Multi-account Azure support silently broken

The azurePropertiesFromHadoopConf loop overwrites accessKey/sasKey on each matching entry (TODO comment acknowledges this). With two storage accounts, only the last matched key survives. The TODO is fine for now, but worth a comment that cross-account reads within a single Spark session are currently unsupported.

🟢 Nit: HadoopUtils() constructor should be private

The constructor is package-private (HadoopUtils() {}), which means subclasses in the same package could technically instantiate it. Since this is a pure utility class, it should be private.

🟢 Nit: Unused Spark config in test

VortexDataSourceS3MockTest line 55 sets spark.hadoop.fs.s3a.impl — this is needed for the S3AFileSystem Hadoop driver but doesn't affect Vortex's own ObjectStore. It's harmless but slightly misleading in a test that's supposed to verify Vortex's S3 integration specifically.

🟢 Nit: testWriteAndReadWithFormatOptionsFromS3Mock doesn't really test option forwarding

The test passes row_group_size on write and projection_pushdown on read, but neither of these is a validated Vortex option — they're silently ignored. If the goal is to test that format options actually reach the object store, consider asserting that omitting credentials causes a failure (and providing them causes success), rather than passing no-op options.


The core mechanic (VortexPartitionReaderFiles.open(path, formatOptions)) is correct. The critical issues to fix before merge are the inferSchema credentials gap and the object store cache key problem.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude please re-review. I should have addressed all of the RED comments, and most of the green ones. I don't think the YELLOW ones are relevant.

as always, stay javapilled.

@claude
Copy link
Contributor

claude bot commented Mar 25, 2026

Claude finished @a10y's task in 1m 36s —— View job


Re-Review

  • Read VortexDataSourceV2.java — check inferSchema credentials fix (🔴)
  • Read object_store.rs — check cache key fix (🔴)
  • Read HadoopUtils.java / VortexAzureProperties.java — check green nits
  • Post re-review

✅ Fixed: inferSchema now passes credentials

VortexDataSourceV2.inferSchema now calls buildDataSourceOptions and passes the result to both NativeFileMethods.listVortexFiles and Files.open. The credentials gap is closed. ✅

🔴 Still open: Object store cache key doesn't include credentials

url_cache_key in vortex-jni/src/object_store.rs (lines 142–148) is unchanged:

fn url_cache_key(url: &Url) -> String {
    format!(
        "{}://{}",
        url.scheme(),
        &url[url::Position::BeforeHost..url::Position::AfterPort],
    )
}

The properties map is still not part of the cache key. This means: if a store is first constructed for s3://my-bucket/foo with credentials A, and later opened with credentials B for the same endpoint, it silently returns the store built with credentials A. This is a correctness bug that will surface in real multi-tenant or test-then-production scenarios.

The simplest fix is to include a sorted, hashed representation of the properties in the key. Fix this →

✅ Fixed: HadoopUtils() private constructor

Line 10: private HadoopUtils() {} — done. ✅

✅ Fixed: VortexAzureProperties.skipSignature() return type

Now returns boolean (and both VortexS3Properties and VortexAzureProperties use the same style). ✅

✅ Fixed: Unused fs.s3a.impl config in test

Removed from VortexDataSourceS3MockTest. ✅

✅ Improved: testWriteAndReadWithFormatOptionsFromS3Mock

The test no longer passes the meaningless row_group_size/projection_pushdown no-op options. The first test (testWriteAndReadVortexFilesFromS3Mock) effectively validates the credential-forwarding path end-to-end by actually reading and writing from S3Mock. ✅

🟡 Acknowledged as not-relevant (author's call)

  • env-var guard key mismatch ("access_key_id" vs "aws_access_key_id"): safe at runtime because the explicit with_config loop runs after and overrides any spuriously-applied env-var value. Wasted work but not a bug.
  • Azure skip_signature managed-identity false positive: acknowledged as a limitation.
  • Multi-account Azure: tracked as a TODO comment.

Summary: One RED issue remains — the object store cache key needs to incorporate credentials. Everything else from the previous review has been addressed. The Java-side plumbing looks solid.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y
Copy link
Contributor Author

a10y commented Mar 25, 2026

@claude ok i addressed the props in cache key thing. happy now?

@claude
Copy link
Contributor

claude bot commented Mar 25, 2026

Claude finished @a10y's task in 30s —— View job


Re-Review (Cache Key Fix)

  • Read vortex-jni/src/object_store.rs — check cache key fix
  • Post results

✅ Yes, happy now.

url_cache_key now accepts properties and folds the sorted key-value pairs into the key:

fn url_cache_key(url: &Url, properties: &HashMap<String, String>) -> String {
    let mut sorted_props: Vec<_> = properties.iter().collect();
    sorted_props.sort_by_key(|(k, _)| *k);
    let props_str: String = sorted_props
        .iter()
        .map(|(k, v)| format!("{k}={v}"))
        .collect::<Vec<_>>()
        .join(",");
    format!("{}://{};{}", url.scheme(), &url[...], props_str)
}

The sort guarantees determinism regardless of insertion order, so two calls with the same credentials always resolve to the same cache entry, and two calls with different credentials (or no credentials vs. credentials) get distinct entries. The fix is correct.

One non-blocking observation: credentials end up as plaintext in an in-memory HashMap key. That's fine — it's never serialized or logged — but worth being aware of if this cache ever grows legs (e.g., persisted to disk, exposed via metrics). Not asking for a change, just flagging it.

All previously raised issues are now resolved. LGTM. 🚀

@a10y a10y enabled auto-merge (squash) March 25, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant