[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths by yaooqinn · Pull Request #12130 · apache/gluten

yaooqinn · 2026-05-22T19:12:02Z

What's in this PR

Removes the dead Arrow-CSV / Arrow-Dataset JVM code path on the Velox backend.

Three commits:

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path — drops 12 .scala files that form a closed reference cluster (ArrowCSVFileFormat, ArrowCSVOptionConverter, ArrowCSV{PartitionReaderFactory,Scan,ScanBuilder,Table}, ArrowBatchScanExec, Arrow{Convertor,ScanReplace}Rule, ArrowFileSourceScanExec, BaseArrowScanExec, ArrowCsvScanSuite, GlutenRuntimeConfigSuite) plus the now-empty ArrowBatchScanExecShim in shims/spark33..41.
[MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil — drops residual Arrow dataset reader helpers no longer reachable after step 1.
[MINOR][VL] Remove dead spark.gluten.sql.native.arrow.reader.enabled config — drops the config + its plumbing (GlutenConfig.NATIVE_ARROW_READER_ENABLED / enableNativeArrowReader, BackendSettingsApi.enableNativeArrowReadFiles default, VeloxBackend.enableNativeArrowReadFiles override), the leftover .set(NATIVE_ARROW_READER_ENABLED.key, "true") calls in 7 test suites (now no-ops), and the corresponding row in docs/Configuration.md.

Why this is safe — three-layer evidence

1. The cluster is unreachable from any active rule pipeline

The two rules that wired the cluster into Spark's planner — ArrowConvertorRule (post-hoc resolution) and ArrowScanReplaceRule (pre-transform) — were unwired from VeloxRuleApi by:

#11190 [GLUTEN-11088][VL] Fall back CSV reader (merged 2026-01-19)

That PR deleted the injector.injectPostHocResolutionRule(ArrowConvertorRule.apply) and injector.injectPreTransform(c => ArrowScanReplaceRule.apply(c.session)) lines from both injectLegacy and the RAS injection path, and re-enabled GlutenCSVv1Suite/v2Suite precisely because falling back to Spark's native CSV reader was now the chosen path.

After that PR, the deleted cluster has zero references anywhere in the repo outside of self-references — verified by:

$ grep -rn 'ArrowConvertorRule\|ArrowScanReplaceRule\|ArrowCSVFileFormat\|ArrowCSVTable\|ArrowBatchScanExec\|ArrowFileSourceScanExec' \
    --include='*.scala' --include='*.java' --include='*.conf' --include='*.xml' --include='*.properties' .
# (only intra-cluster references — no outside caller, no META-INF service, no config-based wiring)

This is dead-by-isolation, not dead-by-flag-off: the classes exist in the jar, but Spark's planner has no path to dispatch into them.

2. CSV functionality is fully covered by the Spark-native path

The active CSV test surface — GlutenCSVSuite, GlutenCSVv1Suite, GlutenCSVv2Suite, GlutenCSVLegacyTimeParserSuite, GlutenCSVParsingOptionsSuite, GlutenCsvExpressionsSuite, GlutenCsvFunctionsSuite across spark35/40/41 — exercises CSV via the standard Spark FileSourceScanExec / V2 BatchScanExec path that this PR does not touch. Those suites pass today because ArrowConvertorRule no longer intercepts the plan; their previous .set(NATIVE_ARROW_READER_ENABLED.key, "true") lines have been no-ops since #11190 and are removed in commit 3.

The only Arrow-CSV test suite, ArrowCsvScanSuite, had all 8 of its cases @Ignore'd (class-level and method-level), with the V2 variant disabled by #9380 and the flaky cases ignored by #8906 — root cause tracked in #8905 (Velox SIGSEGV in G1BarrierSet on the JNI↔native Arrow vector GC boundary).

3. History timeline

Date	PR / Issue	What happened
2024-05-08	#5447 / #5414	Added the Arrow-CSV cluster because "Velox does not support CSV/TEXT yet, instead via arrow dataset"
2025-03-05	#8906	Ignored flaky `ArrowCsvScanSuite` cases (#8905 root cause: G1BarrierSet SIGSEGV)
2025-04-21	#9380	Disabled `ArrowCsvScanSuiteV2` entirely
2026-01-19	#11190	Unwired `ArrowConvertorRule` / `ArrowScanReplaceRule` from `VeloxRuleApi`; re-enabled `GlutenCSVv1Suite/v2Suite` via Spark-native CSV path
This PR	—	Physical removal of the now-unreachable cluster + zombie config

This is not "giving up CSV"

Velox upstream is bringing CSV in-house via the native TextReader (facebookincubator/velox#13053, ongoing; recent landings in velox#14677, prestodb/presto#25995). When Gluten wires that path through Substrait ReadRel(text), the implementation will live in gluten-substrait/ + cpp/ — not by re-introducing a JVM-side Arrow Dataset bypass.

A side benefit: with the JVM caller gone, the 883-line CsvFragmentScanOptions.from(Map) patch carried by dev/build-arrow.sh becomes a candidate for removal in a follow-up.