Skip to content

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130

Open
yaooqinn wants to merge 3 commits into
apache:mainfrom
yaooqinn:users/kentyao/spike-drop-arrow-csv-dataset
Open

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130
yaooqinn wants to merge 3 commits into
apache:mainfrom
yaooqinn:users/kentyao/spike-drop-arrow-csv-dataset

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

@yaooqinn yaooqinn commented May 22, 2026

What's in this PR

Removes the dead Arrow-CSV / Arrow-Dataset JVM code path on the Velox backend.

Three commits:

  1. [MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path — drops 12 .scala files that form a closed reference cluster (ArrowCSVFileFormat, ArrowCSVOptionConverter, ArrowCSV{PartitionReaderFactory,Scan,ScanBuilder,Table}, ArrowBatchScanExec, Arrow{Convertor,ScanReplace}Rule, ArrowFileSourceScanExec, BaseArrowScanExec, ArrowCsvScanSuite, GlutenRuntimeConfigSuite) plus the now-empty ArrowBatchScanExecShim in shims/spark33..41.
  2. [MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil — drops residual Arrow dataset reader helpers no longer reachable after step 1.
  3. [MINOR][VL] Remove dead spark.gluten.sql.native.arrow.reader.enabled config — drops the config + its plumbing (GlutenConfig.NATIVE_ARROW_READER_ENABLED / enableNativeArrowReader, BackendSettingsApi.enableNativeArrowReadFiles default, VeloxBackend.enableNativeArrowReadFiles override), the leftover .set(NATIVE_ARROW_READER_ENABLED.key, "true") calls in 7 test suites (now no-ops), and the corresponding row in docs/Configuration.md.

Why this is safe — three-layer evidence

1. The cluster is unreachable from any active rule pipeline

The two rules that wired the cluster into Spark's planner — ArrowConvertorRule (post-hoc resolution) and ArrowScanReplaceRule (pre-transform) — were unwired from VeloxRuleApi by:

#11190 [GLUTEN-11088][VL] Fall back CSV reader (merged 2026-01-19)

That PR deleted the injector.injectPostHocResolutionRule(ArrowConvertorRule.apply) and injector.injectPreTransform(c => ArrowScanReplaceRule.apply(c.session)) lines from both injectLegacy and the RAS injection path, and re-enabled GlutenCSVv1Suite/v2Suite precisely because falling back to Spark's native CSV reader was now the chosen path.

After that PR, the deleted cluster has zero references anywhere in the repo outside of self-references — verified by:

$ grep -rn 'ArrowConvertorRule\|ArrowScanReplaceRule\|ArrowCSVFileFormat\|ArrowCSVTable\|ArrowBatchScanExec\|ArrowFileSourceScanExec' \
    --include='*.scala' --include='*.java' --include='*.conf' --include='*.xml' --include='*.properties' .
# (only intra-cluster references — no outside caller, no META-INF service, no config-based wiring)

This is dead-by-isolation, not dead-by-flag-off: the classes exist in the jar, but Spark's planner has no path to dispatch into them.

2. CSV functionality is fully covered by the Spark-native path

The active CSV test surface — GlutenCSVSuite, GlutenCSVv1Suite, GlutenCSVv2Suite, GlutenCSVLegacyTimeParserSuite, GlutenCSVParsingOptionsSuite, GlutenCsvExpressionsSuite, GlutenCsvFunctionsSuite across spark35/40/41 — exercises CSV via the standard Spark FileSourceScanExec / V2 BatchScanExec path that this PR does not touch. Those suites pass today because ArrowConvertorRule no longer intercepts the plan; their previous .set(NATIVE_ARROW_READER_ENABLED.key, "true") lines have been no-ops since #11190 and are removed in commit 3.

The only Arrow-CSV test suite, ArrowCsvScanSuite, had all 8 of its cases @Ignore'd (class-level and method-level), with the V2 variant disabled by #9380 and the flaky cases ignored by #8906 — root cause tracked in #8905 (Velox SIGSEGV in G1BarrierSet on the JNI↔native Arrow vector GC boundary).

3. History timeline

Date PR / Issue What happened
2024-05-08 #5447 / #5414 Added the Arrow-CSV cluster because "Velox does not support CSV/TEXT yet, instead via arrow dataset"
2025-03-05 #8906 Ignored flaky ArrowCsvScanSuite cases (#8905 root cause: G1BarrierSet SIGSEGV)
2025-04-21 #9380 Disabled ArrowCsvScanSuiteV2 entirely
2026-01-19 #11190 Unwired ArrowConvertorRule / ArrowScanReplaceRule from VeloxRuleApi; re-enabled GlutenCSVv1Suite/v2Suite via Spark-native CSV path
This PR Physical removal of the now-unreachable cluster + zombie config

This is not "giving up CSV"

Velox upstream is bringing CSV in-house via the native TextReader (facebookincubator/velox#13053, ongoing; recent landings in velox#14677, prestodb/presto#25995). When Gluten wires that path through Substrait ReadRel(text), the implementation will live in gluten-substrait/ + cpp/not by re-introducing a JVM-side Arrow Dataset bypass.

A side benefit: with the JVM caller gone, the 883-line CsvFragmentScanOptions.from(Map) patch carried by dev/build-arrow.sh becomes a candidate for removal in a follow-up.

How was this patch tested?

  • All references statically resolved (no compile-time breakage introduced).
  • CSV coverage continues via the existing Spark-native CSV path through GlutenCSV{,v1,v2,LegacyTimeParser}Suite, GlutenCSVParsingOptionsSuite, GlutenCsvExpressionsSuite, GlutenCsvFunctionsSuite across spark35/40/41.
  • The only deleted test suites (ArrowCsvScanSuite, GlutenRuntimeConfigSuite) were fully @Ignore'd / scoped to the removed config.
  • CI signal is the verification: it cannot regress on a code path that has no callers.

Generated-by: Claude claude-opus-4.7

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the DOCS label May 23, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

yaooqinn added 2 commits May 23, 2026 06:32
The ArrowCSV file format and ArrowBatchScanExec chain are unreachable:
no injection in VeloxRuleApi, no META-INF/services entry, and all
ArrowCsvScanSuite cases are @ignore'd. They were introduced as a
squash-merge byproduct in apache#11776 and never wired up.

Verified by compiling:
  * spark-3.5 + scala-2.12 + arrow 15.0.0-gluten (install)
  * spark-4.0 + scala-2.13 + arrow 18.1.0 (compile)

Generated-by: claude-opus-4.7
makeArrowDiscovery / readArrowSchema / readArrowFileColumnNames /
readSchema(FragmentScanOptions) overloads / loadMissingColumns /
loadPartitionColumns / loadBatch in ArrowUtil have zero callers across
the repo after the previous removal of the ArrowCSV chain. Drop them
together with the now-unused imports (arrow.dataset.*, FileStatus,
URI/URLDecoder, ArrowRecordBatch, Optional, Logging, etc.).

Verified by compiling:
  * spark-3.5 + scala-2.12 (test-compile, patched arrow 15.0.0-gluten)
  * spark-4.0 + scala-2.13 (compile, pure arrow 18.1.0)

Generated-by: claude-opus-4.7
@yaooqinn yaooqinn force-pushed the users/kentyao/spike-drop-arrow-csv-dataset branch from e016ea6 to c117bcf Compare May 23, 2026 06:34
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@yaooqinn yaooqinn force-pushed the users/kentyao/spike-drop-arrow-csv-dataset branch from c117bcf to efc24d9 Compare May 23, 2026 10:21
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@yaooqinn yaooqinn force-pushed the users/kentyao/spike-drop-arrow-csv-dataset branch from efc24d9 to c80b45b Compare May 23, 2026 11:35
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

…config

Following the removal of ArrowConvertorRule/ArrowScanReplaceRule (already

unwired from VeloxRuleApi by PR apache#11190 "[GLUTEN-11088][VL] Fall back CSV

reader" merged 2026-01-19), the spark.gluten.sql.native.arrow.reader.enabled

config and its plumbing have no consumers:

  * GlutenConfig.enableNativeArrowReader / NATIVE_ARROW_READER_ENABLED

  * BackendSettingsApi.enableNativeArrowReadFiles (default)

  * VeloxBackend.enableNativeArrowReadFiles (override)

Test suites still set this flag (MiscOperatorSuite, GlutenCSVSuite,

GlutenReadSchemaSuite across spark35/40/41) but it has been a no-op since

PR apache#11190; CSV continues to be covered by these suites via the Spark

native CSV path. The corresponding entry in docs/Configuration.md is

also removed.

Generated-by: Claude claude-opus-4.7
@yaooqinn yaooqinn force-pushed the users/kentyao/spike-drop-arrow-csv-dataset branch from c80b45b to 9ea8290 Compare May 23, 2026 14:16
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant