[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130
Open
yaooqinn wants to merge 3 commits into
Open
[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130yaooqinn wants to merge 3 commits into
yaooqinn wants to merge 3 commits into
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
The ArrowCSV file format and ArrowBatchScanExec chain are unreachable: no injection in VeloxRuleApi, no META-INF/services entry, and all ArrowCsvScanSuite cases are @ignore'd. They were introduced as a squash-merge byproduct in apache#11776 and never wired up. Verified by compiling: * spark-3.5 + scala-2.12 + arrow 15.0.0-gluten (install) * spark-4.0 + scala-2.13 + arrow 18.1.0 (compile) Generated-by: claude-opus-4.7
makeArrowDiscovery / readArrowSchema / readArrowFileColumnNames / readSchema(FragmentScanOptions) overloads / loadMissingColumns / loadPartitionColumns / loadBatch in ArrowUtil have zero callers across the repo after the previous removal of the ArrowCSV chain. Drop them together with the now-unused imports (arrow.dataset.*, FileStatus, URI/URLDecoder, ArrowRecordBatch, Optional, Logging, etc.). Verified by compiling: * spark-3.5 + scala-2.12 (test-compile, patched arrow 15.0.0-gluten) * spark-4.0 + scala-2.13 (compile, pure arrow 18.1.0) Generated-by: claude-opus-4.7
e016ea6 to
c117bcf
Compare
|
Run Gluten Clickhouse CI on x86 |
c117bcf to
efc24d9
Compare
|
Run Gluten Clickhouse CI on x86 |
efc24d9 to
c80b45b
Compare
|
Run Gluten Clickhouse CI on x86 |
…config Following the removal of ArrowConvertorRule/ArrowScanReplaceRule (already unwired from VeloxRuleApi by PR apache#11190 "[GLUTEN-11088][VL] Fall back CSV reader" merged 2026-01-19), the spark.gluten.sql.native.arrow.reader.enabled config and its plumbing have no consumers: * GlutenConfig.enableNativeArrowReader / NATIVE_ARROW_READER_ENABLED * BackendSettingsApi.enableNativeArrowReadFiles (default) * VeloxBackend.enableNativeArrowReadFiles (override) Test suites still set this flag (MiscOperatorSuite, GlutenCSVSuite, GlutenReadSchemaSuite across spark35/40/41) but it has been a no-op since PR apache#11190; CSV continues to be covered by these suites via the Spark native CSV path. The corresponding entry in docs/Configuration.md is also removed. Generated-by: Claude claude-opus-4.7
c80b45b to
9ea8290
Compare
|
Run Gluten Clickhouse CI on x86 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's in this PR
Removes the dead Arrow-CSV / Arrow-Dataset JVM code path on the Velox backend.
Three commits:
[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path— drops 12.scalafiles that form a closed reference cluster (ArrowCSVFileFormat,ArrowCSVOptionConverter,ArrowCSV{PartitionReaderFactory,Scan,ScanBuilder,Table},ArrowBatchScanExec,Arrow{Convertor,ScanReplace}Rule,ArrowFileSourceScanExec,BaseArrowScanExec,ArrowCsvScanSuite,GlutenRuntimeConfigSuite) plus the now-emptyArrowBatchScanExecShiminshims/spark33..41.[MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil— drops residual Arrow dataset reader helpers no longer reachable after step 1.[MINOR][VL] Remove dead spark.gluten.sql.native.arrow.reader.enabled config— drops the config + its plumbing (GlutenConfig.NATIVE_ARROW_READER_ENABLED/enableNativeArrowReader,BackendSettingsApi.enableNativeArrowReadFilesdefault,VeloxBackend.enableNativeArrowReadFilesoverride), the leftover.set(NATIVE_ARROW_READER_ENABLED.key, "true")calls in 7 test suites (now no-ops), and the corresponding row indocs/Configuration.md.Why this is safe — three-layer evidence
1. The cluster is unreachable from any active rule pipeline
The two rules that wired the cluster into Spark's planner —
ArrowConvertorRule(post-hoc resolution) andArrowScanReplaceRule(pre-transform) — were unwired fromVeloxRuleApiby:That PR deleted the
injector.injectPostHocResolutionRule(ArrowConvertorRule.apply)andinjector.injectPreTransform(c => ArrowScanReplaceRule.apply(c.session))lines from bothinjectLegacyand the RAS injection path, and re-enabledGlutenCSVv1Suite/v2Suiteprecisely because falling back to Spark's native CSV reader was now the chosen path.After that PR, the deleted cluster has zero references anywhere in the repo outside of self-references — verified by:
This is dead-by-isolation, not dead-by-flag-off: the classes exist in the jar, but Spark's planner has no path to dispatch into them.
2. CSV functionality is fully covered by the Spark-native path
The active CSV test surface —
GlutenCSVSuite,GlutenCSVv1Suite,GlutenCSVv2Suite,GlutenCSVLegacyTimeParserSuite,GlutenCSVParsingOptionsSuite,GlutenCsvExpressionsSuite,GlutenCsvFunctionsSuiteacrossspark35/40/41— exercises CSV via the standard SparkFileSourceScanExec/ V2BatchScanExecpath that this PR does not touch. Those suites pass today becauseArrowConvertorRuleno longer intercepts the plan; their previous.set(NATIVE_ARROW_READER_ENABLED.key, "true")lines have been no-ops since #11190 and are removed in commit 3.The only Arrow-CSV test suite,
ArrowCsvScanSuite, had all 8 of its cases@Ignore'd (class-level and method-level), with the V2 variant disabled by #9380 and the flaky cases ignored by #8906 — root cause tracked in #8905 (Velox SIGSEGV in G1BarrierSet on the JNI↔native Arrow vector GC boundary).3. History timeline
ArrowCsvScanSuitecases (#8905 root cause: G1BarrierSet SIGSEGV)ArrowCsvScanSuiteV2entirelyArrowConvertorRule/ArrowScanReplaceRulefromVeloxRuleApi; re-enabledGlutenCSVv1Suite/v2Suitevia Spark-native CSV pathThis is not "giving up CSV"
Velox upstream is bringing CSV in-house via the native
TextReader(facebookincubator/velox#13053, ongoing; recent landings in velox#14677, prestodb/presto#25995). When Gluten wires that path through SubstraitReadRel(text), the implementation will live ingluten-substrait/+cpp/— not by re-introducing a JVM-side Arrow Dataset bypass.A side benefit: with the JVM caller gone, the 883-line
CsvFragmentScanOptions.from(Map)patch carried bydev/build-arrow.shbecomes a candidate for removal in a follow-up.How was this patch tested?
GlutenCSV{,v1,v2,LegacyTimeParser}Suite,GlutenCSVParsingOptionsSuite,GlutenCsvExpressionsSuite,GlutenCsvFunctionsSuiteacross spark35/40/41.ArrowCsvScanSuite,GlutenRuntimeConfigSuite) were fully@Ignore'd / scoped to the removed config.Generated-by: Claude claude-opus-4.7