Skip to content

[VL] Support native Parquet write for complex types (Struct/Array/Map)#11788

Open
Zouxxyy wants to merge 2 commits intoapache:mainfrom
Zouxxyy:dev/native-write
Open

[VL] Support native Parquet write for complex types (Struct/Array/Map)#11788
Zouxxyy wants to merge 2 commits intoapache:mainfrom
Zouxxyy:dev/native-write

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Mar 19, 2026

What changes are proposed in this pull request?

Enable native Parquet write for complex types (Struct/Array/Map) in Velox backend.

Velox's parquet writer converts vectors to Arrow then writes via Arrow's Parquet writer, which natively supports nested types. The previous Scala-side type restrictions were unnecessary.

Changes:

  • Remove supportNativeWrite gate — no longer needed since supportWriteFilesExec handles validation
  • Remove Struct/Array/Map restrictions from validateDataTypes for Parquet (only YearMonthIntervalType remains blocked, as Arrow has no mapping for it)
  • Make validateDataTypes recursively check nested types for YearMonthIntervalType
  • Add tests for struct, array, map, and nested struct writes

How was this patch tested?

New tests in VeloxParquetWriteSuite. Existing tests pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Kiro (Claude Opus 4.6)

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Mar 19, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Mar 19, 2026

Generated-by: Kiro (Claude Opus 4.6)

+-------------------------------------------------------------+
 |              Spark SQL Write Statement                        |
 |  (INSERT INTO / CTAS / INSERT OVERWRITE DIR / Hive INSERT)   |
 +-----------------------------+-------------------------------+
                               |
                               v
 +-----------------------------+-------------------------------+
 |                  DataWritingCommandExec(cmd, child)          |
 +-----------------------------+-------------------------------+
                               |
 ==============================|====================================
  Gate 1: NativeWritePostRule  (GlutenWriterColumnarRules.scala)
 ==============================|====================================
                               |
                               v
              +-------------------------------+
              |  1a. enableNativeWriteFiles   |
              |  spark.gluten.sql.native      |
              |  .writer.enabled              |
              |  default true since Spark 3.4 |
              +------+----------------+-------+
                  NO |                | YES
                     v                v
               +--------+   +-------------------------------+
               |FALLBACK|   |  1b. getNativeFormat(cmd)     |
               +--------+   |                               |
                             |  Match write cmd + format:    |
                             |  OK InsertIntoHadoopFs       |
                             |  OK CreateDataSourceTable    |
                             |  OK InsertIntoHiveDir/Table  |
                             |  X  Other commands            |
                             +------+----------------+------+
                                 NO |                | Some("parquet")
                                    v                v
                              +--------+   +--------------------+
                              |FALLBACK|   | Inject FakeRowAdapt|
                              +--------+   +---------+----------+
                                                     |
 ====================================================|=================
  Gate 2: doValidateInternal  (WriteFilesExecTransformer.scala)
 ====================================================|=================
                                                     |
                                                     v
                              +--------------------------------------+
                              |  2a. Constant complex type check     |
                              |                                      |
                              |  Parquet + Project contains           |
                              |  Literal(ArrayType|MapType)?         |
                              +------+----------------+--------------+
                                 YES |                | NO
                                     v                v
                               +--------+   +-------------------------------+
                               |FALLBACK|   | 2b. supportWriteFilesExec()   |
                               +--------+   +-------------------------------+
                                                     |
                                                     v
                    +---------------------------------------------------+
                    |   Validation chain (in order, any fail = fallback) |
                    |                                                   |
                    |  (1) validateFileFormat                            |
                    |      OK ParquetFileFormat                         |
                    |      OK HiveFileFormat (Parquet SerDe)            |
                    |      X  Other formats                             |
                    |                    |                               |
                    |                    v                               |
                    |  (2) validateCompressionCodec                      |
                    |      X  brotli / lzo / lz4raw / lz4_raw           |
                    |                    |                               |
                    |                    v                               |
                    |  (3) validateFieldMetadata                         |
                    |      X  StructField.metadata != Metadata.empty    |
                    |                    |                               |
                    |                    v                               |
                    |  (4) validateDataTypes                             |
                    |      Parquet: X YearMonthIntervalType             |
                    |      Non-Parquet: X Struct/Array/Map/YearMonth    |
                    |                    |                               |
                    |                    v                               |
                    |  (5) validateWriteFilesOptions                     |
                    |      X  maxRecordsPerFile > 0                     |
                    |                    |                               |
                    |                    v                               |
                    |  (6) validateBucketSpec                            |
                    |      X  Non Hive-compatible bucket write          |
                    +------------------------+--------------------------+
                                             |
                                      +------+------+
                                      | All passed? |
                                      +--+-------+--+
                                      NO |       | YES
                                         v       v
                                   +--------+   +------------------------------+
                                   |FALLBACK|   | 2c. doNativeValidation()     |
                                   +--------+   |  Substrait -> C++ validation |
                                                +---------------+--------------+
                                                                |
 ===============================================================|=========
  Gate 3: C++ Validator  (SubstraitToVeloxPlanValidator.cc)
 ===============================================================|=========
                                                                |
                                                                v
                              +--------------------------------------+
                              |  validate(WriteRel)                  |
                              |                                      |
                              |  3a. Recursively validate input plan |
                              |  3b. Parse input row type            |
                              |  3c. Validate partition column types:|
                              |      OK BOOLEAN / TINYINT / SMALLINT |
                              |      OK INTEGER / BIGINT             |
                              |      OK VARCHAR / VARBINARY          |
                              |      X  Other types as partition col |
                              |                                      |
                              |  Data columns: no type restriction   |
                              +------+----------------+--------------+
                                  NO |                | YES
                                     v                v
                               +--------+   +---------------------+
                               |FALLBACK|   | OK: Native Write    |
                               +--------+   +---------+-----------+
                                                      |
                                                      v
                              +--------------------------------------+
                              |           Execution Layer             |
                              |                                      |
                              |  VeloxColumnarWriteFilesExec          |
                              |       |                              |
                              |       v                              |
                              |  VeloxParquetDataSource               |
                              |       |                              |
                              |       v                              |
                              |  velox::parquet::Writer              |
                              |  (backed by Arrow Parquet Writer)    |
                              +--------------------------------------+

@jackylee-ch jackylee-ch requested a review from Copilot March 19, 2026 04:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables Velox’s native Parquet write path to support complex Spark SQL types (Struct/Array/Map) by removing earlier type-gating and adjusting the Velox write validation to allow nested types for Parquet.

Changes:

  • Removes the schema-based “native write supported” gate (supportNativeWrite) and relies on the WriteFiles validation path instead.
  • Updates Velox write validation to allow Parquet StructType (still blocks YearMonthIntervalType) and reorders validation checks.
  • Simplifies Delta Parquet native-writability checks and adds new Velox Parquet write tests for complex/nested types.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
gluten-substrait/src/main/scala/org/apache/spark/sql/execution/datasources/GlutenWriterColumnarRules.scala Removes schema gate before enabling native write properties/adaptor injection.
gluten-substrait/src/main/scala/org/apache/gluten/backendsapi/BackendSettingsApi.scala Deletes supportNativeWrite from the backend settings API.
backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxBackend.scala Allows Parquet struct types; refactors/reorders native write validation chain.
backends-velox/src/test/scala/org/apache/spark/sql/execution/VeloxParquetWriteSuite.scala Adds native Parquet write coverage for struct/array/map and nested struct.
backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/files/GlutenDeltaFileFormatWriter.scala Removes dependency on deleted Parquet companion helper; forces native-writable flag.
backends-velox/src-delta33/main/scala/org/apache/spark/sql/delta/GlutenParquetFileFormat.scala Removes fallback branch + companion object; always uses Gluten Parquet OutputWriterFactory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 101 to 106
case rc @ DataWritingCommandExec(cmd, child) =>
// The same thread can set these properties in the last query submission.
val format =
if (
BackendsApiManager.getSettings.supportNativeWrite(child.schema.fields) &&
BackendsApiManager.getSettings.enableNativeWriteFiles()
) {
if (BackendsApiManager.getSettings.enableNativeWriteFiles()) {
getNativeFormat(cmd)
} else {
Comment on lines 33 to 42
import org.slf4j.LoggerFactory

class GlutenParquetFileFormat
extends ParquetFileFormat
with DataSourceRegister
with Logging
with Serializable {
import GlutenParquetFileFormat._

private val logger = LoggerFactory.getLogger(classOf[GlutenParquetFileFormat])

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants