feat: Variant Support#2188
Conversation
| let table_creation = TableCreation::builder() | ||
| .name(name.clone()) | ||
| .schema(iceberg_schema) | ||
| .format_version(format_version) |
There was a problem hiding this comment.
Before this change existing tests where rightfully failing as we used to create a V2 table with a NS Timestamp column:
https://github.com/apache/iceberg-rust/actions/runs/22522306915/job/65248930667
This new logic determines the min format version required and uses that - but at least V2. Thus we switch now to V3 for ns timestamps.
|
@CTTY @liurenjie1024 @Xuanwo this would be ready for review! |
CTTY
left a comment
There was a problem hiding this comment.
Thanks for the feature! Just took a look.
Also the test seems to be failing
| custom_attributes: Default::default(), | ||
| }, | ||
| ]; | ||
| let mut schema = avro_record_schema(VARIANT_LOGICAL_TYPE, fields)?; |
There was a problem hiding this comment.
I don't think we can use a static string for every record. what if there are multiple variant columns, would the record name conflict?
There was a problem hiding this comment.
Properly build from field ids now
| // field ID resolves to Type::Variant and record all their sub-fields so | ||
| // the second filter_leaves can include them directly. | ||
| let mut variant_sub_fields: HashMap<FieldRef, i32> = HashMap::new(); | ||
| for top_field in fields.iter() { |
There was a problem hiding this comment.
What would happen if variant is nested within another type?
There was a problem hiding this comment.
It will fail to find the leaf node metadata/value in the variant. The following are errors that can happen if nested in a Map.
Map field must have exactly 2 fields
partial projection of MapArray is not supported
There was a problem hiding this comment.
fixed in latest commit
| default: None, | ||
| custom_attributes: Default::default(), | ||
| }, | ||
| ]; |
There was a problem hiding this comment.
Would you use LazyLock<Vec<AvroRecordField>> plus clone? avro_record_schema consumes the Vec by value (it builds a RecordSchema), so each call needs its own owned copy regardless. Not sure if this would gain much.
| Err(Error::new( | ||
| ErrorKind::FeatureUnsupported, | ||
| "Conversion from VariantType is not supported for Glue", | ||
| )) |
There was a problem hiding this comment.
we can just return "variant".to_string(), on glue it would look like below:
{
"data": {
"unknown": "variant"
}
}
There was a problem hiding this comment.
Added this for HMS too.
However I see this is still open:
apache/iceberg#15220
|
@CTTY ready for another round! |
CTTY
left a comment
There was a problem hiding this comment.
Mostly LGTM! Left some minor comments
| Err(Error::new( | ||
| ErrorKind::DataInvalid, | ||
| format!( | ||
| "Invalid schema for v{format_version}:\n- {}", |
There was a problem hiding this comment.
| "Invalid schema for v{format_version}:\n- {}", | |
| "Invalid schema for {format_version}:\n- {}", |
nit: v seems redundant
| /// Minimum format version required for nanosecond-precision timestamp types (v3). | ||
| pub const MIN_FORMAT_VERSION_TIMESTAMP_NS: FormatVersion = FormatVersion::V3; | ||
| /// Minimum format version required for the variant type (v3). | ||
| pub const MIN_FORMAT_VERSION_VARIANT: FormatVersion = FormatVersion::V3; |
There was a problem hiding this comment.
nit: These are not really needed, since the min versions of these types should be hardcoded to FormatVersion::V3 anyway
| /// Validates that every type used across all schemas is supported by the | ||
| /// table's format version. Delegates to [`Schema::check_format_compatibility`]. | ||
| fn validate_schema_format_compatibility(&self) -> Result<()> { | ||
| for schema in self.schemas.values() { |
There was a problem hiding this comment.
Could you help me understand why do we need to validate all schemas rather than just the current schema?
Which issue does this PR close?
Variant Support.
Arrow value support is currently missing as I am unsure how we want to extend
LiteralWhat changes are included in this PR?
Core: Variant Type
crates/iceberg/src/spec/datatypes.rs— newVarianttypecrates/iceberg/src/spec/values/literal.rs—Variantliteral valuecrates/iceberg/src/spec/schema/— visitor, index, pruning, mod, id reassigner all handleVariantcrates/iceberg/src/spec/table_metadata.rs— metadata supportAvro
crates/iceberg/src/avro/schema.rs— read/writeVariantin AvroArrow
crates/iceberg/src/arrow/schema.rs— mapVariantto Arrow typecrates/iceberg/src/arrow/reader.rs— readVariantfrom Arrowcrates/iceberg/src/arrow/value.rs— Arrow value conversioncaching_delete_file_loader.rsandnan_val_cnt_visitor.rsParquet
crates/iceberg/src/writer/file_writer/parquet_writer.rs— writeVariantcolumnsTests & Dev
crates/integration_tests/tests/read_variant.rs— new integration test for reading Variant datadev/spark/provision.py— Spark provisioning to generate Variant test dataAre these changes tested?
Sure! Even integration tested :)