feat(table): Adding geometry and geography type + schema plumbing#984
feat(table): Adding geometry and geography type + schema plumbing#984happydave1 wants to merge 5 commits intoapache:mainfrom
Conversation
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
laskoviymishka
left a comment
There was a problem hiding this comment.
Direction looks good overall: JSON parsing, partition-spec rejection, null-only defaults, v3 gating, and Substrait pass-through all seem aligned with the spec.
I’d hold this before merge on two things:
-
SchemaVisitorPerPrimitiveTypenow hasVisitGeometry/VisitGeography, and the visitors implement them, butvisitFielddoes not seem to dispatch to them yet. So geo columns still fall through tovisitor.Primitive(...), which panics for both Arrow and Substrait conversion. That meansSchemaToArrowSchemaor Substrait conversion can panic on a geo column today. -
The geography default algorithm canonicalization seems to differ from Java. The spec default is
spherical; Java treats that as the implicit/default form and emits canonicalgeographyforgeography(OGC:CRS84, spherical). This PR appears to re-emit the full form, which may cause cross-client schema fingerprint drift for the same logical type.
The rest looks like smaller cleanups. Once these are addressed, happy to take another pass.
| VisitBinary() T | ||
| VisitUUID() T | ||
| VisitUnknown() T | ||
| VisitGeometry(GeometryType) T |
There was a problem hiding this comment.
These methods get added to the interface but visitField (~671–720) doesn't dispatch them. GeometryType/GeographyType are PrimitiveTypes, so they fall through to visitor.Primitive(...), which both visitors panic from. SchemaToArrowSchema on any geo column will panic at runtime today. I'd add the two cases in the dispatcher and a regression test that runs iceberg.Visit(geoSchema, convertToArrow{}).
| } | ||
|
|
||
| func (GeographyType) primitive() {} | ||
| func (g GeographyType) Type() string { |
There was a problem hiding this comment.
Spec default algorithm is spherical. Java treats it as null internally and emits canonical "geography" for geography(OGC:CRS84, spherical). Here we re-emit the full string, so GeographyTypeOf("OGC:CRS84", EdgeAlgorithmSpherical) doesn't equal GeographyType{} and emits different JSON than Java for the same logical type — schema fingerprints will diverge. I'd canonicalize at construction (CRS84 + spherical → zero-value) and have Algorithm() return EdgeAlgorithmSpherical for the empty internal state, mirroring CRS().
| normalizedCRS = "" | ||
| } | ||
|
|
||
| return GeographyType{crs: normalizedCRS, algorithm: algorithm}, nil |
There was a problem hiding this comment.
Doesn't validate algorithm — GeographyTypeOf("srid:4326", EdgeAlgorithm("garbage")) succeeds and produces metadata Java/PyIceberg will reject. JSON parse goes through ParseEdgeAlgorithm; the constructor doesn't. I'd run it here too when algorithm != "".
|
|
||
| func GeometryTypeOf(crs string) (GeometryType, error) { | ||
| if crs == "" { | ||
| return GeometryType{}, errors.New("invalid CRS: (empty string)") |
There was a problem hiding this comment.
Plain errors.New here. Rest of this file wraps ErrInvalidTypeString (lines 161/169/198) so callers can errors.Is. Same applies to GeographyTypeOf (line 912) and ParseEdgeAlgorithm (line 850).
| decimalRegex = regexp.MustCompile(`decimal\(\s*(\d+)\s*,\s*(\d+)\s*\)`) | ||
| geometryRegex = regexp.MustCompile(`(?i)^geometry\s*(?:\(\s*([^)]+?)\s*\))?$`) | ||
| geographyRegex = regexp.MustCompile(`(?i)^geography\s*(?:\(\s*([^,]+?)\s*(?:,\s*(\w+)\s*)?\))?$`) | ||
| ) |
There was a problem hiding this comment.
Both regexes over-accept: geometry(srid:4326, extra) parses with the comma absorbed into the CRS group; geography(srid:4269 karney) (missing comma) parses with the space absorbed. I'd tighten the CRS group (e.g. [^),]+? for geometry) and add a couple of negative tests.
| // Returning nil indicates this type cannot be converted to Substrait | ||
| return nil | ||
| } | ||
| func (convertToSubstrait) VisitGeometry(iceberg.GeometryType) types.Type { return &types.BinaryType{} } |
There was a problem hiding this comment.
Same dispatch issue as schema.go — these are dead code today because visitField doesn't route to them, and convertToSubstrait.Primitive panics on the fall-through. Minor: VisitGeometry is one-line but VisitGeography is multi-line; surrounding methods are uniformly one-liners.
|
|
||
| func (IdentityTransform) CanTransform(t Type) bool { | ||
| _, ok := t.(PrimitiveType) | ||
| switch t.(type) { |
There was a problem hiding this comment.
Style nit — the nested switch only adds a level to special-case geo. I'd flatten:
switch t.(type) {
case GeometryType, GeographyType:
return false
}
_, ok := t.(PrimitiveType)
return ok| }, | ||
| notAllowed: []iceberg.Type{ | ||
| &iceberg.StructType{}, &iceberg.ListType{}, &iceberg.MapType{}, | ||
| iceberg.GeometryType{}, |
There was a problem hiding this comment.
Identity and Bucket get geo coverage but Void/Truncate/Year/Month/Day/Hour aren't pinned. Void particularly — per spec it's the only transform geo is allowed for, so it's the one most likely to silently regress.
| }) | ||
|
|
||
| t.Run("test update schema with add geometry and geography columns", func(t *testing.T) { | ||
| table := New([]string{"id"}, testMetadata, "", nil, nil) |
There was a problem hiding this comment.
This builds against testMetadata (v2) and only calls Apply() — that path skips checkSchemaCompatibility, so the test passes despite geo being illegal in v2. Reads as if v2 add works. Build a v3 metadata for this case (mirror the geoMeta construction in the error test below) and ideally run Commit()/BuildUpdates() so the realistic path is covered.
| }) | ||
| } | ||
|
|
||
| func TestGeometryGeographyNullOnlyDefaults(t *testing.T) { |
There was a problem hiding this comment.
The v2 with non-null initial default subtest asserts "is not supported until v3". That path actually fires both the v2-unsupported error and the must-default-to-null error, and ErrorContains happens to match the first. If geo ever becomes allowed in v2, this silently changes meaning. Worth splitting: one v2 subtest asserting the type-unsupported message, one v3 subtest with non-null default asserting must-default-to-null.
| type EdgeAlgorithm string | ||
|
|
||
| const ( | ||
| EdgeAlgorithmSpherical EdgeAlgorithm = "spherical" | ||
| EdgeAlgorithmVincenty EdgeAlgorithm = "vincenty" | ||
| EdgeAlgorithmThomas EdgeAlgorithm = "thomas" | ||
| EdgeAlgorithmAndoyer EdgeAlgorithm = "andoyer" | ||
| EdgeAlgorithmKarney EdgeAlgorithm = "karney" | ||
| ) |
There was a problem hiding this comment.
these exist in geoarrow/geoarrow-go, we should use the constants from there instead of defining them here. See https://github.com/geoarrow/geoarrow-go/blob/main/metadata.go
| case UUIDType: | ||
| return &boundRef[uuid.UUID]{field: field, acc: acc} | ||
| case GeographyType, GeometryType: | ||
| return &boundRef[[]byte]{field: field, acc: acc} |
There was a problem hiding this comment.
should this be https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L20 instead of []byte?
| } | ||
| } | ||
|
|
||
| func (c convertToArrow) VisitGeometry(iceberg.GeometryType) arrow.Field { |
There was a problem hiding this comment.
I don't see why we don't just use https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L15 off the bat right now instead of using binary/large binary as the intermediate.
https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L84 can be used for the large type case
| // Passthrough binary for now, adding geoarrow-go support later | ||
| if c.useLargeTypes { | ||
| return arrow.Field{Type: arrow.BinaryTypes.LargeBinary} | ||
| } | ||
|
|
||
| return arrow.Field{Type: arrow.BinaryTypes.Binary} |
There was a problem hiding this comment.
same point as above. why not just use https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L20 right now instead of the passthrough?
Based on #628 and addresses #990
Note that this PR sets up a PR to address #991