Skip to content

feat(table): Adding geometry and geography type + schema plumbing#984

Open
happydave1 wants to merge 5 commits intoapache:mainfrom
happydave1:feat/geo-type
Open

feat(table): Adding geometry and geography type + schema plumbing#984
happydave1 wants to merge 5 commits intoapache:mainfrom
happydave1:feat/geo-type

Conversation

@happydave1
Copy link
Copy Markdown
Contributor

@happydave1 happydave1 commented May 5, 2026

Based on #628 and addresses #990

Note that this PR sets up a PR to address #991

twuebi and others added 5 commits May 6, 2026 16:14
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
Signed-off-by: happydave1 <dzhao2004@gmail.com>
@happydave1 happydave1 changed the title Feat/geo type feat(table): Adding geometry and geography type + schema plumbing May 6, 2026
@happydave1 happydave1 marked this pull request as ready for review May 6, 2026 20:30
@happydave1 happydave1 requested a review from zeroshade as a code owner May 6, 2026 20:30
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direction looks good overall: JSON parsing, partition-spec rejection, null-only defaults, v3 gating, and Substrait pass-through all seem aligned with the spec.

I’d hold this before merge on two things:

  1. SchemaVisitorPerPrimitiveType now has VisitGeometry / VisitGeography, and the visitors implement them, but visitField does not seem to dispatch to them yet. So geo columns still fall through to visitor.Primitive(...), which panics for both Arrow and Substrait conversion. That means SchemaToArrowSchema or Substrait conversion can panic on a geo column today.

  2. The geography default algorithm canonicalization seems to differ from Java. The spec default is spherical; Java treats that as the implicit/default form and emits canonical geography for geography(OGC:CRS84, spherical). This PR appears to re-emit the full form, which may cause cross-client schema fingerprint drift for the same logical type.

The rest looks like smaller cleanups. Once these are addressed, happy to take another pass.

Comment thread schema.go
VisitBinary() T
VisitUUID() T
VisitUnknown() T
VisitGeometry(GeometryType) T
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods get added to the interface but visitField (~671–720) doesn't dispatch them. GeometryType/GeographyType are PrimitiveTypes, so they fall through to visitor.Primitive(...), which both visitors panic from. SchemaToArrowSchema on any geo column will panic at runtime today. I'd add the two cases in the dispatcher and a regression test that runs iceberg.Visit(geoSchema, convertToArrow{}).

Comment thread types.go
}

func (GeographyType) primitive() {}
func (g GeographyType) Type() string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec default algorithm is spherical. Java treats it as null internally and emits canonical "geography" for geography(OGC:CRS84, spherical). Here we re-emit the full string, so GeographyTypeOf("OGC:CRS84", EdgeAlgorithmSpherical) doesn't equal GeographyType{} and emits different JSON than Java for the same logical type — schema fingerprints will diverge. I'd canonicalize at construction (CRS84 + spherical → zero-value) and have Algorithm() return EdgeAlgorithmSpherical for the empty internal state, mirroring CRS().

Comment thread types.go
normalizedCRS = ""
}

return GeographyType{crs: normalizedCRS, algorithm: algorithm}, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't validate algorithmGeographyTypeOf("srid:4326", EdgeAlgorithm("garbage")) succeeds and produces metadata Java/PyIceberg will reject. JSON parse goes through ParseEdgeAlgorithm; the constructor doesn't. I'd run it here too when algorithm != "".

Comment thread types.go

func GeometryTypeOf(crs string) (GeometryType, error) {
if crs == "" {
return GeometryType{}, errors.New("invalid CRS: (empty string)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plain errors.New here. Rest of this file wraps ErrInvalidTypeString (lines 161/169/198) so callers can errors.Is. Same applies to GeographyTypeOf (line 912) and ParseEdgeAlgorithm (line 850).

Comment thread types.go
decimalRegex = regexp.MustCompile(`decimal\(\s*(\d+)\s*,\s*(\d+)\s*\)`)
geometryRegex = regexp.MustCompile(`(?i)^geometry\s*(?:\(\s*([^)]+?)\s*\))?$`)
geographyRegex = regexp.MustCompile(`(?i)^geography\s*(?:\(\s*([^,]+?)\s*(?:,\s*(\w+)\s*)?\))?$`)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both regexes over-accept: geometry(srid:4326, extra) parses with the comma absorbed into the CRS group; geography(srid:4269 karney) (missing comma) parses with the space absorbed. I'd tighten the CRS group (e.g. [^),]+? for geometry) and add a couple of negative tests.

// Returning nil indicates this type cannot be converted to Substrait
return nil
}
func (convertToSubstrait) VisitGeometry(iceberg.GeometryType) types.Type { return &types.BinaryType{} }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same dispatch issue as schema.go — these are dead code today because visitField doesn't route to them, and convertToSubstrait.Primitive panics on the fall-through. Minor: VisitGeometry is one-line but VisitGeography is multi-line; surrounding methods are uniformly one-liners.

Comment thread transforms.go

func (IdentityTransform) CanTransform(t Type) bool {
_, ok := t.(PrimitiveType)
switch t.(type) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit — the nested switch only adds a level to special-case geo. I'd flatten:

switch t.(type) {
case GeometryType, GeographyType:
    return false
}
_, ok := t.(PrimitiveType)
return ok

Comment thread transforms_test.go
},
notAllowed: []iceberg.Type{
&iceberg.StructType{}, &iceberg.ListType{}, &iceberg.MapType{},
iceberg.GeometryType{},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identity and Bucket get geo coverage but Void/Truncate/Year/Month/Day/Hour aren't pinned. Void particularly — per spec it's the only transform geo is allowed for, so it's the one most likely to silently regress.

})

t.Run("test update schema with add geometry and geography columns", func(t *testing.T) {
table := New([]string{"id"}, testMetadata, "", nil, nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builds against testMetadata (v2) and only calls Apply() — that path skips checkSchemaCompatibility, so the test passes despite geo being illegal in v2. Reads as if v2 add works. Build a v3 metadata for this case (mirror the geoMeta construction in the error test below) and ideally run Commit()/BuildUpdates() so the realistic path is covered.

})
}

func TestGeometryGeographyNullOnlyDefaults(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v2 with non-null initial default subtest asserts "is not supported until v3". That path actually fires both the v2-unsupported error and the must-default-to-null error, and ErrorContains happens to match the first. If geo ever becomes allowed in v2, this silently changes meaning. Worth splitting: one v2 subtest asserting the type-unsupported message, one v3 subtest with non-null default asserting must-default-to-null.

Comment thread types.go
Comment on lines +827 to +835
type EdgeAlgorithm string

const (
EdgeAlgorithmSpherical EdgeAlgorithm = "spherical"
EdgeAlgorithmVincenty EdgeAlgorithm = "vincenty"
EdgeAlgorithmThomas EdgeAlgorithm = "thomas"
EdgeAlgorithmAndoyer EdgeAlgorithm = "andoyer"
EdgeAlgorithmKarney EdgeAlgorithm = "karney"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these exist in geoarrow/geoarrow-go, we should use the constants from there instead of defining them here. See https://github.com/geoarrow/geoarrow-go/blob/main/metadata.go

Comment thread exprs.go
case UUIDType:
return &boundRef[uuid.UUID]{field: field, acc: acc}
case GeographyType, GeometryType:
return &boundRef[[]byte]{field: field, acc: acc}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread table/arrow_utils.go
}
}

func (c convertToArrow) VisitGeometry(iceberg.GeometryType) arrow.Field {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we don't just use https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L15 off the bat right now instead of using binary/large binary as the intermediate.

https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L84 can be used for the large type case

Comment thread table/arrow_utils.go
Comment on lines +643 to +648
// Passthrough binary for now, adding geoarrow-go support later
if c.useLargeTypes {
return arrow.Field{Type: arrow.BinaryTypes.LargeBinary}
}

return arrow.Field{Type: arrow.BinaryTypes.Binary}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same point as above. why not just use https://github.com/geoarrow/geoarrow-go/blob/main/wkb.go#L20 right now instead of the passthrough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants