Skip to content

feat: propagate GEOMETRY CRS as PostGIS SRID via EWKB#418

Draft
jatorre wants to merge 1 commit intoduckdb:mainfrom
jatorre:fix/geometry-srid-ewkb
Draft

feat: propagate GEOMETRY CRS as PostGIS SRID via EWKB#418
jatorre wants to merge 1 commit intoduckdb:mainfrom
jatorre:fix/geometry-srid-ewkb

Conversation

@jatorre
Copy link
Copy Markdown

@jatorre jatorre commented Mar 20, 2026

Summary

DuckDB 1.5 introduced CRS metadata on the GEOMETRY type (e.g., GEOMETRY('EPSG:4326')). When writing to PostGIS via binary COPY, the SRID is currently lost because the binary writer sends plain WKB. PostGIS accepts the geometry but sets SRID=0.

This PR detects CRS on the GEOMETRY column's LogicalType and writes EWKB (WKB with SRID header) instead, so PostGIS receives the correct SRID.

The Fix

In postgres_binary_writer.hpp, the GEOMETRY case now:

  1. Checks if the column type has CRS via LogicalType::GetCRS(type)
  2. Extracts the SRID from the CRS identifier (e.g., EPSG:43264326)
  3. Writes EWKB by setting the SRID flag bit (0x20000000) on the WKB type field and inserting the 4-byte SRID
WKB:  [byte_order:1] [type:4]                    [payload...]
EWKB: [byte_order:1] [type|0x20000000:4] [srid:4] [payload...]

If no CRS is set on the GEOMETRY type, plain WKB is sent (preserving current behavior).

Before:

-- Source has CRS
CREATE TABLE t AS SELECT * FROM ST_READ('airports.shp');
-- type: GEOMETRY('EPSG:4326')

-- Write to PostGIS
CREATE TABLE pg.public.airports AS SELECT * FROM t;
-- PostGIS: ST_SRID(geom) = 0  ← SRID lost

After:

-- Same source, same write
CREATE TABLE pg.public.airports AS SELECT * FROM t;
-- PostGIS: ST_SRID(geom) = 4326  ← SRID preserved

Context

This was the main remaining gap for transparent geospatial data transfer between DuckDB and PostGIS. With DuckDB 1.5's GEOMETRY('EPSG:4326') type and this fix, the full pipeline works without any workarounds:

ST_READ('file.shp') → GEOMETRY('EPSG:4326') → postgres_scanner → PostGIS (SRID=4326, GIST indexed)

Previously users had to run UPDATE table SET geom = ST_SetSRID(geom, 4326) after every import.

Related: duckdb/duckdb-spatial#587 (EWKB support discussion)

@jatorre jatorre marked this pull request as draft March 20, 2026 04:43
@jatorre jatorre force-pushed the fix/geometry-srid-ewkb branch from 3468deb to e3d8ea5 Compare March 20, 2026 04:47
DuckDB 1.5 introduced CRS metadata on the GEOMETRY type (e.g.,
GEOMETRY('EPSG:4326')). When writing to PostGIS via binary COPY,
the SRID was lost because the writer sent plain WKB.

This change detects CRS on the GEOMETRY LogicalType and writes
EWKB (WKB with SRID header) instead:

  WKB:  [byte_order:1] [type:4]                    [payload...]
  EWKB: [byte_order:1] [type|0x20000000:4] [srid:4] [payload...]

The SRID is extracted from the CRS identifier (e.g., "EPSG:4326" → 4326).
If no CRS is set, plain WKB is sent (preserving current behavior).

Before: DuckDB GEOMETRY('EPSG:4326') → PostGIS geometry(SRID=0)
After:  DuckDB GEOMETRY('EPSG:4326') → PostGIS geometry(SRID=4326)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jatorre jatorre force-pushed the fix/geometry-srid-ewkb branch from e3d8ea5 to 774ea01 Compare March 20, 2026 04:52
@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Hey @Maxxen — this PR adds EWKB writing to postgres_scanner so that GEOMETRY('EPSG:4326') columns arrive in PostGIS with the correct SRID instead of 0.

A couple of things I'd appreciate your input on:

  1. SRID extraction approach: Currently we parse the CRS identifier (e.g., "EPSG:4326"4326) per-row via string::find + stoi. It's cheap but not elegant. Would you prefer this to happen once per-column during init, maybe storing the resolved SRID in the copy state? Or is per-row fine given it's a short string?

  2. AuxInfo() guard: We check type.AuxInfo() before calling GetCRS() since it throws on bare GEOMETRY without CRS. Is there a more idiomatic way to check if a GEOMETRY type has CRS attached? Something like GeoType::HasCRS() that I might have missed?

  3. Internal format assumption: We treat the GEOMETRY binary representation as WKB and insert the SRID header directly. Verified byte-for-byte identical to ST_AsWKB() output for all 7 geometry types. Is this a safe assumption going forward, or should we go through an explicit conversion?

Also — maybe you're already looking at completing this work somewhere else, or have a different approach in mind. Let me know if for topics like this you'd prefer an issue first rather than a PR. I thought this was small enough that a PR would be more useful, but happy to adjust to whatever workflow you prefer.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

Unfortunately this isnt safe to do, the postgres srid is not guaranteed to be the same as the numeric part of the auth code. Ig also doesnt work for non-integer auth codes, like OGC:CRS84.

We need to do a lookup in the postgres databases SPATIAL_REF_SYS table and find the best match (which I think we need PROJ for, which only spatial bundles) to get the SRID. This is not really easy to do in the type conversion code and we dont want to add proj as a dependency here right now.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

I have some thought on how to eventually solve this in the future, that would involve breaking PROJ out from spatial into its own separate extension, however we dont want to do that yet until we have a stable extension c-api, which we are working on.

@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Thanks for the context @Maxxen — that makes total sense about the SRID ≠ EPSG code edge case and OGC:CRS84.

That said, I think there's a pragmatic middle ground here: only handle the EPSG:NNNN case (where SRID = integer part), and fall back to plain WKB (SRID=0) for everything else. The code already does this — it looks for a colon, tries stoi, and falls back to WriteRawBlob if parsing fails.

One of the key values of DuckDB in the geospatial world is being the intermediary between data warehouses — BigQuery, Snowflake, Databricks, Redshift — and PostGIS. Those cloud warehouses don't have a spatial_ref_sys table or the CRS flexibility that PostGIS has — they work with well-known EPSG codes directly. So in practice, the vast majority of data flowing through DuckDB into PostGIS will be GEOMETRY('EPSG:NNNN') where the PostGIS SRID matches the integer value.

Covering that 99% case with a few lines of code would already be extremely valuable — right now every postgres_scanner write loses the SRID, and users have to UPDATE ... SET geom = ST_SetSRID(geom, 4326) after every import.

For the remaining cases (OGC:CRS84, custom SRIDs, non-integer auth codes), falling back to SRID=0 preserves current behavior and ensures correctness — no worse than today, and the user can handle it with a post-import ST_SetSRID just as they do now.

Would you be open to that scoped approach? The contract would be:

  • GEOMETRY('EPSG:4326') → EWKB with SRID=4326 ✓
  • GEOMETRY('OGC:CRS84') → plain WKB, SRID=0 (same as today)
  • GEOMETRY (no CRS) → plain WKB, SRID=0 (same as today)

Happy to adjust the code or close this if you'd rather wait for the PROJ-based solution — just wanted to make the case that the simple version already covers almost all real-world usage.

@Maxxen
Copy link
Copy Markdown
Member

Maxxen commented Mar 20, 2026

Alright, maybe we can do it for EPSG only, I just worry that the SRIDs in PostGIS spatial_ref_sys table don't actually match with the EPSG codes, even for EPSG defined projections, not sure how easy that would be to verify... Either way we still lose the CRS on import - but I guess we can deal with that separately.

Let me get back to this later :)

@jatorre
Copy link
Copy Markdown
Author

jatorre commented Mar 20, 2026

Good news — I checked a default PostGIS spatial_ref_sys table and all 6,184 EPSG entries have srid = auth_srid with zero mismatches:

SELECT COUNT(*) as total,
       SUM(CASE WHEN srid = auth_srid::INT THEN 1 ELSE 0 END) as matching,
       SUM(CASE WHEN srid != auth_srid::INT THEN 1 ELSE 0 END) as mismatched
FROM spatial_ref_sys WHERE auth_name = 'EPSG';

-- total: 6184, matching: 6184, mismatched: 0

So for the default table, EPSG code = PostGIS SRID is a safe assumption. A user could theoretically insert custom rows that break this, but that's an edge case they'd need to handle themselves anyway.

Happy to wait until you have time to look at it properly — no rush.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants