Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Jan 29, 2026

Closes #530

Motivation

Today, converting legacy Parquet files that store geometry as raw WKB payloads inside BINARY columns into GeoParquet requires a full SQL rewrite pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the geometry column before writing:

# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

df = df.to_view("t", overwrite=True)

df = sd.sql("""
  SELECT
    ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
    * EXCLUDE (geo_bin)
  FROM t
""")

df.to_parquet("geo_geoparquet.parquet")

This works, but it would be have a easier to use python API:

“Treat this binary column as a geometry column with encoding=WKB and CRS=EPSG:4326.”

This PR introduces a geometry_columns option on read_parquet() so legacy Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.


Proposed Python API

Demo

df = sd.read_parquet(
    "/data/geo_legacy.parquet",
    geometry_columns={
        "geo_bin": {
            "encoding": "WKB",
            "crs": 4326,
        }
    },
)

df.to_parquet("geo_geoparquet.parquet")

Specification

            geometry_columns: Optional mapping of column name to GeoParquet column
                metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
                binary WKB columns as geometry columns. Supported keys:
                - encoding: "WKB" (required)
                - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326).
                  If not provided, the default CRS is OGC:CRS84
                  (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means
                  the data in this column must be stored in longitude/latitude
                  based on the WGS84 datum.
                - edges: "planar" (default) or "spherical"
                Useful for:
                - Legacy Parquet files with Binary columns containing WKB payloads.
                - Overriding GeoParquet metadata when fields like `crs` are missing.
                Precedence:
                - If a column appears in both GeoParquet metadata and this option,
                  the geometry_columns entry takes precedence.
                Example:
                - For `geo.parquet(geo1: geometry, geo2: geometry, geo3: binary)`,
                  `read_parquet("geo.parquet", geometry_columns={"geo2": {...}, "geo3": ...})`
                  will override `geo2` metadata and treat `geo3` as a geometry column.
                Safety:
                - Columns specified here are not validated for WKB correctness.
                  Invalid WKB payloads may cause undefined behavior.

The key points:

  • The geometry columns specified in the option overrides what's already in the metadata, I think this can be useful if the metadata is missing some configurations like crs, and we can use this API to provide more details
  • No validation for now, this can be done in a follow-on PR

Key Changes

  1. Parse python option fields into rust GeoParquetColumnMetadata struct
  2. In the schema inference step, first infer the schema from GeoParquet metadata as before, next look at the options, to add/override additional geometry columns

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 29, 2026

I would love to hear feedbacks on the design and specification(in PR writeup)

Once we reach agreement on that, I will:

  • Add comprehensive tests
  • Polish the implementation — so far I have only validated the high-level structure; I haven’t yet reviewed the details (e.g. metadata parsing) carefully

@2010YOUY01 2010YOUY01 marked this pull request as draft January 29, 2026 11:49
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!

I took a look at the whole thing but I know you're still working so feel free to ignore comments that aren't in scope.

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

/// Metadata about geometry columns. Each key is the name of a geometry column in the table.
pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

self,
table_paths: Union[str, Path, Iterable[str]],
options: Optional[Dict[str, Any]] = None,
geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably make this just Optional[Mapping[str, Any]]. In Python a user can rather easily decode that from JSON if they need to.

Comment on lines +141 to +147
- encoding: "WKB" (required)
- crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326).
If not provided, the default CRS is OGC:CRS84
(https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means
the data in this column must be stored in longitude/latitude
based on the WGS84 datum.
- edges: "planar" (default) or "spherical"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just "See the GeoParquet specification for the required fields for each column".

https://geoparquet.org/releases/v1.1.0/

Comment on lines +139 to +140
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns. Supported keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns. Supported keys:
metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark
binary WKB columns as geometry columns or correct metadata such
as the column CRS. Supported keys:

Comment on lines +37 to +40
fn parse_geometry_columns<'py>(
py: Python<'py>,
geometry_columns: HashMap<String, PyObject>,
) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bit can be avoided by just passing a string at this point (i.e., in Python, use json.dumps() before passing to Rust).

py: Python<'py>,
table_paths: Vec<String>,
options: HashMap<String, PyObject>,
geometry_columns: Option<HashMap<String, PyObject>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
geometry_columns: Option<HashMap<String, PyObject>>,
geometry_columns: Option<String>,

..I think JSON is the right format for this particular step (reduces bindings code considerably!)

src = tmp_path / "plain.parquet"
pq.write_table(table, src)

# Check metadata: geoparquet meatadata should not be available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Check metadata: geoparquet meatadata should not be available
# Check metadata: geoparquet metadata should not be available

Comment on lines +128 to +136
out = tmp_path / "geo.parquet"
df.to_parquet(out)
metadata = pq.read_metadata(out).metadata
assert metadata is not None
geo = metadata.get(b"geo")
assert geo is not None
geo_metadata = json.loads(geo.decode("utf-8"))
print(json.dumps(geo_metadata, indent=2, sort_keys=True))
assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably skip this bit of the test (verifying the geometry-ness and CRS of the input seems reasonable to me).

pub struct GeoParquetReadOptions<'a> {
inner: ParquetReadOptions<'a>,
table_options: Option<HashMap<String, String>>,
geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
geometry_columns: Option<String>,

...just keeping this as a String will make this work for SQL, too (easy to import from the HashMap<String, String> that DataFusion gives us)


// Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0"
let crs = if LngLat::is_str_lnglat(crs_str) {
let crs = if crs_str == "OGC:CRS84" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes should be reverted (there is >1 string that can represent lon/lat)

}
}

if let Some(number) = crs_value.as_number() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is OK (but perhaps add a test)

@2010YOUY01
Copy link
Contributor Author

2010YOUY01 commented Jan 30, 2026

Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.

/// Metadata about geometry columns. Each key is the name of a geometry column in the table.
pub columns: HashMap<String, GeoParquetColumnMetadata>,

Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.

It's a great idea to make the rust internal API easier to use for different frontend bindings, WDYT:

  • For the builder of Rust Struct GeoParquetReadOptions, it takes string for Json options to make it easier to use
  • Inside GeoParquetReadOptions, it keeps typed/parsed field for geometry_columns, to make the rust backend impl cleaner

Now the API looks like:

pub struct GeoParquetReadOptions<'a> {
    inner: ParquetReadOptions<'a>,
    table_options: Option<HashMap<String, String>>,
    // Keep it typed to make backend impl cleaner
    geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
}

impl GeoParquetReadOptions {
    // ...
    pub fn with_geometry_columns(
        mut self,
        // Json config string like "{"geo": {"encoding": "wkb"}}"
        geometry_columns: String,
    ) -> Self {
        let parsed = parse_geometry_columns(geometry_columns);
        self.geometry_columns = Some(parsed);
        self
    }
}

@paleolimbot
Copy link
Member

Thank you for considering...sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python API to cast binary columns to WKB columns

2 participants