-
Notifications
You must be signed in to change notification settings - Fork 42
feat(rust/sedona-geoparquet): Support geometry_columns option in read_parquet(..) to mark additional geometry columns
#560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I would love to hear feedbacks on the design and specification(in PR writeup) Once we reach agreement on that, I will:
|
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
I took a look at the whole thing but I know you're still working so feel free to ignore comments that aren't in scope.
Mostly I think you can avoid exposing GeoParquetMetadata via the options and just accept a string for that parameter. I believe serde_json can automatically deserialize that for you to avoid the parsing code here.
sedona-db/rust/sedona-geoparquet/src/metadata.rs
Lines 292 to 293 in 3f91e26
| /// Metadata about geometry columns. Each key is the name of a geometry column in the table. | |
| pub columns: HashMap<String, GeoParquetColumnMetadata>, |
Exposing a HashMap<GeoParquetColumnMetadata> in the options is OK, too if you feel strongly about it (probably helpful if this is being used from Rust), but for our bulit-in frontends (Python, R, SQL) a String is easier to deal with.
| self, | ||
| table_paths: Union[str, Path, Iterable[str]], | ||
| options: Optional[Dict[str, Any]] = None, | ||
| geometry_columns: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably make this just Optional[Mapping[str, Any]]. In Python a user can rather easily decode that from JSON if they need to.
| - encoding: "WKB" (required) | ||
| - crs: string (e.g., "EPSG:4326") or integer SRID (e.g., 4326). | ||
| If not provided, the default CRS is OGC:CRS84 | ||
| (https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means | ||
| the data in this column must be stored in longitude/latitude | ||
| based on the WGS84 datum. | ||
| - edges: "planar" (default) or "spherical" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just "See the GeoParquet specification for the required fields for each column".
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | ||
| binary WKB columns as geometry columns. Supported keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | |
| binary WKB columns as geometry columns. Supported keys: | |
| metadata (e.g., {"geom": {"encoding": "WKB"}}). Use this to mark | |
| binary WKB columns as geometry columns or correct metadata such | |
| as the column CRS. Supported keys: |
| fn parse_geometry_columns<'py>( | ||
| py: Python<'py>, | ||
| geometry_columns: HashMap<String, PyObject>, | ||
| ) -> Result<HashMap<String, GeoParquetColumnMetadata>, PySedonaError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this bit can be avoided by just passing a string at this point (i.e., in Python, use json.dumps() before passing to Rust).
| py: Python<'py>, | ||
| table_paths: Vec<String>, | ||
| options: HashMap<String, PyObject>, | ||
| geometry_columns: Option<HashMap<String, PyObject>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| geometry_columns: Option<HashMap<String, PyObject>>, | |
| geometry_columns: Option<String>, |
..I think JSON is the right format for this particular step (reduces bindings code considerably!)
| src = tmp_path / "plain.parquet" | ||
| pq.write_table(table, src) | ||
|
|
||
| # Check metadata: geoparquet meatadata should not be available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Check metadata: geoparquet meatadata should not be available | |
| # Check metadata: geoparquet metadata should not be available |
| out = tmp_path / "geo.parquet" | ||
| df.to_parquet(out) | ||
| metadata = pq.read_metadata(out).metadata | ||
| assert metadata is not None | ||
| geo = metadata.get(b"geo") | ||
| assert geo is not None | ||
| geo_metadata = json.loads(geo.decode("utf-8")) | ||
| print(json.dumps(geo_metadata, indent=2, sort_keys=True)) | ||
| assert geo_metadata["columns"]["geom"]["crs"] == "EPSG:4326" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can probably skip this bit of the test (verifying the geometry-ness and CRS of the input seems reasonable to me).
| pub struct GeoParquetReadOptions<'a> { | ||
| inner: ParquetReadOptions<'a>, | ||
| table_options: Option<HashMap<String, String>>, | ||
| geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>, | |
| geometry_columns: Option<String>, |
...just keeping this as a String will make this work for SQL, too (easy to import from the HashMap<String, String> that DataFusion gives us)
|
|
||
| // Handle JSON strings "OGC:CRS84", "EPSG:4326", "{AUTH}:{CODE}" and "0" | ||
| let crs = if LngLat::is_str_lnglat(crs_str) { | ||
| let crs = if crs_str == "OGC:CRS84" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes should be reverted (there is >1 string that can represent lon/lat)
| } | ||
| } | ||
|
|
||
| if let Some(number) = crs_value.as_number() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is OK (but perhaps add a test)
It's a great idea to make the rust internal API easier to use for different frontend bindings, WDYT:
Now the API looks like: pub struct GeoParquetReadOptions<'a> {
inner: ParquetReadOptions<'a>,
table_options: Option<HashMap<String, String>>,
// Keep it typed to make backend impl cleaner
geometry_columns: Option<HashMap<String, GeoParquetColumnMetadata>>,
}
impl GeoParquetReadOptions {
// ...
pub fn with_geometry_columns(
mut self,
// Json config string like "{"geo": {"encoding": "wkb"}}"
geometry_columns: String,
) -> Self {
let parsed = parse_geometry_columns(geometry_columns);
self.geometry_columns = Some(parsed);
self
}
} |
|
Thank you for considering...sounds good to me! |
Closes #530
Motivation
Today, converting legacy Parquet files that store geometry as raw WKB payloads inside
BINARYcolumns into GeoParquet requires a full SQL rewrite pipeline. Users must explicitly parse WKB, assign CRS, and reconstruct the geometry column before writing:This works, but it would be have a easier to use python API:
This PR introduces a
geometry_columnsoption onread_parquet()so legacy Parquet files can be interpreted as GeoParquet directly, without SQL rewriting.Proposed Python API
Demo
Specification
The key points:
crs, and we can use this API to provide more detailsKey Changes
GeoParquetColumnMetadatastructGeoParquetmetadata as before, next look at the options, to add/override additional geometry columns