Skip to content

[GH-2768] Replace len(self)==0 with cheaper _is_empty() check in GeoSeries#2770

Merged
jiayuasu merged 2 commits intomasterfrom
fix/geoseries-empty-check-perf
Mar 20, 2026
Merged

[GH-2768] Replace len(self)==0 with cheaper _is_empty() check in GeoSeries#2770
jiayuasu merged 2 commits intomasterfrom
fix/geoseries-empty-check-perf

Conversation

@jiayuasu
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Several GeoSeries methods use len(self) == 0 as an early-return guard for empty input. Under the hood, len() on a Pandas-on-Spark Series calls DataFrame.count(), which triggers a full Spark scan of all rows.

This PR adds a private _is_empty() helper method that uses self._internal.spark_frame.take(1) instead, which short-circuits after finding a single row rather than counting all rows.

All 6 occurrences of len(self) == 0 in geoseries.py are replaced:

  • crs (getter)
  • build_area()
  • polygonize()
  • union_all()
  • intersection_all()
  • total_bounds

How was this patch tested?

Existing tests for all affected methods (test_build_area, test_polygonize, test_union_all, test_intersection_all, test_total_bounds, test_crs, test_empty_list) were run and pass.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

…eries

Replace all 6 occurrences of len(self) == 0 in GeoSeries with a new
_is_empty() helper that uses spark_frame.take(1) instead of
DataFrame.count(). This avoids triggering a full Spark scan just to
check if the series is empty.

Affected methods: crs (getter), build_area(), polygonize(),
union_all(), intersection_all(), total_bounds.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes empty-input guards in GeoSeries to avoid triggering expensive Spark count() scans when checking for emptiness in Pandas-on-Spark-backed series.

Changes:

  • Added a private GeoSeries._is_empty() helper that uses spark_frame.take(1) for short-circuit emptiness checks.
  • Replaced len(self) == 0 guards with self._is_empty() in 6 GeoSeries methods/properties (crs, build_area, polygonize, union_all, intersection_all, total_bounds).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…mpty returns

In build_area() and polygonize(), the early-return for empty input
used crs=self.crs, which triggers the crs getter and calls _is_empty()
again. Use crs=None instead since the series is known to be empty.
@jiayuasu jiayuasu added this to the sedona-1.9.0 milestone Mar 20, 2026
@jiayuasu jiayuasu merged commit 501df70 into master Mar 20, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GeoSeries: len(self) == 0 checks trigger full Spark scans

2 participants