[GH-2760] Extend OSM PBF reader to support additional metadata fields by jiayuasu · Pull Request #2776 · apache/sedona

jiayuasu · 2026-03-21T20:08:21Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Developer Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes Extend OSM PBF reader to support additional fields (changeset, timestamp, uid, user, version, visible) #2760

What changes were proposed in this PR?

Extend the OSM PBF reader to extract the following metadata fields from Info (for Node/Way/Relation) and DenseInfo (for DenseNodes) protobuf messages:

changeset (BIGINT) — the changeset ID the entity belongs to
timestamp (TIMESTAMP) — when the entity was last modified
uid (INT) — the user ID of the last editor
user (STRING) — the username of the last editor (resolved from the string table via user_sid)
version (INT) — the entity version number
visible (BOOLEAN) — whether the entity is visible (relevant for history files)

These fields are part of the standard OSM PBF format specification but were previously ignored by the reader.

Key implementation details

Added InfoResolver utility class to extract Info fields from Osmformat.Info for nodes, ways, and relations
Extended DenseNodeExtractor to decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid)
Added metadata fields with setters to OSMEntity to avoid constructor bloat
Updated SchemaProvider to include the 6 new columns in the output schema
Updated OsmPartitionReader to map the new fields into Spark InternalRow
Passed PrimitiveBlock (instead of just StringTable) to WayIterator and RelationIterator so they can access both the string table and date_granularity for timestamp conversion

How was this patch tested?

Added 3 new tests to OsmReaderTest:

Metadata fields populated for all entities — verifies version, timestamp, changeset are non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0
Schema includes metadata for dense nodes — verifies the 6 new fields appear in the schema when reading dense node PBF files
Schema includes metadata for normal nodes — verifies the 6 new fields appear in the schema when reading normal node PBF files

All 10 existing tests continue to pass.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation. The new fields are automatically available in the output schema when reading OSM PBF files.

Extract changeset, timestamp, uid, user, version, and visible fields from the Info/DenseInfo protobuf messages that were previously ignored by the OSM PBF reader. These fields are part of the standard OSM PBF format and provide useful provenance metadata for each entity.

Copilot

Pull request overview

Extends Sedona’s Spark OSM PBF datasource to surface standard OSM metadata (changeset, timestamp, uid/user, version, visible) for Nodes/Ways/Relations, including DenseNodes, and wires these fields through to the Spark output schema and rows.

Changes:

Add 6 metadata columns to the Spark schema and map them into InternalRow.
Populate OSMEntity metadata from Osmformat.Info (nodes/ways/relations) via new InfoResolver.
Decode DenseNodes DenseInfo metadata (including delta-encoded fields) in DenseNodeExtractor, and update iterators to pass PrimitiveBlock/date_granularity.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala	Adds tests for metadata presence and schema columns.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/SchemaProvider.scala	Adds metadata columns to datasource schema.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/OsmPartitionReader.scala	Maps new metadata fields into Spark `InternalRow` (incl. timestamp conversion).
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/model/OSMEntity.java	Adds nullable metadata fields + getters/setters.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/WayIterator.java	Populates `Way` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/RelationIterator.java	Populates `Relation` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/NodeIterator.java	Populates `Node` metadata from `Info` using `InfoResolver`.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/BlobIterator.java	Passes `PrimitiveBlock`/`date_granularity` through to iterators/extractor.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java	New utility to extract `Info` fields onto entities.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java	Decodes DenseInfo metadata (delta-coded fields) for DenseNodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T20:14:00Z

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

+      node.setVersion(denseInfo.getVersion(idx));
+
+      // timestamp, changeset, uid, user_sid are delta-encoded
+      long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
+      long changeset = denseInfo.getChangeset(idx) + firstChangeset;
+      int uid = denseInfo.getUid(idx) + firstUid;
+      int userSid = denseInfo.getUserSid(idx) + firstUserSid;
+
+      firstTimestamp = timestamp;
+      firstChangeset = changeset;
+      firstUid = uid;
+      firstUserSid = userSid;
+
+      node.setTimestamp(timestamp * dateGranularity);
+      node.setChangeset(changeset);
+      node.setUid(uid);
+      if (userSid > 0) {
+        node.setUser(stringTable.getS(userSid).toStringUtf8());
+      }
+


Inside the hasDenseInfo block, denseInfo.getTimestamp/ getChangeset/ getUid/ getUserSid are indexed unconditionally. These are repeated fields and may be absent or shorter than idCount in some files, causing IndexOutOfBoundsException during parsing. Guard each access with the corresponding get*Count() > idx (and only advance the delta accumulator for fields that are actually present).

Suggested change

node.setVersion(denseInfo.getVersion(idx));

// timestamp, changeset, uid, user_sid are delta-encoded

long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;

long changeset = denseInfo.getChangeset(idx) + firstChangeset;

int uid = denseInfo.getUid(idx) + firstUid;

int userSid = denseInfo.getUserSid(idx) + firstUserSid;

firstTimestamp = timestamp;

firstChangeset = changeset;

firstUid = uid;

firstUserSid = userSid;

node.setTimestamp(timestamp * dateGranularity);

node.setChangeset(changeset);

node.setUid(uid);

if (userSid > 0) {

node.setUser(stringTable.getS(userSid).toStringUtf8());

}

if (denseInfo.getVersionCount() > idx) {

node.setVersion(denseInfo.getVersion(idx));

}

// timestamp, changeset, uid, user_sid are delta-encoded

if (denseInfo.getTimestampCount() > idx) {

long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;

firstTimestamp = timestamp;

node.setTimestamp(timestamp * dateGranularity);

}

if (denseInfo.getChangesetCount() > idx) {

long changeset = denseInfo.getChangeset(idx) + firstChangeset;

firstChangeset = changeset;

node.setChangeset(changeset);

}

if (denseInfo.getUidCount() > idx) {

int uid = denseInfo.getUid(idx) + firstUid;

firstUid = uid;

node.setUid(uid);

}

if (denseInfo.getUserSidCount() > idx) {

int userSid = denseInfo.getUserSid(idx) + firstUserSid;

firstUserSid = userSid;

if (userSid > 0) {

node.setUser(stringTable.getS(userSid).toStringUtf8());

}

}

Copilot · 2026-03-21T20:14:00Z

spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java

+      entity.setUser(stringTable.getS(info.getUserSid()).toStringUtf8());
+    }
+    if (info.hasVisible()) {
+      entity.setVisible(info.getVisible());


visible handling is incomplete per the PBF spec: when the file has required_features HistoricalInformation, objects that lack the visible flag must be assumed true (see generated Osmformat docs). Currently, if info.hasVisible() is false the column stays null, which is incorrect for history files. Consider parsing the HeaderBlock required_features and defaulting visible to true when appropriate (or explicitly documenting the limitation).

Suggested change

entity.setVisible(info.getVisible());

entity.setVisible(info.getVisible());

} else {

// Per OSM PBF spec, when HistoricalInformation is a required_feature,

// missing "visible" must be treated as true. Default to true when absent.

entity.setVisible(true);

Copilot · 2026-03-21T20:14:00Z

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

+
+      // visible is NOT delta-encoded, and may not be present
+      if (denseInfo.getVisibleCount() > idx) {
+        node.setVisible(denseInfo.getVisible(idx));


For DenseNodes, visible is only set when visibleCount > idx, otherwise it remains null. The OSM PBF spec notes that in history files (required_features HistoricalInformation) missing visible values must be treated as true, so returning null can be incorrect. Consider defaulting to true based on header required_features (or leaving null but documenting/encoding that behavior explicitly).

Suggested change

node.setVisible(denseInfo.getVisible(idx));

node.setVisible(denseInfo.getVisible(idx));

} else {

// Per OSM PBF spec, missing 'visible' must be treated as true (especially in history files).

node.setVisible(true);

Copilot · 2026-03-21T20:14:00Z

spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala

+      fieldNames should contain("uid")
+      fieldNames should contain("user")
+      fieldNames should contain("version")
+      fieldNames should contain("visible")


The new dense-node test only asserts that the schema contains the metadata columns; it doesn’t validate that DenseNodeExtractor actually populates/decodes them (including the delta-decoding logic for timestamp/changeset/uid/user_sid). Adding at least one assertion on non-null/range-correct values for a dense PBF fixture would prevent regressions where the columns exist but are always null/incorrect.

Suggested change

fieldNames should contain("visible")

fieldNames should contain("visible")

// Verify that at least one dense node has populated metadata values

val nodeWithMetadata = denseData

.where(col("kind") === "node")

.select("changeset", "timestamp", "uid", "user", "version", "visible")

.head()

val changeset = nodeWithMetadata.getAs[Long]("changeset")

val timestamp = nodeWithMetadata.getAs[Long]("timestamp")

val uid = nodeWithMetadata.getAs[Long]("uid")

val user = nodeWithMetadata.getAs[String]("user")

val version = nodeWithMetadata.getAs[Long]("version")

val visible = nodeWithMetadata.getAs[Boolean]("visible")

// Basic range/non-null checks to ensure delta-decoded metadata is populated

changeset should be > 0L

timestamp should be > 0L

uid should be >= 0L

version should be > 0L

user should not be null

user.trim.isEmpty shouldBe false

// Accessing 'visible' as Boolean already guarantees it is non-null

(visible == true || visible == false) shouldBe true

Copilot · 2026-03-21T20:14:01Z

spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java

+    entity.setVersion(info.getVersion());
+    entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
+    entity.setChangeset(info.getChangeset());
+    entity.setUid(info.getUid());


populateInfo writes default protobuf values into the entity even when the corresponding fields are not present (e.g., version defaults to -1, timestamp/changeset/uid default to 0). This makes missing metadata look like real values and can produce incorrect timestamps (epoch) or version -1. Use info.hasVersion/hasTimestamp/hasChangeset/hasUid (and optionally treat version==-1 as unset) to only set these fields when present; otherwise leave them null.

Suggested change

entity.setVersion(info.getVersion());

entity.setTimestamp((long) info.getTimestamp() * dateGranularity);

entity.setChangeset(info.getChangeset());

entity.setUid(info.getUid());

if (info.hasVersion()) {

int version = info.getVersion();

if (version != -1) {

entity.setVersion(version);

}

}

if (info.hasTimestamp()) {

entity.setTimestamp((long) info.getTimestamp() * dateGranularity);

}

if (info.hasChangeset()) {

entity.setChangeset(info.getChangeset());

}

if (info.hasUid()) {

entity.setUid(info.getUid());

}

Copilot · 2026-03-21T20:14:01Z

...on/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java

    this.nodes = nodes;
    this.keyIndex = 0;
+
+    this.hasDenseInfo = nodes.hasDenseinfo() && nodes.getDenseinfo().getVersionCount() > 0;


hasDenseInfo is derived only from getVersionCount() > 0, but the code later assumes timestamp/changeset/uid/user_sid lists are also present for every node. If a PBF includes DenseInfo with only versions populated (other lists empty), getTimestamp(idx) etc will throw. Consider computing presence per-field (using get*Count() > idx) instead of a single boolean gate.

jiayuasu requested a review from Copilot March 21, 2026 20:09

jiayuasu added this to the sedona-1.9.0 milestone Mar 21, 2026

Copilot started reviewing on behalf of jiayuasu March 21, 2026 20:09 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

jiayuasu marked this pull request as draft March 22, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
jiayuasu wants to merge 1 commit intomasterfrom
feature/osm-pbf-metadata-fields

jiayuasu commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-      entity.setVisible(info.getVisible());
+      entity.setVisible(info.getVisible());
+    } else {
+      // Per OSM PBF spec, when HistoricalInformation is a required_feature,
+      // missing "visible" must be treated as true. Default to true when absent.
+      entity.setVisible(true);

-      fieldNames should contain("visible")
+      fieldNames should contain("visible")
+      // Verify that at least one dense node has populated metadata values
+      val nodeWithMetadata = denseData
+        .where(col("kind") === "node")
+        .select("changeset", "timestamp", "uid", "user", "version", "visible")
+        .head()
+      val changeset = nodeWithMetadata.getAs[Long]("changeset")
+      val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
+      val uid = nodeWithMetadata.getAs[Long]("uid")
+      val user = nodeWithMetadata.getAs[String]("user")
+      val version = nodeWithMetadata.getAs[Long]("version")
+      val visible = nodeWithMetadata.getAs[Boolean]("visible")
+      // Basic range/non-null checks to ensure delta-decoded metadata is populated
+      changeset should be > 0L
+      timestamp should be > 0L
+      uid should be >= 0L
+      version should be > 0L
+      user should not be null
+      user.trim.isEmpty shouldBe false
+      // Accessing 'visible' as Boolean already guarantees it is non-null
+      (visible == true || visible == false) shouldBe true

Conversation

jiayuasu commented Mar 21, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Key implementation details

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants