Skip to content

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776

Draft
jiayuasu wants to merge 1 commit intomasterfrom
feature/osm-pbf-metadata-fields
Draft

[GH-2760] Extend OSM PBF reader to support additional metadata fields#2776
jiayuasu wants to merge 1 commit intomasterfrom
feature/osm-pbf-metadata-fields

Conversation

@jiayuasu
Copy link
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Extend the OSM PBF reader to extract the following metadata fields from Info (for Node/Way/Relation) and DenseInfo (for DenseNodes) protobuf messages:

  • changeset (BIGINT) — the changeset ID the entity belongs to
  • timestamp (TIMESTAMP) — when the entity was last modified
  • uid (INT) — the user ID of the last editor
  • user (STRING) — the username of the last editor (resolved from the string table via user_sid)
  • version (INT) — the entity version number
  • visible (BOOLEAN) — whether the entity is visible (relevant for history files)

These fields are part of the standard OSM PBF format specification but were previously ignored by the reader.

Key implementation details

  • Added InfoResolver utility class to extract Info fields from Osmformat.Info for nodes, ways, and relations
  • Extended DenseNodeExtractor to decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid)
  • Added metadata fields with setters to OSMEntity to avoid constructor bloat
  • Updated SchemaProvider to include the 6 new columns in the output schema
  • Updated OsmPartitionReader to map the new fields into Spark InternalRow
  • Passed PrimitiveBlock (instead of just StringTable) to WayIterator and RelationIterator so they can access both the string table and date_granularity for timestamp conversion

How was this patch tested?

Added 3 new tests to OsmReaderTest:

  1. Metadata fields populated for all entities — verifies version, timestamp, changeset are non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0
  2. Schema includes metadata for dense nodes — verifies the 6 new fields appear in the schema when reading dense node PBF files
  3. Schema includes metadata for normal nodes — verifies the 6 new fields appear in the schema when reading normal node PBF files

All 10 existing tests continue to pass.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation. The new fields are automatically available in the output schema when reading OSM PBF files.

Extract changeset, timestamp, uid, user, version, and visible fields
from the Info/DenseInfo protobuf messages that were previously ignored
by the OSM PBF reader. These fields are part of the standard OSM PBF
format and provide useful provenance metadata for each entity.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Extends Sedona’s Spark OSM PBF datasource to surface standard OSM metadata (changeset, timestamp, uid/user, version, visible) for Nodes/Ways/Relations, including DenseNodes, and wires these fields through to the Spark output schema and rows.

Changes:

  • Add 6 metadata columns to the Spark schema and map them into InternalRow.
  • Populate OSMEntity metadata from Osmformat.Info (nodes/ways/relations) via new InfoResolver.
  • Decode DenseNodes DenseInfo metadata (including delta-encoded fields) in DenseNodeExtractor, and update iterators to pass PrimitiveBlock/date_granularity.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala Adds tests for metadata presence and schema columns.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/SchemaProvider.scala Adds metadata columns to datasource schema.
spark/common/src/main/scala/org/apache/sedona/sql/datasources/osm/OsmPartitionReader.scala Maps new metadata fields into Spark InternalRow (incl. timestamp conversion).
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/model/OSMEntity.java Adds nullable metadata fields + getters/setters.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/WayIterator.java Populates Way metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/RelationIterator.java Populates Relation metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/NodeIterator.java Populates Node metadata from Info using InfoResolver.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/iterators/BlobIterator.java Passes PrimitiveBlock/date_granularity through to iterators/extractor.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java New utility to extract Info fields onto entities.
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java Decodes DenseInfo metadata (delta-coded fields) for DenseNodes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +91 to +110
node.setVersion(denseInfo.getVersion(idx));

// timestamp, changeset, uid, user_sid are delta-encoded
long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
long changeset = denseInfo.getChangeset(idx) + firstChangeset;
int uid = denseInfo.getUid(idx) + firstUid;
int userSid = denseInfo.getUserSid(idx) + firstUserSid;

firstTimestamp = timestamp;
firstChangeset = changeset;
firstUid = uid;
firstUserSid = userSid;

node.setTimestamp(timestamp * dateGranularity);
node.setChangeset(changeset);
node.setUid(uid);
if (userSid > 0) {
node.setUser(stringTable.getS(userSid).toStringUtf8());
}

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the hasDenseInfo block, denseInfo.getTimestamp/ getChangeset/ getUid/ getUserSid are indexed unconditionally. These are repeated fields and may be absent or shorter than idCount in some files, causing IndexOutOfBoundsException during parsing. Guard each access with the corresponding get*Count() > idx (and only advance the delta accumulator for fields that are actually present).

Suggested change
node.setVersion(denseInfo.getVersion(idx));
// timestamp, changeset, uid, user_sid are delta-encoded
long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
long changeset = denseInfo.getChangeset(idx) + firstChangeset;
int uid = denseInfo.getUid(idx) + firstUid;
int userSid = denseInfo.getUserSid(idx) + firstUserSid;
firstTimestamp = timestamp;
firstChangeset = changeset;
firstUid = uid;
firstUserSid = userSid;
node.setTimestamp(timestamp * dateGranularity);
node.setChangeset(changeset);
node.setUid(uid);
if (userSid > 0) {
node.setUser(stringTable.getS(userSid).toStringUtf8());
}
if (denseInfo.getVersionCount() > idx) {
node.setVersion(denseInfo.getVersion(idx));
}
// timestamp, changeset, uid, user_sid are delta-encoded
if (denseInfo.getTimestampCount() > idx) {
long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
firstTimestamp = timestamp;
node.setTimestamp(timestamp * dateGranularity);
}
if (denseInfo.getChangesetCount() > idx) {
long changeset = denseInfo.getChangeset(idx) + firstChangeset;
firstChangeset = changeset;
node.setChangeset(changeset);
}
if (denseInfo.getUidCount() > idx) {
int uid = denseInfo.getUid(idx) + firstUid;
firstUid = uid;
node.setUid(uid);
}
if (denseInfo.getUserSidCount() > idx) {
int userSid = denseInfo.getUserSid(idx) + firstUserSid;
firstUserSid = userSid;
if (userSid > 0) {
node.setUser(stringTable.getS(userSid).toStringUtf8());
}
}

Copilot uses AI. Check for mistakes.
entity.setUser(stringTable.getS(info.getUserSid()).toStringUtf8());
}
if (info.hasVisible()) {
entity.setVisible(info.getVisible());
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visible handling is incomplete per the PBF spec: when the file has required_features HistoricalInformation, objects that lack the visible flag must be assumed true (see generated Osmformat docs). Currently, if info.hasVisible() is false the column stays null, which is incorrect for history files. Consider parsing the HeaderBlock required_features and defaulting visible to true when appropriate (or explicitly documenting the limitation).

Suggested change
entity.setVisible(info.getVisible());
entity.setVisible(info.getVisible());
} else {
// Per OSM PBF spec, when HistoricalInformation is a required_feature,
// missing "visible" must be treated as true. Default to true when absent.
entity.setVisible(true);

Copilot uses AI. Check for mistakes.

// visible is NOT delta-encoded, and may not be present
if (denseInfo.getVisibleCount() > idx) {
node.setVisible(denseInfo.getVisible(idx));
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DenseNodes, visible is only set when visibleCount > idx, otherwise it remains null. The OSM PBF spec notes that in history files (required_features HistoricalInformation) missing visible values must be treated as true, so returning null can be incorrect. Consider defaulting to true based on header required_features (or leaving null but documenting/encoding that behavior explicitly).

Suggested change
node.setVisible(denseInfo.getVisible(idx));
node.setVisible(denseInfo.getVisible(idx));
} else {
// Per OSM PBF spec, missing 'visible' must be treated as true (especially in history files).
node.setVisible(true);

Copilot uses AI. Check for mistakes.
fieldNames should contain("uid")
fieldNames should contain("user")
fieldNames should contain("version")
fieldNames should contain("visible")
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dense-node test only asserts that the schema contains the metadata columns; it doesn’t validate that DenseNodeExtractor actually populates/decodes them (including the delta-decoding logic for timestamp/changeset/uid/user_sid). Adding at least one assertion on non-null/range-correct values for a dense PBF fixture would prevent regressions where the columns exist but are always null/incorrect.

Suggested change
fieldNames should contain("visible")
fieldNames should contain("visible")
// Verify that at least one dense node has populated metadata values
val nodeWithMetadata = denseData
.where(col("kind") === "node")
.select("changeset", "timestamp", "uid", "user", "version", "visible")
.head()
val changeset = nodeWithMetadata.getAs[Long]("changeset")
val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
val uid = nodeWithMetadata.getAs[Long]("uid")
val user = nodeWithMetadata.getAs[String]("user")
val version = nodeWithMetadata.getAs[Long]("version")
val visible = nodeWithMetadata.getAs[Boolean]("visible")
// Basic range/non-null checks to ensure delta-decoded metadata is populated
changeset should be > 0L
timestamp should be > 0L
uid should be >= 0L
version should be > 0L
user should not be null
user.trim.isEmpty shouldBe false
// Accessing 'visible' as Boolean already guarantees it is non-null
(visible == true || visible == false) shouldBe true

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
entity.setVersion(info.getVersion());
entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
entity.setChangeset(info.getChangeset());
entity.setUid(info.getUid());
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

populateInfo writes default protobuf values into the entity even when the corresponding fields are not present (e.g., version defaults to -1, timestamp/changeset/uid default to 0). This makes missing metadata look like real values and can produce incorrect timestamps (epoch) or version -1. Use info.hasVersion/hasTimestamp/hasChangeset/hasUid (and optionally treat version==-1 as unset) to only set these fields when present; otherwise leave them null.

Suggested change
entity.setVersion(info.getVersion());
entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
entity.setChangeset(info.getChangeset());
entity.setUid(info.getUid());
if (info.hasVersion()) {
int version = info.getVersion();
if (version != -1) {
entity.setVersion(version);
}
}
if (info.hasTimestamp()) {
entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
}
if (info.hasChangeset()) {
entity.setChangeset(info.getChangeset());
}
if (info.hasUid()) {
entity.setUid(info.getUid());
}

Copilot uses AI. Check for mistakes.
this.nodes = nodes;
this.keyIndex = 0;

this.hasDenseInfo = nodes.hasDenseinfo() && nodes.getDenseinfo().getVersionCount() > 0;
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasDenseInfo is derived only from getVersionCount() > 0, but the code later assumes timestamp/changeset/uid/user_sid lists are also present for every node. If a PBF includes DenseInfo with only versions populated (other lists empty), getTimestamp(idx) etc will throw. Consider computing presence per-field (using get*Count() > idx) instead of a single boolean gate.

Copilot uses AI. Check for mistakes.
@jiayuasu jiayuasu marked this pull request as draft March 22, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend OSM PBF reader to support additional fields (changeset, timestamp, uid, user, version, visible)

2 participants