In #1 (comment) we realized that some providers would generate new ids for each version of datasets.
When we think about it, it makes sense that ids are only meaningful within the dataset itself. We are not guaranteed that they will make sense in the context of another dataset, even of the same feed.
We looked at mdb-2014. and saw that the shape_ids in shapes.txt a regenerated for each dataset. e.g. from mdb-2014-202603110034:
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
0000abe0-5266-475b-808a-5cf929120a80,50.118257595,-5.540823891,229,
But for mdb-2014-202603090029 the "same" line is:
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
3d88c650-6c5b-4f14-996b-e5b01b4eec33,50.118257595,-5.540823891,229,
We need to establish how prevalent this in the feeds we host.
And find a way to evaluate the diff considering these generated ids.
Here is copilot take on it:
A rough breakdown by field
| Field |
Stability |
Reason |
shape_id |
🔴 Very unstable |
Almost universally regenerated on export by scheduling tools |
trip_id |
🟠 Often unstable |
HASTUS, Trapeze, and other tools generate these internally |
service_id |
🟠 Often unstable |
Date-based generation is common (e.g. 20240115_WD) |
block_id |
🟠 Often unstable |
Operational scheduling artifact |
route_id |
🟡 Moderately stable |
Often matches public route numbers, but not always |
fare_id |
🟡 Moderately stable |
Fare structures change slowly |
stop_id |
🟢 Usually stable |
Stops are physical — agencies maintain these |
agency_id |
🟢 Very stable |
Rarely changes |
So realistically stop_id and agency_id are the only keys you can reliably trust. Everything else is potentially a surrogate.
This means for a large fraction of feeds, the diff engine's output for shapes.txt, trips.txt, and calendar.txt is likely dominated by key churn rather than real changes — which
significantly undermines the value of the tool for those files without content-based matching.
This is arguably the most important design problem for gtfs-diff-engine to solve
In #1 (comment) we realized that some providers would generate new ids for each version of datasets.
When we think about it, it makes sense that ids are only meaningful within the dataset itself. We are not guaranteed that they will make sense in the context of another dataset, even of the same feed.
We looked at mdb-2014. and saw that the shape_ids in shapes.txt a regenerated for each dataset. e.g. from mdb-2014-202603110034:
But for mdb-2014-202603090029 the "same" line is:
We need to establish how prevalent this in the feeds we host.
And find a way to evaluate the diff considering these generated ids.
Here is copilot take on it: