spike: Tackle the problem of generated ids

In https://github.com/MobilityData/gtfs-diff-engine/pull/1#issuecomment-4300008238 we realized that some providers would generate new ids  for each version of datasets. 
When we think about it, it makes sense that ids are only meaningful within the dataset itself. We are not guaranteed that they will make sense in the context of another dataset, even of the same feed.

 We looked at mdb-2014. and saw that the shape_ids in shapes.txt a regenerated for each dataset. e.g. from mdb-2014-202603110034:
```
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
0000abe0-5266-475b-808a-5cf929120a80,50.118257595,-5.540823891,229,
```
But for mdb-2014-202603090029 the "same" line is:
```
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
3d88c650-6c5b-4f14-996b-e5b01b4eec33,50.118257595,-5.540823891,229,
```

We need to establish how prevalent this in the feeds we host.
And find a way to evaluate the diff considering these generated ids.

Here is copilot take on it:
> 
> A rough breakdown by field
>  | Field | Stability | Reason |
>  |---|---|---|
>  | `shape_id` | 🔴 Very unstable | Almost universally regenerated on export by scheduling tools |
>  | `trip_id` | 🟠 Often unstable | HASTUS, Trapeze, and other tools generate these internally |
>  | `service_id` | 🟠 Often unstable | Date-based generation is common (e.g. `20240115_WD`) |
>  | `block_id` | 🟠 Often unstable | Operational scheduling artifact |
>  | `route_id` | 🟡 Moderately stable | Often matches public route numbers, but not always |
>  | `fare_id` | 🟡 Moderately stable | Fare structures change slowly |
>  | `stop_id` | 🟢 Usually stable | Stops are physical — agencies maintain these |
>  | `agency_id` | 🟢 Very stable | Rarely changes |
> 
> So realistically stop_id and agency_id are the only keys you can reliably trust. Everything else is potentially a surrogate.
> 
> This means for a large fraction of feeds, the diff engine's output for shapes.txt, trips.txt, and calendar.txt is likely dominated by key churn rather than real changes — which
> significantly undermines the value of the tool for those files without content-based matching.
> 
> This is arguably the most important design problem for gtfs-diff-engine to solve
> 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: Tackle the problem of generated ids #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Stability	Reason
`shape_id`	🔴 Very unstable	Almost universally regenerated on export by scheduling tools
`trip_id`	🟠 Often unstable	HASTUS, Trapeze, and other tools generate these internally
`service_id`	🟠 Often unstable	Date-based generation is common (e.g. `20240115_WD`)
`block_id`	🟠 Often unstable	Operational scheduling artifact
`route_id`	🟡 Moderately stable	Often matches public route numbers, but not always
`fare_id`	🟡 Moderately stable	Fare structures change slowly
`stop_id`	🟢 Usually stable	Stops are physical — agencies maintain these
`agency_id`	🟢 Very stable	Rarely changes

spike: Tackle the problem of generated ids #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions