gtfs diff engine by cka-y · Pull Request #1 · MobilityData/gtfs-diff-engine

cka-y · 2026-04-12T14:00:05Z

Closes #2

This pull request introduces the initial release of the GTFS Diff Engine, a memory-efficient Python library and CLI for comparing two GTFS feeds and producing a structured diff conforming to the GTFS Diff v2 schema. The changes include a robust implementation of the core diff logic, a clear public API, a command-line interface, detailed documentation, and supporting scripts for end-to-end usage.

The most important changes are:

Core Functionality and API:

Implements the core diff logic in engine.py, exposing a single diff_feeds() function that returns a typed Pydantic model representing the diff result. [1] [2]
Defines the GTFS file schema, supported files, and primary key columns in gtfs_definitions.py, with a helper for primary key lookup.

Command-Line Interface and Tooling:

Adds a Click-based CLI (gtfs-diff) in cli.py, supporting options for output file, row change cap, pretty-printing, and feed download timestamps.
Provides a Bash script compare_feeds.sh to automate downloading two GTFS feeds by URL and running the diff tool, with argument parsing and error handling.

Documentation and Examples:

Expands README.md with a comprehensive overview, installation instructions, usage examples, API reference, supported files table, output schema example, and implementation notes on memory efficiency.
Adds docs/architecture.md detailing design goals, module structure, streaming diff algorithm, edge case handling, and future improvements.

Packaging and Project Setup:

Configures packaging with pyproject.toml for installation, development, and test dependencies, and sets up the CLI entry point.
Adds package versioning and module entry points in __init__.py and __main__.py. [1] [2]

These changes collectively deliver a ready-to-use, well-documented GTFS diff engine suitable for both programmatic and CLI-based workflows.

…output, and update tests

…estion

jcpitre · 2026-04-22T21:13:43Z

Some testing:

Went to the extreme and tested it on the biggest feed we have:
- mdb-2014-202603090029.zip
- mdb-2014-202603110034.zip

Each is about 1.5GB zipped and over 38 miilion entries in shapes.txt and 50 million of entries in stop_times.txt
I had to stop it as it was working on shapes.txt, using 4GB of memory:

[gtfs-diff 16:31:57 4321MB]   [shapes.txt] scan done in 0.0s — added=50,118,266 deleted=49,537,043 modified=0

Which indicates that there was 50+ million addition and about as many deletion.

Examining shapes.txt, we can see that they use some uuid for the shape_id, e.g.:

0000abe0-5266-475b-808a-5cf929120a80,50.118257595,-5.540823891,229,

That seems to be completely regenerated for each version. That means that trying to do a diff in that kind of dataset is not possible.

We should make a survey of the dataset we have to see if it's common to have regenerated ids.
We should consider finding some heuristics that would tell us quickly if a there is a possibility of meaningful comparison between 2 datasets.

jcpitre · 2026-04-27T19:40:18Z

Tested with STM: mdb-2126-202511111837.zip vs mdb-2126-202511130041.zip.
I spot checked some of the reported differences, and they seems ok, except:

Problems:

In shapes.txt and stops.txt, there was a significant number (thousands?) of modified lines that were due to coordinates having extra zeros:

          {
            "identifier": {
              "shape_id": "11071",
              "shape_pt_sequence": "150001"
            },
            "raw_value": "11071,45.518332,-73.556250,150001",
            "base_line_number": 35,
            "new_line_number": 35,
            "field_changes": [
              {
                "field": "shape_pt_lon",
                "base_value": "-73.55625",
                "new_value": "-73.556250"
              }
            ]
          }

In feed_info.txt, the base file had an empty line at the end, and this translated to this jason entry:

       "deleted": [
          {
            "identifier": {
              "feed_publisher_name": "",
              "feed_publisher_url": "",
              "feed_lang": "",
              "feed_start_date": "",
              "feed_end_date": "",
              "feed_version": ""
            },
            "raw_value": ",,,,,",
            "base_line_number": 3
          },

jcpitre · 2026-04-28T17:08:46Z

Modified code to ignore extra zeros in coordinates.

emmambd · 2026-04-28T17:15:22Z

@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped.

jcpitre · 2026-05-01T13:07:59Z

@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped.

mdb-2014 is the UK aggregate feed. Size is about 1.5GB zipped. Here are the datasets I used:

davidgamez · 2026-05-04T18:45:08Z

+# compare_feeds.sh — Download two GTFS feeds by URL and diff them.
+#
+# Usage:
+#   ./scripts/compare_feeds.sh <BASE_URL> <NEW_URL> [OPTIONS]


[suggestion]: It would be nice if the shell script also supported local zip and folder options. This can be done outside this PR.

davidgamez · 2026-05-04T18:51:53Z

+    return dict(zip(headers, row))
+
+
+def _values_differ(a: str, b: str) -> bool:


[question]: Do we need to trim the parameters to make sure only the "actual" content is compared?

davidgamez · 2026-05-04T18:55:00Z

+# Per-file diff
+# ---------------------------------------------------------------------------
+
+def _diff_file(


[suggestion]: I find this function long. To improve readability, I suggest splitting it into multiple functions. We can naturally split it "per section", as it's already commented within the function.

cka-y and others added 4 commits April 12, 2026 09:58

wip: gtfs diff engine

87c53d4

Fix unchanged files reported as modified, omit null fields from JSON …

75226ff

…output, and update tests

Add tests and comments for row/column reorder silent-ignore design qu…

914c545

…estion

Rename gtfs_diff_engine to gtfs-diff-engine in docs and comments

4692253

jcpitre mentioned this pull request Apr 22, 2026

spike: Tackle the problem of generated ids #4

Open

jcpitre added 3 commits April 27, 2026 16:17

Add timestamped progress tracing with memory usage to diff engine

7e84d4d

Switch trace memory metric to psutil live RSS; fix test stdout parsing

44897e1

Ignore trailing-zero coordinate diffs by comparing fields numerically

fa9acb4

jcpitre marked this pull request as ready for review April 28, 2026 17:09

jcpitre changed the title ~~wip: gtfs diff engine~~ gtfs diff engine Apr 28, 2026

davidgamez self-requested a review May 4, 2026 14:47

davidgamez reviewed May 4, 2026

View reviewed changes

jcpitre mentioned this pull request May 4, 2026

Add a heuristic that tells if the standard diff is possible #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gtfs diff engine#1

gtfs diff engine#1
cka-y wants to merge 7 commits intomainfrom
feat/1637

cka-y commented Apr 12, 2026 •

edited by jcpitre

Loading

Uh oh!

jcpitre commented Apr 22, 2026 •

edited

Loading

Uh oh!

jcpitre commented Apr 27, 2026 •

edited

Loading

Uh oh!

jcpitre commented Apr 28, 2026

Uh oh!

emmambd commented Apr 28, 2026

Uh oh!

jcpitre commented May 1, 2026

Uh oh!

davidgamez May 4, 2026

Uh oh!

davidgamez May 4, 2026

Uh oh!

davidgamez May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		return dict(zip(headers, row))


		def _values_differ(a: str, b: str) -> bool:

Conversation

cka-y commented Apr 12, 2026 • edited by jcpitre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcpitre commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcpitre commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcpitre commented Apr 28, 2026

Uh oh!

emmambd commented Apr 28, 2026

Uh oh!

jcpitre commented May 1, 2026

Uh oh!

davidgamez May 4, 2026

Choose a reason for hiding this comment

Uh oh!

davidgamez May 4, 2026

Choose a reason for hiding this comment

Uh oh!

davidgamez May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cka-y commented Apr 12, 2026 •

edited by jcpitre

Loading

jcpitre commented Apr 22, 2026 •

edited

Loading

jcpitre commented Apr 27, 2026 •

edited

Loading