Skip to content

gtfs diff engine#1

Open
cka-y wants to merge 7 commits intomainfrom
feat/1637
Open

gtfs diff engine#1
cka-y wants to merge 7 commits intomainfrom
feat/1637

Conversation

@cka-y
Copy link
Copy Markdown
Collaborator

@cka-y cka-y commented Apr 12, 2026

Closes #2

This pull request introduces the initial release of the GTFS Diff Engine, a memory-efficient Python library and CLI for comparing two GTFS feeds and producing a structured diff conforming to the GTFS Diff v2 schema. The changes include a robust implementation of the core diff logic, a clear public API, a command-line interface, detailed documentation, and supporting scripts for end-to-end usage.

The most important changes are:

Core Functionality and API:

  • Implements the core diff logic in engine.py, exposing a single diff_feeds() function that returns a typed Pydantic model representing the diff result. [1] [2]
  • Defines the GTFS file schema, supported files, and primary key columns in gtfs_definitions.py, with a helper for primary key lookup.

Command-Line Interface and Tooling:

  • Adds a Click-based CLI (gtfs-diff) in cli.py, supporting options for output file, row change cap, pretty-printing, and feed download timestamps.
  • Provides a Bash script compare_feeds.sh to automate downloading two GTFS feeds by URL and running the diff tool, with argument parsing and error handling.

Documentation and Examples:

  • Expands README.md with a comprehensive overview, installation instructions, usage examples, API reference, supported files table, output schema example, and implementation notes on memory efficiency.
  • Adds docs/architecture.md detailing design goals, module structure, streaming diff algorithm, edge case handling, and future improvements.

Packaging and Project Setup:

  • Configures packaging with pyproject.toml for installation, development, and test dependencies, and sets up the CLI entry point.
  • Adds package versioning and module entry points in __init__.py and __main__.py. [1] [2]

These changes collectively deliver a ready-to-use, well-documented GTFS diff engine suitable for both programmatic and CLI-based workflows.

@jcpitre
Copy link
Copy Markdown

jcpitre commented Apr 22, 2026

Some testing:

  • Went to the extreme and tested it on the biggest feed we have:
    • mdb-2014-202603090029.zip
    • mdb-2014-202603110034.zip

Each is about 1.5GB zipped and over 38 miilion entries in shapes.txt and 50 million of entries in stop_times.txt
I had to stop it as it was working on shapes.txt, using 4GB of memory:

[gtfs-diff 16:31:57 4321MB]   [shapes.txt] scan done in 0.0s — added=50,118,266 deleted=49,537,043 modified=0

Which indicates that there was 50+ million addition and about as many deletion.

Examining shapes.txt, we can see that they use some uuid for the shape_id, e.g.:

0000abe0-5266-475b-808a-5cf929120a80,50.118257595,-5.540823891,229,

That seems to be completely regenerated for each version. That means that trying to do a diff in that kind of dataset is not possible.

  • We should make a survey of the dataset we have to see if it's common to have regenerated ids.
  • We should consider finding some heuristics that would tell us quickly if a there is a possibility of meaningful comparison between 2 datasets.

@jcpitre
Copy link
Copy Markdown

jcpitre commented Apr 27, 2026

Tested with STM: mdb-2126-202511111837.zip vs mdb-2126-202511130041.zip.
I spot checked some of the reported differences, and they seems ok, except:

Problems:

  • In shapes.txt and stops.txt, there was a significant number (thousands?) of modified lines that were due to coordinates having extra zeros:
          {
            "identifier": {
              "shape_id": "11071",
              "shape_pt_sequence": "150001"
            },
            "raw_value": "11071,45.518332,-73.556250,150001",
            "base_line_number": 35,
            "new_line_number": 35,
            "field_changes": [
              {
                "field": "shape_pt_lon",
                "base_value": "-73.55625",
                "new_value": "-73.556250"
              }
            ]
          }
  • In feed_info.txt, the base file had an empty line at the end, and this translated to this jason entry:
       "deleted": [
          {
            "identifier": {
              "feed_publisher_name": "",
              "feed_publisher_url": "",
              "feed_lang": "",
              "feed_start_date": "",
              "feed_end_date": "",
              "feed_version": ""
            },
            "raw_value": ",,,,,",
            "base_line_number": 3
          },

@jcpitre
Copy link
Copy Markdown

jcpitre commented Apr 28, 2026

Modified code to ignore extra zeros in coordinates.

@jcpitre jcpitre marked this pull request as ready for review April 28, 2026 17:09
@jcpitre jcpitre changed the title wip: gtfs diff engine gtfs diff engine Apr 28, 2026
@emmambd
Copy link
Copy Markdown

emmambd commented Apr 28, 2026

@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped.

@jcpitre
Copy link
Copy Markdown

jcpitre commented May 1, 2026

@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped.

mdb-2014 is the UK aggregate feed. Size is about 1.5GB zipped. Here are the datasets I used:

@davidgamez davidgamez self-requested a review May 4, 2026 14:47
Comment thread scripts/compare_feeds.sh
# compare_feeds.sh — Download two GTFS feeds by URL and diff them.
#
# Usage:
# ./scripts/compare_feeds.sh <BASE_URL> <NEW_URL> [OPTIONS]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion]: It would be nice if the shell script also supported local zip and folder options. This can be done outside this PR.

Comment thread src/gtfs_diff/engine.py
return dict(zip(headers, row))


def _values_differ(a: str, b: str) -> bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]: Do we need to trim the parameters to make sure only the "actual" content is compared?

Comment thread src/gtfs_diff/engine.py
# Per-file diff
# ---------------------------------------------------------------------------

def _diff_file(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion]: I find this function long. To improve readability, I suggest splitting it into multiple functions. We can naturally split it "per section", as it's already commented within the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implementation: Generic GTFS Diff Engine

4 participants