Conversation
|
Some testing:
Each is about 1.5GB zipped and over 38 miilion entries in shapes.txt and 50 million of entries in stop_times.txt Which indicates that there was 50+ million addition and about as many deletion. Examining shapes.txt, we can see that they use some uuid for the shape_id, e.g.: That seems to be completely regenerated for each version. That means that trying to do a diff in that kind of dataset is not possible.
|
|
Tested with STM: mdb-2126-202511111837.zip vs mdb-2126-202511130041.zip. Problems:
|
|
Modified code to ignore extra zeros in coordinates. |
|
@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped. |
mdb-2014 is the UK aggregate feed. Size is about 1.5GB zipped. Here are the datasets I used: |
| # compare_feeds.sh — Download two GTFS feeds by URL and diff them. | ||
| # | ||
| # Usage: | ||
| # ./scripts/compare_feeds.sh <BASE_URL> <NEW_URL> [OPTIONS] |
There was a problem hiding this comment.
[suggestion]: It would be nice if the shell script also supported local zip and folder options. This can be done outside this PR.
| return dict(zip(headers, row)) | ||
|
|
||
|
|
||
| def _values_differ(a: str, b: str) -> bool: |
There was a problem hiding this comment.
[question]: Do we need to trim the parameters to make sure only the "actual" content is compared?
| # Per-file diff | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| def _diff_file( |
There was a problem hiding this comment.
[suggestion]: I find this function long. To improve readability, I suggest splitting it into multiple functions. We can naturally split it "per section", as it's already commented within the function.
Closes #2
This pull request introduces the initial release of the GTFS Diff Engine, a memory-efficient Python library and CLI for comparing two GTFS feeds and producing a structured diff conforming to the GTFS Diff v2 schema. The changes include a robust implementation of the core diff logic, a clear public API, a command-line interface, detailed documentation, and supporting scripts for end-to-end usage.
The most important changes are:
Core Functionality and API:
engine.py, exposing a singlediff_feeds()function that returns a typed Pydantic model representing the diff result. [1] [2]gtfs_definitions.py, with a helper for primary key lookup.Command-Line Interface and Tooling:
gtfs-diff) incli.py, supporting options for output file, row change cap, pretty-printing, and feed download timestamps.compare_feeds.shto automate downloading two GTFS feeds by URL and running the diff tool, with argument parsing and error handling.Documentation and Examples:
README.mdwith a comprehensive overview, installation instructions, usage examples, API reference, supported files table, output schema example, and implementation notes on memory efficiency.docs/architecture.mddetailing design goals, module structure, streaming diff algorithm, edge case handling, and future improvements.Packaging and Project Setup:
pyproject.tomlfor installation, development, and test dependencies, and sets up the CLI entry point.__init__.pyand__main__.py. [1] [2]These changes collectively deliver a ready-to-use, well-documented GTFS diff engine suitable for both programmatic and CLI-based workflows.