Pin dataset revisions for reproducibility#39
Pin dataset revisions for reproducibility#39alexrs-cohere wants to merge 4 commits intoOpenEuroLLM:mainfrom
Conversation
Adds a single source of truth (judgearena/dataset_revisions.py) for every HuggingFace dataset/space and raw URL the library downloads, and forwards revision= through every snapshot_download / load_dataset / urlretrieve call so a run that succeeds today will see the exact same bytes after upstream rewrites. The pinned table currently records the existing ComparIA pin and leaves the remaining repos as None until we choose a SHA, but every site is now plumbed so bumping a pin is a one-line change. The FastChat MT-Bench GPT-4 reference URL is rewritten through RAW_URL_REVISIONS so it can be pinned the same way. The judgearena.utils import of judgearena.instruction_dataset.arena_hard moves to the bottom of the module to break the import cycle introduced by m_arenahard.py importing data_root. Made-with: Cursor
geoalgo
left a comment
There was a problem hiding this comment.
LGTM overall, I left some comments.
| "lmarena-ai/arena-human-preference-100k": None, | ||
| "lmarena-ai/arena-human-preference-140k": None, | ||
| "lmarena-ai/arena-human-preference-55k": None, | ||
| # ComparIA (already pinned via the legacy comparia_revision argument). | ||
| "ministere-culture/comparia-votes": ("7a40bce496c1f2aa3be4001da85a49cb4743042b"), | ||
| # m-ArenaHard (Cohere release) | ||
| "CohereLabs/m-ArenaHard": None, | ||
| # AlpacaEval instructions / model_outputs (geoalgo redistribution). | ||
| "geoalgo/llmjudge": None, | ||
| # MT-Bench questions (LMSYS Space). | ||
| "lmsys/mt-bench": None, | ||
| # Multilingual fluency contexts. | ||
| "geoalgo/multilingual-contexts-to-be-completed": None, | ||
| # Arena-Hard official source (used via datasets.load_dataset). | ||
| "lmarena-ai/arena-hard-auto": None, |
There was a problem hiding this comment.
Can we tied all of those to the current commit as well?
I agree its probably better than to have those updated silently.
There was a problem hiding this comment.
What SHAs were used for ablations?
There was a problem hiding this comment.
Fixed! Got the latest SHAs for each repo
There was a problem hiding this comment.
What SHAs were used for ablations?
We tagged comparia as this dataset was changing all the time (which is great). The other have not received any commit as they were static dumps as far as I know.
| else: | ||
| download_hf(name=dataset, local_path=local_path_tables) | ||
|
|
||
| contexts_repo = "geoalgo/multilingual-contexts-to-be-completed" |
There was a problem hiding this comment.
| return pd.read_csv(cache_file) | ||
|
|
||
|
|
||
| # Imported at the bottom so that ``data_root``, ``download_hf`` and ``read_df`` |
There was a problem hiding this comment.
Can we rather move things in an util? It is better to import at the top of files.
|
@alexrs-cohere Thanks for the PR Alex. There is a bit of an issue because I have been working for some time on this big PR, and it would be better if you can rebase your PRs there until we merge it, since otherwise I would have to redo most things because a lot has changed. |
|
This PR should be rather independent from your changes @ErlisLushtaku right? In any cases, I do not think we should rebase on the other PR as its 7K lines added, we would want to trim it down before. |
- Pin every HuggingFace dataset/space to its current main SHA in dataset_revisions.py (and the FastChat raw URL), so unpinned repos no longer track upstream silently. - Drop the redundant `or "<sha>"` fallback on _DEFAULT_COMPARIA_REVISION now that the SHA is centralized in HF_DATASET_REVISIONS; widen comparia_revision to `str | None`. - Move data_root, download_hf and read_df out of judgearena/utils.py into a new leaf module judgearena/paths.py so that judgearena.instruction_dataset can resolve them without going through judgearena.utils. This removes the bottom-of-file lazy import for arena_hard in utils.py; both functions are still re-exported from judgearena.utils for backward compatibility (incl. tests that monkeypatch judgearena.utils.download_arena_hard / download_hf). Made-with: Cursor
Lock files are part of the reproducibility story: pinning dataset revisions is only half useful if the Python dependency tree resolved against pyproject.toml can still drift between machines / CI runs. Committing uv.lock makes ``uv sync`` deterministic. The lock file is ~1.1 MB so it would trip the default 500 KB limit on ``check-added-large-files``; exclude uv.lock from that hook (lockfile growth is expected) while keeping the safety net for everything else. Made-with: Cursor
The feature is independent in intent, but it no longer applies cleanly on top of my branch. There would be some conflicts. However that is fine if you want to merge this first, until I clean that branch up. |
There was a problem hiding this comment.
I dont think we should commit this given that this is a library published in pypi, we have a pyproject.toml to specify our dependencies.
There was a problem hiding this comment.
The Astral docs specifically recommend to track the uv.lock file -- https://docs.astral.sh/uv/concepts/projects/layout/#the-lockfile
This file should be checked into version control, allowing for consistent and reproducible installations across machines.
A lockfile ensures that developers working on the project are using a consistent set of package versions. Additionally, it ensures when deploying the project as an application that the exact set of used package versions is known.
This file should not interfere with installations from pypi. When we run pip install judgearena, pip ignores the lock file entirely. pip resolves transitively from pyproject.toml's dependencies = [...] against the user's environment. The lock has zero effect on what your downstream users get.
Adding the lockfile helps during development so we all use the same dependencies, and CI runs are consistent.
If you still disagree, I'm happy to remove it!
There was a problem hiding this comment.
I am not sure reading about this, for instance see astral-sh/uv#10730 lets not do that in this PR as it is not its intent, we can revisit this discussion in an issue.
There was a problem hiding this comment.
astral-sh/uv#10730 (comment) however confirms that the uv.lock does not have impact on libraries.
|
LGTM, can you just remove the |
This reverts commit e76764e.
Summary
judgearena/dataset_revisions.pyas the single source of truth for every HuggingFace dataset/space and raw URL the library downloads.HF_DATASET_REVISIONSmaps repo IDs to commit SHAs andRAW_URL_REVISIONSdoes the same for raw GitHub URLs (currently only the FastChat MT-Bench reference answers).revision=through everysnapshot_download/load_dataset/urlretrievecall:judgearena/arenas_utils.py— LMArena 55k / 100k / 140k, plus the existing ComparIA pin is moved into the new module so the default revision is now derived fromhf_revision("ministere-culture/comparia-votes").judgearena/instruction_dataset/arena_hard.py—lmarena-ai/arena-hard-auto.judgearena/instruction_dataset/m_arenahard.py—CohereLabs/m-ArenaHard.judgearena/instruction_dataset/mt_bench.py—lmsys/mt-benchspace and the FastChat raw-URL via a new_fastchat_reference_url()helper.judgearena/utils.py—geoalgo/llmjudge(AlpacaEval) andgeoalgo/multilingual-contexts-to-be-completed.judgearena.utilsimport ofjudgearena.instruction_dataset.arena_hardmoves to the bottom of the module to break the import cycle introduced bym_arenahard.pynow importingdata_root.The pinned table currently keeps the existing ComparIA pin and leaves the remaining repos as
Noneuntil we choose SHAs, but every download site is already plumbed so bumping a pin is a one-line change indataset_revisions.py. When a value isNonethe call passesrevision=None, i.e. unchanged behaviour.A follow-up PR will record the resolved revisions under
metadata["dataset_revisions"]so each run captures which version of the data it actually saw.Why
Today only ComparIA is pinned (via a magic string in
arenas_utils.py). The other six download sites silently track upstreammain, so a run that succeeds today can produce different bytes — and therefore different results — after any upstream rewrite, contradicting the paper's reproducibility claims. This PR is the smallest piece of plumbing that lets us pin once we choose SHAs.Test plan
uv run pytest -qpasses (74 tests on this branch in isolation).download_all()end-to-end on a clean~/judgearena-datato confirmrevision=Nonekeeps existing behaviour untouched.