Pin dataset revisions for reproducibility by alexrs-cohere · Pull Request #39 · OpenEuroLLM/JudgeArena

alexrs-cohere · 2026-04-29T09:05:48Z

Summary

Adds judgearena/dataset_revisions.py as the single source of truth for every HuggingFace dataset/space and raw URL the library downloads. HF_DATASET_REVISIONS maps repo IDs to commit SHAs and RAW_URL_REVISIONS does the same for raw GitHub URLs (currently only the FastChat MT-Bench reference answers).
Forwards revision= through every snapshot_download / load_dataset / urlretrieve call:
- judgearena/arenas_utils.py — LMArena 55k / 100k / 140k, plus the existing ComparIA pin is moved into the new module so the default revision is now derived from hf_revision("ministere-culture/comparia-votes").
- judgearena/instruction_dataset/arena_hard.py — lmarena-ai/arena-hard-auto.
- judgearena/instruction_dataset/m_arenahard.py — CohereLabs/m-ArenaHard.
- judgearena/instruction_dataset/mt_bench.py — lmsys/mt-bench space and the FastChat raw-URL via a new _fastchat_reference_url() helper.
- judgearena/utils.py — geoalgo/llmjudge (AlpacaEval) and geoalgo/multilingual-contexts-to-be-completed.
The judgearena.utils import of judgearena.instruction_dataset.arena_hard moves to the bottom of the module to break the import cycle introduced by m_arenahard.py now importing data_root.

The pinned table currently keeps the existing ComparIA pin and leaves the remaining repos as None until we choose SHAs, but every download site is already plumbed so bumping a pin is a one-line change in dataset_revisions.py. When a value is None the call passes revision=None, i.e. unchanged behaviour.

A follow-up PR will record the resolved revisions under metadata["dataset_revisions"] so each run captures which version of the data it actually saw.

Why

Today only ComparIA is pinned (via a magic string in arenas_utils.py). The other six download sites silently track upstream main, so a run that succeeds today can produce different bytes — and therefore different results — after any upstream rewrite, contradicting the paper's reproducibility claims. This PR is the smallest piece of plumbing that lets us pin once we choose SHAs.

Test plan

uv run pytest -q passes (74 tests on this branch in isolation).
CI runs the full suite.
Sanity-check download_all() end-to-end on a clean ~/judgearena-data to confirm revision=None keeps existing behaviour untouched.

Adds a single source of truth (judgearena/dataset_revisions.py) for every HuggingFace dataset/space and raw URL the library downloads, and forwards revision= through every snapshot_download / load_dataset / urlretrieve call so a run that succeeds today will see the exact same bytes after upstream rewrites. The pinned table currently records the existing ComparIA pin and leaves the remaining repos as None until we choose a SHA, but every site is now plumbed so bumping a pin is a one-line change. The FastChat MT-Bench GPT-4 reference URL is rewritten through RAW_URL_REVISIONS so it can be pinned the same way. The judgearena.utils import of judgearena.instruction_dataset.arena_hard moves to the bottom of the module to break the import cycle introduced by m_arenahard.py importing data_root. Made-with: Cursor

geoalgo

LGTM overall, I left some comments.

geoalgo · 2026-04-29T16:12:54Z

+    "lmarena-ai/arena-human-preference-100k": None,
+    "lmarena-ai/arena-human-preference-140k": None,
+    "lmarena-ai/arena-human-preference-55k": None,
+    # ComparIA (already pinned via the legacy comparia_revision argument).
+    "ministere-culture/comparia-votes": ("7a40bce496c1f2aa3be4001da85a49cb4743042b"),
+    # m-ArenaHard (Cohere release)
+    "CohereLabs/m-ArenaHard": None,
+    # AlpacaEval instructions / model_outputs (geoalgo redistribution).
+    "geoalgo/llmjudge": None,
+    # MT-Bench questions (LMSYS Space).
+    "lmsys/mt-bench": None,
+    # Multilingual fluency contexts.
+    "geoalgo/multilingual-contexts-to-be-completed": None,
+    # Arena-Hard official source (used via datasets.load_dataset).
+    "lmarena-ai/arena-hard-auto": None,


Can we tied all of those to the current commit as well?
I agree its probably better than to have those updated silently.

What SHAs were used for ablations?

Fixed! Got the latest SHAs for each repo

What SHAs were used for ablations?

We tagged comparia as this dataset was changing all the time (which is great). The other have not received any commit as they were static dumps as far as I know.

geoalgo · 2026-04-29T16:14:17Z

        else:
            download_hf(name=dataset, local_path=local_path_tables)

+    contexts_repo = "geoalgo/multilingual-contexts-to-be-completed"


https://github.com/OpenEuroLLM/JudgeArena/pull/39/changes#r3165448983

geoalgo · 2026-04-29T16:15:13Z

                return pd.read_csv(cache_file)


+# Imported at the bottom so that ``data_root``, ``download_hf`` and ``read_df``


Can we rather move things in an util? It is better to import at the top of files.

ErlisLushtaku · 2026-04-29T16:26:35Z

@alexrs-cohere Thanks for the PR Alex. There is a bit of an issue because I have been working for some time on this big PR, and it would be better if you can rebase your PRs there until we merge it, since otherwise I would have to redo most things because a lot has changed.

geoalgo · 2026-04-29T17:59:02Z

This PR should be rather independent from your changes @ErlisLushtaku right?

In any cases, I do not think we should rebase on the other PR as its 7K lines added, we would want to trim it down before.

- Pin every HuggingFace dataset/space to its current main SHA in dataset_revisions.py (and the FastChat raw URL), so unpinned repos no longer track upstream silently. - Drop the redundant `or "<sha>"` fallback on _DEFAULT_COMPARIA_REVISION now that the SHA is centralized in HF_DATASET_REVISIONS; widen comparia_revision to `str | None`. - Move data_root, download_hf and read_df out of judgearena/utils.py into a new leaf module judgearena/paths.py so that judgearena.instruction_dataset can resolve them without going through judgearena.utils. This removes the bottom-of-file lazy import for arena_hard in utils.py; both functions are still re-exported from judgearena.utils for backward compatibility (incl. tests that monkeypatch judgearena.utils.download_arena_hard / download_hf). Made-with: Cursor

Lock files are part of the reproducibility story: pinning dataset revisions is only half useful if the Python dependency tree resolved against pyproject.toml can still drift between machines / CI runs. Committing uv.lock makes ``uv sync`` deterministic. The lock file is ~1.1 MB so it would trip the default 500 KB limit on ``check-added-large-files``; exclude uv.lock from that hook (lockfile growth is expected) while keeping the safety net for everything else. Made-with: Cursor

ErlisLushtaku · 2026-04-30T08:30:00Z

This PR should be rather independent from your changes @ErlisLushtaku right?

In any cases, I do not think we should rebase on the other PR as its 7K lines added, we would want to trim it down before.

The feature is independent in intent, but it no longer applies cleanly on top of my branch. There would be some conflicts. However that is fine if you want to merge this first, until I clean that branch up.

geoalgo · 2026-04-30T13:21:21Z

I dont think we should commit this given that this is a library published in pypi, we have a pyproject.toml to specify our dependencies.

The Astral docs specifically recommend to track the uv.lock file -- https://docs.astral.sh/uv/concepts/projects/layout/#the-lockfile

This file should be checked into version control, allowing for consistent and reproducible installations across machines.

A lockfile ensures that developers working on the project are using a consistent set of package versions. Additionally, it ensures when deploying the project as an application that the exact set of used package versions is known.

This file should not interfere with installations from pypi. When we run pip install judgearena, pip ignores the lock file entirely. pip resolves transitively from pyproject.toml's dependencies = [...] against the user's environment. The lock has zero effect on what your downstream users get.

Adding the lockfile helps during development so we all use the same dependencies, and CI runs are consistent.

If you still disagree, I'm happy to remove it!

I am not sure reading about this, for instance see astral-sh/uv#10730 lets not do that in this PR as it is not its intent, we can revisit this discussion in an issue.

astral-sh/uv#10730 (comment) however confirms that the uv.lock does not have impact on libraries.

geoalgo · 2026-04-30T13:21:41Z

LGTM, can you just remove the uv.lock?

This reverts commit e76764e.

geoalgo reviewed Apr 29, 2026

View reviewed changes

alexrs-cohere added 2 commits April 30, 2026 04:56

alexrs-cohere requested a review from geoalgo April 30, 2026 11:10

geoalgo reviewed Apr 30, 2026

View reviewed changes

Revert "Track uv.lock for reproducibility"

b3cc7b1

This reverts commit e76764e.

alexrs-cohere requested a review from geoalgo May 1, 2026 02:31

		return pd.read_csv(cache_file)


		# Imported at the bottom so that ``data_root``, ``download_hf`` and ``read_df``

Conversation

alexrs-cohere commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

geoalgo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geoalgo Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoalgo commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ErlisLushtaku commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoalgo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexrs-cohere commented Apr 29, 2026 •

edited

Loading

geoalgo Apr 29, 2026 •

edited

Loading

ErlisLushtaku commented Apr 29, 2026 •

edited

Loading

geoalgo commented Apr 29, 2026 •

edited

Loading

geoalgo Apr 30, 2026 •

edited

Loading