feat: structured logging by kargibora · Pull Request #37 · OpenEuroLLM/JudgeArena

kargibora · 2026-04-15T09:01:27Z

Summary

Replace ad-hoc print() calls across the codebase with Python's logging module under a unified judgearena logger namespace. This gives users control over verbosity and enables persistent debug logs without code changes.

I totally think this is required and its better to switch to using logging package than late as its common practice (also development friendly). The only concern is the log files. It is usually hard to read log directly from console so its better to save log in a rotation for debugging and results.

Changes

New: `judgearena/log.py`

get_logger(name) — returns a child logger under the judgearena namespace; auto-prefixes bare module names.
configure_logging(verbosity, log_file) — sets up console + optional file handlers. Supports JUDGEARENA_LOG_LEVEL env-var override.
attach_file_handler(path) — adds a DEBUG-level file handler (always captures full trace regardless of console verbosity).
make_run_log_path(folder) — generates a timestamped run-YYYYMMDD_HHMMSS.log path.

New CLI flags (`judgearena/cli_common.py`)

Flag	Effect
`-v` / `--verbose`	Set console to DEBUG
`-q` / `--quiet`	Suppress everything below WARNING
`--log-file PATH`	Explicit log file location
`--no-log-file`	Disable automatic `run-*.log` in the result folder

Added resolve_verbosity(args) helper — -q takes precedence over -v.

Codebase migration

Replaced print() with logger.info() / logger.debug() in:

evaluate.py, generate_and_evaluate.py, estimate_elo_ratings.py
arenas_utils.py, eval_utils.py, utils.py
instruction_dataset/__init__.py, mt_bench/mt_bench_utils.py

Tests (`tests/test_logging.py`)

Behaviour

Default behaviour (-v 0) matches existing output — INFO messages print to stderr just like the old print() calls. No visible change for users who don't pass new flags.

Example Log

2026-04-14 10:00:37 [INFO] judgearena.__main__: Using dataset alpaca-eval and evaluating models VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct.
2026-04-14 10:00:37 [INFO] judgearena.instruction_dataset: Loaded 805 instructions for alpaca-eval.
2026-04-14 10:00:37 [INFO] judgearena.__main__: Generating completions for dataset alpaca-eval with model VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct (or loading them directly if present)
2026-04-14 10:00:40 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-0.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-1.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First instruction/context: "I am trying to win over a new client for my writing services and skinny brown dog media to as as a ghost writer for their book Unbreakable Confidence. Can you help me write a persuasive proposal that highlights the benefits and value of having a editor/publisher"
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First completion of VLLM/Qwen/Qwen2.5-0.5B-Instruct:

ErlisLushtaku · 2026-04-16T12:22:03Z

I think the new logging flags won’t actually take effect for judgearena-elo yet, because the package entrypoint still points to estimate_elo_ratings:main while configure_logging() only runs in cli(). Could we make both paths go through the same logging setup?
Maybe we can point pyproject.toml at judgearena.estimate_elo_ratings:cli.

ErlisLushtaku · 2026-04-16T12:35:38Z

Looks like the MT-Bench path resolves a timestamped res_folder and attaches a file handler before dispatch in generate_and_evaluate.py, but _save_mt_bench_results() later creates a different res_folder for the actual artifacts. Could we resolve the output directory once and reuse it for the whole MT-Bench run so logs and results stay together?

ErlisLushtaku · 2026-04-16T12:54:26Z

+
+    Returns the handler so the caller can remove it if needed.
+    """
+    root = logging.getLogger(_ROOT_LOGGER_NAME)


attach_file_handler() always adds a new FileHandler with no deduping. In multi-step flows like MT-Bench, this can accumulate multiple file handlers on the same logger. Should we make this idempotent, or centralize file-handler setup in one place?

…g existing handler instead of creating new one

kargibora · 2026-04-24T08:10:56Z

Hi @ErlisLushtaku , thanks for the review. I have fixed the remaining problems

kargibora added 4 commits April 14, 2026 12:07

feat: add structured logging module and tests

2874dc2

feat: add -v, -q, --log-file, --no-log-file CLI flags

5cbc880

refactor: replace print() with structured logger across codebase

f34d437

Remove redundant tests

8c4cc36

kargibora requested review from ErlisLushtaku and geoalgo April 15, 2026 09:01

ErlisLushtaku reviewed Apr 16, 2026

View reviewed changes

kargibora added 2 commits April 24, 2026 10:07

Change entry point, solve resolve problems, fix file handler for usin…

6db45c9

…g existing handler instead of creating new one

ruff check

d2a98f2

ErlisLushtaku approved these changes Apr 24, 2026

View reviewed changes

ErlisLushtaku merged commit c34b350 into main Apr 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: structured logging#37

feat: structured logging#37
ErlisLushtaku merged 6 commits intomainfrom
feat/structured-logging

kargibora commented Apr 15, 2026

Uh oh!

ErlisLushtaku Apr 16, 2026

Uh oh!

Uh oh!

ErlisLushtaku Apr 16, 2026

Uh oh!

ErlisLushtaku Apr 16, 2026

Uh oh!

kargibora commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kargibora commented Apr 15, 2026

Summary

Changes

New: judgearena/log.py

New CLI flags (judgearena/cli_common.py)

Codebase migration

Tests (tests/test_logging.py)

Behaviour

Example Log

Uh oh!

ErlisLushtaku Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ErlisLushtaku Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ErlisLushtaku Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

kargibora commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New: `judgearena/log.py`

New CLI flags (`judgearena/cli_common.py`)

Tests (`tests/test_logging.py`)