Skip to content

feat: structured logging#37

Merged
ErlisLushtaku merged 6 commits intomainfrom
feat/structured-logging
Apr 24, 2026
Merged

feat: structured logging#37
ErlisLushtaku merged 6 commits intomainfrom
feat/structured-logging

Conversation

@kargibora
Copy link
Copy Markdown
Collaborator

Summary

Replace ad-hoc print() calls across the codebase with Python's logging module under a unified judgearena logger namespace. This gives users control over verbosity and enables persistent debug logs without code changes.

I totally think this is required and its better to switch to using logging package than late as its common practice (also development friendly). The only concern is the log files. It is usually hard to read log directly from console so its better to save log in a rotation for debugging and results.

Changes

New: judgearena/log.py

  • get_logger(name) — returns a child logger under the judgearena namespace; auto-prefixes bare module names.
  • configure_logging(verbosity, log_file) — sets up console + optional file handlers. Supports JUDGEARENA_LOG_LEVEL env-var override.
  • attach_file_handler(path) — adds a DEBUG-level file handler (always captures full trace regardless of console verbosity).
  • make_run_log_path(folder) — generates a timestamped run-YYYYMMDD_HHMMSS.log path.

New CLI flags (judgearena/cli_common.py)

Flag Effect
-v / --verbose Set console to DEBUG
-q / --quiet Suppress everything below WARNING
--log-file PATH Explicit log file location
--no-log-file Disable automatic run-*.log in the result folder

Added resolve_verbosity(args) helper — -q takes precedence over -v.

Codebase migration

Replaced print() with logger.info() / logger.debug() in:

  • evaluate.py, generate_and_evaluate.py, estimate_elo_ratings.py
  • arenas_utils.py, eval_utils.py, utils.py
  • instruction_dataset/__init__.py, mt_bench/mt_bench_utils.py

Tests (tests/test_logging.py)

Behaviour

Default behaviour (-v 0) matches existing output — INFO messages print to stderr just like the old print() calls. No visible change for users who don't pass new flags.

Example Log

2026-04-14 10:00:37 [INFO] judgearena.__main__: Using dataset alpaca-eval and evaluating models VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct.
2026-04-14 10:00:37 [INFO] judgearena.instruction_dataset: Loaded 805 instructions for alpaca-eval.
2026-04-14 10:00:37 [INFO] judgearena.__main__: Generating completions for dataset alpaca-eval with model VLLM/Qwen/Qwen2.5-0.5B-Instruct and VLLM/Qwen/Qwen2.5-1.5B-Instruct (or loading them directly if present)
2026-04-14 10:00:40 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-0.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [INFO] judgearena.utils: Loading cache /leonardo_work/OELLM_prod2026/users/bkargi00/openjury-eval-data/cache/alpaca-eval_VLLM/Qwen/Qwen2.5-1.5B-Instruct_25.csv.zip
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First instruction/context: "I am trying to win over a new client for my writing services and skinny brown dog media to as as a ghost writer for their book Unbreakable Confidence. Can you help me write a persuasive proposal that highlights the benefits and value of having a editor/publisher"
2026-04-14 10:00:43 [DEBUG] judgearena.__main__: First completion of VLLM/Qwen/Qwen2.5-0.5B-Instruct:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new logging flags won’t actually take effect for judgearena-elo yet, because the package entrypoint still points to estimate_elo_ratings:main while configure_logging() only runs in cli(). Could we make both paths go through the same logging setup?
Maybe we can point pyproject.toml at judgearena.estimate_elo_ratings:cli.

Comment thread judgearena/generate_and_evaluate.py
Comment thread judgearena/generate_and_evaluate.py Outdated
Comment on lines 182 to 206
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the MT-Bench path resolves a timestamped res_folder and attaches a file handler before dispatch in generate_and_evaluate.py, but _save_mt_bench_results() later creates a different res_folder for the actual artifacts. Could we resolve the output directory once and reuse it for the whole MT-Bench run so logs and results stay together?

Comment thread judgearena/log.py

Returns the handler so the caller can remove it if needed.
"""
root = logging.getLogger(_ROOT_LOGGER_NAME)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attach_file_handler() always adds a new FileHandler with no deduping. In multi-step flows like MT-Bench, this can accumulate multiple file handlers on the same logger. Should we make this idempotent, or centralize file-handler setup in one place?

@kargibora
Copy link
Copy Markdown
Collaborator Author

Hi @ErlisLushtaku , thanks for the review. I have fixed the remaining problems

@ErlisLushtaku ErlisLushtaku merged commit c34b350 into main Apr 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants