Release v0.4.0 · parameterlab/MASEval

[0.4.0] - 2026-03-28

Fixed

Core

Fixed MessageHistory.to_list() returning a reference to the internal list instead of a copy, causing simulator logs to contain future conversation messages that hadn't occurred at the time of logging. (PR: #48)
Fixed get_git_info() crashing on detached HEAD (e.g. in CI checkout), now returns detached@<short-hash> as the branch name. (PR: #41)

Interface

Agent adapter gather_config() in smolagents, langgraph, and llamaindex no longer silently swallows exceptions, ensuring config collection errors are visible instead of producing incomplete configuration data. (PR: #53)

Added

Core

Usage and cost tracking via Usage and TokenUsage data classes. ModelAdapter tracks token usage automatically after each chat() call. Components that implement UsageTrackableMixin are collected via gather_usage(). Live totals available during benchmark runs via benchmark.usage (grand total) and benchmark.usage_by_component (per-component breakdowns). Post-hoc analysis via UsageReporter.from_reports(benchmark.reports) with breakdowns by task, component, or model. (PR: #45)
Pluggable cost calculation via CostCalculator protocol. StaticPricingCalculator computes cost from user-supplied per-token rates. LiteLLMCostCalculator in maseval.interface.usage for automatic pricing via LiteLLM's model database (supports custom_pricing overrides and model_id_map; requires litellm). Pass a cost_calculator to ModelAdapter or AgentAdapter to compute Usage.cost. Provider-reported cost always takes precedence. (PR: #45)
AgentAdapter now accepts cost_calculator and model_id parameters. For smolagents, CAMEL, and LlamaIndex, both are auto-detected from the framework's agent object (LiteLLMCostCalculator if litellm is installed). LangGraph requires explicit model_id since graphs can contain multiple models. Explicit parameters always override auto-detection. (PR: #45)
Task.freeze() and Task.unfreeze() methods to make task data read-only during benchmark runs, preventing accidental mutation of environment_data, user_data, evaluation_data, and metadata (including nested dicts). Attribute reassignment is also blocked while frozen. Check state with Task.is_frozen. (PR: #42)
TaskFrozenError exception in maseval.core.exceptions, raised when attempting to modify a frozen task. (PR: #42)
Added InformativeSubsetQueue and DISCOQueue to maseval.core.task for subset-based evaluation (e.g., anchor-point selection for DISCO). DISCOQueue accepts anchor_points_path to load indices from a .json/.pkl file via DISCOQueue.load_anchor_points(). Available via from maseval import DISCOQueue, InformativeSubsetQueue. (PR: #34 and #41)
Added ModelScorer abstract base class in maseval.core.scorer for log-likelihood scoring, with loglikelihood(), loglikelihood_batch(), and loglikelihood_choices() methods. (PR: #34 and #41)
Added SeedGenerator abstract base class and DefaultSeedGenerator implementation for reproducible benchmark runs via SHA-256-based seed derivation (PR: #24)
Added seed and seed_generator parameters to Benchmark.__init__ for enabling reproducibility (PR: #24)
Added seed_generator parameter to all benchmark setup methods (setup_environment, setup_user, setup_agents, setup_evaluators) (PR: #24)
Added seed parameter to ModelAdapter.__init__ for deterministic model inference (PR: #24)
Added SeedingError exception for providers that don't support seeding (Anthropic models raise this if seed is provided) (PR: #24)
Added UserExhaustedError exception in maseval.core.exceptions for flow control when a user's turns are exhausted (PR: #39)

Interface

Added seed support to interface adapters: OpenAIModelAdapter, GoogleGenAIModelAdapter, LiteLLMModelAdapter, HuggingFacePipelineModelAdapter pass seeds to underlying APIs (PR: #24)
Added HuggingFaceModelScorer in maseval.interface.inference — log-likelihood scorer backed by a HuggingFace AutoModelForCausalLM, with single-token optimisation for MCQ evaluation. Implements the ModelScorer interface. (PR: #34 and #41)
CAMEL-AI integration: CamelAgentAdapter and CamelLLMUser for evaluating CAMEL-AI ChatAgent-based systems (PR: #22)
- Added CamelAgentUser for using a CAMEL ChatAgent as the user in agent-to-agent evaluation (PR: #22)
- Added camel_role_playing_execution_loop() for benchmarks using CAMEL's RolePlaying semantics (PR: #22)
- Added CamelRolePlayingTracer and CamelWorkforceTracer for capturing orchestration-level traces from CAMEL's multi-agent systems (PR: #22)

Benchmarks

MMLU Benchmark with DISCO support: Integration for evaluating language models on MMLU (Massive Multitask Language Understanding) multiple-choice questions, compatible with DISCO anchor-point methodology. MMLUBenchmark is a framework-agnostic base class (setup_agents() and get_model_adapter() must be implemented by subclasses); DefaultMMLUBenchmark provides a ready-made HuggingFace implementation. Also includes MMLUEnvironment, MMLUEvaluator, load_tasks(), and compute_benchmark_metrics(). Install with pip install maseval[mmlu]. Optional extras: lm-eval (for DefaultMMLUBenchmark.precompute_all_logprobs_lmeval), disco (for DISCO prediction in the example). (PR: #34 and #41)
CONVERSE benchmark for contextual safety evaluation in adversarial agent-to-agent conversations, including ConverseBenchmark, DefaultAgentConverseBenchmark, ConverseEnvironment, ConverseExternalAgent, PrivacyEvaluator, SecurityEvaluator, and load_tasks() utilities for travel, real_estate, and insurance domains. Benchmark source files are now downloaded on first use via ensure_data_exists() instead of being bundled in the package. (PR: #28)
GAIA2 Benchmark: Integration with Meta's ARE (Agent Research Environments) platform for evaluating LLM-based agents on dynamic, multi-step scenarios (PR: #26)
- Gaia2Benchmark, Gaia2Environment, Gaia2Evaluator components for framework-agnostic evaluation with ARE simulation (PR: #26)
- DefaultAgentGaia2Benchmark with ReAct-style agent for direct comparison with ARE reference implementation (PR: #26)
- Generic tool wrapper (Gaia2GenericTool) for MASEval tracing of ARE tools with simulation time tracking (PR: #26, #30)
- Data loading utilities: load_tasks(), configure_model_ids() for loading scenarios from HuggingFace (PR: #26)
- Gaia2JudgeEngineConfig for configuring the judge's LLM model and provider (PR: #30)
- Metrics: compute_gaia2_metrics() for GSR (Goal Success Rate) computation by capability type (PR: #26)
- Support for 5 capability dimensions: execution, search, adaptability, time, ambiguity (PR: #26, #30)
- Added gaia2 optional dependency: pip install maseval[gaia2] (PR: #26)
MultiAgentBench Benchmark: Integration with MARBLE MultiAgentBench for evaluating multi-agent collaboration across all 6 paper-defined domains: research, bargaining, coding, database, werewolf, and minecraft (PR: #25, #30)
- MultiAgentBenchBenchmark abstract base class for framework-agnostic multi-agent evaluation with seeding support for evaluators and agents (PR: #25)
- MarbleMultiAgentBenchBenchmark for exact MARBLE reproduction mode using native MARBLE agents (note: MARBLE's internal LLM calls bypass MASEval seeding) (PR: #25)
- MultiAgentBenchEnvironment and MultiAgentBenchEvaluator components (PR: #25)
- Data loading utilities: load_tasks(), configure_model_ids(), get_domain_info(), ensure_marble_exists() (PR: #25)
- MARBLE adapter: MarbleAgentAdapter for wrapping MARBLE agents with MASEval tracing (PR: #25)

Examples

Added usage tracking to the 5-A-Day benchmark: five_a_day_benchmark.ipynb (section 2.7) and five_a_day_benchmark.py (post-run usage summary with per-component and per-task breakdowns). (PR: #45)
MMLU benchmark example at examples/mmlu_benchmark/ for evaluating HuggingFace models on MMLU with optional DISCO prediction (--disco_model_path, --disco_transform_path). Supports local data, HuggingFace dataset repos, and DISCO weights from .pkl/.npz or HF repos. (PR: #34 and #41)
Added a dedicated runnable CONVERSE default benchmark example at examples/converse_benchmark/default_converse_benchmark.py for quick start with DefaultAgentConverseBenchmark. (PR: #28)
Gaia2 benchmark example with Google GenAI and OpenAI model support (PR: #26)

Documentation

Usage & Cost Tracking guide (docs/guides/usage-tracking.md) and API reference (docs/reference/usage.md). (PR: #45)

Testing

Composable pytest markers (live, credentialed, slow, smoke) for fine-grained test selection; default runs exclude slow, credentialed, and smoke tests (PR: #29)
Marker implication hook: credentialed implies live, so -m "not live" always gives a fully offline run (PR: #29)
Skip decorators (requires_openai, requires_anthropic, requires_google) for tests needing API keys (PR: #29)
Data integrity tests for Tau2, MACS, GAIA2, and MultiAgentBench benchmarks validating download pipelines, file structures, and data content (PR: #29, #30)
Real-data integration tests for GAIA2 and MultiAgentBench (PR: #30)
HTTP-level API contract tests for model adapters (OpenAI, Anthropic, Google GenAI, LiteLLM) using respx mocks — no API keys needed (PR: #29)
Live API round-trip tests for all model adapters (-m credentialed) (PR: #29)
CI jobs for slow tests (with benchmark data caching) and credentialed tests (behind GitHub Environment approval) (PR: #29)
Added respx dev dependency for HTTP-level mocking (PR: #29)
pytest marker mmlu for tests that require the MMLU benchmark (HuggingFace + DISCO). (PR: #34 and #41)

Changed

Core

Simplified seeding API: seed_generator parameter in setup methods is now always non-None (SeedGenerator instead of Optional[SeedGenerator]). When seeding is disabled (seed=None), derive_seed() returns None instead of raising an error. This eliminates all if seed_generator is not None: conditional checks - the same code path works whether seeding is enabled or disabled. (PR: #27)
Clarified benchmark/evaluator component guidance in docstrings and docs, including recommended evaluator exception behavior with fail_on_evaluation_error. (PR: #28)
User.respond() now raises UserExhaustedError instead of returning an empty string when the user has no more turns. Set the new exhausted_response parameter to return a configurable message instead (e.g. for tool-based integrations where agents call ask_user). Affects LLMUser, AgenticLLMUser, Tau2User, and MACSUser. (PR: #39)
_extract_json_object() helper in maseval.core.simulator replaces brittle markdown-fence stripping with robust outermost-brace extraction for all LLM simulator JSON parsing (ToolLLMSimulator, UserLLMSimulator, AgenticUserLLMSimulator). (PR: #39)
UserLLMSimulator and AgenticUserLLMSimulator now preserve stop tokens that appear outside the JSON object in raw LLM output, so User._check_stop_token can detect them. (PR: #39)

Interface

LlamaIndexAgentAdapter: Added max_iterations constructor parameter, forwarded to AgentWorkflow.run(). Fixes silent swallowing of max_steps by FunctionAgent.__init__. (PR: #39)
SmolAgentAdapter: New _determine_step_status() detects crashed steps where AgentGenerationError was raised before step.error was set, preventing false "success" status on empty steps. (PR: #39)
GoogleGenAIModelAdapter: Consecutive tool-response messages are now merged into a single contents entry, fixing Google API errors when multiple tool results are returned in one turn. (PR: #39)
Renamed framework-specific user classes to reflect the new LLMUser base (PR: #22):
- SmolAgentUser → SmolAgentLLMUser
- LangGraphUser → LangGraphLLMUser
- LlamaIndexUser → LlamaIndexLLMUser

Benchmarks

MACSBenchmark and Tau2Benchmark benchmarks now actively use the seeding system by deriving seeds for model adapters. Seeds are passed to agents, user simulators, tool simulators, and LLM-based evaluators for reproducible runs. (PR: #26)
- Gaia2Benchmark: Seeds agents/gaia2_agent, evaluators/judge
- MACSBenchmark: Seeds environment/tools/tool_{name}, simulators/user, evaluators/user_gsr, evaluators/system_gsr
- Tau2Benchmark: Seeds simulators/user, agents/default_agent
All benchmarks except MACS are now labeled as Beta in docs, BENCHMARKS.md, and benchmark index, with a warning that results have not yet been validated against original implementations. (PR: #39)

User

Refactored User class into abstract base class defining the interface (get_initial_query(), respond(), is_done()) with LLMUser as the concrete LLM-driven implementation. This enables non-LLM user implementations (scripted, human-in-the-loop, agent-based). (PR: #22)
Renamed AgenticUser → AgenticLLMUser for consistency with the new hierarchy (PR: #22)

Testing

Coverage script (scripts/coverage_by_feature.py) now accepts --exclude flag to skip additional markers; always excludes credentialed and smoke by default (PR: #29)

Fixed

Core

ResultLogger._filter_report() now includes status and error fields in persisted results, so saved logs can distinguish successful runs from infrastructure failures. Report schema is now consistent across success and failure paths (error is always present, None on success). (PR: #38)
Packaging: Fixed setuptools configuration — packages now uses find with include = ["maseval*"] so subpackages and package data (.json, .jsonl, .md, etc.) are included in PyPI installs. (PR: #39)

Benchmarks

Tau2: Fixed telecom domain schema to match tau2-bench, added agent/user state synchronization and deterministic network simulation, fixed initialization flow and tool result serialization (PR: #30)
Tau2: Added initial agent greeting ("Hi! How can I help you today?") to user simulator's message history, matching the original tau2-bench orchestrator. Fixed tool call counter accumulating across agent turns instead of resetting per turn. Corrected max_steps comments (original default is 100, not 200). Documented all known architectural divergences from original tau2-bench in PROVENANCE.md. (PR: #39)
Tau2: Various bugfixes including user tool routing, environment state synchronization, tool result serialization, telecom domain user models/tools, evaluator assertion logic, and addict dependency for nested dict access. (PR: #39)
Tau2: Fixed incorrect return type annotations on DB.load() and DB.copy_deep() — now use Self instead of "DB", so subclass methods return the correct type (PR: #29)
MultiAgentBench: Fixed bargaining evaluation to use both buyer and seller LLM evaluation prompts, matching the MARBLE paper's methodology. Previously only the seller prompt was used (mirroring a regression in the MARBLE codebase), causing buyer scores to always default to -1 and completion checks to always fail. Now reports buyer_score, seller_score, and mean_score scaled to 0-100. (PR: #39)
MultiAgentBench: Corrected domain mappings, added missing werewolf/minecraft support, fixed environment constructors, added result summarization matching MARBLE's evaluation pipeline (PR: #30)
MultiAgentBench: MarbleMultiAgentBenchBenchmark now implements MARBLE's multi-iteration coordination loop with all 4 modes (graph, star, chain, tree) instead of executing agents only once. Fixed default coordinate_mode from "star" to "graph" matching 1215/1226 MARBLE configs. Uses per-task max_iterations from task config (matching engine.py:97), respects per-agent LLM overrides, and initializes memory type from task config. (PR: #39)
MultiAgentBench: Faithfulness audit fixes for reproduction mode — fixed wrong import path (marble.utils.utils → marble.llms.model_prompting), added Minecraft agent registration, per-domain defaults for max_iterations/coordinate_mode/environment.type/memory.type from MARBLE YAML configs, resolved hardcoded relative paths for score.json and workspace/solution.py via _MARBLE_ROOT, unified coordinate_mode defaults, corrected evaluator and agent model defaults to match MARBLE, replaced auto-generated agent IDs with strict validation. (PR: #39)
MultiAgentBench: Fixed bargaining evaluation crash from .format() on single-brace JSON in evaluator prompts. Documented chain communication assertion bug in MARBLE's engine.py. (PR: #39)
GAIA2: Various fixes for faithful reproduction of ARE reference results — scenario lifecycle, data loading, evaluation flow, multi-turn notification handling, tool filtering, default agent fidelity, and simulation time management (PR: #30)
MACS: MACSGenericTool._schema_to_inputs() now preserves items sub-schema for array-type properties, fixing tool registration with Gemini and OpenAI providers. (PR: #39)
MACS: Simplified MACSUser._extract_user_profile() — no longer attempts brittle parsing of scenario text; points profile section at the scenario to avoid duplication. (PR: #39)
Converse: Removed silent "gpt-4o" default for attacker_model_id; now raises ValueError if not provided, preventing accidental benchmark misconfiguration. (PR: #39)
ConVerse: Various fixes for faithful reproduction of original. (PR: #32)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

[0.4.0] - 2026-03-28

Fixed

Added

Changed

Fixed

Removed

Uh oh!