This guide walks through the complete workflow for running the CRM benchmark inside Atlas so the teacher can judge end-to-end outcomes. Follow the steps in order before attempting a full benchmark run.
- Python: 3.10 or newer (Atlas SDK is validated on 3.13).
- Postgres: Two logical databases running on the same server or container:
crm_sandbox– used by the CRM harness to store case state.atlas– used by Atlas telemetry, rewards, and learning.
- Environment variables:
OPENAI_API_KEY(student + teacher + judge via LiteLLM/OpenAI adapter).GEMINI_API_KEY(small/large judges plus learning synthesizer).- CRM DB connection:
DB_HOST,DB_PORT,DB_USER,DB_PASSWORD,DB_NAME=crm_sandbox. - Atlas telemetry DB:
STORAGE__DATABASE_URL=postgresql://atlas:atlas@<host>:<port>/atlas. - Optional:
ATLAS_OFFLINE_MODE=1during local dry-runs (unset for real LLM validation).
- System packages:
psycopgis already listed in repo requirements; ensure Postgres CLI tools are available if you plan to inspect databases manually.
Copy
configs/atlas/.env.exampleto.env(or your shell profile) and adjust credentials. The repo's.envalready contains the necessary API keys—duplicate the values there so Atlas picks them up.
Important: The Atlas SDK requires a local modification to support environment variable override for the storage database URL. After installing the SDK, apply this change:
File: external/atlas-sdk/atlas/config/models.py
Location: Add a model_validator to the StorageConfig class (around line 533):
@model_validator(mode="before")
@classmethod
def _override_with_env_var(cls, data: Any) -> Any:
"""Override database_url with STORAGE__DATABASE_URL if set."""
import os
if isinstance(data, dict):
env_url = os.getenv("STORAGE__DATABASE_URL")
if env_url:
data = {**data, "database_url": env_url}
return dataThis ensures that STORAGE__DATABASE_URL from .env overrides the YAML config value, allowing environment-specific database URLs without modifying config files.
Atlas is vendored under external/atlas-sdk, but you must install it in editable mode inside your Python environment so atlas CLI commands resolve:
pip install -e external/atlas-sdk[dev]Notes:
- This brings in
litellm>=1.77.7. If you also need packages that pin olderlitellm(e.g.,bespokelabs-curator==1.61.3), use a separate virtualenv. - The editable install lets you modify the SDK adapters inside this repo (e.g., the CRM harness adapter) without re-installing.
- After pulling new commits that touch
external/atlas-sdk, rerun the same command to ensure your env stays in sync.
Reference: external/atlas-sdk/README.md (Quick Start and Storage sections) and .env.example.
The CRM harness talks to Postgres via ConversationHarness._create_backend() (see src/evaluation/conversation_harness.py:596-603). To guarantee each case starts clean:
- Start the Postgres service (local Docker or managed instance) with both databases created (
crm_sandbox,atlas). - Run the existing seeding flow (if available) or execute the CRM schema migrations.
- Validate connectivity by running a quick harness smoke test:
During initialization the harness will:
python -m src.evaluation.run_harness \ --dataset data/conversations/dev.jsonl \ --backend postgres \ --sample 1
- Call
PostgresCrmBackend.begin_session(reset=True)(seesrc/crm_backend.py:90-134). - Seed initial entities via
_seed_postgres_backend. - Roll back the session after the run (so each task starts from the same snapshot).
- Call
- If the command fails, double-check the DB_* env vars and that the
crm_sandboxschema exists.
Atlas needs to know which dataset version produced each run. Use _compute_dataset_revision (src/integration/atlas_integration.py:90-138) to record either:
- The current Git commit SHA (preferred).
- Or the modification timestamp of the dataset file when Git metadata is unavailable.
Workflow:
- Place the final CRM conversations JSONL in
data/conversations/<name>.jsonl. - Record the intended subset (full benchmark vs. sample).
- When generating task payloads (next section), pass the dataset path so the script embeds the revision string in every task payload and, later, in
artifacts/baselines/<ts>/atlas/metrics.json.
Create (or run) the CLI that wraps conversation_to_payload (src/integration/atlas_common.py:19-43). It should:
- Load the dataset JSONL into Conversation objects.
- For each conversation, build a payload:
{ "task_id": "<conversation_id>::<run_id>", "run_id": "<timestamp>-<suffix>", "conversation": { ...serialized Conversation... }, "dataset_revision": "<git sha or timestamp>", "backend": "postgres", "use_llm_judge": true } - Write the list to
artifacts/baselines/<timestamp>/atlas/tasks.jsonl.
- Copy
configs/atlas/.env.exampleto your env and export the values. - Use
configs/atlas/crm_harness.yaml(or.dev.yaml) as the runtime config:agent.type: crm_harness- Student model: GPT‑4.1‑mini
- Teacher model: GPT‑4.1 (low temperature)
rim.small_model: Gemini 2.5 Flashrim.large_model: Gemini 2.5 Prolearning.llm: Gemini 2.5 Flashorchestration.forced_mode: paired(capability probe disabled)storage.database_url: Atlas telemetry DB
atlas run \
--config configs/atlas/crm_harness.yaml \
--task-file artifacts/baselines/<timestamp>/atlas/tasks.jsonl \
--output-dir artifacts/baselines/<timestamp>/atlasFlags of note:
--task-fileinjects the serialized CRM conversations directly; the CRM harness adapter consumestask_payload.--output-dirmirrors Atlas’ own.atlas/runs/...artifacts into our repo-standard location for easier handoff.- Ensure
ATLAS_OFFLINE_MODEis unset (or0) so real LLMs execute.
After the run, you should see:
artifacts/baselines/<timestamp>/atlas/tasks.jsonl– input payloads.artifacts/baselines/<timestamp>/atlas/sessions.jsonl– copy of Atlas session traces (atlas/cli/jsonl_writer.py:330-488).artifacts/baselines/<timestamp>/atlas/metrics.json– aggregated stats including:execution_mode,adaptive_summary(will read “paired” throughout because the probe is disabled).reward_stats,session_reward, token usage (prompt_tokens,completion_tokens,calls).- CRM verification results:
overall_success,first_failed_turn, etc.
artifacts/baselines/<timestamp>/atlas/README.md– human-readable summary (timestamp, config path, dataset, success counts).
Use these files to compare Atlas rewards vs. baseline harness metrics and to prepare reports for downstream training.
Before launching the full benchmark:
- Single conversation run:
atlas run --config configs/atlas/crm_harness.yaml \ --task-file artifacts/baselines/smoke/tasks.jsonl \ --limit 1
- Ensure
ATLAS_OFFLINE_MODE=0(unset). - Inspect the single session record:
metadata.execution_modemust bepaired.session_reward.score > 0(or judge rationale shows a meaningful verdict).- CRM Postgres tables reflect the tool calls during execution, and
PostgresCrmBackend.rollback_session()returned the DB to the initial state afterward.
- Verify
artifacts/.../sessions.jsonlandmetrics.jsoninclude token usage and adaptive summaries. - If everything looks good, proceed with the full dataset run using the same config/runbook.
Document every smoke run (timestamp, dataset, config, reward score) so we can show a clear progression from baseline to Atlas-graded runs.