Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .github/workflows/deploy-staging.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,15 @@ jobs:
set -euo pipefail
cd "${{ secrets.STAGING_EC2_DEPLOY_PATH }}"

PR_BRANCH="${{ github.head_ref || github.event.inputs.ref || 'develop' }}"
echo "── Deploying branch: $PR_BRANCH ──"
DEPLOY_REF="${{ github.head_ref || github.event.inputs.ref || 'develop' }}"
echo "── Deploying ref: $DEPLOY_REF ──"

git fetch origin "$PR_BRANCH"
git checkout "$PR_BRANCH"
git pull origin "$PR_BRANCH"
git fetch origin "$DEPLOY_REF"
if git show-ref --verify --quiet "refs/remotes/origin/$DEPLOY_REF"; then
git checkout -B "$DEPLOY_REF" "origin/$DEPLOY_REF"
else
git checkout --detach FETCH_HEAD
fi

echo "── Restarting XMem staging service ──"
sudo systemctl restart xmem-staging
Expand Down
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,16 @@ tests/
!tests/
!tests/**/*.py
benchmarks/
!benchmarks/
!benchmarks/README.md
LongMemEval/
!benchmarks/longmemeval/
!benchmarks/longmemeval/**
benchmarks/longmemeval/**/__pycache__/
benchmarks/longmemeval/**/*.pyc
benchmarks/longmemeval/data/
benchmarks/longmemeval/results/
benchmarks/longmemeval/outputs/
backboard/
rust/

Expand Down
9 changes: 9 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# XMem Benchmarks

This directory contains benchmark harnesses for XMem.

- `longmemeval/`: Python-only LongMemEval benchmark runner targeting the XMem HTTP API.

Benchmark runs can create large dataset and result artifacts. Keep those files under
`benchmarks/longmemeval/data`, `benchmarks/longmemeval/results`, or
`benchmarks/longmemeval/outputs`; those paths are intentionally ignored by git.
166 changes: 166 additions & 0 deletions benchmarks/longmemeval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# LongMemEval Benchmark for XMem Python

This harness benchmarks the Python XMem service only. It targets the deployed
Python API at `https://api.xmem.in` by default and does not run or compare the
Go implementation.

LongMemEval evaluates long-term conversational memory across multi-session
recall, temporal reasoning, single-session recall, knowledge updates, and
preference tracking. The harness follows the same broad structure used by
open-source memory-layer benchmarks: load dataset records, ingest the haystack
conversation history into an isolated user namespace, retrieve an answer for
the benchmark question, write predictions, and compute lightweight local
metrics for quick iteration.

## Files

- `dataset.py`: Loads JSON/JSONL LongMemEval records and converts sessions to
XMem conversation-turn ingest payloads.
- `client.py`: Async HTTP client for the Python XMem API.
- `runner.py`: Benchmark orchestration, batching, polling, resume support, and
output writing.
- `metrics.py`: Local exact-match, contains, and token-F1 metrics plus summary
aggregation.
- `run.py`: CLI entrypoint.

## Secrets

Do not commit API keys or provider credentials.

To generate XMem predictions, set an XMem API key:

```bash
export XMEM_API_KEY="..."
```

Use `--api-key-env` if your local environment uses a different variable name.

To score predictions with the official LongMemEval LLM-as-judge evaluator, set
an OpenAI API key before running the evaluator:

```bash
export OPENAI_API_KEY="..."
```

## Run a Smoke Check

Validate dataset parsing and payload construction without calling the service:

```bash
python -m benchmarks.longmemeval.run \
--download \
--dry-run \
--limit 2
```

Validate all six official categories without requiring an API key:

```bash
python -m benchmarks.longmemeval.run_all_categories \
--download \
--dry-run
```

If the dataset is already available locally:

```bash
python -m benchmarks.longmemeval.run \
--dataset-path benchmarks/longmemeval/data/longmemeval_s_cleaned.json \
--dry-run \
--limit 2
```

## Run Against the Python API

```bash
export XMEM_API_KEY="..."

python -m benchmarks.longmemeval.run \
--download \
--api-base-url https://api.xmem.in \
--limit 10 \
--batch-size 25 \
--output-dir benchmarks/longmemeval/results/run-001
```

The runner writes:

- `results.jsonl`: Full per-example benchmark records.
- `predictions.jsonl`: Official prediction file with only `question_id` and
`hypothesis`.
- `summary.json`: Aggregate local metrics and latency.

The local metrics are intended for fast development feedback. For publication
quality reporting, run the generated `predictions.jsonl` through the official
LongMemEval evaluation flow or an agreed LLM-as-judge rubric using the same
model/settings across systems.

The benchmark runner itself only needs `XMEM_API_KEY` because it generates XMem
answers. The official/equivalent evaluator is a separate scoring step and needs
`OPENAI_API_KEY` when using an OpenAI judge model.

## Run All Official Categories

The dataset has six `question_type` categories. Each example has a unique
`question_id` and its own haystack sessions, and this runner isolates each
question into a separate XMem user namespace. That makes category-level
parallelism safe from memory leakage; the only practical constraint is API
throughput and rate limiting.

```bash
export XMEM_API_KEY="..."

python -m benchmarks.longmemeval.run_all_categories \
--dataset-path benchmarks/longmemeval/data/longmemeval_s_cleaned.json \
--api-base-url https://api.xmem.in \
--output-root benchmarks/longmemeval/results/full-six-categories \
--max-parallel-categories 6
```

The all-category runner prints live processed/left/ETA status and writes one
official merged prediction file at:

```text
benchmarks/longmemeval/results/full-six-categories/predictions.jsonl
```

Each category also gets a `runner.log` file under its output directory. If a
category process fails, the launcher prints the failing category, exit code, log
path, and the most recent child-process output.

## Useful Options

- `--limit N`: Run a small subset first.
- `--offset N`: Skip the first N selected examples.
- `--question-type TYPE`: Filter to one LongMemEval category.
- `--skip-ingest`: Reuse already-ingested user namespaces and only retrieve.
- `--no-resume`: Re-run examples even if they already exist in `results.jsonl`.
- `--ingest-api-version v1`: Use synchronous batch ingestion instead of the
default durable `/v2/memory/batch-ingest` path.
- `--effort-level high`: Use high-effort XMem ingestion for long records.
- `--dry-run`: Validate dataset/category setup without API calls.
- `--verbose`: Print child runner output while the all-category launcher runs.

## Expected Failures

These errors are intentional and should be actionable:

- `Dataset file not found`: run with `--download`, or pass `--dataset-path`.
- `Missing API key`: set `XMEM_API_KEY`, or pass `--api-key-env` for a custom
variable name.
- Official evaluator authentication errors: set `OPENAI_API_KEY` before running
the LongMemEval scoring step.
- `Failed to download the LongMemEval dataset`: check network access, then retry
or download the dataset manually.
- `<category> failed with exit code ...`: inspect that category's `runner.log`.

## Isolation Model

Each example is ingested into a user id derived from:

```text
<user-prefix>-<question-id>
```

This prevents facts from one benchmark question from leaking into another. Use a
new `--user-prefix` for fully fresh runs.
1 change: 1 addition & 0 deletions benchmarks/longmemeval/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""LongMemEval benchmark harness for the Python XMem API."""
129 changes: 129 additions & 0 deletions benchmarks/longmemeval/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""HTTP client for the Python XMem API used by the benchmark."""

from __future__ import annotations

import asyncio
import time
from dataclasses import dataclass
from typing import Any

import httpx


TERMINAL_JOB_STATUSES = {"succeeded", "dead_letter"}


@dataclass(frozen=True)
class ApiCallResult:
data: dict[str, Any]
elapsed_ms: float


class XMemApiClient:
"""Small async client around the deployed Python XMem API."""

def __init__(
self,
*,
base_url: str,
api_key: str,
timeout_seconds: float = 120.0,
max_retries: int = 3,
retry_backoff_seconds: float = 2.0,
) -> None:
self.base_url = base_url.rstrip("/")
self.max_retries = max_retries
self.retry_backoff_seconds = retry_backoff_seconds
self._client = httpx.AsyncClient(
base_url=self.base_url,
timeout=httpx.Timeout(timeout_seconds),
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"User-Agent": "xmem-longmemeval-benchmark/1.0",
},
)

async def __aenter__(self) -> "XMemApiClient":
return self

async def __aexit__(self, exc_type, exc, tb) -> None:
await self.close()

async def close(self) -> None:
await self._client.aclose()

async def ingest(self, payload: dict[str, Any]) -> ApiCallResult:
return await self._post("/v1/memory/ingest", payload)

async def batch_ingest_v1(self, items: list[dict[str, Any]]) -> ApiCallResult:
return await self._post("/v1/memory/batch-ingest", {"items": items})

async def batch_ingest_v2(self, items: list[dict[str, Any]]) -> ApiCallResult:
return await self._post("/v2/memory/batch-ingest", {"items": items})

async def retrieve(self, payload: dict[str, Any]) -> ApiCallResult:
return await self._post("/v1/memory/retrieve", payload)

async def job_status(self, status_url: str) -> ApiCallResult:
return await self._get(status_url)

async def poll_job(
self,
status_url: str,
*,
interval_seconds: float,
timeout_seconds: float,
) -> ApiCallResult:
deadline = time.monotonic() + timeout_seconds
last_result: ApiCallResult | None = None
while time.monotonic() < deadline:
last_result = await self.job_status(status_url)
status = str(last_result.data.get("status") or "").lower()
if status in TERMINAL_JOB_STATUSES:
return last_result
await asyncio.sleep(interval_seconds)
status = last_result.data.get("status") if last_result else "unknown"
raise TimeoutError(f"Timed out polling job {status_url}; last status={status}")

async def _get(self, path: str) -> ApiCallResult:
return await self._request("GET", path)

async def _post(self, path: str, payload: dict[str, Any]) -> ApiCallResult:
return await self._request("POST", path, json=payload)

async def _request(self, method: str, path: str, **kwargs: Any) -> ApiCallResult:
request_path = self._request_path(path)
start = time.perf_counter()
response: httpx.Response | None = None
for attempt in range(self.max_retries + 1):
try:
response = await self._client.request(method, request_path, **kwargs)
Comment thread
greptile-apps[bot] marked this conversation as resolved.
if response.status_code < 500 and response.status_code != 429:
break
except httpx.HTTPError:
if attempt >= self.max_retries:
raise
if attempt < self.max_retries:
await asyncio.sleep(self.retry_backoff_seconds * (attempt + 1))

if response is None:
raise RuntimeError(f"No response from {method} {request_path}")
elapsed_ms = round((time.perf_counter() - start) * 1000, 2)
response.raise_for_status()
body = response.json()
if body.get("status") == "error":
error = body.get("error") or f"XMem API error from {request_path}"
raise RuntimeError(error)
data = body.get("data")
if data is None:
data = {}
if not isinstance(data, dict):
data = {"value": data}
return ApiCallResult(data=data, elapsed_ms=elapsed_ms)

@staticmethod
def _request_path(path: str) -> str:
if path.startswith(("http://", "https://", "/")):
return path
return f"/{path}"
Loading
Loading