Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,45 @@ Template files and setup scripts for the test workspace:
| `template` | Local directory uploaded to `/workspace/` in the sandbox |
| `setupScript` | Script file uploaded and executed during scaffolding |

### Executor plugins

Install Claude Code / Codex / Gemini plugins into the executor sandbox before the agent runs. Useful for A/B-testing whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use your SDK.

Plugins are installed **only** in the executor sandbox — the judge sandbox stays plugin-free so its scoring is independent of the executor's tooling. Run the same suite twice (once without `executorPlugins`, once with) and compare per-test-case judge scores in the inspect UI.

```json
{
"executorPlugins": [
{ "type": "local", "name": "my-sdk-skills", "path": "/abs/path/to/plugin-dir" },
{
"type": "git",
"name": "shared-skills",
"url": "https://github.com/example/skills.git",
"branch": "main",
"subpath": "plugins/shared-skills"
}
]
}
```

| Field | Description |
|---|---|
| `type` | `"local"` or `"git"` |
| `name` | Plugin slug (letters, digits, `.`, `_`, `-`). Must match the plugin manifest's name and must be unique across `executorPlugins`. |
| `path` | For `type: "local"`. Directory on the host containing the adapter-specific manifest. |
| `url` / `branch` / `subpath` / `sparse` | For `type: "git"`. Same semantics as `GitSource` under `privateInfo`. |

What each adapter expects inside the plugin directory:

| Adapter | Required file(s) | Where it lands in the sandbox |
|---|---|---|
| `claude` | `.claude-plugin/plugin.json` | Plugin dir extracted to `$HOME/.claude/plugins/<name>/`, then loaded via the documented `--plugin-dir <path>` CLI flag at each invocation. (Marketplace registration is intentionally skipped — Claude Code's marketplace flow prompts for trust, which can't be answered in `--print` mode.) |
| `codex` | `.codex-plugin/plugin.json` plus one or more `skills/<skill-name>/SKILL.md` | Each `skills/<skill-name>/` extracted to `$CODEX_HOME/skills/<skill-name>/`. Codex auto-discovers skills from that directory. |
| `gemini` | `gemini-extension.json` at the plugin root | The whole plugin dir extracted to `$HOME/.gemini/extensions/<name>/`. |
| custom | — | Not supported. The adapter raises a clear error at install time. |

Adapters fail fast at install time if the required manifest is missing, so an A/B run cannot silently no-op against the wrong CLI.

### Sandbox

Resource limits, secrets, and environment variables for sandbox VMs:
Expand Down
45 changes: 45 additions & 0 deletions skills/_reference/config-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
| `agents` | `object` | No | Per-role agent configuration. |
| `targets` | `TargetConfig[]` | **Yes** | Non-empty array. Docker images for sandboxed execution. |
| `workspace` | `WorkspaceConfig` | No | Workspace template and setup. |
| `executorPlugins` | `ExecutorPlugin[]` | No | Plugin directories installed into the executor's agent CLI inside the sandbox (Claude marketplace, Codex skills, Gemini extensions). Not installed in the judge sandbox — that's intentional, so the judge stays independent of the executor's tooling. |
| `sandbox` | `SandboxConfig` | **Yes** | Must be an object (can be `{}`). Resource limits, secrets, env vars. |

## SourceConfig (discriminated union on `type`)
Expand Down Expand Up @@ -142,6 +143,49 @@ Custom agents (any command not in the table above) **must** provide `envVar` and
| `template` | `string` | No — local directory to copy into sandbox workspace |
| `setupScript` | `string` | No — path to script run during workspace setup |

## ExecutorPlugin (discriminated union on `type`)

A plugin tree installed into the executor's agent CLI. Use these to A/B test
whether shipping skills/plugins to the executor improves judge scores. Plugins
are installed **only** in the executor sandbox; the judge sandbox is kept
plugin-free so its scoring is independent of the executor's tooling.

Each entry has a `name` (slug — letters/digits/`.`/`_`/`-` only) plus the
discriminator:

### LocalExecutorPlugin (`type: "local"`)

| Field | Type | Required |
|-------|------|----------|
| `type` | `"local"` | Yes |
| `name` | `string` | Yes — plugin slug |
| `path` | `string` | Yes — absolute or relative directory on the host |

### GitExecutorPlugin (`type: "git"`)

| Field | Type | Required |
|-------|------|----------|
| `type` | `"git"` | Yes |
| `name` | `string` | Yes — plugin slug |
| `url` | `string` | Yes — git repository URL |
| `branch` | `string` | No |
| `subpath` | `string` | No — path within the repo to the plugin dir |
| `sparse` | `string[]` | No — sparse checkout paths |

### Per-adapter expectations

What an adapter requires inside the plugin directory:

| Adapter | Required file(s) | Sandbox destination |
|---|---|---|
| `claude` | `.claude-plugin/plugin.json` at plugin root | Plugin dir extracted to `$HOME/.claude/plugins/<name>/`; loaded for each session via the `--plugin-dir <path>` CLI flag. |
| `codex` | `.codex-plugin/plugin.json` at plugin root, and one or more `skills/<skill-name>/SKILL.md` files | Each `skills/<skill-name>/` dir extracted to `$CODEX_HOME/skills/<skill-name>/` (auto-discovered). |
| `gemini` | `gemini-extension.json` at plugin root | Entire plugin dir extracted to `$HOME/.gemini/extensions/<name>/`. |
| custom | — | Not supported. Adapter throws a clear error if `executorPlugins` is non-empty. |

Each adapter fails fast at install time if its required file is missing — the
A/B comparison won't silently no-op.

## Validation Rules

1. Root must be a JSON object
Expand All @@ -153,6 +197,7 @@ Custom agents (any command not in the table above) **must** provide `envVar` and
7. `agents.executor` and `agents.judge` must have `secret.value` (non-empty string)
8. Custom agents must provide `envVar` and `baseUrl` in their secret
9. `baseUrl` must be a parseable URL
10. `executorPlugins`, if present, must be an array; each entry needs a `name` (slug-safe) and a valid `type` (`local` or `git`); names must be unique

## Minimal Examples

Expand Down
56 changes: 56 additions & 0 deletions src/agents/__tests__/claude.test.ts
Original file line number Diff line number Diff line change
@@ -1,15 +1,28 @@
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { access } from 'node:fs/promises';
import { spawnAgent, spawnInteractive } from '../spawn.js';
import { uploadDirToSandbox } from '../../sandbox/scaffolding.js';
import { ClaudeAdapter } from '../claude.js';
import { makeAgentResult } from '../../__tests__/helpers/fixtures.js';
import { makeMockSandboxClient } from '../../__tests__/helpers/mock-sandbox-client.js';

vi.mock('../spawn.js', () => ({
spawnAgent: vi.fn(),
spawnInteractive: vi.fn(),
}));

vi.mock('node:fs/promises', () => ({
access: vi.fn(),
}));

vi.mock('../../sandbox/scaffolding.js', () => ({
uploadDirToSandbox: vi.fn(),
}));

const mockSpawnAgent = vi.mocked(spawnAgent);
const mockSpawnInteractive = vi.mocked(spawnInteractive);
const mockAccess = vi.mocked(access);
const mockUploadDir = vi.mocked(uploadDirToSandbox);

describe('ClaudeAdapter', () => {
let adapter: ClaudeAdapter;
Expand Down Expand Up @@ -131,4 +144,47 @@ describe('ClaudeAdapter', () => {
expect(adapter.installCommand).toBe('npm i -g @anthropic-ai/claude-code');
});
});

describe('installPluginsInSandbox', () => {
it('is a no-op when given an empty plugin list', async () => {
const client = makeMockSandboxClient();
await adapter.installPluginsInSandbox(client as any, []);
expect(client.runCommand).not.toHaveBeenCalled();
expect(client.uploadFiles).not.toHaveBeenCalled();
});

it('throws clearly when a plugin is missing its Claude manifest', async () => {
mockAccess.mockRejectedValueOnce(new Error('ENOENT'));
const client = makeMockSandboxClient();
await expect(adapter.installPluginsInSandbox(client as any, [
{ name: 'broken', hostDir: '/tmp/broken' },
])).rejects.toThrow(/\.claude-plugin\/plugin\.json/);
expect(client.runCommand).not.toHaveBeenCalled();
});

it('extracts each plugin into /root/.claude/plugins/<name> and records the paths', async () => {
mockAccess.mockResolvedValue(undefined);
const client = makeMockSandboxClient();

await adapter.installPluginsInSandbox(client as any, [
{ name: 'plugin-a', hostDir: '/tmp/a' },
{ name: 'plugin-b', hostDir: '/tmp/b' },
]);

expect(mockUploadDir).toHaveBeenCalledTimes(2);
expect(mockUploadDir).toHaveBeenCalledWith(client, '/tmp/a', '/root/.claude/plugins/plugin-a', 'plugin_plugin-a');
expect(mockUploadDir).toHaveBeenCalledWith(client, '/tmp/b', '/root/.claude/plugins/plugin-b', 'plugin_plugin-b');

// sandboxCommand should now emit --plugin-dir for each plugin.
const cmd = adapter.sandboxCommand('do the thing');
expect(cmd).toContain("--plugin-dir '/root/.claude/plugins/plugin-a'");
expect(cmd).toContain("--plugin-dir '/root/.claude/plugins/plugin-b'");
});

it('sandboxCommand does not include --plugin-dir flags when no plugins have been installed', () => {
const fresh = new ClaudeAdapter({ command: 'claude' });
const cmd = fresh.sandboxCommand('do the thing');
expect(cmd).not.toContain('--plugin-dir');
});
});
});
112 changes: 111 additions & 1 deletion src/agents/__tests__/codex.test.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { writeFile, readFile, rm } from 'node:fs/promises';
import { writeFile, readFile, rm, access, readdir, stat } from 'node:fs/promises';
import { spawnAgent, spawnInteractive } from '../spawn.js';
import { uploadDirToSandbox } from '../../sandbox/scaffolding.js';
import { CodexAdapter } from '../codex.js';
import { makeAgentResult } from '../../__tests__/helpers/fixtures.js';
import { makeMockSandboxClient } from '../../__tests__/helpers/mock-sandbox-client.js';

vi.mock('../spawn.js', () => ({
spawnAgent: vi.fn(),
Expand All @@ -13,13 +15,24 @@ vi.mock('node:fs/promises', () => ({
writeFile: vi.fn().mockResolvedValue(undefined),
readFile: vi.fn(),
rm: vi.fn().mockResolvedValue(undefined),
access: vi.fn(),
readdir: vi.fn(),
stat: vi.fn(),
}));

vi.mock('../../sandbox/scaffolding.js', () => ({
uploadDirToSandbox: vi.fn(),
}));

const mockSpawnAgent = vi.mocked(spawnAgent);
const mockSpawnInteractive = vi.mocked(spawnInteractive);
const mockWriteFile = vi.mocked(writeFile);
const mockReadFile = vi.mocked(readFile);
const mockRm = vi.mocked(rm);
const mockAccess = vi.mocked(access);
const mockReaddir = vi.mocked(readdir);
const mockStat = vi.mocked(stat);
const mockUploadDir = vi.mocked(uploadDirToSandbox);

describe('CodexAdapter', () => {
let adapter: CodexAdapter;
Expand Down Expand Up @@ -120,4 +133,101 @@ describe('CodexAdapter', () => {
expect(adapter.installCommand).toBe('npm i -g @openai/codex@0.93.0');
});
});

describe('installPluginsInSandbox', () => {
function makeDirent(name: string, isDir: boolean) {
return {
name,
isDirectory: () => isDir,
isFile: () => !isDir,
} as any;
}

it('is a no-op when given an empty plugin list', async () => {
const client = makeMockSandboxClient();
await adapter.installPluginsInSandbox(client as any, []);
expect(client.runCommand).not.toHaveBeenCalled();
});

it('throws when a plugin is missing its Codex manifest', async () => {
mockAccess.mockRejectedValueOnce(new Error('ENOENT'));
const client = makeMockSandboxClient();
await expect(adapter.installPluginsInSandbox(client as any, [
{ name: 'broken', hostDir: '/tmp/broken' },
])).rejects.toThrow(/\.codex-plugin\/plugin\.json/);
expect(mockUploadDir).not.toHaveBeenCalled();
});

it('throws when a plugin has no skills/ directory', async () => {
mockAccess.mockResolvedValueOnce(undefined);
mockReaddir.mockRejectedValueOnce(new Error('ENOENT'));
const client = makeMockSandboxClient();
await expect(adapter.installPluginsInSandbox(client as any, [
{ name: 'empty', hostDir: '/tmp/empty' },
])).rejects.toThrow(/no 'skills\/' directory/);
});

it('throws when a plugin contributes no SKILL.md-bearing dirs', async () => {
mockAccess.mockResolvedValueOnce(undefined);
mockReaddir.mockResolvedValueOnce([
makeDirent('not-a-skill', true),
]);
mockStat.mockRejectedValueOnce(new Error('ENOENT'));
const client = makeMockSandboxClient();
await expect(adapter.installPluginsInSandbox(client as any, [
{ name: 'shell', hostDir: '/tmp/shell' },
])).rejects.toThrow(/no usable Codex skills/);
});

it('extracts each plugin skill into $CODEX_HOME/skills/<name>', async () => {
mockAccess.mockResolvedValue(undefined);
// One plugin with two skills.
mockReaddir.mockResolvedValueOnce([
makeDirent('skill-one', true),
makeDirent('skill-two', true),
makeDirent('not-a-dir', false),
]);
mockStat.mockResolvedValue({ isFile: () => true } as any);

const client = makeMockSandboxClient();
client.runCommand
.mockResolvedValueOnce({ stdout: '/root/.codex', stderr: '', exitCode: 0 }) // printf CODEX_HOME
.mockResolvedValue({ stdout: '', stderr: '', exitCode: 0 }); // mkdir, etc.

await adapter.installPluginsInSandbox(client as any, [
{ name: 'bundle', hostDir: '/tmp/bundle' },
]);

expect(mockUploadDir).toHaveBeenCalledTimes(2);
expect(mockUploadDir).toHaveBeenCalledWith(
client,
expect.stringContaining('skills/skill-one'),
'/root/.codex/skills/skill-one',
'codex_skill_skill-one',
);
expect(mockUploadDir).toHaveBeenCalledWith(
client,
expect.stringContaining('skills/skill-two'),
'/root/.codex/skills/skill-two',
'codex_skill_skill-two',
);
});

it('throws when two plugins contribute the same skill name, naming both', async () => {
mockAccess.mockResolvedValue(undefined);
// Two plugins, each contributing a skill called 'shared'.
mockReaddir
.mockResolvedValueOnce([makeDirent('shared', true)])
.mockResolvedValueOnce([makeDirent('shared', true)]);
mockStat.mockResolvedValue({ isFile: () => true } as any);

const client = makeMockSandboxClient();
client.runCommand.mockResolvedValue({ stdout: '/root/.codex', stderr: '', exitCode: 0 });

await expect(adapter.installPluginsInSandbox(client as any, [
{ name: 'plugin-a', hostDir: '/tmp/a' },
{ name: 'plugin-b', hostDir: '/tmp/b' },
])).rejects.toThrow(/'plugin-a'.*'plugin-b'/);
});
});
});
14 changes: 14 additions & 0 deletions src/agents/__tests__/custom.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -182,4 +182,18 @@ describe('CustomAdapter', () => {
});
});

describe('installPluginsInSandbox', () => {
it('is a no-op when given an empty plugin list', async () => {
const adapter = new CustomAdapter({ command: 'my-tool' });
await expect(adapter.installPluginsInSandbox({} as any, [])).resolves.toBeUndefined();
});

it('throws a clear error when given plugins (custom CLIs have no documented plugin layout)', async () => {
const adapter = new CustomAdapter({ command: 'my-tool' });
await expect(adapter.installPluginsInSandbox({} as any, [
{ name: 'x', hostDir: '/tmp/x' },
])).rejects.toThrow(/does not support executorPlugins/);
});
});

});
Loading
Loading