PSPDFKit-labs · nickwinder · May 18, 2026 · May 14, 2026 · May 15, 2026 · May 15, 2026
diff --git a/README.md b/README.md
@@ -441,6 +441,45 @@ Template files and setup scripts for the test workspace:
 | `template` | Local directory uploaded to `/workspace/` in the sandbox |
 | `setupScript` | Script file uploaded and executed during scaffolding |
 
+### Executor plugins
+
+Install Claude Code / Codex / Gemini plugins into the executor sandbox before the agent runs. Useful for A/B-testing whether shipping a plugin (skills, slash commands, marketplace bundles) measurably improves an agent's ability to use your SDK.
+
+Plugins are installed **only** in the executor sandbox — the judge sandbox stays plugin-free so its scoring is independent of the executor's tooling. Run the same suite twice (once without `executorPlugins`, once with) and compare per-test-case judge scores in the inspect UI.
+
+```json
+{
+  "executorPlugins": [
+    { "type": "local", "name": "my-sdk-skills", "path": "/abs/path/to/plugin-dir" },
+    {
+      "type": "git",
+      "name": "shared-skills",
+      "url": "https://github.com/example/skills.git",
+      "branch": "main",
+      "subpath": "plugins/shared-skills"
+    }
+  ]
+}
+```
+
+| Field | Description |
+|---|---|
+| `type` | `"local"` or `"git"` |
+| `name` | Plugin slug (letters, digits, `.`, `_`, `-`). Must match the plugin manifest's name and must be unique across `executorPlugins`. |
+| `path` | For `type: "local"`. Directory on the host containing the adapter-specific manifest. |
+| `url` / `branch` / `subpath` / `sparse` | For `type: "git"`. Same semantics as `GitSource` under `privateInfo`. |
+
+What each adapter expects inside the plugin directory:
+
+| Adapter | Required file(s) | Where it lands in the sandbox |
+|---|---|---|
+| `claude` | `.claude-plugin/plugin.json` | Plugin dir extracted to `$HOME/.claude/plugins/<name>/`, then loaded via the documented `--plugin-dir <path>` CLI flag at each invocation. (Marketplace registration is intentionally skipped — Claude Code's marketplace flow prompts for trust, which can't be answered in `--print` mode.) |
+| `codex` | `.codex-plugin/plugin.json` plus one or more `skills/<skill-name>/SKILL.md` | Each `skills/<skill-name>/` extracted to `$CODEX_HOME/skills/<skill-name>/`. Codex auto-discovers skills from that directory. |
+| `gemini` | `gemini-extension.json` at the plugin root | The whole plugin dir extracted to `$HOME/.gemini/extensions/<name>/`. |
+| custom | — | Not supported. The adapter raises a clear error at install time. |
+
+Adapters fail fast at install time if the required manifest is missing, so an A/B run cannot silently no-op against the wrong CLI.
+
 ### Sandbox
 
 Resource limits, secrets, and environment variables for sandbox VMs:

diff --git a/skills/_reference/config-schema.md b/skills/_reference/config-schema.md
@@ -9,6 +9,7 @@
 | `agents` | `object` | No | Per-role agent configuration. |
 | `targets` | `TargetConfig[]` | **Yes** | Non-empty array. Docker images for sandboxed execution. |
 | `workspace` | `WorkspaceConfig` | No | Workspace template and setup. |
+| `executorPlugins` | `ExecutorPlugin[]` | No | Plugin directories installed into the executor's agent CLI inside the sandbox (Claude marketplace, Codex skills, Gemini extensions). Not installed in the judge sandbox — that's intentional, so the judge stays independent of the executor's tooling. |
 | `sandbox` | `SandboxConfig` | **Yes** | Must be an object (can be `{}`). Resource limits, secrets, env vars. |
 
 ## SourceConfig (discriminated union on `type`)
@@ -142,6 +143,49 @@ Custom agents (any command not in the table above) **must** provide `envVar` and
 | `template` | `string` | No — local directory to copy into sandbox workspace |
 | `setupScript` | `string` | No — path to script run during workspace setup |
 
+## ExecutorPlugin (discriminated union on `type`)
+
+A plugin tree installed into the executor's agent CLI. Use these to A/B test
+whether shipping skills/plugins to the executor improves judge scores. Plugins
+are installed **only** in the executor sandbox; the judge sandbox is kept
+plugin-free so its scoring is independent of the executor's tooling.
+
+Each entry has a `name` (slug — letters/digits/`.`/`_`/`-` only) plus the
+discriminator:
+
+### LocalExecutorPlugin (`type: "local"`)
+
+| Field | Type | Required |
+|-------|------|----------|
+| `type` | `"local"` | Yes |
+| `name` | `string` | Yes — plugin slug |
+| `path` | `string` | Yes — absolute or relative directory on the host |
+
+### GitExecutorPlugin (`type: "git"`)
+
+| Field | Type | Required |
+|-------|------|----------|
+| `type` | `"git"` | Yes |
+| `name` | `string` | Yes — plugin slug |
+| `url` | `string` | Yes — git repository URL |
+| `branch` | `string` | No |
+| `subpath` | `string` | No — path within the repo to the plugin dir |
+| `sparse` | `string[]` | No — sparse checkout paths |
+
+### Per-adapter expectations
+
+What an adapter requires inside the plugin directory:
+
+| Adapter | Required file(s) | Sandbox destination |
+|---|---|---|
+| `claude` | `.claude-plugin/plugin.json` at plugin root | Plugin dir extracted to `$HOME/.claude/plugins/<name>/`; loaded for each session via the `--plugin-dir <path>` CLI flag. |
+| `codex` | `.codex-plugin/plugin.json` at plugin root, and one or more `skills/<skill-name>/SKILL.md` files | Each `skills/<skill-name>/` dir extracted to `$CODEX_HOME/skills/<skill-name>/` (auto-discovered). |
+| `gemini` | `gemini-extension.json` at plugin root | Entire plugin dir extracted to `$HOME/.gemini/extensions/<name>/`. |
+| custom | — | Not supported. Adapter throws a clear error if `executorPlugins` is non-empty. |
+
+Each adapter fails fast at install time if its required file is missing — the
+A/B comparison won't silently no-op.
+
 ## Validation Rules
 
 1. Root must be a JSON object
@@ -153,6 +197,7 @@ Custom agents (any command not in the table above) **must** provide `envVar` and
 7. `agents.executor` and `agents.judge` must have `secret.value` (non-empty string)
 8. Custom agents must provide `envVar` and `baseUrl` in their secret
 9. `baseUrl` must be a parseable URL
+10. `executorPlugins`, if present, must be an array; each entry needs a `name` (slug-safe) and a valid `type` (`local` or `git`); names must be unique
 
 ## Minimal Examples
 

diff --git a/src/agents/__tests__/claude.test.ts b/src/agents/__tests__/claude.test.ts
@@ -1,15 +1,28 @@
 import { describe, it, expect, vi, beforeEach } from 'vitest';
+import { access } from 'node:fs/promises';
 import { spawnAgent, spawnInteractive } from '../spawn.js';
+import { uploadDirToSandbox } from '../../sandbox/scaffolding.js';
 import { ClaudeAdapter } from '../claude.js';
 import { makeAgentResult } from '../../__tests__/helpers/fixtures.js';
+import { makeMockSandboxClient } from '../../__tests__/helpers/mock-sandbox-client.js';
 
 vi.mock('../spawn.js', () => ({
   spawnAgent: vi.fn(),
   spawnInteractive: vi.fn(),
 }));
 
+vi.mock('node:fs/promises', () => ({
+  access: vi.fn(),
+}));
+
+vi.mock('../../sandbox/scaffolding.js', () => ({
+  uploadDirToSandbox: vi.fn(),
+}));
+
 const mockSpawnAgent = vi.mocked(spawnAgent);
 const mockSpawnInteractive = vi.mocked(spawnInteractive);
+const mockAccess = vi.mocked(access);
+const mockUploadDir = vi.mocked(uploadDirToSandbox);
 
 describe('ClaudeAdapter', () => {
   let adapter: ClaudeAdapter;
@@ -131,4 +144,47 @@ describe('ClaudeAdapter', () => {
       expect(adapter.installCommand).toBe('npm i -g @anthropic-ai/claude-code');
     });
   });
+
+  describe('installPluginsInSandbox', () => {
+    it('is a no-op when given an empty plugin list', async () => {
+      const client = makeMockSandboxClient();
+      await adapter.installPluginsInSandbox(client as any, []);
+      expect(client.runCommand).not.toHaveBeenCalled();
+      expect(client.uploadFiles).not.toHaveBeenCalled();
+    });
+
+    it('throws clearly when a plugin is missing its Claude manifest', async () => {
+      mockAccess.mockRejectedValueOnce(new Error('ENOENT'));
+      const client = makeMockSandboxClient();
+      await expect(adapter.installPluginsInSandbox(client as any, [
+        { name: 'broken', hostDir: '/tmp/broken' },
+      ])).rejects.toThrow(/\.claude-plugin\/plugin\.json/);
+      expect(client.runCommand).not.toHaveBeenCalled();
+    });
+
+    it('extracts each plugin into /root/.claude/plugins/<name> and records the paths', async () => {
+      mockAccess.mockResolvedValue(undefined);
+      const client = makeMockSandboxClient();
+
+      await adapter.installPluginsInSandbox(client as any, [
+        { name: 'plugin-a', hostDir: '/tmp/a' },
+        { name: 'plugin-b', hostDir: '/tmp/b' },
+      ]);
+
+      expect(mockUploadDir).toHaveBeenCalledTimes(2);
+      expect(mockUploadDir).toHaveBeenCalledWith(client, '/tmp/a', '/root/.claude/plugins/plugin-a', 'plugin_plugin-a');
+      expect(mockUploadDir).toHaveBeenCalledWith(client, '/tmp/b', '/root/.claude/plugins/plugin-b', 'plugin_plugin-b');
+
+      // sandboxCommand should now emit --plugin-dir for each plugin.
+      const cmd = adapter.sandboxCommand('do the thing');
+      expect(cmd).toContain("--plugin-dir '/root/.claude/plugins/plugin-a'");
+      expect(cmd).toContain("--plugin-dir '/root/.claude/plugins/plugin-b'");
+    });
+
+    it('sandboxCommand does not include --plugin-dir flags when no plugins have been installed', () => {
+      const fresh = new ClaudeAdapter({ command: 'claude' });
+      const cmd = fresh.sandboxCommand('do the thing');
+      expect(cmd).not.toContain('--plugin-dir');
+    });
+  });
 });
diff --git a/src/agents/__tests__/codex.test.ts b/src/agents/__tests__/codex.test.ts
@@ -1,8 +1,10 @@
 import { describe, it, expect, vi, beforeEach } from 'vitest';
-import { writeFile, readFile, rm } from 'node:fs/promises';
+import { writeFile, readFile, rm, access, readdir, stat } from 'node:fs/promises';
 import { spawnAgent, spawnInteractive } from '../spawn.js';
+import { uploadDirToSandbox } from '../../sandbox/scaffolding.js';
 import { CodexAdapter } from '../codex.js';
 import { makeAgentResult } from '../../__tests__/helpers/fixtures.js';
+import { makeMockSandboxClient } from '../../__tests__/helpers/mock-sandbox-client.js';
 
 vi.mock('../spawn.js', () => ({
   spawnAgent: vi.fn(),
@@ -13,13 +15,24 @@ vi.mock('node:fs/promises', () => ({
   writeFile: vi.fn().mockResolvedValue(undefined),
   readFile: vi.fn(),
   rm: vi.fn().mockResolvedValue(undefined),
+  access: vi.fn(),
+  readdir: vi.fn(),
+  stat: vi.fn(),
+}));
+
+vi.mock('../../sandbox/scaffolding.js', () => ({
+  uploadDirToSandbox: vi.fn(),
 }));
 
 const mockSpawnAgent = vi.mocked(spawnAgent);
 const mockSpawnInteractive = vi.mocked(spawnInteractive);
 const mockWriteFile = vi.mocked(writeFile);
 const mockReadFile = vi.mocked(readFile);
 const mockRm = vi.mocked(rm);
+const mockAccess = vi.mocked(access);
+const mockReaddir = vi.mocked(readdir);
+const mockStat = vi.mocked(stat);
+const mockUploadDir = vi.mocked(uploadDirToSandbox);
 
 describe('CodexAdapter', () => {
   let adapter: CodexAdapter;
@@ -120,4 +133,101 @@ describe('CodexAdapter', () => {
       expect(adapter.installCommand).toBe('npm i -g @openai/codex@0.93.0');
     });
   });
+
+  describe('installPluginsInSandbox', () => {
+    function makeDirent(name: string, isDir: boolean) {
+      return {
+        name,
+        isDirectory: () => isDir,
+        isFile: () => !isDir,
+      } as any;
+    }
+
+    it('is a no-op when given an empty plugin list', async () => {
+      const client = makeMockSandboxClient();
+      await adapter.installPluginsInSandbox(client as any, []);
+      expect(client.runCommand).not.toHaveBeenCalled();
+    });
+
+    it('throws when a plugin is missing its Codex manifest', async () => {
+      mockAccess.mockRejectedValueOnce(new Error('ENOENT'));
+      const client = makeMockSandboxClient();
+      await expect(adapter.installPluginsInSandbox(client as any, [
+        { name: 'broken', hostDir: '/tmp/broken' },
+      ])).rejects.toThrow(/\.codex-plugin\/plugin\.json/);
+      expect(mockUploadDir).not.toHaveBeenCalled();
+    });
+
+    it('throws when a plugin has no skills/ directory', async () => {
+      mockAccess.mockResolvedValueOnce(undefined);
+      mockReaddir.mockRejectedValueOnce(new Error('ENOENT'));
+      const client = makeMockSandboxClient();
+      await expect(adapter.installPluginsInSandbox(client as any, [
+        { name: 'empty', hostDir: '/tmp/empty' },
+      ])).rejects.toThrow(/no 'skills\/' directory/);
+    });
+
+    it('throws when a plugin contributes no SKILL.md-bearing dirs', async () => {
+      mockAccess.mockResolvedValueOnce(undefined);
+      mockReaddir.mockResolvedValueOnce([
+        makeDirent('not-a-skill', true),
+      ]);
+      mockStat.mockRejectedValueOnce(new Error('ENOENT'));
+      const client = makeMockSandboxClient();
+      await expect(adapter.installPluginsInSandbox(client as any, [
+        { name: 'shell', hostDir: '/tmp/shell' },
+      ])).rejects.toThrow(/no usable Codex skills/);
+    });
+
+    it('extracts each plugin skill into $CODEX_HOME/skills/<name>', async () => {
+      mockAccess.mockResolvedValue(undefined);
+      // One plugin with two skills.
+      mockReaddir.mockResolvedValueOnce([
+        makeDirent('skill-one', true),
+        makeDirent('skill-two', true),
+        makeDirent('not-a-dir', false),
+      ]);
+      mockStat.mockResolvedValue({ isFile: () => true } as any);
+
+      const client = makeMockSandboxClient();
+      client.runCommand
+        .mockResolvedValueOnce({ stdout: '/root/.codex', stderr: '', exitCode: 0 }) // printf CODEX_HOME
+        .mockResolvedValue({ stdout: '', stderr: '', exitCode: 0 });                  // mkdir, etc.
+
+      await adapter.installPluginsInSandbox(client as any, [
+        { name: 'bundle', hostDir: '/tmp/bundle' },
+      ]);
+
+      expect(mockUploadDir).toHaveBeenCalledTimes(2);
+      expect(mockUploadDir).toHaveBeenCalledWith(
+        client,
+        expect.stringContaining('skills/skill-one'),
+        '/root/.codex/skills/skill-one',
+        'codex_skill_skill-one',
+      );
+      expect(mockUploadDir).toHaveBeenCalledWith(
+        client,
+        expect.stringContaining('skills/skill-two'),
+        '/root/.codex/skills/skill-two',
+        'codex_skill_skill-two',
+      );
+    });
+
+    it('throws when two plugins contribute the same skill name, naming both', async () => {
+      mockAccess.mockResolvedValue(undefined);
+      // Two plugins, each contributing a skill called 'shared'.
+      mockReaddir
+        .mockResolvedValueOnce([makeDirent('shared', true)])
+        .mockResolvedValueOnce([makeDirent('shared', true)]);
+      mockStat.mockResolvedValue({ isFile: () => true } as any);
+
+      const client = makeMockSandboxClient();
+      client.runCommand.mockResolvedValue({ stdout: '/root/.codex', stderr: '', exitCode: 0 });
+
+      await expect(adapter.installPluginsInSandbox(client as any, [
+        { name: 'plugin-a', hostDir: '/tmp/a' },
+        { name: 'plugin-b', hostDir: '/tmp/b' },
+      ])).rejects.toThrow(/'plugin-a'.*'plugin-b'/);
+    });
+  });
 });
diff --git a/src/agents/__tests__/custom.test.ts b/src/agents/__tests__/custom.test.ts
@@ -182,4 +182,18 @@ describe('CustomAdapter', () => {
     });
   });
 
+  describe('installPluginsInSandbox', () => {
+    it('is a no-op when given an empty plugin list', async () => {
+      const adapter = new CustomAdapter({ command: 'my-tool' });
+      await expect(adapter.installPluginsInSandbox({} as any, [])).resolves.toBeUndefined();
+    });
+
+    it('throws a clear error when given plugins (custom CLIs have no documented plugin layout)', async () => {
+      const adapter = new CustomAdapter({ command: 'my-tool' });
+      await expect(adapter.installPluginsInSandbox({} as any, [
+        { name: 'x', hostDir: '/tmp/x' },
+      ])).rejects.toThrow(/does not support executorPlugins/);
+    });
+  });
+
 });