4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands activate correctly via programmatic detection and LLM judgment.
-
Updated
May 12, 2026 - TypeScript
4-stage evaluation framework for testing Claude Code plugin component triggering. Validates skills, agents, and commands activate correctly via programmatic detection and LLM judgment.
Benchmark harness for A/B testing Claude Code plugins against OOLONG long-context reasoning tasks. Compare truncation vs RLM-RS recursive chunking strategies. Features Claude Code hooks integration, SQLite persistence, and comprehensive scoring aligned with the OOLONG paper methodology.
🚀 Automate the evaluation of Claude Code plugin components to ensure accurate triggering of skills, agents, commands, and hooks.
Testing harness for AI agent skills and plugins — contract-driven YAML specs with mocked environments
Testing framework for Codify plugins with complete lifecycle testing, IPC communication, and cross-platform support
Performed end-to-end manual testing and source code analysis on the easy.jobs WordPress Plugin (v2.7.1), identifying 8 bugs across 22 test cases covering authentication, job management, candidate management, security, and edge cases.
Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
Add a description, image, and links to the plugin-testing topic page so that developers can more easily learn about it.
To associate your repository with the plugin-testing topic, visit your repo's landing page and select "manage topics."