EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
EnterpriseOps-Gym is a containerized, resettable enterprise simulation benchmark for evaluating LLM agents on stateful, multi-step planning and tool use across realistic enterprise workflows
Authors
Shiva Krishna Reddy Malay*,1 Β Β Shravan Nayak*,1,2,3 Β Β Jishnu Sethumadhavan Nair1 Β Β Aman Tiwari1 Β Β Sathwik Tejaswi Madhusudhan1 Β Β Sagar Davasam1 Β Β Sridhar Krishna Nemala1 Β Β Srinivas Sunkara1 Β Β Sai Rajeswar1,2,3
*Equal contribution Β |Β 1ServiceNow Research Β |Β 2Mila β Quebec AI Institute Β |Β 3UniversitΓ© de MontrΓ©al
EnterpriseOps-Gym evaluates LLM agents on 1,150 expert-curated tasks across 8 enterprise domains β Calendar, CSM, Drive, Email, HR, ITSM, Teams, and Hybrid β in a fully interactive, containerized environment.
Unlike static datasets, tasks run against live MCP servers and are evaluated by SQL verifiers that check final environment state, not action sequences.
Key Features:
- π οΈ 512 tools across 8 enterprise domains
- ποΈ 164 database tables with avg 1.7 foreign-key dependencies per table
- π’ 9.15 avg steps per task (up to 34), with 5.3 avg verification conditions
- π 89k avg context length per task
- π Best model achieves only 34.1% success rate β significant headroom for improvement
- βοΈ Installation
- π§ Prerequisites
- π Running the Benchmark
- π Scoring
- π Leaderboard
- π Citation
Requires Python 3.11+ and uv.
git clone https://github.com/ServiceNow/EnterpriseOps-Gym.git
cd EnterpriseOps-Gym
# Install with only the provider(s) you need
uv sync --extra anthropic # Claude / AWS Bedrock
uv sync --extra openai # OpenAI / Azure OpenAI
uv sync --extra google # Gemini / Vertex AI
uv sync --extra deepseek # DeepSeek
uv sync --extra all # EverythingCopy and configure the example configs:
cp -r conf.example/ conf/
# Edit conf/llm/my-model.json with your API key and model detailsEach task runs against a pre-populated database seeded from a SQL snapshot. These snapshots are bundled in gym_dbs.zip at the root of the repository β one SQL file per unique database, organized by domain:
Domain Wise DBs and Task-DB Mappings/
calendar/dbs/ # Calendar domain database snapshots
csm/dbs/ # Customer Service Management snapshots
drive/dbs/ # Drive domain snapshots
email/dbs/ # Email domain snapshots
hr/dbs/ # HR domain snapshots
hybrid/dbs/ # Multi-domain (hybrid) snapshots
itsm/dbs/ # IT Service Management snapshots
teams/dbs/ # Teams domain snapshots
Unzip it before running the benchmark:
unzip gym_dbs.zipEach domain requires a running MCP server. Pull and start the Docker image for each domain:
docker pull shivakrishnareddyma225/enterpriseops-gym-mcp-<domain>:latest
docker run -d -p <host_port>:<container_port> shivakrishnareddyma225/enterpriseops-gym-mcp-<domain>:latestDefault ports:
| Domain | MCP Server | Port |
|---|---|---|
teams |
gym-teams-mcp |
8002 |
csm |
sn-csm-server |
8001 |
email |
gym-email-mcp |
8004 |
itsm |
gym-itsm-mcp |
8006 |
calendar |
gym-calendar |
8003 |
hr |
sn-hr-internal |
8008 |
drive |
gym-google-drive-mcp |
8009 |
<container_port> |
N/A | 8005 |
Update conf/ray/domain_conf.json if you use non-default ports. For calendar use 8003 as the container_port.
LLM configs live in conf/llm/<name>.json. Use an array for load-balanced pools.
| Field | Required | Description |
|---|---|---|
llm_provider |
β | anthropic, aws_bedrock, openai, azureopenai, googlevertexai, google, vllm, openrouter, deepseek, qwq |
llm_model |
β | Model identifier |
llm_api_key |
β | API key |
llm_api_endpoint |
β | Required for Azure OpenAI / vLLM |
llm_api_version |
β | Required for Azure OpenAI |
llm_region |
β | Region for aws_bedrock / googlevertexai |
temperature |
β | Default 0.0 |
max_tokens |
β | Default 4096 |
reasoning |
- | Reasoning Parameters |
{
"llm_provider": "azureopenai",
"llm_model": "gpt-4.1",
"llm_api_key": "<your-api-key>",
"llm_api_endpoint": "https://<your-resource>.openai.azure.com",
"llm_api_version": "2025-04-01-preview",
"temperature": 0.1,
"max_tokens": 16384
}Ray orchestrates parallel runs across models and domains.
1. Create an experiment config (conf/ray/experiment.json):
{
"llms": ["gpt-4.1-mini", "gemini_2p5"],
"domains": ["teams", "csm", "email"],
"modes": ["oracle", "plus_5_tools", "plus_10_tools", "plus_15_tools"],
"orchestrator": "react",
"num_runs": 1,
"num_llm_instances": 1,
"path_templates": {
"log_dir": "logs/{orchestrator}/{llm}/{domain}/{mode}",
"output_folder": "results/{orchestrator}/{llm}/{domain}/{mode}",
"llm_config": "conf/llm/{llm}.json"
}
}Per-model task concurrency is set in conf/ray/llm_concurrency.json (defaults to 5):
{ "gpt-4.1-mini": 4, "gemini_2p5": 4 }2. Run:
python ray_experiment_queue.py --experiment_config conf/ray/experiment.jsonRun a single domain/mode without Ray. Use this option for the hybrid domain.
python evaluate.py \
--hf_dataset ServiceNow-AI/EnterpriseOps-Gym \
--domain teams --mode oracle \
--llm_config conf/llm/gpt-4.1-mini.json \
--output_folder results/react/gpt-4.1-mini/teams/oracle \
--orchestrator react \
--concurrency 4 --num_runs 1For hybrid tasks:
python evaluate.py \
--hf_dataset ServiceNow-AI/EnterpriseOps-Gym \
--domain hybrid --mode oracle \
--llm_config conf/llm/gpt-4.1-mini.json \
--output_folder results/react/gpt-4.1-mini/hybrid/oracle \
--orchestrator react \
--concurrency 2 --num_runs 1Orchestrators:
| Value | Description |
|---|---|
react |
Standard ReAct loop |
planner_react |
Planner generates a plan; executor follows it |
decomposing |
Decomposes task into sub-goals before executing |
For planner_react / decomposing, add --planner_llm_config conf/llm/<planner>.json.
# Single run
python compute_score.py --results_folder results/react/gpt-4.1-mini/teams/oracle
# All modes at once
python compute_score.py --results_folder results/react/gpt-4.1-mini/teamsOutput:
+----------------+---------------+-----------------+----------------------+-----------------------+
| Mode | Total Files | Files w/ Errors | Avg Success Rate (%) | Avg Verifier Pass (%) |
+================+===============+=================+======================+=======================+
| oracle | 100 | 0 | 72.00 | 68.50 |
+----------------+---------------+-----------------+----------------------+-----------------------+
| plus_5_tools | 100 | 0 | 65.00 | 61.20 |
+----------------+---------------+-----------------+----------------------+-----------------------+
- Avg Success Rate β tasks where all verifiers passed
- Avg Verifier Pass β average per-verifier pass rate
- Files w/ Errors β agent errors; excluded from averages
Task success rate (%) on Oracle mode on the full benchmark. A task passes only if all verification conditions are met.
| Model | Teams | CSM | ITSM | Calendar | HR | Drive | Hybrid | Avg | |
|---|---|---|---|---|---|---|---|---|---|
| Closed Source | |||||||||
| Claude Opus 4.6 | 52.0 | 45.1 | 57.7 | 33.3 | 43.3 | 45.1 | 57.1 | 34.0 | 45.9 |
| Claude Sonnet 4.6 | 47.0 | 32.6 | 58.6 | 35.5 | 40.4 | 37.0 | 57.1 | 29.4 | 42.2 |
| Claude Opus 4.5 | 50.0 | 34.2 | 51.9 | 23.8 | 43.2 | 32.1 | 49.5 | 30.7 | 39.4 |
| Gemini-3.1-Pro | 46.0 | 46.7 | 47.1 | 32.8 | 40.4 | 10.9 | 55.2 | 30.1 | 38.7 |
| Claude Sonnet 4.5 | 51.0 | 16.7 | 51.3 | 17.6 | 34.6 | 21.6 | 52.1 | 28.1 | 34.1 |
| Gemini-3-Flash | 47.3 | 35.0 | 44.3 | 28.5 | 30.5 | 12.6 | 49.7 | 24.2 | 34.0 |
| Gemini-3-Pro | 43.0 | 27.7 | 33.6 | 22.2 | 28.8 | 12.5 | 46.7 | 22.9 | 29.7 |
| GPT-5 | 26.3 | 36.4 | 49.0 | 18.9 | 41.3 | 17.9 | 34.0 | 23.5 | 30.9 |
| GPT-5-Mini | 25.7 | 15.8 | 47.4 | 8.9 | 28.8 | 10.7 | 23.8 | 22.5 | 22.9 |
| Gemini-2.5-Pro | 39.3 | 11.6 | 31.1 | 13.9 | 12.5 | 4.9 | 27.0 | 19.6 | 20.0 |
| Open Source | |||||||||
| DeepSeek-V3.2 | 35.7 | 15.4 | 45.8 | 9.6 | 21.5 | 15.0 | 27.6 | 22.9 | 24.2 |
| Kimi-K2-Thinking | 30.0 | 7.1 | 51.0 | 12.2 | 15.4 | 8.2 | 39.6 | 15.7 | 22.4 |
| Qwen3-30B (Think) | 22.0 | 5.4 | 51.9 | 6.7 | 18.3 | 7.6 | 25.7 | 15.7 | 19.1 |
| Qwen3-235B (Inst.) | 28.0 | 4.7 | 38.1 | 9.3 | 15.7 | 7.8 | 23.8 | 17.7 | 18.1 |
| Qwen3-4B (Think) | 24.0 | 3.8 | 38.4 | 5.6 | 5.8 | 7.1 | 21.9 | 15.8 | 15.3 |
We release 60% of the benchmark samples in the public split. For completeness, we present the evaluation results limited to the public split samples below:
| Model | Teams | CSM | ITSM | Calendar | HR | Drive | Hybrid | Avg. | |
|---|---|---|---|---|---|---|---|---|---|
| Closed Source Models | |||||||||
| Claude Opus 4.5 | 50.8 | 29.7 | 47.8 | 28.2 | 41.0 | 32.4 | 46.9 | 30.7 | 36.6 |
| Gemini-3-Flash | 50.8 | 25.7 | 47.8 | 26.2 | 23.0 | 17.6 | 53.1 | 22.7 | 31.2 |
| GPT-5.2 (High) | 27.9 | 28.7 | 52.2 | 22.3 | 34.4 | 22.5 | 37.5 | 20.5 | 29.4 |
| Claude Sonnet 4.5 | 54.1 | 15.8 | 46.3 | 22.3 | 36.1 | 22.5 | 54.7 | 25.0 | 31.7 |
| GPT-5 | 23.0 | 30.7 | 55.2 | 18.4 | 37.7 | 16.7 | 34.4 | 21.6 | 28.1 |
| Gemini-3-Pro | 45.9 | 21.8 | 29.9 | 24.3 | 24.6 | 14.7 | 42.2 | 23.9 | 26.7 |
| GPT-5.2 (Low) | 24.6 | 17.8 | 41.8 | 7.8 | 26.2 | 6.9 | 23.4 | 20.5 | 19.3 |
| GPT-5-Mini | 23.0 | 16.8 | 52.2 | 5.8 | 31.1 | 6.9 | 21.9 | 21.8 | 22.0 |
| Open Source Models | |||||||||
| DeepSeek-V3.2 (High) | 41.0 | 12.9 | 44.8 | 18.4 | 21.3 | 19.6 | 37.5 | 23.9 | 25.5 |
| GPT-OSS-120B (High) | 37.7 | 19.8 | 43.3 | 6.8 | 24.6 | 17.6 | 45.3 | 19.3 | 24.4 |
| Kimi-K2-Thinking | 29.5 | 6.9 | 46.3 | 15.5 | 11.5 | 8.8 | 32.8 | 12.5 | 18.5 |
| Qwen3-30B (Think) | 21.3 | 5.0 | 53.7 | 8.7 | 18.0 | 8.8 | 26.6 | 11.4 | 17.0 |
| Qwen3-235B (Inst.) | 29.5 | 4.0 | 41.8 | 10.7 | 23.0 | 14.7 | 31.2 | 19.3 | 19.6 |
| Qwen3-4B (Think) | 23.0 | 3.0 | 37.3 | 5.8 | 4.9 | 7.8 | 23.4 | 15.9 | 13.6 |
@misc{malay2026enterpriseopsgymenvironmentsevaluationsstateful,
title={EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings},
author={Shiva Krishna Reddy Malay and Shravan Nayak and Jishnu Sethumadhavan Nair and Sagar Davasam and Aman Tiwari and Sathwik Tejaswi Madhusudhan and Sridhar Krishna Nemala and Srinivas Sunkara and Sai Rajeswar},
year={2026},
eprint={2603.13594},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.13594},
}