ScaleCUA/evaluation/WindowsAgentArena/README.md at main · OpenGVLab/ScaleCUA

Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.

‼️ Declaration

This README provides instructions on how to run and evaluate ScaleCUA on the Windows Agent Arena platform. Building on the original README, we add more detailed guidance on environment setup, highlight previously undocumented engineering issues during deployment, and remove content related to Azure cloud deployment to streamline the evaluation process of ScaleCUA. Additionally, this script includes a simple implementation for evaluating multiple Docker containers simultaneously and removes certain problematic tasks that previously caused exceptions in Chrome.

WAA.Intro.mp4

📣 Updates

2025-08-06: Resolved the original WindowsAgentArena issue of insufficient inter-task isolation. At the beginning of each task we now overwrite the golden image, boot a fresh Windows VM, run the evaluation, and then immediately shut the VM down. This guarantees that every task is executed in a completely clean, identically initialized environment.
2025-07-21: We release the evaluation script for ScaleCUA on Windows Agent Arena. Based on the original version, this script adds a simple implementation for evaluating multiple Docker containers simultaneously, removes some problematic tasks that caused exceptions in Chrome, and provides a detailed explanation of potential pitfalls during the installation process.

☝️ Pre-requisites:

Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
An OpenAI or Azure OpenAI API Key.
Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run conda create -n winarena python=3.9.

Clone the repository and install dependencies:

git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt

💻 Local deployment (WSL or Linux)

1. Configuration file

Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints, It’s not that no setup is needed, you can just write any arbitrary code as a placeholder):

{
    "OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
    "AZURE_API_KEY": "<AZURE_API_KEY>",  // if you are using Azure endpoint
    "AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}

2. Prepare the Windows Arena Docker Image

2.1 Pull the WinArena-Base Image from Docker Hub

To get started, pull the base image from Docker Hub:

docker pull windowsarena/winarena-base:latest

This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.

2.2 Build the WinArena Image Locally

Next, build the WinArena image locally:

cd scripts
./build-container-image.sh

# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true

# For other build options:
# ./build-container-image.sh --help

This will create the windowsarena/winarena:latest image with the latest code from the src directory.

3. Prepare the Windows 11 VM

WAA.Prepare.Golden.Image.mp4

3.1 Download Windows 11 Evaluation .iso file:

Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB].
After downloading, rename the file to setup.iso and copy it to the directory ./src/win-arena-container/vm/image.

3.2 Automatic Setup of the Windows 11 golden image:

Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.

To prepare the gold snapshot, run once:

cd ./scripts
./run-local.sh --prepare-image true

You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.

Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.

At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.

You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:

Additional Notes (Please make sure to pay particular attention to Notes 1 and 2❗️)

Once the Windows system is ready, check whether any evaluation software is missing from the installed Windows environment, such as LibreOffice. If anything is missing, you need to manually install the required applications (refer to Development-Tips Doc for more details). Then, back up the fully prepared 30 GB WAA golden image (by saving it to another location). After each evaluation, you need to overwrite the evaluation golden image with the backup, because WAA does not support snapshots and cannot be restored to its initial state.
It's better to place multiple golden images under the ./src/win-arena-container/vm directory, allowing you to launch multiple Docker containers for parallel evaluation, with each Docker container corresponding to one golden image, for the specific setup, please refer to Section 5.
We configure the proxy settings in ./scripts/build-container-image.sh and ./src/win-arena-container/client/run.py. If you don’t need to use a proxy, simply set them to None.
During development, if you want to include any changes made in the src/win-arena-container directory in the WAA golden image, please ensure to specify the flag --skip-build false to the run-local.sh script (default to true). This will ensure that a new container image is built instead than using the prebuilt windowsarena/winarena:latest image.
If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of storage.
We recommend copying this storage folder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup.
Depending on your docker settings, you might have to run the above command with sudo.
Running on WSL2? If you encounter the error /bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:

cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +

4. Run ScaleCUA on Windows Agent Arena

4.1 Setup Environment

To enable multiple Docker-based evaluations to run in parallel while guaranteeing a clean environment after each task, save the storage snapshot produced during the installation phase as a golden image. Then replicate it n times, where n is the desired level of parallelism. Below is a single-copy example.

# example
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance1
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance2
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance3
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance4
...

4.2 Model Development

For evaluating ScaleCUA on WAA, we rely solely on visual inputs and access the models through their APIs. Specifically, the ScaleCUA model family is served with vLLM. You can follow the vLLM repository to complete the installation. More development details see evaluation/WindowsAgentArena/README.md.

4.3 Set up Evaluation for ScaleCUA

You're now ready to launch the evaluation for ScaleCUA. To run the default setting, set ScaleCUA model url_set in ./src/win-arena-container/client/run.py first:

# example
parser.add_argument("--url_set", type=str, default="http://10.140.66.44:8003/v1")

Then run the evaluation script, do:

sh eval.sh
# You may need to run this
# sudo sh eval.sh

4.4 Evaluation Mode

There are three modes for running ScaleCUA on WAA: CoT, Navigation, and Planner+Grounder.

4.4.1 CoT Mode

The CoT mode means that the model first generates a chain of thought and then produces the corresponding actions. Models that reason in this way often output more accurate actions. The configuration is as follows:

# In ./src/win-arena-container/client/run.py
# Default is CoT Mode
parser.add_argument("--enable_thinking", type=bool, default=True)

4.4.2 Navigation Mode

In navigation mode, the model produces the operation and action directly. The setup is as follows:

# In ./src/win-arena-container/client/run.py
parser.add_argument("--enable_thinking", type=bool, default=False)

4.4.3 Planner+Grounder Mode

The planner+grounder mode typically uses GPT‑4o as the planner model, and ScaleCUA as Grounder. The configuration is as follows:

# In ./src/win-arena-container/client/run.py
parser.add_argument("--planner_model", type=str, default="gpt-4o-2024-11-20")

# In ./src/win-arena-container/client/mm_agents/navi/scalecua_agent.py
# Set your openai key
os.environ['OPENAI_KEY'] = 'your_openai_key'

5. Deploying the agent in the arena (Optional)

5.1 Running the base benchmark

You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:

cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help

Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:

./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia

At the end of the run you can display the results using the command:

cd src/win-arena-container/client
python show_results.py --result_dir <path_to_results_folder>

Available Configurations

Below is a comparison of various combinations of hyperparameters used by the Navi agent in our study, which can be overridden by specifying --som-origin <som_origin> --a11y-backend <a11y_backend> when running the run-local.sh script:

Command	Description	Notes
`./run-local.sh --som-origin mixed-omni --a11y-backend uia`	Combines Omniparser with accessibility tree information	⭐Recommended for best results
`./run-local.sh --som-origin omni`	Uses Omniparser for screen understanding
`./run-local.sh --som-origin oss`	Uses webparse, groundingdino, and OCR (TesseractOCR)	🌲Baseline
`./run-local.sh --som-origin a11y --a11y-backend uia`	Uses slower, more accurate accessibility tree
`./run-local.sh --som-origin a11y --a11y-backend win32`	Uses faster, less accurate accessibility tree	🐇Fastest
`./run-local.sh --som-origin mixed-oss --a11y-backend uia`	Combines oss detections with accessibility tree

--som-origin determines how the Navi agent detects screen elements.
--a11y-backend specifies the Accessibility backend type (when using a11y or mixed modes).

5.2 Local development tips

At first sight it might seem challenging to develop/debug code running inside the docker container. However, we provide a few tips to make this process easier. Check the Development-Tips Doc for more details such as:

How to attach a VSCode window (with debugger) to the running container
How to change the agent and Windows server code from your local machine and see the changes reflected in real time in the container

💐 Acknowledgements

Thanks to Window Agent Arena, which provides an ideal evaluation platform for ScaleCUA, and make brilliant contributions to the GUI Agent development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

‼️ Declaration

📣 Updates

☝️ Pre-requisites:

💻 Local deployment (WSL or Linux)

1. Configuration file

2. Prepare the Windows Arena Docker Image

2.1 Pull the WinArena-Base Image from Docker Hub

2.2 Build the WinArena Image Locally

3. Prepare the Windows 11 VM

3.1 Download Windows 11 Evaluation .iso file:

3.2 Automatic Setup of the Windows 11 golden image:

Additional Notes (Please make sure to pay particular attention to Notes 1 and 2❗️)

4. Run ScaleCUA on Windows Agent Arena

4.1 Setup Environment

4.2 Model Development

4.3 Set up Evaluation for ScaleCUA

4.4 Evaluation Mode

4.4.1 CoT Mode

4.4.2 Navigation Mode

4.4.3 Planner+Grounder Mode

5. Deploying the agent in the arena (Optional)

5.1 Running the base benchmark

Available Configurations

5.2 Local development tips

💐 Acknowledgements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

‼️ Declaration

📣 Updates

☝️ Pre-requisites:

💻 Local deployment (WSL or Linux)

1. Configuration file

2. Prepare the Windows Arena Docker Image

2.1 Pull the WinArena-Base Image from Docker Hub

2.2 Build the WinArena Image Locally

3. Prepare the Windows 11 VM

3.1 Download Windows 11 Evaluation .iso file:

3.2 Automatic Setup of the Windows 11 golden image:

Additional Notes (Please make sure to pay particular attention to Notes 1 and 2❗️)

4. Run ScaleCUA on Windows Agent Arena

4.1 Setup Environment

4.2 Model Development

4.3 Set up Evaluation for ScaleCUA

4.4 Evaluation Mode

4.4.1 CoT Mode

4.4.2 Navigation Mode

4.4.3 Planner+Grounder Mode

5. Deploying the agent in the arena (Optional)

5.1 Running the base benchmark

Available Configurations

5.2 Local development tips

💐 Acknowledgements