Windows Agent Arena (WAA) 🪟 is a scalable Windows AI agent platform for testing and benchmarking multi-modal, desktop AI agents. WAA provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, where agentic AI workflows can be tested across a diverse range of tasks.
This README provides instructions on how to run and evaluate ScaleCUA on the Windows Agent Arena platform. Building on the original README, we add more detailed guidance on environment setup, highlight previously undocumented engineering issues during deployment, and remove content related to Azure cloud deployment to streamline the evaluation process of ScaleCUA. Additionally, this script includes a simple implementation for evaluating multiple Docker containers simultaneously and removes certain problematic tasks that previously caused exceptions in Chrome.
WAA.Intro.mp4
- 2025-08-06: Resolved the original WindowsAgentArena issue of insufficient inter-task isolation. At the beginning of each task we now overwrite the golden image, boot a fresh Windows VM, run the evaluation, and then immediately shut the VM down. This guarantees that every task is executed in a completely clean, identically initialized environment.
- 2025-07-21: We release the evaluation script for ScaleCUA on Windows Agent Arena. Based on the original version, this script adds a simple implementation for evaluating multiple Docker containers simultaneously, removes some problematic tasks that caused exceptions in Chrome, and provides a detailed explanation of potential pitfalls during the installation process.
- Docker daemon installed and running. On Windows, we recommend using Docker with WSL 2.
- An OpenAI or Azure OpenAI API Key.
- Python 3.9 - we recommend using Conda and creating an adhoc python environment for running the scripts. For creating a new environment run
conda create -n winarena python=3.9.
Clone the repository and install dependencies:
git clone https://github.com/microsoft/WindowsAgentArena.git
cd WindowsAgentArena
# Install the required dependencies in your python environment
# conda activate winarena
pip install -r requirements.txt
Create a new config.json at the root of the project with the necessary keys (from OpenAI or Azure endpoints, It’s not that no setup is needed, you can just write any arbitrary code as a placeholder):
{
"OPENAI_API_KEY": "<OPENAI_API_KEY>", // if you are using OpenAI endpoint
"AZURE_API_KEY": "<AZURE_API_KEY>", // if you are using Azure endpoint
"AZURE_ENDPOINT": "https://yourendpoint.openai.azure.com/", // if you are using Azure endpoint
}
To get started, pull the base image from Docker Hub:
docker pull windowsarena/winarena-base:latest
This image includes all the necessary dependencies (such as packages and models) required to run the code in the src directory.
Next, build the WinArena image locally:
cd scripts
./build-container-image.sh
# If there are any changes in 'Dockerfile-WinArena-Base', use the --build-base-image flag to build also the base image locally
# ./build-container-image.sh --build-base-image true
# For other build options:
# ./build-container-image.sh --help
This will create the windowsarena/winarena:latest image with the latest code from the src directory.
WAA.Prepare.Golden.Image.mp4
- Visit Microsoft Evaluation Center, accept the Terms of Service, and download a Windows 11 Enterprise Evaluation (90-day trial, English, United States) ISO file [~6GB].
- After downloading, rename the file to
setup.isoand copy it to the directory./src/win-arena-container/vm/image.
Before running the arena, you need to prepare a new WAA snapshot (also referred as WAA golden image). This 30GB snapshot represents a fully functional Windows 11 VM with all the programs needed to run the benchmark. This VM additionally hosts a Python server which receives and executes agent commands. To learn more about the components at play, see our local and cloud components diagrams.
To prepare the gold snapshot, run once:
cd ./scripts
./run-local.sh --prepare-image true
You can monitor progress at http://localhost:8006. The preparation process is fully automated and will take ~20 minutes.
Please do not interfere with the VM while it is being prepared. It will automatically shut down when the provisioning process is complete.
At the end, you should expect the Docker container named winarena to gracefully terminate as shown from the below logs.
You will find the 30GB WAA golden image in WindowsAgentArena/src/win-arena-container/vm/storage, consisting of the following files:
- Once the Windows system is ready, check whether any evaluation software is missing from the installed Windows environment, such as LibreOffice. If anything is missing, you need to manually install the required applications (refer to Development-Tips Doc for more details). Then, back up the fully prepared 30 GB WAA golden image (by saving it to another location). After each evaluation, you need to overwrite the evaluation golden image with the backup, because WAA does not support snapshots and cannot be restored to its initial state.
- It's better to place multiple golden images under the
./src/win-arena-container/vmdirectory, allowing you to launch multiple Docker containers for parallel evaluation, with each Docker container corresponding to one golden image, for the specific setup, please refer to Section 5. - We configure the proxy settings in
./scripts/build-container-image.shand./src/win-arena-container/client/run.py. If you don’t need to use a proxy, simply set them to None. - During development, if you want to include any changes made in the
src/win-arena-containerdirectory in the WAA golden image, please ensure to specify the flag--skip-build falseto therun-local.shscript (default to true). This will ensure that a new container image is built instead than using the prebuiltwindowsarena/winarena:latestimage. - If you have previously run an installation process and want to do it again from scratch, make sure to delete the content of
storage. - We recommend copying this
storagefolder to a safe location outside of the repository in case you or the agent accidentally corrupt the VM at some point and you want to avoid a fresh setup. - Depending on your docker settings, you might have to run the above command with
sudo. - Running on WSL2? If you encounter the error
/bin/bash: bad interpreter: No such file or directory, we recommend converting the bash scripts from DOS/Windows format to Unix format:
cd ./scripts
find . -maxdepth 1 -type f -exec dos2unix {} +
To enable multiple Docker-based evaluations to run in parallel while guaranteeing a clean environment after each task, save the storage snapshot produced during the installation phase as a golden image. Then replicate it n times, where n is the desired level of parallelism. Below is a single-copy example.
# example
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance1
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance2
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance3
cp -r ./src/win-arena-container/vm/storage ./src/win-arena-container/vm/storage_intance4
...
For evaluating ScaleCUA on WAA, we rely solely on visual inputs and access the models through their APIs. Specifically, the ScaleCUA model family is served with vLLM. You can follow the vLLM repository to complete the installation. More development details see evaluation/WindowsAgentArena/README.md.
You're now ready to launch the evaluation for ScaleCUA. To run the default setting, set ScaleCUA model url_set in ./src/win-arena-container/client/run.py first:
# example
parser.add_argument("--url_set", type=str, default="http://10.140.66.44:8003/v1")
Then run the evaluation script, do:
sh eval.sh
# You may need to run this
# sudo sh eval.sh
There are three modes for running ScaleCUA on WAA: CoT, Navigation, and Planner+Grounder.
The CoT mode means that the model first generates a chain of thought and then produces the corresponding actions. Models that reason in this way often output more accurate actions. The configuration is as follows:
# In ./src/win-arena-container/client/run.py
# Default is CoT Mode
parser.add_argument("--enable_thinking", type=bool, default=True)
In navigation mode, the model produces the operation and action directly. The setup is as follows:
# In ./src/win-arena-container/client/run.py
parser.add_argument("--enable_thinking", type=bool, default=False)
The planner+grounder mode typically uses GPT‑4o as the planner model, and ScaleCUA as Grounder. The configuration is as follows:
# In ./src/win-arena-container/client/run.py
parser.add_argument("--planner_model", type=str, default="gpt-4o-2024-11-20")
# In ./src/win-arena-container/client/mm_agents/navi/scalecua_agent.py
# Set your openai key
os.environ['OPENAI_KEY'] = 'your_openai_key'
You're now ready to launch the evaluation. To run the baseline agent on all benchmark tasks, do:
cd scripts
./run-local.sh
# For client/agent options:
# ./run-local.sh --help
Open http://localhost:8006 to see the Windows VM with the agent running. If you have a beefy PC, you can instead run the strongest agent configuration in our paper by doing:
./run-local.sh --gpu-enabled true --som-origin mixed-omni --a11y-backend uia
At the end of the run you can display the results using the command:
cd src/win-arena-container/client
python show_results.py --result_dir <path_to_results_folder>
Below is a comparison of various combinations of hyperparameters used by the Navi agent in our study, which can be overridden by specifying --som-origin <som_origin> --a11y-backend <a11y_backend> when running the run-local.sh script:
| Command | Description | Notes |
|---|---|---|
./run-local.sh --som-origin mixed-omni --a11y-backend uia |
Combines Omniparser with accessibility tree information | ⭐Recommended for best results |
./run-local.sh --som-origin omni |
Uses Omniparser for screen understanding | |
./run-local.sh --som-origin oss |
Uses webparse, groundingdino, and OCR (TesseractOCR) | 🌲Baseline |
./run-local.sh --som-origin a11y --a11y-backend uia |
Uses slower, more accurate accessibility tree | |
./run-local.sh --som-origin a11y --a11y-backend win32 |
Uses faster, less accurate accessibility tree | 🐇Fastest |
./run-local.sh --som-origin mixed-oss --a11y-backend uia |
Combines oss detections with accessibility tree |
--som-origindetermines how the Navi agent detects screen elements.--a11y-backendspecifies the Accessibility backend type (when usinga11yor mixed modes).
At first sight it might seem challenging to develop/debug code running inside the docker container. However, we provide a few tips to make this process easier. Check the Development-Tips Doc for more details such as:
- How to attach a VSCode window (with debugger) to the running container
- How to change the agent and Windows server code from your local machine and see the changes reflected in real time in the container
Thanks to Window Agent Arena, which provides an ideal evaluation platform for ScaleCUA, and make brilliant contributions to the GUI Agent development.





