Stream bench pr by sruan2 · Pull Request #28 · ace-agent/ace

sruan2 · 2026-02-19T02:24:17Z

Summary

Adds a full benchmarking framework (eval/stream-bench/) for evaluating ACE on text-to-SQL tasks using the StreamBench suite, supporting the BIRD, CoSQL, and Spider datasets
Extends ACE core (ace.py, reflector.py, llm.py, utils.py) with improvements needed for text-to-SQL evaluation: per-split data processors, SQL execution result passing to the reflector, more robust LLM error handling (HTTP 400 / invalid prompt recovery), and enriched run folder naming
Adds data download, preprocessing, training, playbook evaluation, and plotting scripts with pre-built configuration files for common task sizes and difficulty balancing strategies

File	Purpose
`run.py`	Main ACE training/evaluation runner for text-to-SQL
`run_playbook.py`	Evaluate a saved playbook on any data split
`data_processor.py`	Data loading, schema formatting, and execution-based SQL accuracy
`download_text2sql_data.py`	Download raw BIRD/CoSQL/Spider SQLite databases
`preprocess_streambench_{bird,cosql,spider}.py`	Convert HuggingFace StreamBench data to `.jsonl` with schema
`dataset_stats.py`	Print dataset statistics
`plot.py`	Generate accuracy-over-time plots
`analyze_logs.py`	Analyze terminal output logs
`data/bird_config.json`	Pre-built BIRD task configurations
`data/cosql_config.json`	Pre-built CoSQL task configurations
`data/spider_config.json`	Pre-built Spider task configurations
`README.md`	Full setup and usage documentation

ace/ace.py: Support separate train_processor/val_processor/test_processor (backward-compatible with data_processor); pass metadata (db_name and curriculum) into run folder name; print correct/total counts alongside accuracy
ace/core/reflector.py: Accept and format SQL execution results (predicted_result, ground_truth_result) and pass them into the reflector prompt for richer feedback
ace/prompts/reflector.py: Add SQL execution results placeholder to both ground-truth and no-ground-truth reflector prompt templates
llm.py: Detect and gracefully handle HTTP 400 / invalid-prompt / policy-violation errors without crashing; skip the affected sample and continue execution
utils.py: Minor utilities supporting stream-bench evaluation

sruan2 added 3 commits February 18, 2026 18:04

Add Stream-Bench support

2dd392f

remove stream-bench dependency on finance

e6e447b

move extra args in _setup_paths to metadata

7ffedab

Alex-q-z self-requested a review February 19, 2026 06:18

Alex-q-z added the dataset New datasets or benchmarks label Feb 23, 2026