Skip to content

Stream bench pr#28

Open
sruan2 wants to merge 3 commits intoace-agent:mainfrom
sruan2:stream-bench-pr
Open

Stream bench pr#28
sruan2 wants to merge 3 commits intoace-agent:mainfrom
sruan2:stream-bench-pr

Conversation

@sruan2
Copy link

@sruan2 sruan2 commented Feb 19, 2026

Summary

  • Adds a full benchmarking framework (eval/stream-bench/) for evaluating ACE on text-to-SQL tasks using the StreamBench suite, supporting the BIRD, CoSQL, and Spider datasets
  • Extends ACE core (ace.py, reflector.py, llm.py, utils.py) with improvements needed for text-to-SQL evaluation: per-split data processors, SQL execution result passing to the reflector, more robust LLM error handling (HTTP 400 / invalid prompt recovery), and enriched run folder naming
  • Adds data download, preprocessing, training, playbook evaluation, and plotting scripts with pre-built configuration files for common task sizes and difficulty balancing strategies

Changes

New: eval/stream-bench/

File Purpose
run.py Main ACE training/evaluation runner for text-to-SQL
run_playbook.py Evaluate a saved playbook on any data split
data_processor.py Data loading, schema formatting, and execution-based SQL accuracy
download_text2sql_data.py Download raw BIRD/CoSQL/Spider SQLite databases
preprocess_streambench_{bird,cosql,spider}.py Convert HuggingFace StreamBench data to .jsonl with schema
dataset_stats.py Print dataset statistics
plot.py Generate accuracy-over-time plots
analyze_logs.py Analyze terminal output logs
data/bird_config.json Pre-built BIRD task configurations
data/cosql_config.json Pre-built CoSQL task configurations
data/spider_config.json Pre-built Spider task configurations
README.md Full setup and usage documentation

Modified: ACE core

  • ace/ace.py: Support separate train_processor/val_processor/test_processor (backward-compatible with data_processor); pass metadata (db_name and curriculum) into run folder name; print correct/total counts alongside accuracy
  • ace/core/reflector.py: Accept and format SQL execution results (predicted_result, ground_truth_result) and pass them into the reflector prompt for richer feedback
  • ace/prompts/reflector.py: Add SQL execution results placeholder to both ground-truth and no-ground-truth reflector prompt templates
  • llm.py: Detect and gracefully handle HTTP 400 / invalid-prompt / policy-violation errors without crashing; skip the affected sample and continue execution
  • utils.py: Minor utilities supporting stream-bench evaluation

@Alex-q-z Alex-q-z self-requested a review February 19, 2026 06:18
@Alex-q-z Alex-q-z added the dataset New datasets or benchmarks label Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset New datasets or benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants