ACME: Automated Clause Mapping Engine for Differential Testing of Emerging Database Systems

Overview

A growing number of emerging database management systems (such as time-series and streaming databases) have been developed to support specialized workloads with enhanced performance and functionality. However, these systems are often less mature than traditional relational databases, making them more prone to logic bugs and internal errors that affect correctness and reliability.

ACME is a research tool that addresses this through LLM-assisted differential testing. The key insight is that many emerging database systems are conceptually extensions of relational databases, so bugs can be uncovered by comparing query results against a more mature relational database. To bridge the syntax and semantic differences between the two systems, ACME uses Large Language Models to automatically discover supported clauses and generate clause mappings that translate system-specific SQL features into equivalent standard SQL.

ACME proceeds in three steps:

On-the-fly clause mapping: whenever differential testing encounters an unsuccessful query, ACME constructs a clause-mapping request to an LLM to fix the translation.
Mapping verification:generated mappings are validated by executing test queries; verification failures trigger re-prompting, and verified mappings are saved as reusable Python functions.
Diverse query generation: semantically equivalent but syntactically diverse queries are generated to broaden differential testing coverage.

ACME was applied to four widely used emerging database systems, uncovering 59 previously unknown bugs (17 logic bugs, 42 internal errors), of which 52 have been fixed and 5 confirmed by vendors. Evaluation shows that ACME enhances LLM reliability through query verification, and proves effective under tight constraints — whether using local models or a limited online token budget.

This repository demonstrates ACME on QuestDB (emerging, time-series) vs PostgreSQL (mature, relational) as a concrete example.

Architecture

QueryGenerator  ──→  DifferentialTester  ──→  QuestDB (emerging DB)
  (emerging DB             │                        │
    SQL + schema)     CodeMapper               Results  ──→  Comparator
                           │                        │              │
                      (mappings/*.py)          PostgreSQL ────────→
                           ↑
                    LLM (on mapping failure)

QueryGenerator (questdb_generator.py): Generates QuestDB-specific DDL/DML/SELECT covering a broad range of clauses and edge cases.
CodeMapper (differential_tester.py): Applies a chain of Python transform(sql) functions from mappings/ to translate QuestDB SQL into valid PostgreSQL SQL.
DifferentialTester (differential_tester.py): Orchestrates the pipeline: generate → execute on QuestDB → translate via CodeMapper → execute on PostgreSQL → compare result sets.
LLM Integration: On PostgreSQL execution failure, an LLM generates a new mapping function. Verified functions are saved to mappings/ for reuse across all future queries.

Getting Started

Prerequisites

Docker & Docker Compose
Python 3.10+
Ollama (for local LLM inference)

1. Start Databases

docker-compose up -d

PostgreSQL: port 5432
QuestDB: port 8812 (PG wire protocol), 9000 (web console)

2. Prepare Local LLM

# Install Ollama from https://ollama.com/
ollama pull qwen2.5-coder:7b-instruct

3. Install Python Dependencies

pip install -r requirements.txt

Recommended Workflow

ACME works best in two phases. The mapping library starts empty and needs to be built up before large-scale parallel testing is effective.

Phase 1 — Build the Mapping Library (LLM enabled, single-threaded)

The LLM learns to translate QuestDB constructs into PostgreSQL by observing execution errors and generating Python transform rules. Run single-threaded so rule generation is not disrupted by concurrent execution.

python3 main.py test --count 0 --seed 42

With a specific LLM provider:

# Local (Ollama)
python3 main.py test --offline

# Remote (OpenAI)
python3 main.py test --online

Stop when the LLM call rate drops noticeably, this indicates that most common constructs are already covered by saved mappings. All verified mappings are persisted to mappings/*.py automatically.

Phase 2 — Differential Testing at Scale (no LLM, parallel)

Once a sufficient mapping library exists, disable the LLM and run with many parallel threads for high-throughput bug finding.

python3 main.py test --count 0 --seed 42 --no-llm --threads 16

Any query that fails to map cleanly is logged to test_logs/results.txt rather than retried with the LLM, keeping throughput high.

All Flags

python3 main.py test [options]

Flag	Default	Description
`--count N`	`50`	Queries to run; `0` = infinite
`--seed N`	`42`	Random seed for reproducibility
`--threads N`	`1`	Parallel worker threads
`--offline`	—	Use Ollama (`qwen2.5-coder:7b-instruct`)
`--online`	—	Use OpenAI (`gpt-4o`)
`--llm PROVIDER`	`ollama`	LLM provider: `ollama` or `openai`
`--llm-model MODEL`	`qwen2.5-coder:7b-instruct`	Model name
`--no-llm`	—	Disable LLM; use saved mappings only
`--max-prompts N`	—	Stop after N LLM mapping requests
`--verbose`	—	Enable debug-level logging

Evaluation-Only Flags

The following flags alter testing behaviour and are intended for controlled experiments, not for regular bug-finding runs.

Flag	Description
`--static-data`	Replace random inserts with a fixed 100-row dataset. Use when you need deterministic, reproducible results (e.g. comparing two mapping strategies or measuring LLM effectiveness). Random data generally gives better coverage.
`--ablation`	Skip LLM-generated rule verification before applying rules. Use only to measure the impact of the verification step in ablation studies.
`--no-strict`	Relax result-set comparison during rule verification. Use in controlled experiments where approximate equivalence is sufficient.

Output

Path	Contents
`mappings/*.py`	Learned translation rules (persistent across runs)
`test_logs/results.txt`	All mismatches and mapping failures
`test_logs/token_stats.txt`	Per-call and session LLM token usage and cost estimates
`repro_scripts/`	Standalone Python scripts to reproduce each finding

Project Structure

File	Role
`main.py`	CLI entry point; manages Docker containers
`differential_tester.py`	Core engine: `CodeMapper` + `DifferentialTester`
`questdb_generator.py`	QuestDB SQL generator (emerging DB example)
`mappings/*.py`	Learned QDB→PG translation rules
`docker-compose.yml`	PostgreSQL and QuestDB service definitions
`tests/`	Unit and pipeline integration test suite

Research Paper

Find more details in our FSE 2026 research paper: ACME: Automated Clause Mapping Engine for Differential Testing of Emerging Database Systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACME: Automated Clause Mapping Engine for Differential Testing of Emerging Database Systems

Overview

Architecture

Getting Started

Prerequisites

1. Start Databases

2. Prepare Local LLM

3. Install Python Dependencies

Recommended Workflow

Phase 1 — Build the Mapping Library (LLM enabled, single-threaded)

Phase 2 — Differential Testing at Scale (no LLM, parallel)

All Flags

Evaluation-Only Flags

Output

Project Structure

Research Paper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
example-mappings		example-mappings
tests		tests
README.md		README.md
differential_tester.py		differential_tester.py
docker-compose.yml		docker-compose.yml
main.py		main.py
questdb_generator.py		questdb_generator.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ACME: Automated Clause Mapping Engine for Differential Testing of Emerging Database Systems

Overview

Architecture

Getting Started

Prerequisites

1. Start Databases

2. Prepare Local LLM

3. Install Python Dependencies

Recommended Workflow

Phase 1 — Build the Mapping Library (LLM enabled, single-threaded)

Phase 2 — Differential Testing at Scale (no LLM, parallel)

All Flags

Evaluation-Only Flags

Output

Project Structure

Research Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages