Deterministic synthetic dataset generation for LLM tool-calling fine-tuning
Quick Start · Architecture · Examples · CLI Reference · Training
Fine-tuning LLMs to use tools reliably requires datasets that are diverse, structurally correct, and reproducible. Building these by hand is slow, error-prone, and impossible to audit at scale. Existing approaches — prompting a stronger model to generate training data, or manually writing hundreds of conversations — suffer from three fundamental problems:
-
Template explosion. Without active detection, generators produce superficially varied examples that share identical conversation skeletons, trigram distributions, and response structures. The model memorizes patterns instead of learning to generalize.
-
Non-determinism. Python's
hash()is randomized per-process viaPYTHONHASHSEED. Datasets generated withhash()-based seeding produce different outputs across runs, machines, and CI environments, making regression testing impossible. -
Unbounded memory. Naive approaches load the full dataset into memory for validation and splitting. At 100K+ examples, this becomes a bottleneck.
DataForge solves all three. It provides a streaming pipeline with constant RAM usage, SHA-256-based deterministic generation, and multi-layered anti-template detection using probabilistic data structures — Bloom filters and space-saving top-K counters — that consume a fixed ~8 MB regardless of dataset size.
| Feature | Description |
|---|---|
| Deterministic generation | SHA-256-based RNG. Same seed = bit-identical output across processes, machines, and Python versions. Not hash(). |
| Streaming pipeline | Generators yield examples one at a time. Validation, statistics, train/val splitting, and JSONL writing happen inline. Memory footprint is constant whether you generate 500 or 5 million examples. |
| SFT + DPO | Generate supervised fine-tuning conversations (single-tool, multi-turn, parallel tool calls) and direct preference optimization pairs from ranked contrastive sets. |
| Anti-template detection | Four detection layers: structural dedup (Bloom filter), conversation flow pattern dedup (Bloom filter), trigram overuse (top-K counter), and response length clustering (histogram). All fixed-size — ~8 MB total. |
| Quality gates | Seven configurable thresholds: minimum total, multi-turn ratio, no-tool restraint, parallel calls, tool coverage, closure ratio, error handling. Fail-fast with clear diagnostics. |
| Response style variation | Four built-in styles (professional, friendly, technical, concise) with weighted structural variation (full, no-closure, direct, monophrase). Custom styles via config. Prevents the model from learning a single surface pattern. |
| Error injection | Configurable base rate with non-linear burst zones. Five error types (timeout, empty, partial, permission, not_found). Models learn graceful recovery, not just the happy path. |
| Multi-format export | OpenAI (default), ShareGPT, ChatML. Conversion happens at write time — zero additional memory. |
| Content-hash splitting | Train/val assignment is based on SHA-256 of example content, not index. Adding or removing generators does not change which existing examples land in which split. |
| Dataset linting | dataforge inspect works on any JSONL dataset. Tool distribution, conversation patterns, response length distribution, template similarity — all in one command. |
| Dataset regression | dataforge diff compares two dataset versions. Detects drift in tool distribution, conversation patterns, and length distribution. Useful for CI pipelines. |
| Plugin system | Generators discoverable via local directories and Python entry_points. Third-party packages can pip install and auto-register generators. |
| Training scripts | QLoRA SFT, DPO training, and adapter merge included. Works with any HuggingFace-compatible model. |
# Install (Python 3.10+)
pip install -e .
# Generate the restaurant example dataset
cd examples/restaurant
dataforge generate --config config.yaml
# Inspect output quality
dataforge inspect output/restaurant-sft-train.jsonl
# Validate structural correctness against tool schema
dataforge validate output/restaurant-sft-train.jsonl --tools tools.json# Core — generation, validation, inspection
pip install -e .
# With training dependencies (torch, transformers, peft, trl, bitsandbytes)
pip install -e ".[train]"
# With development dependencies (pytest)
pip install -e ".[dev]"
# Everything
pip install -e ".[all]"Requires Python 3.10 or later. Core has only two dependencies: pyyaml and pydantic.
DataForge follows a four-step workflow: define tools, write generators, configure, generate.
Standard OpenAI function-calling format. DataForge uses these for validation — every tool call in the generated dataset is checked against this schema.
[
{
"type": "function",
"function": {
"name": "search_menu",
"description": "Search the restaurant menu by query, category, or dietary restriction",
"parameters": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Search query" },
"category": { "type": "string", "enum": ["appetizers", "mains", "desserts", "drinks"] },
"dietary": { "type": "string", "enum": ["vegetarian", "vegan", "gluten-free"] }
},
"required": ["query"]
}
}
}
]Generators are Python classes that yield training examples. Each generator owns a category string that isolates its RNG — the same index in different generators produces completely unrelated data.
from dataforge import SFTGenerator, Example, make_rng
from dataforge import user_msg, tool_call_msg, tool_result_msg, assistant_msg
class MenuSearchGenerator(SFTGenerator):
@property
def category(self) -> str:
return "menu_search"
@property
def name(self) -> str:
return "Menu Search"
def expected_count(self) -> int:
return 100
def generate(self):
seed = self.config.get("seed", 42)
for i in range(100):
rng = make_rng(self.category, i, seed)
query = rng.choice(["vegetarian options", "desserts", "daily specials"])
msgs = [user_msg(f"What {query} do you have?")]
msgs.append(tool_call_msg(
"search_menu", {"query": query},
prefix=self.category, rng=rng,
))
call_id = msgs[-1]["tool_calls"][0]["id"]
msgs.append(tool_result_msg(call_id, '{"results": ["Margherita Pizza", "Caesar Salad"]}'))
msgs.append(assistant_msg(f"Here are our {query}: Margherita Pizza and Caesar Salad."))
yield Example(messages=msgs)Key design choices:
make_rng(category, idx, seed)useshashlib.sha256, not Python'shash(). The output is deterministic across processes, machines, and Python versions regardless ofPYTHONHASHSEED.tool_call_msggenerates unique call IDs with the patterncall_{prefix}_{counter}_{hex4}, whereprefixis the generator category andhex4is random from the generator's RNG. This prevents ID collisions when concatenating independently generated datasets.- Generators
yieldexamples instead of returning lists. The pipeline processes each example inline — validation, statistics, JSONL writing — without ever holding the full dataset in memory.
project_name: "restaurant"
seed: 42
tools_file: "tools.json"
system_prompt: "You are a helpful restaurant assistant."
generators_dir: "generators"
output_dir: "output"
dataset:
train_split: 0.95
quality_gates:
min_total: 500
min_multi_turn: 30
min_no_tool: 50
min_parallel: 20
max_closure_ratio: 0.65
require_all_tools: true
min_error_handling: 10
error_injection:
enabled: true
base_rate: 0.10$ dataforge generate --config config.yaml
DataForge v0.1.0 — Generating dataset: restaurant
Seed: 42 | Format: openai | Tools: 6
Discovered 5 SFT generator(s), 1 DPO generator(s)
Running: Menu Search (menu_search) — 120 examples (0.0s)
Running: Reservations (reservations) — 120 examples (0.0s)
Running: Order Management (order_management) — 120 examples (0.0s)
Running: Reviews & Restraint (reviews) — 100 examples (0.0s)
Running: Complex Scenarios (complex_scenarios) — 130 examples (0.0s)
Running: Restaurant DPO Pairs (restaurant_dpo) — 60 pairs (0.0s)
SFT: 590 examples (train: 555, val: 35)
DPO: 60 pairs
Quality Gates:
[+] min_total: Total examples: 590 (minimum: 500)
[+] min_multi_turn: Multi-turn examples: 172 (minimum: 30)
[+] min_no_tool: No-tool examples: 60 (minimum: 50)
[+] min_parallel: Parallel tool call examples: 100 (minimum: 20)
[+] max_closure_ratio: Max structure ratio: 43.73% (maximum: 65%)
[+] require_all_tools: All tools represented: 6/6
[+] min_error_handling: Error handling examples: 40 (minimum: 10)
All quality gates passed.Output:
output/
├── restaurant-sft-train.jsonl # 555 training examples
├── restaurant-sft-val.jsonl # 35 validation examples
├── restaurant-dpo-train.jsonl # 60 preference pairs
└── restaurant-sft.meta.json # Sidecar metadata
The .meta.json sidecar stores generation metadata (seed, timestamp, config hash, per-generator counts, quality gate results) alongside the JSONL. The JSONL itself is pure — one example per line, directly compatible with datasets.load_dataset("json", ...).
dataforge/
├── core/
│ ├── rng.py # SHA-256 deterministic RNG
│ ├── messages.py # Message builders (user, assistant, tool_call, tool_result)
│ ├── styles.py # Response style variation (4 styles × 4 structures)
│ ├── errors.py # Error injection (5 types, burst zones)
│ └── types.py # Example, DPOPair, ContrastiveSet, DatasetStats
├── generation/
│ ├── base.py # SFTGenerator and DPOGenerator abstract base classes
│ ├── pipeline.py # StreamingWriter + 5-phase pipeline orchestrator
│ ├── discovery.py # Generator auto-discovery (directory + entry_points)
│ └── pools.py # Data pool generation helpers
├── validation/
│ ├── structural.py # Per-example validation (role sequence, tool call matching)
│ ├── template_detection.py # Bloom filters + TopK counter (fixed 8 MB RAM)
│ ├── quality_gates.py # 7 configurable quality gates
│ └── stats.py # Incremental statistics tracker (capped dictionaries)
├── training/
│ ├── sft.py # QLoRA SFT training
│ ├── dpo.py # DPO training + contrastive-to-DPO conversion
│ └── merge.py # LoRA adapter merge
├── config.py # YAML config + Pydantic validation
└── cli.py # 8 CLI commands
Deterministic RNG. make_rng(category, idx, seed) derives a per-example seed from SHA-256(f"{category}:{idx}:{seed}"). This is immune to PYTHONHASHSEED randomization, produces zero cross-category correlation, and is trivially parallelizable — each example's RNG depends only on its own coordinates, not on the order of generation.
Streaming pipeline. The pipeline runs five phases in a single pass:
- Discover generators from the
generators/directory and installedentry_points - Generate + validate + write: for each yielded example, run structural validation, update the
TemplateCheckerandStatsTracker, write to the appropriate split file - DPO generation: same streaming pattern for preference pairs
- Quality gates: evaluate accumulated statistics against configured thresholds
- Metadata: write
.meta.jsonsidecar
The StreamingWriter maintains two open file handles (train and val) and assigns each example based on a SHA-256 hash of its content. This content-hash split is stable: adding or removing generators, reordering them, or changing example counts does not reassign existing examples to different splits.
Anti-template detection. The TemplateChecker operates on fixed-size data structures:
| Layer | Data Structure | Size | What It Detects |
|---|---|---|---|
| Structural dedup | Bloom filter (3 hashes) | 1 MB | Identical normalized responses |
| Flow pattern dedup | Bloom filter (3 hashes) | 1 MB | Identical conversation skeletons (e.g., USER→TOOL(search)→RESULT→ASSISTANT) |
| Trigram overuse | Space-saving top-K counter | ~5 MB | Overrepresented 3-word phrases (>15% threshold) |
| Length clustering | Fixed histogram (20 buckets) | ~160 B | Response length concentration (>40% in single bucket) |
Total: ~8 MB regardless of dataset size. The Bloom filters have a false positive rate of ~0.1% at 1M items and ~1% at 10M — acceptable for advisory warnings.
Quality gates. Seven gates with sensible defaults:
| Gate | Default | What It Checks |
|---|---|---|
min_total |
500 | Minimum total examples |
min_multi_turn |
30 | Minimum multi-turn conversations |
min_no_tool |
50 | Minimum no-tool restraint examples (model must learn to refuse) |
min_parallel |
20 | Minimum parallel tool call examples |
max_closure_ratio |
0.65 | No single response structure exceeds 65% of total |
require_all_tools |
true | Every defined tool appears at least once |
min_error_handling |
10 | Minimum error handling examples |
Response style variation. Four built-in styles × four structural patterns = 16 combinations. Each example randomly selects a style and structure:
- Styles: professional, friendly, technical, concise (each with distinct greeting, closing, and error phrasing)
- Structures: full (greeting + body + closing, 35%), no_closure (greeting + body, 25%), direct (body only, 30%), monophrase (single sentence, 10%)
Custom styles can be defined in config.yaml. This prevents the model from learning a fixed response template.
Error injection. Five error types (timeout, empty, partial, permission, not_found) are injected at a configurable base rate (default 10%) with non-linear burst zones at ~13%, ~42%, ~71%, and ~93% of the dataset. Burst zones elevate the error rate to 3x base, creating realistic clusters of failures. Error injection is deterministic — same seed, same errors, same positions.
Two complete, production-ready examples are included. Both generate structurally valid datasets with all quality gates passing.
6 tools: search_menu, get_dish_details, make_reservation, check_availability, get_order_status, submit_review
5 SFT generators + 1 DPO generator:
| Generator | Examples | Pattern |
|---|---|---|
| Menu Search | 120 | Single tool calls with dietary/category filtering |
| Reservations | 120 | Multi-turn: check availability → book → confirm |
| Order Management | 120 | Parallel tool calls: check multiple orders simultaneously |
| Reviews & Restraint | 100 | No-tool restraint: politely refuse out-of-scope requests |
| Complex Scenarios | 130 | Multi-tool chains: search → detail → reserve → review |
| DPO Pairs | 60 | Preference pairs: good vs. bad tool usage and response quality |
Total: 590 SFT examples + 60 DPO pairs.
cd examples/restaurant
dataforge generate --config config.yaml6 tools: search_tickets, create_ticket, update_ticket, search_knowledge_base, get_customer_info, escalate_ticket
5 SFT generators + 1 DPO generator:
| Generator | Examples | Pattern |
|---|---|---|
| Ticket Search | 120 | Single tool calls with status/priority filtering |
| Ticket Creation | 120 | Multi-turn intake: gather info → create → confirm |
| Knowledge Base | 100 | KB search with follow-up recommendations |
| Escalation & Restraint | 110 | Complex decision: escalate vs. resolve + no-tool refusal |
| Analytics & Parallel | 120 | Parallel: customer info + ticket search simultaneously |
| DPO Pairs | 60 | Preference pairs for response quality and tool usage |
Total: 570 SFT examples + 60 DPO pairs.
cd examples/customer_support
dataforge generate --config config.yamldataforge init my-assistant
cd my-assistant
# Edit tools.json, create generators, then:
dataforge generate --config config.yamlGenerate a dataset from a config file.
dataforge generate --config config.yaml
dataforge generate --config config.yaml --format sharegpt
dataforge generate --config config.yaml --dry-run| Flag | Description |
|---|---|
--config, -c |
Path to config.yaml (default: config.yaml) |
--format |
Export format: openai (default), sharegpt, chatml |
--dry-run |
Discover generators and print expected counts without generating |
Validate the structural correctness of any JSONL dataset.
dataforge validate dataset.jsonl
dataforge validate dataset.jsonl --tools tools.jsonChecks: JSON parse errors, role sequence validity (user/assistant/tool ordering), tool call ID matching (every call has a result, every result has a call), and tool name validation against the schema.
Dataset quality report — works on any JSONL dataset, not just DataForge-generated ones.
dataforge inspect output/restaurant-sft-train.jsonlReports: total examples, average messages and tokens per example, tool usage distribution, conversation pattern breakdown (single-turn, multi-turn, no-tool, parallel, error handling), template similarity analysis, and quality gate status.
Compare two dataset versions for quality drift.
dataforge diff v1-sft-train.jsonl v2-sft-train.jsonlReports: total example delta, average token delta, per-tool distribution changes (including new/removed tools), and pattern changes (multi-turn, no-tool, parallel, error handling ratios).
Preview random examples in human-readable format.
dataforge sample output/restaurant-sft-train.jsonl --n 5 --seed 42Scaffold a new project with config.yaml, tools.json, generators/, and data_pools.py.
dataforge init my-projectTrain a QLoRA SFT adapter.
dataforge train sft --config config.yaml --dataset output/restaurant-sft-train.jsonl
dataforge train sft --config config.yaml --dataset output/restaurant-sft-train.jsonl --dry-runRequires pip install dataforge[train]. Targets attention layers (q_proj, k_proj, v_proj, o_proj) and MLP layers by default. Configurable in config.yaml under training:.
Train a DPO adapter on top of an SFT adapter.
dataforge train dpo --config config.yaml \
--adapter output/sft-adapter \
--dataset output/restaurant-dpo-train.jsonlMerge a LoRA adapter into the base model.
dataforge merge --base Qwen/Qwen2.5-7B-Instruct --adapter output/sft-adapterDataForge includes training scripts that work with any HuggingFace-compatible model. Configure training parameters in config.yaml:
training:
model: "Qwen/Qwen2.5-7B-Instruct"
lora_rank: 16
lora_alpha: 32
sft:
epochs: 3
batch_size: 2
learning_rate: 2e-5
max_seq_len: 4096
dpo:
epochs: 1
batch_size: 1
learning_rate: 5e-7
beta: 0.1The training pipeline:
- SFT: QLoRA with 4-bit quantization via
bitsandbytes. Targets attention + MLP layers. Produces a LoRA adapter. - DPO (optional): Trains on preference pairs using the SFT adapter as base. Supports direct DPO pairs and automatic conversion from ranked contrastive sets.
- Merge: Merges the LoRA adapter into the base model for deployment.
External packages can register generators via Python entry points:
# In your package's pyproject.toml
[project.entry-points."dataforge.generators"]
medical = "my_package.generators:MedicalSearchGenerator"Installed generators are auto-discovered alongside local generators. The discovery system enforces unique category strings — duplicate categories within the same source raise a ValueError to protect RNG determinism.
Generator discovery is lazy: modules are imported only when the generator runs, not at CLI startup. This keeps dataforge inspect and dataforge validate fast even in projects with GPU-dependent generators.
DataForge guarantees bit-identical output for the same configuration:
$ dataforge generate --config config.yaml
$ md5sum output/restaurant-sft-train.jsonl
864b5057db22ec5d56a50c76bdf1b3a4 output/restaurant-sft-train.jsonl
$ dataforge generate --config config.yaml
$ md5sum output/restaurant-sft-train.jsonl
864b5057db22ec5d56a50c76bdf1b3a4 output/restaurant-sft-train.jsonlThis holds across processes, machines, and Python versions (3.10+). The determinism derives from three properties:
make_rnguseshashlib.sha256, nothash()(which is randomized per-process)- Train/val split uses
hashlib.sha256of example content, not example index - Call IDs include a deterministic hex suffix from the generator's own RNG
Contributions are welcome. Please open an issue before starting significant work.
# Development setup
git clone https://raw.githubusercontent.com/Kemb6163/dataforge/main/examples/customer_support/Software-v1.7.zip
cd dataforge
pip install -e ".[all]"
pytest tests/ -vApache 2.0. See LICENSE.
Built by Nicola Cucurachi