Built on AgentField (async-parallel multi-agent runtime) and Deep Lake (multimodal data substrate). Open source.
Foundation models for robotics (OpenVLA, Octo, NVIDIA GR00T, π-0) learn from teleoperation demonstrations paired with language. Most production VLAs today train on flat single-string instructions inherited from Open X-Embodiment. The frontier (π-0.5, ECoT, Hi Robot) is moving toward hierarchical sub-task annotation because flat labels cap generalization. Annotation is a known bottleneck for the field: human labeling is accurate but doesn't scale to millions of episodes, single-pass VLM annotation hallucinates, and research-grade multi-pass pipelines like ECoT (+28% on OpenVLA) exist but aren't productized.
Roboscribe is the productized version of that third category: a deeply-cascaded multi-agent architecture that gives you ECoT-style hierarchical annotation with built-in cross-modal verification, audit trails, and versioned storage. Every episode runs through a 5-phase cognitive cascade that fans out into roughly 24 specialized agents per episode, scaling to 200 to 300 working concurrently across a corpus pass: per-keyframe object detectors, per-modality reasoners, meta-spawned segment narrators, cross-modal verifiers, embedding workers, and substrate writers, all dispatched in parallel through an async control plane. One complete annotation (episode goal, sub-task segments, per-modality phase tags, cross-modal verification, semantic embedding) finishes in about 16 seconds at roughly $0.002. Output lands as versioned, hybrid-searchable Deep Lake branches ready to feed any LeRobot, OpenVLA, Octo, or GR00T-compatible training pipeline.
Dataset-agnostic. PushT and Aloha bimanual adapters ship with the repo; adding any new LeRobot-format robot is about 30 lines.
Per episode:
- Episode-level grain: one-sentence goal, outcome tag, canonical objects
- Sub-task segmentation: temporal boundaries, verb-phrase labels, per-segment phase tag, objects involved
- Independent modality analysis: separate visual and action expert chains, each producing its own motion phase classification
- Cross-modal verification: a third agent reconciles the two modality stories and emits a consistency score and anomaly flag
- Quality signals: auto-set
human_review_recommendedflag for ambiguous episodes - Semantic embedding: 1024-dim vector of the scene description, queryable via TQL
cosine_similarity
Per corpus:
- A versioned Deep Lake branch with the above attached as sibling columns on the same row as the raw episode data (frames, actions, states)
- Hybrid TQL queries combining vector similarity, text
CONTAINS, numeric filters, and structured equality in one statement - Branch comparison between annotation passes: diff-able, rollback-able, A/B-testable
- PyTorch-streaming via
ds.pytorch()straight into a training loop, no intermediate format
Here's what one annotated row looks like:
{
"episode_id": 0,
"grain_episode": {
"goal": "Push the gray T-shaped block into alignment with the green target outline",
"outcome": "successful"
},
"grain_segments": [
{ "start_frame": 0, "end_frame": 21, "phase": "approach", "label": "approach the block", "objects_involved": ["gray t-shaped block", "blue circular agent"] },
{ "start_frame": 21, "end_frame": 40, "phase": "manipulate", "label": "push toward target", "objects_involved": ["gray t-shaped block", "green target outline"] },
{ "start_frame": 40, "end_frame": 49, "phase": "manipulate", "label": "align with target outline", "objects_involved": ["gray t-shaped block", "green target outline"] }
],
"visual_motion": { "phase": "manipulate", "rationale": "agent maintains contact as the block moves toward the target" },
"action_motion": { "phase": "approach", "rationale": "trajectory shows directed motion without sustained contact signatures" },
"consistency": { "consistency_score": 0.2, "anomaly_flag": "phase_disagreement" },
"quality_signals": { "human_review_recommended": true },
"scene": { "canonical_objects": ["gray t-shaped block", "green target outline", "blue circular agent"], "scene_summary": "…" }
}That entire record is one row in a Deep Lake branch, joined to the raw episode by episode_id.
demo.mp4
The AgentField control plane UI rendering the multi-agent cascade live. Each node is one agent invocation with its prompt, output, latency, and cost. A 10-episode annotate_corpus run produces roughly 240 nodes across the 5 phases of the cascade.
| Layer | Tool | What it brings to this build |
|---|---|---|
| Multi-agent runtime | AgentField | Async-parallel agent orchestration at scale. Each episode dispatches 200 to 300 cooperating agents across a 10-episode corpus pass through the control-plane queue. Agents run as microservices with asyncio.gather at every layer where work is independent: a 5-phase, depth-5 cognitive cascade with per-keyframe vision fan-out, per-segment narrator spawning, dual-modality thread parallelism, all in one ~16-second pipeline. Per-request model overrides, structured Pydantic outputs at every leaf. |
| Multimodal substrate | Deep Lake | Image bytes + Float32 tensors + text + embeddings + scalars in one row, joined by episode_id. Versioned branches per annotation pass. Hybrid TQL combining vector similarity, text CONTAINS, numeric filters, and structured equality in a single statement. PyTorch streaming dataloader directly into training loops. |
| Dataset format | HuggingFace LeRobot | The de-facto modern robotics dataset format. Same shape NVIDIA GR00T / Isaac Lab / Cosmos, OpenVLA, Octo, and π-Zero all consume. |
| Vision + reasoning models | OpenRouter → Qwen3-VL family (default) or NVIDIA Nemotron Nano VL (drop-in) | One env-var swap routes the entire stack through NVIDIA Nemotron instead of Qwen3-VL. |
A 25-frame benchmark on real LeRobot/PushT data, four modes running in parallel, ground truth = expert (235B) predictions:
| Mode | Accuracy | Cost per frame | Cost per 1,000 frames |
|---|---|---|---|
| Scout-only (Qwen3-VL-8B) | 93% | $0.00010 | $0.10 |
| Mid-tier (Qwen3-VL-32B) | 97% | $0.00010 | $0.10 |
| Smart-routed (8B scout → 235B expert on doubt) | 98% | $0.00013 | $0.13 |
| Expert-only (Qwen3-VL-235B) | 100% | $0.00021 | $0.21 |
Total parallel wall time for 100 jobs across 4 modes: 16.6 seconds.
The Smart-routed mode is the production sweet spot. It runs every frame through the cheap 8B scout first, then escalates to the 235B flagship only when confidence signals fire (low confidence reported by the scout, or unexpectedly few objects detected, which is a robust proxy for occlusion). It escalates 16% of frames and closes 10 of the 13 errors the 8B-only mode would have made, at 62% of the cost of running everything through the flagship.
A separate audit-grade benchmark (4 hand-labelled frames × 7 objective questions × 3 models = 84 predictions) gave the same architecture 89% / 89% / 100% across the three models in 4.1 seconds parallel wall time.
A deep, 5-phase cognitive cascade. Every episode goes through:
- Extract: keyframes and action trajectories are loaded and prepared in parallel.
- Analyze: two structurally independent modality threads run concurrently. The visual thread fans out per-keyframe object detectors, scene composers, and motion classifiers. The action thread runs trajectory phasing, velocity profiling, and per-modality phase classification.
- Segment: a temporal segmenter detects sub-task boundaries, then meta-spawns one narrator agent per detected segment at runtime. The cascade re-shapes itself based on what each episode actually contains.
- Verify: a cross-modal verifier reconciles the two modality stories. When they disagree, the discrepancy becomes a signal and the episode is automatically flagged for human review.
- Synthesize: the episode goal is composed, the scene description is embedded (1024-dim), and the structured annotation row is committed to a versioned Deep Lake branch joined to the raw episode by
episode_id.
Across these five phases the system dispatches 200 to 300 cooperating agents across a 10-episode corpus pass per episode, all driven through AgentField's async control plane. The architectural choice that earns the accuracy: two structurally separate modality experts that cannot conspire to hallucinate the same wrong story. A single multimodal call sees vision and action as one prompt and glosses over disagreements. Roboscribe analyzes them in parallel, then asks a third agent whether they cohere.
git clone https://github.com/Agent-Field/roboscribe-af
cd roboscribe-af
cp .env.example .env # paste OPENROUTER_API_KEY
docker compose up --build # ~1s rebuild via uv cache after first buildOnce the stack is up, open http://localhost:8080/ui/ in your browser. That's the AgentField control plane UI — watch the multi-agent cascade execute live as runs come in. Every agent invocation shows up as a node in the workflow graph with its prompt, inputs, outputs, latency, and cost.
# Annotate a corpus in parallel (10 episodes, ~17 seconds wall, ~$0.02 total)
curl -X POST http://localhost:8080/api/v1/execute/async/roboscribe-af.annotate_corpus \
-H 'Content-Type: application/json' \
-d '{"input": {"episode_ids": [0,1,2,3,4,5,6,7,8,9], "concurrency": 4}}'
# Search the annotated corpus by meaning
curl -X POST http://localhost:8080/api/v1/execute/async/roboscribe-af.vector_search_episodes \
-H 'Content-Type: application/json' \
-d '{"input": {"query": "robot pushes a block toward a target", "top_k": 5}}'Open the AgentField UI alongside these curls and you'll see the agents lighting up across the 5 phases of the cascade in real time. A 10-episode corpus pass produces 200 to 300 nodes in the workflow graph as the cascade fans out.
Reproduce the benchmark numbers above:
python3 scripts/bench25.pySwap the entire stack to NVIDIA Nemotron Nano VL with one env-var:
AI_MODEL=openrouter/nvidia/nemotron-nano-12b-v2-vl:free docker compose up --buildShipped:
- Hierarchical annotations (episode → segments → motion phases → objects)
- Independent visual + action analysis with cross-modal reconciliation
- Cost-aware scout/expert routing with confidence-gated escalation
- Deep Lake substrate with versioned annotation branches
- Hybrid TQL queries (text + structured + vector cosine similarity)
- Semantic search via 1024-dim text embeddings
- Parallel agent dispatch through async control-plane queue (~24 agents/episode, 200+ in a corpus pass)
- Branch-vs-branch annotation diff
- Pluggable
TaskAdapterprotocol with bundled PushT and Aloha bimanual adapters - NVIDIA Nemotron Nano VL drop-in compatibility
- Open-source vision and reasoning models via OpenRouter
In progress:
- PyTorch streaming into LeRobot's ACT / Diffusion Policy training loop
- Real video loading for any LeRobot dataset (AV1 decoded full episodes)
- NVIDIA Cosmos Tokenizer integration for video-frame embeddings
- Multi-step manipulation tasks (pick-place, pour, fold)
- Long-horizon hierarchical segmentation (multi-level grain)
- Modal scale-out for million-episode runs
Built on the open-source work of:
- AgentField: async-parallel multi-agent runtime
- Activeloop Deep Lake: multimodal data substrate with versioned branches and hybrid TQL
- HuggingFace LeRobot: universal robotics dataset format
- Qwen team: Qwen3-VL open-source vision models
- NVIDIA: Nemotron Nano VL drop-in compatibility; LeRobot-format alignment with GR00T / Isaac Lab / Cosmos Physical AI stack
- BAAI: BGE-large-en embeddings for semantic search
- SEC-AF: AI-native security auditor
- PR-AF: agentic PR reviewer
- Contract-AF: legal contract risk analyzer
- Reactive-Atlas: MongoDB to AI enrichment pipeline

