LingBot-World

LingBot-World

Image-to-Video World Model with Camera Pose Control

Architecture • Installation • Quick Start • API Reference • Configuration • Gradio UI

Overview

LingBot-World transforms static images into cinematic videos using a state-of-the-art diffusion model with optional camera trajectory control. Built on the NF4-quantized LingBot-World model, this service provides serverless GPU inference via Modal with automatic scaling.

Key Features

Image-to-Video Generation — Generate 5-60 second videos from a single image
Camera Control — Optional camera trajectory for cinematic motion (pan, zoom, orbit)
NF4 Quantization — Optimized for ~32GB VRAM (A100-80GB)
Serverless Deployment — Auto-scaling GPU inference via Modal
REST API — FastAPI endpoint with OpenAPI documentation
Gradio UI — Interactive web interface for easy experimentation

Architecture

sequenceDiagram
    participant C as Client
    participant API as FastAPI (Modal)
    participant GPU as A100-80GB
    participant HF as HuggingFace Hub
    participant VOL as Modal Volume

    Note over C: Upload image + prompt
    C->>API: POST /generate
    API->>VOL: Check model cache
    alt Model not cached
        VOL->>HF: Download NF4 weights (~30GB)
        HF-->>VOL: Store in persistent volume
    end
    API->>GPU: Load WanI2V_PreQuant pipeline
    GPU->>GPU: Diffusion sampling (40-70 steps)
    Note over GPU: High/Low noise model swapping
    GPU->>GPU: VAE decode → video frames
    GPU-->>API: Return MP4 bytes
    API-->>C: video/mp4 response

    Note over C,API: ~2-10 min depending on resolution/frames

Key Optimizations

Optimization	Benefit
GPU Memory Snapshots	Cold starts reduced from ~90s to ~10s
Persistent Volume Caching	Model weights cached across restarts
15-min Scaledown Window	Containers stay warm between requests
Concurrent Processing	2 requests per GPU container
NF4 Quantization	3.9x model compression (~85GB → ~30GB)

Model Components

Component	Size	Description
`high_noise_model_bnb_nf4`	~9.6GB	High-noise diffusion model (NF4)
`low_noise_model_bnb_nf4`	~9.6GB	Low-noise diffusion model (NF4)
`models_t5_umt5-xxl-enc-bf16.pth`	~10.6GB	T5-XXL text encoder
`Wan2.1_VAE.pth`	~485MB	VAE encoder/decoder

Installation

Prerequisites

# Required
python >= 3.10
uv (recommended) or pip

# Accounts needed
# - Modal account with API key
# - HuggingFace account with read access token

Setup

# Clone repository
git clone https://github.com/DreamFlux-Workspace/lingbot-world.git
cd lingbot-world

# Install uv (if not installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Configure Modal
uv run modal setup

# Create HuggingFace secret in Modal
modal secret create hf-secret HF_TOKEN=<your-hf-token>

Quick Start

1. Download Model Weights

uv run modal run src/lingbot_world/inference.py --action setup

⏱️ This downloads ~30GB of model weights. Takes 30-60 minutes.

2. Deploy to Modal

uv run modal deploy src/lingbot_world/inference.py

3. Generate Your First Video

# Using the CLI
uv run lingbot generate input.jpg "A cinematic video with gentle camera movement" -o output.mp4

# Or via curl
curl -X POST "https://YOUR_WORKSPACE--lingbot-world-api.modal.run/generate" \
    -F "image=@input.jpg" \
    -F "prompt=A cinematic video with gentle camera movement" \
    -o output.mp4

API Reference

Endpoints

Method	Endpoint	Description
`GET`	`/`	API information
`GET`	`/health`	Health check & GPU metrics
`POST`	`/generate`	Generate video from image
`GET`	`/docs`	OpenAPI documentation

Generate Video

POST /generate
Content-Type: multipart/form-data

Parameters:

Parameter	Type	Default	Description
`image`	file	required	Input image (JPEG/PNG)
`prompt`	string	required	Text description of desired motion
`size`	string	`"480*832"`	Resolution: `480832`, `7201280`
`frame_num`	int	`81`	Frame count (must be 4n+1)
`sampling_steps`	int	`40`	Diffusion steps (30-70)
`guide_scale`	float	`5.0`	Guidance scale (3.0-7.0)
`seed`	int	`-1`	Random seed (-1 for random)

Response:

Content-Type: video/mp4
Content-Disposition: attachment; filename=lingbot_<id>.mp4

Example Request

import httpx

with open("cityscape.jpg", "rb") as f:
    response = httpx.post(
        "https://your-workspace--lingbot-world-api.modal.run/generate",
        files={"image": f},
        data={
            "prompt": "A cinematic first-person exploration through the urban environment",
            "size": "480*832",
            "frame_num": 81,
            "sampling_steps": 40,
        },
        timeout=600.0,
    )

with open("output.mp4", "wb") as f:
    f.write(response.content)

Health Check Response

{
    "status": "healthy",
    "model_loaded": true,
    "pipeline_ready": true,
    "gpu": "NVIDIA A100-SXM4-80GB",
    "gpu_memory_gb": 79.4,
    "gpu_memory_used_gb": 28.3
}

Configuration

Frame Presets

Preset	Frames	Duration	Use Case
Short	81	~5 sec	Quick previews
Medium	161	~10 sec	Standard videos
Long	241	~15 sec	Extended scenes
Very Long	481	~30 sec	Cinematic shots

Resolution Options

Size	Aspect Ratio	VRAM Usage	Speed
`480*832`	9:16 Portrait	~28GB	Fast
`832*480`	16:9 Landscape	~28GB	Fast
`720*1280`	9:16 Portrait	~45GB	Slower
`1280*720`	16:9 Landscape	~45GB	Slower

Environment Variables

Variable	Description	Required
`HF_TOKEN`	HuggingFace access token	Yes
`LINGBOT_API_URL`	API URL for UI client	No

Gradio UI

Launch the interactive web interface:

# Set your API URL
export LINGBOT_API_URL="https://your-workspace--lingbot-world-api.modal.run"

# Launch UI
uv run python -m lingbot_world.ui

Then open http://localhost:7860 in your browser.

UI Features

📷 Image Upload — Drag & drop or click to upload
📝 Prompt Editor — Describe the motion you want
🎬 Camera Presets — Pan, zoom, orbit, dolly controls
⚙️ Advanced Settings — Fine-tune generation parameters
🎥 Video Preview — Watch results directly in browser

CLI Reference

# Show help
uv run lingbot --help

# Setup (download model)
uv run lingbot setup

# Deploy to Modal
uv run lingbot deploy

# Check health
uv run lingbot health --api-url https://...

# Generate video
uv run lingbot generate input.jpg "Your prompt" -o output.mp4 \
    --size 480*832 \
    --frames 81 \
    --steps 40 \
    --guidance 5.0

# Launch UI
uv run lingbot ui --api-url https://...

Performance

Generation Times (A100-80GB)

Resolution	Frames	Steps	Time
480*832	81	40	~2-3 min
480*832	161	40	~4-5 min
720*1280	81	40	~5-6 min
720*1280	161	70	~8-10 min

Cold Start (with GPU Memory Snapshots)

Scenario	Time	Description
First ever	~90s	Creates GPU memory snapshot
Snapshot restore	~10s	Restores from cached GPU state
Warm container	Instant	15-min keepalive window

Optimization: GPU memory snapshots capture the loaded model state, reducing cold starts by ~80%.

Error Handling

from lingbot_world.exceptions import (
    LingBotError,
    ModelNotFoundError,
    GenerationError,
    InvalidParameterError,
)

try:
    result = await generate_video(...)
except ModelNotFoundError:
    print("Run 'lingbot setup' first to download model weights")
except InvalidParameterError as e:
    print(f"Invalid parameter: {e}")
except GenerationError as e:
    print(f"Generation failed: {e}")

Development

# Install dev dependencies
uv sync --group dev

# Run linter
uv run ruff check src/

# Run formatter
uv run ruff format src/

# Run type checker
uv run mypy src/

# Run pre-commit hooks
uv run pre-commit run --all-files

License

Creative Rail v1.0 — See LICENSE for details.

Acknowledgments

LingBot-World — Original model by Cahlen
Modal — Serverless GPU infrastructure
bitsandbytes — NF4 quantization

Built with ❤️ by DreamFlux

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude/skills/modal-ml-inference		.claude/skills/modal-ml-inference
assets		assets
src/lingbot_world		src/lingbot_world
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_inference.py		test_inference.py

Folders and files

Latest commit

History

Repository files navigation

LingBot-World

Image-to-Video World Model with Camera Pose Control

Overview

Key Features

Architecture

Key Optimizations

Model Components

Installation

Prerequisites

Setup

Quick Start

1. Download Model Weights

2. Deploy to Modal

3. Generate Your First Video

API Reference

Endpoints

Generate Video

Example Request

Health Check Response

Configuration

Frame Presets

Resolution Options

Environment Variables

Gradio UI

UI Features

CLI Reference

Performance

Generation Times (A100-80GB)

Cold Start (with GPU Memory Snapshots)

Error Handling

Development

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages