Generates vertical short-form MP4 videos (1080×1920, 9:16, 30 fps) for TikTok, YouTube Shorts, and Instagram Reels from simple markdown script files. Characters talk over gameplay footage with karaoke-style subtitles.
- You write a script (
.mdfile) with dialogue lines and optional image overlays - You provide TTS audio files (one MP3 per dialogue line, pre-generated)
- The pipeline parses the script → transcribes audio with Whisper → composites everything → exports H.264
Script (.md) + Audio (.mp3) + Background (.mp4) + Character sprites (.png)
└──────────────────────────────────┬──────────────────────────────────────┘
render.py
│
output/*.mp4
pip install moviepy openai-whisper Pillow numpy imageio-ffmpegpython render.py scripts/episode1.md # output → output/episode1.mp4
python render.py scripts/episode1.md custom.mp4 # custom output pathYou need to populate the assets/ folder before rendering. Here is everything required:
assets/
├── backgrounds/ ← gameplay footage (landscape MP4s, any resolution)
│ └── minecraft.mp4
│ └── subway_surfers.mp4
│
├── characters/ ← character sprites (PNG with transparency)
│ └── peter.png
│ └── stewie.png
│
├── audio/ ← TTS audio, one MP3 per dialogue line
│ └── peter_001.mp3 ← Peter's 1st line
│ └── peter_002.mp3 ← Peter's 2nd line
│ └── stewie_001.mp3 ← Stewie's 1st line
│ └── stewie_002.mp3
│ └── ...
│
├── fonts/
│ └── bold.ttf ← subtitle font (any bold TTF; fallback = low-quality bitmap)
│
└── pictures/ ← graph/image overlays referenced in scripts
└── intro_chart.png
└── inflation_graph.png
- Any landscape MP4 — the renderer crops it to fill 9:16 automatically (scale-to-cover, never letterboxed)
- One background is picked at random per render
- You can have as many as you want in
assets/backgrounds/
- PNG files with RGBA transparency (transparent background)
- Named exactly as they appear in the script:
[peter]→assets/characters/peter.png - They are scaled to ~45% of frame width by default (configurable in
config.py) - First character encountered in the script goes on the left, second on the right
- One MP3 per dialogue line, named
<character>_<NNN>.mp3(zero-padded 3-digit index) - The index counts that character's lines in order from the top of the script
- Generate these with any TTS tool (ElevenLabs, edge-tts, etc.) before rendering
- Missing audio files produce a warning but don't crash — that clip is skipped
- Put any bold TTF at
assets/fonts/bold.ttf - If missing, Pillow's built-in bitmap font is used (tiny and ugly — always supply a real font)
- Any image format Pillow supports (PNG, JPG, WebP, etc.)
- Referenced in the script by path relative to
assets/or the project root - Scaled to fill the frame while preserving aspect ratio
Scripts live in scripts/. See scripts/episode1.md for a working example.
# This is a comment — ignored by the parser
[img: "pictures/intro_chart.png" 4s] ← show image for 4 seconds
[peter] : Hey Stewie have you seen this ← Peter speaks
[stewie] : Yes it is remarkable ← Stewie speaks
[peter] : I know right ← Peter's 2nd line → peter_002.mp3
Rules:
[character]names are case-insensitive and map to sprite + audio files- Audio files are auto-matched by order: first
[peter]line →peter_001.mp3, second →peter_002.mp3, etc. - Image overlays display for the specified duration with no audio
- Order in the file = strict order in the video
Whisper transcription (used for word-level subtitle timing) runs once per audio file and caches the result in cache/. Subsequent renders are instant.
To re-transcribe a file, delete its cache entry:
rm cache/peter_001.jsonWhile a render is running, a live status file is written at output/<job>.status.json.
# watch live in another terminal
watch -n1 "jq '.stage, .step_label, (.percent|tostring)+\"%\"' output/episode1.status.json"
# or use the built-in viewer
python progress.pyPipeline stages in order: PARSING → ALIGNING → BACKGROUND → COMPOSITING → EXPORTING → DONE
The final frame is built by stacking four z-ordered layers. These rules are always enforced regardless of script order:
| Layer | What | Can cover |
|---|---|---|
| 1 — Background | Gameplay footage | Nothing |
| 2 — Characters | Sprite of the speaking character | Background only |
| 3 — Image overlays | Charts / pictures | Background, characters |
| 4 — Subtitles | Karaoke word highlight | Everything — always visible on top |
One character at a time — each character sprite clip is active only for the exact duration of that character's audio line. Two character sprites are never visible simultaneously.
Subtitles avoid images — subtitles are always the topmost layer and are never hidden. When an image is active, the subtitle is repositioned to sit below the image (preferred) or above it if there isn't enough space below. They visually avoid the image rather than covering it.
All tuneable values are in config.py. Key ones:
| Setting | Default | What it does |
|---|---|---|
WHISPER_MODEL |
"base" |
Whisper model size — tiny is fast, large is accurate |
CHARACTER_SCALE |
0.45 |
Sprite size (fraction of frame width) |
CHARACTER_X_LEFT |
80 |
X position of left speaker |
CHARACTER_X_RIGHT |
600 |
X position of right speaker |
SUBTITLE_Y |
1150 |
Vertical position of subtitles |
SUBTITLE_HIGHLIGHT_COLOR |
(255,220,0) |
Active word color (yellow) |
GRAPH_Y_CENTER |
600 |
Vertical center of image overlays |
| File | Role |
|---|---|
render.py |
Entry point — composes and exports the video |
config.py |
All settings, sizes, colors, paths |
parse.py |
Converts the .md script into a timeline |
align.py |
Runs Whisper to get per-word timestamps |
progress.py |
Progress tracking and status file viewer |