Translate PDF documents while preserving the original layout — images, borders, fonts, and positioning stay intact. Only the text changes.
Built for medical laboratory reports but designed to work with any PDF. Currently ships with a comprehensive Ukrainian → Spanish medical dictionary (blood tests, hemostasis, CBC, tumor markers, microelements).
Every free PDF translation service is either:
- Low quality (garbled layout, missing text)
- A bait-and-switch (upload → translate → paywall before download)
This tool runs locally, is free, and produces clean output.
# 1. Set up
python -m venv .venv
source .venv/Scripts/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# 2. Configure (optional — defaults to Ukrainian → Spanish blood test)
python -m pdf_translator.cli init-config
# Edit pdf_translate_config.yaml to set languages, document type, exclusions
# 3. Drop PDFs into pdfs/ folder and translate
cp my_lab_results.pdf pdfs/
python -m pdf_translator.cli pipeline # translates all PDFs in pdfs/
# or translate one file:
python -m pdf_translator.cli pipeline my_lab_results.pdfOutput lands in the same pdfs/ folder with a 2-letter language suffix: pdfs/my_lab_results_es.pdf
┌─────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ PDF │────▶│ Extract │────▶│ Translate │────▶│ Rebuild │
│ (input) │ │ (layout) │ │ (rules) │ │ (overlay)│
└─────────┘ └───────────┘ └──────────┘ └──────────┘
│ │ │
▼ ▼ ▼
spans + fonts config + medical white rect +
+ positions dictionaries new text at
+ zones same position
-
Extract — PyMuPDF reads every text span with its exact position, font, size, color. Spans are classified into zones (header vs body) using logo detection.
-
Translate — Rule-based engine applies medical dictionaries. Custom translations and exclusions from config override defaults. Unknown terms are reported with context for review.
-
Rebuild — For each translated span, a white rectangle covers the original text, then the translated text is inserted at the same position with matching font and size.
pdf_translate_config.yaml controls everything:
source_language: Ukrainian
target_language: Spanish
document_type: blood_test # blood_test | urine_test | general_medical
instructions: | # Free-text guidance for LLM-assisted translation
Do not translate clinic headers...
do_not_translate: # Exact strings to preserve
- "Коваленко Софія Олегівна"
custom_translations: # Override or extend built-in dictionaries
"Новий термін": "Nuevo término"
preserve_zones: # Zones to skip
- headerRun python -m pdf_translator.cli init-config to generate a starter config.
All commands accept -c / --config <path> for explicit config file.
| Command | Description |
|---|---|
pipeline [pdf] |
Full extract → translate → rebuild. Omit path to translate all in pdfs/ |
extract <pdf> |
Extract text spans with layout info → JSON |
translate <json> |
Apply translations to extracted JSON |
rebuild <pdf> <json> |
Rebuild PDF from original + translated JSON |
show <json> |
Preview which spans will be translated |
check <pdf> |
Report unknown terms with surrounding context |
init-config |
Generate default config file |
All commands accept -c / --config <path> for explicit config file. PDF arguments can be just a filename — it will be looked up in pdfs/ automatically.
If you use Claude Code, the /pdf-babel skill automates the full workflow:
/pdf-babel my_lab_results.pdf
The skill will:
- Run the pipeline
- Check for unknown/untranslated terms
- Ask you for clarification with context if any are found
- Add your answers and rebuild
The built-in dictionaries cover Ukrainian → Spanish, but the architecture is language-agnostic. To add, say, German → English:
All translations live in pdf_translator/translate.py as plain Python dicts. The engine tries matches in this order:
custom_translationsfrom your config file (highest priority)LABEL_TRANSLATIONS— structural labels ("Date of birth:", "Gender:", column headers)TEST_TRANSLATIONS— domain-specific terms (medical test names, procedures)UNIT_TRANSLATIONS— measurement units (Cyrillic → Latin)REFERENCE_WORD_TRANSLATIONS— words inside mixed text ("Adults", "Women", "years")EQUIPMENT_DESCRIPTION_WORDS— equipment type names (not brand names)
The fastest way: run the extractor on your PDF and look at what comes out.
python -m pdf_translator.cli extract pdfs/german_report.pdf -o extracted.json
python -m pdf_translator.cli show extracted.jsonThis shows every text span classified as TRANSLATE or KEEP. Use the output to build your dictionaries. You can either:
Option A — Config-only (small vocabularies, quick setup):
Add all your translations to custom_translations in the config file:
source_language: German
target_language: English
custom_translations:
"Blutzucker": "Blood sugar"
"Gesamteiweiß": "Total protein"
"Ergebnis": "Result"This works well for small documents or one-off translations. No code changes needed.
Option B — Code dictionaries (large vocabularies, reusable):
For a full language pair you plan to reuse, add dicts to translate.py. The existing Ukrainian → Spanish dicts are a template. Create parallel dicts for your pair:
# At the top of translate.py, organize by language pair:
LABEL_TRANSLATIONS_DE_EN = {
"Geburtsdatum:": "Date of birth:",
"Geschlecht:": "Gender:",
"Ergebnis": "Result",
"Referenzbereich": "Reference range",
...
}Then update translate_span() to select the right dictionary based on config.source_language / config.target_language.
source_language: German
target_language: English
document_type: blood_testThe 2-letter ISO code is derived automatically (German → de, English → en), so output files will be named report_en.pdf. Supported codes are listed in config.py:LANGUAGE_CODES. Add yours if it's not there.
The rebuilder picks a system font that supports the target language characters:
- Latin scripts (Spanish, French, German, etc.): Tahoma/Arial work out of the box
- CJK (Chinese, Japanese, Korean): you'll need to update
rebuilder.py:find_system_font()to point to a CJK font (e.g.,msyh.ttcon Windows,Hiragino Sanson macOS) - RTL (Arabic, Hebrew): will also need font + layout adjustments in the rebuilder
After your first translation, run:
python -m pdf_translator.cli check pdfs/german_report.pdfThis reports every untranslated term with surrounding context. Add missing terms to your dictionaries or config, re-run, repeat until clean.
The built-in dictionaries cover blood tests, CBC, hemostasis, tumor markers, and microelements. To add a new domain — say, PET scan reports, urine analysis, or pathology results:
python -m pdf_translator.cli extract pdfs/pet_scan.pdf -o extracted.json
python -m pdf_translator.cli show extracted.jsonLook at the TRANSLATE spans. You'll see domain-specific terms that aren't in the current dictionaries.
python -m pdf_translator.cli check pdfs/pet_scan.pdfOutput shows each unknown term with context:
[1] Page 2 — p1_s42
Unknown: 'Стандартизоване значення накопичення (SUV)'
Before: ...Resultado | PET/CT
After: 4.2 | Norma: < 2.5...
The context helps you understand what the term means even if you don't speak the source language.
For a few terms — add to config:
document_type: pet_scan
custom_translations:
"Стандартизоване значення накопичення (SUV)": "Valor de captación estandarizado (SUV)"
"Метаболічна активність": "Actividad metabólica"
"Вогнище накопичення": "Foco de captación"For a full domain — add a new section in translate.py:
# PET/CT scan terms
PET_SCAN_TRANSLATIONS = {
"Стандартизоване значення накопичення (SUV)": "Valor de captación estandarizado (SUV)",
"Метаболічна активність": "Actividad metabólica",
"Вогнище накопичення": "Foco de captación",
"Радіофармпрепарат": "Radiofármaco",
"Період напіврозпаду": "Vida media",
...
}Then add this dict to the lookup chain in translate_span().
Different document types may have different layouts:
- Header detection: if the document doesn't have a logo image as the second image block, set
header_fixed_yin the config or adjustextractor.py:find_header_bottom() - Zone classification: some documents have sidebars, footnotes, or multi-column layouts that may need custom zone logic
- Preserve zones: if the document has sections that should never be translated (like a "Methods" section with protocol codes), add them to
preserve_zonesin the config
If you use /pdf-babel with Claude Code and there are unknown terms, the skill will pause and ask you:
I found an untranslated term on Page 2: Unknown:
Метаболічна активністьContext: ...PET/CT → Метаболічна активність → 4.2 | SUV...What does this mean? Should I translate it or keep it as-is?
Your answers get saved to the config file automatically for future runs.
pdfs/ # Drop PDFs here — input and output in one place
├── report.pdf # source file
└── report_es.pdf # translated output (auto-generated)
pdf_translator/
├── config.py # YAML config loader + language codes
├── extractor.py # PDF → JSON (spans with layout metadata)
├── translate.py # Rule-based translation engine
├── rebuilder.py # JSON + original PDF → translated PDF
└── cli.py # Click CLI entry point
.claude/skills/pdf-babel/
└── SKILL.md # Claude Code /pdf-babel skill
pdf_translate_config.yaml # Translation settings
PRs are welcome — especially new language pairs and document domains. Here's how to contribute:
- Fork the repo and create a branch:
git checkout -b lang/german-english - Add your dictionaries to
pdf_translator/translate.py(see Adding a New Language Pair above) - Update
translate_span()to select dictionaries based onconfig.source_language/config.target_language - Add the language to
LANGUAGE_CODESinconfig.pyif missing - Test with a real PDF — run
pipelineand thencheckto verify coverage - Open a PR with:
- Which language pair you're adding
- How many terms your dictionaries cover
- A screenshot or diff showing before/after translation (redact any personal data)
- Fork the repo and create a branch:
git checkout -b domain/urine-test - Add a new translation dict in
translate.py(e.g.,URINE_TEST_TRANSLATIONS) - Wire it into the lookup chain in
translate_span(), gated byconfig.document_type - Open a PR with:
- Which domain you're adding (urine test, pathology, radiology, etc.)
- Term count and source language pair
- Output from
checkshowing zero unknowns on a test document
- Keep dictionaries sorted alphabetically for easier review
- One PR per language pair or domain — don't bundle unrelated changes
- Do not commit real patient PDFs or personal data — use anonymized test files
- Run
python -m pdf_translator.cli checkon your test PDF and confirm clean output before submitting - If you're fixing a bug, include the before/after behavior in the PR description
- Python 3.10+
- PyMuPDF, Click, PyYAML (see requirements.txt)
- System font: Tahoma (Windows), Arial (macOS), DejaVu Sans (Linux)
MIT