Local-first Python CLI for extracting datasheet PDFs into structured JSON, table exports, figure artifacts, and processing rollup reports.
poetry install
poetry shell # activate the virtual environment# Install from local directory
pip install .
# Install directly from GitHub
pip install git+https://github.com/amahpour/datasheet-extractor.gitThe pipeline uses Ollama to classify and describe extracted figures locally. Install it, then pull a vision model:
brew install ollama
ollama serve # or open the Ollama macOS app
ollama pull moondream # fast, ~1.7GBUse the walkthrough notebook to demo the "before vs after" pipeline:
poetry run pip install jupyter ipykernel
poetry run jupyter notebook notebooks/pipeline_walkthrough.ipynbThe notebook shows:
- raw Docling extraction output
- schema-normalized objects and enrichment
- end-to-end
process_pdf()run with rollups and reports
Use --file to process a single PDF, or --dir to process every PDF in a directory.
Note: If using Poetry, prefix commands with poetry run (e.g., poetry run datasheet-extract).
# Single file
datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out
# Single file, first 3 pages only
datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out --pages "1-3"
# All PDFs in a directory
datasheet-extract --dir ./examples --out ./out
# Specify Ollama model explicitly
datasheet-extract --dir ./examples --out ./out --ollama-model moondream
# Skip images or tables
datasheet-extract --dir ./examples --out ./out --no-images
datasheet-extract --dir ./examples --out ./out --no-tablesWith Poetry:
poetry run datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out
poetry run datasheet-extract --dir ./examples --out ./out
poetry run datasheet-smoke-testWith pip or as a module:
python -m cli.ds_extract --file ./examples/dac7578/dac5578.pdf --out ./out
python -m cli.ds_extract --dir ./examples --out ./out
python -m cli.smoke_test--file— path to a single PDF file (mutually exclusive with--dir)--dir— directory containing PDF files (mutually exclusive with--file)--out(default./out)--glob(default*.pdf) — filename pattern when using--dir--pages "1-3,7"--force— reprocess even if status files exist--no-images--no-tables--max-figures N--ollama-model NAME— Ollama vision model (auto-detected if omitted)--max-tokens N(default256) — max tokens per text block chunk (aligned to embedding model tokenizer)
The pipeline runs two phases in one command:
- Docling extraction — text is chunked using Docling's
HybridChunker, which provides structure-aware, token-aligned chunks with section heading context. Each chunk carries its parent heading hierarchy and anenriched_textfield ready for embedding. The chunker uses thesentence-transformers/all-MiniLM-L6-v2tokenizer (configurable via--max-tokens, default 256). Tables are exported as CSV/MD/JSON, and figure images as PNG. - Local figure processing — each figure is classified and described by Ollama. Results are written to per-figure status files.
Figures are classified and routed:
- Resolved locally — logos, icons, simple photos
- Needs external LLM — plots, pinouts, schematics, block diagrams, timing diagrams
Open processing_rollup.md to see what still needs processing. For each figure flagged as needs_external, use the prompt template in prompts/figure_analysis.md with a vision-capable LLM (GPT-4o, Claude, etc.).
After processing, update the figure's status file in processing/ to "status": "resolved_external" and fill in external_llm_result.
Also record which external model produced that result by setting
external_llm_provider and external_llm_model.
Per PDF (out/<pdf_stem>/):
document.json— full extraction: structure-aware chunked text blocks (with headings and enriched text), tables, figuresindex.json— paths to all artifactsfigures/fig_0001.png— extracted figure imagestables/table_0001.{json,csv,md}— extracted tablesprocessing/fig_0001.json— per-figure processing statusprocessing_rollup.{json,md}— summary of what's resolved vs pendingderived/figures/fig_0001/meta.jsonderived/figures/fig_0001/description.mdmanual_processing_report.{md,json}
Global (out/):
index.jsonprocessing_rollup.json— global rollup across all PDFsmanual_processing_report.{md,json}
Each processing/fig_XXXX.json tracks:
{
"figure_id": "fig_0017",
"image_path": "out/dac5578/figures/fig_0017.png",
"status": "needs_external",
"stage": "local_llm",
"local_llm_provider": "ollama",
"local_llm_model": "moondream",
"local_llm_description": "Graph showing linearity error vs digital input code...",
"local_llm_classification": "plot",
"external_llm_provider": "",
"external_llm_model": "",
"external_llm_result": null,
"needs_external": true,
"confidence": 0.3,
"processed_at": "2026-02-12T00:33:06Z"
}Status values: resolved_local, needs_external, resolved_external
Stage 2 skips any figure where status is resolved_local or resolved_external.
- Docling is required. There is no fallback PDF extractor — if Docling fails, the error propagates.
- Figures that Docling cannot extract an image for are skipped (logged as warnings), not replaced with placeholders.
- Local vision models (moondream, llava) hallucinate on complex figures — they're used for classification/triage only, not precision extraction. Garbled or non-text output (control characters,
<unk>tokens) is automatically detected and treated as a failed description. - Complex figures (plots, pinouts, schematics) are intentionally deferred to external LLM via the rollup report.
- Docling does not extract figure captions or bounding boxes for many PDF layouts. Figures will have empty
captionfields and zeroedbboxvalues. This limits rule-based classification (which falls back to page context or LLM inference). - Table header OCR artifacts (e.g.,
THERMAL.THERMAL,PACKAGE- LEAD) are passed through as-is from Docling's table extraction. These are upstream OCR/layout issues, not pipeline bugs.