Datasheet Extractor

Local-first Python CLI for extracting datasheet PDFs into structured JSON, table exports, figure artifacts, and processing rollup reports.

Install

Using Poetry (recommended)

poetry install
poetry shell  # activate the virtual environment

Using pip

# Install from local directory
pip install .

# Install directly from GitHub
pip install git+https://github.com/amahpour/datasheet-extractor.git

Ollama (local vision LLM)

The pipeline uses Ollama to classify and describe extracted figures locally. Install it, then pull a vision model:

brew install ollama
ollama serve            # or open the Ollama macOS app
ollama pull moondream   # fast, ~1.7GB

Interactive notebook demo

Use the walkthrough notebook to demo the "before vs after" pipeline:

poetry run pip install jupyter ipykernel
poetry run jupyter notebook notebooks/pipeline_walkthrough.ipynb

The notebook shows:

raw Docling extraction output
schema-normalized objects and enrichment
end-to-end process_pdf() run with rollups and reports

CLI usage

Use --file to process a single PDF, or --dir to process every PDF in a directory.

Note: If using Poetry, prefix commands with poetry run (e.g., poetry run datasheet-extract).

# Single file
datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out

# Single file, first 3 pages only
datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out --pages "1-3"

# All PDFs in a directory
datasheet-extract --dir ./examples --out ./out

# Specify Ollama model explicitly
datasheet-extract --dir ./examples --out ./out --ollama-model moondream

# Skip images or tables
datasheet-extract --dir ./examples --out ./out --no-images
datasheet-extract --dir ./examples --out ./out --no-tables

With Poetry:

poetry run datasheet-extract --file ./examples/dac7578/dac5578.pdf --out ./out
poetry run datasheet-extract --dir ./examples --out ./out
poetry run datasheet-smoke-test

With pip or as a module:

python -m cli.ds_extract --file ./examples/dac7578/dac5578.pdf --out ./out
python -m cli.ds_extract --dir ./examples --out ./out
python -m cli.smoke_test

Options

--file — path to a single PDF file (mutually exclusive with --dir)
--dir — directory containing PDF files (mutually exclusive with --file)
--out (default ./out)
--glob (default *.pdf) — filename pattern when using --dir
--pages "1-3,7"
--force — reprocess even if status files exist
--no-images
--no-tables
--max-figures N
--ollama-model NAME — Ollama vision model (auto-detected if omitted)
--max-tokens N (default 256) — max tokens per text block chunk (aligned to embedding model tokenizer)

Two-stage workflow

Stage 1: Extract + local processing (automatic)

The pipeline runs two phases in one command:

Docling extraction — text is chunked using Docling's HybridChunker, which provides structure-aware, token-aligned chunks with section heading context. Each chunk carries its parent heading hierarchy and an enriched_text field ready for embedding. The chunker uses the sentence-transformers/all-MiniLM-L6-v2 tokenizer (configurable via --max-tokens, default 256). Tables are exported as CSV/MD/JSON, and figure images as PNG.
Local figure processing — each figure is classified and described by Ollama. Results are written to per-figure status files.

Figures are classified and routed:

Resolved locally — logos, icons, simple photos
Needs external LLM — plots, pinouts, schematics, block diagrams, timing diagrams

Stage 2: External LLM (manual)

Open processing_rollup.md to see what still needs processing. For each figure flagged as needs_external, use the prompt template in prompts/figure_analysis.md with a vision-capable LLM (GPT-4o, Claude, etc.).

After processing, update the figure's status file in processing/ to "status": "resolved_external" and fill in external_llm_result. Also record which external model produced that result by setting external_llm_provider and external_llm_model.

Output layout

Per PDF (out/<pdf_stem>/):

document.json — full extraction: structure-aware chunked text blocks (with headings and enriched text), tables, figures
index.json — paths to all artifacts
figures/fig_0001.png — extracted figure images
tables/table_0001.{json,csv,md} — extracted tables
processing/fig_0001.json — per-figure processing status
processing_rollup.{json,md} — summary of what's resolved vs pending
derived/figures/fig_0001/meta.json
derived/figures/fig_0001/description.md
manual_processing_report.{md,json}

Global (out/):

index.json
processing_rollup.json — global rollup across all PDFs
manual_processing_report.{md,json}

Per-figure status file

Each processing/fig_XXXX.json tracks:

{
  "figure_id": "fig_0017",
  "image_path": "out/dac5578/figures/fig_0017.png",
  "status": "needs_external",
  "stage": "local_llm",
  "local_llm_provider": "ollama",
  "local_llm_model": "moondream",
  "local_llm_description": "Graph showing linearity error vs digital input code...",
  "local_llm_classification": "plot",
  "external_llm_provider": "",
  "external_llm_model": "",
  "external_llm_result": null,
  "needs_external": true,
  "confidence": 0.3,
  "processed_at": "2026-02-12T00:33:06Z"
}

Status values: resolved_local, needs_external, resolved_external

Stage 2 skips any figure where status is resolved_local or resolved_external.

Caveats

Docling is required. There is no fallback PDF extractor — if Docling fails, the error propagates.
Figures that Docling cannot extract an image for are skipped (logged as warnings), not replaced with placeholders.
Local vision models (moondream, llava) hallucinate on complex figures — they're used for classification/triage only, not precision extraction. Garbled or non-text output (control characters, <unk> tokens) is automatically detected and treated as a failed description.
Complex figures (plots, pinouts, schematics) are intentionally deferred to external LLM via the rollup report.
Docling does not extract figure captions or bounding boxes for many PDF layouts. Figures will have empty caption fields and zeroed bbox values. This limits rule-based classification (which falls back to page context or LLM inference).
Table header OCR artifacts (e.g., THERMAL.THERMAL, PACKAGE- LEAD) are passed through as-is from Docling's table extraction. These are upstream OCR/layout issues, not pipeline bugs.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
analysis		analysis
cli		cli
docs		docs
examples/dac7578		examples/dac7578
notebooks		notebooks
out_example		out_example
prompts		prompts
src		src
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasheet Extractor

Install

Using Poetry (recommended)

Using pip

Ollama (local vision LLM)

Interactive notebook demo

CLI usage

Options

Two-stage workflow

Stage 1: Extract + local processing (automatic)

Stage 2: External LLM (manual)

Output layout

Per-figure status file

Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Datasheet Extractor

Install

Using Poetry (recommended)

Using pip

Ollama (local vision LLM)

Interactive notebook demo

CLI usage

Options

Two-stage workflow

Stage 1: Extract + local processing (automatic)

Stage 2: External LLM (manual)

Output layout

Per-figure status file

Caveats

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages