Skip to content

feat(ingest): describe PDF figures with vision model#55

Open
scaleborg wants to merge 7 commits intomainfrom
feat/pdf-figure-descriptions
Open

feat(ingest): describe PDF figures with vision model#55
scaleborg wants to merge 7 commits intomainfrom
feat/pdf-figure-descriptions

Conversation

@scaleborg
Copy link
Owner

Summary

  • Enable include_image_base64=True in Mistral OCR and describe significant figures using Gemini 2.5 Flash
  • Figures (>2% page area) get text descriptions appended as [Figure description – page N] blocks before chunking
  • Parallel vision calls (4 workers), graceful degradation if no Gemini key, non-PDF paths unaffected

Verified on

Test Result
PDF with figures (IRPAPERS, 24pp, 7 figures) 91 chunks, 6 chunks contain figure descriptions, 8 concepts + 13 tools extracted
PDF without figures (SVD_Notes) 27 chunks, zero vision calls, no overhead
Markdown file (AGENTS.md) 8 chunks, no regression
Before/after text size 91,064 → 95,339 chars (+4.7%)
Vision cost ~$0.01 per PDF (~1.3s/image, Gemini Flash)

Example stored chunk

[Source: 2602.17687v1]

[Figure description – page 2]
This diagram illustrates a multi-modal approach for "Retrieval and QA Analysis"
of "IRPAPERS". It processes input papers through three distinct pathways:
"Page Image" leading to "Multi-Vector Image Embeddings", "OCR Transcription"
leading to "BM25" and "Single-Vector Text Embeddings". All these embedding and
retrieval methods converge into the final "Retrieval and QA Analysis" stage.

Test plan

  • API ingest of PDF with figures — chunks include descriptions
  • API ingest of PDF without figures — no vision calls, no overhead
  • API ingest of non-PDF file — no regression
  • Learnings extraction works with figure-enriched text
  • Ruff lint passes (no new violations)
  • Pre-commit hooks pass

…ction

Enable include_image_base64=True in Mistral OCR calls and describe
significant figures (>2% page area) using Gemini 2.5 Flash. Descriptions
are appended as [Figure description – page N] blocks before chunking,
making diagram/chart content available for downstream atom extraction.

- Parallel vision calls (4 workers, 10s timeout per image)
- Size-based filtering skips icons/logos
- Graceful degradation: no GEMINI_API_KEY = silent no-op
- Non-PDF paths completely unaffected
- ~$0.01 additional cost per PDF, ~4-5% text increase
… extraction

Update vision prompt to explicitly request method names, numeric values,
metric names, and key comparisons. Suppress noise (raw vector values,
index numbers). Add title line parsing so descriptions render as
"Figure (page N): <title>" followed by a dense paragraph.

Tested on IRPAPERS (24pp, 7 figures): chart descriptions now capture
individual method scores (e.g. Recall@1 = 0.49) instead of vague
"achieves highest recall". Diagram descriptions stay concise.
Add scripts/eval_figure_retrieval.py - runs 5 figure-dependent queries
and measures Hit@5 and MRR for chunks containing figure descriptions.

Results on IRPAPERS (7 figures, 93 chunks):
  Hit@5: 1.00 (5/5 queries)
  MRR:   0.690

Baseline RAGAS eval (15 cases) shows no regression:
  Gate Average: 0.845 (+0.032 vs previous)
Add 3 cases (tc_fig_001-003) that query figure description content
from the IRPAPERS paper. Tagged with "figure" for filtering.

- tc_fig_001: numeric lookup (recall@20 value from bar chart)
- tc_fig_002: method comparison (best at Recall@1)
- tc_fig_003: pipeline architecture (diagram components)

Results: Faithfulness 0.95-1.0, Context Recall 1.0 on all three.
Figure chunks rank #1-3 in retrieval for all cases.
Dataset version bumped to 2.1.0 (15 → 18 cases).
Update ingestion, critical-path, and engineering docs to reflect:
- Mistral OCR with include_image_base64=True
- Gemini 2.5 Flash figure descriptions (>2% page area filter)
- Figure description format in chunks
- Gemini and Mistral OCR in LLM provider map
- Figure-dependent eval cases (tc_fig_*) and retrieval eval script
Add mermaid flowchart showing the figure detection and vision
description step in the PDF ingest path.
Add scripts/eval_figure_chat_answers.py and `make eval-figure-chat`.
Verifies the full pipeline: PDF figure descriptions -> retrieval ->
chat answers contain figure-derived content (numeric values, method
names, pipeline components). Includes sanity check that non-figure
queries don't surface figure chunks. Runtime: ~13 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant