feat(ingest): describe PDF figures with vision model by scaleborg · Pull Request #55 · scaleborg/second-brain

scaleborg · 2026-03-14T19:54:19Z

Summary

Enable include_image_base64=True in Mistral OCR and describe significant figures using Gemini 2.5 Flash
Figures (>2% page area) get text descriptions appended as [Figure description – page N] blocks before chunking
Parallel vision calls (4 workers), graceful degradation if no Gemini key, non-PDF paths unaffected

Verified on

Test	Result
PDF with figures (IRPAPERS, 24pp, 7 figures)	91 chunks, 6 chunks contain figure descriptions, 8 concepts + 13 tools extracted
PDF without figures (SVD_Notes)	27 chunks, zero vision calls, no overhead
Markdown file (AGENTS.md)	8 chunks, no regression
Before/after text size	91,064 → 95,339 chars (+4.7%)
Vision cost	~$0.01 per PDF (~1.3s/image, Gemini Flash)

Example stored chunk

[Source: 2602.17687v1]

[Figure description – page 2]
This diagram illustrates a multi-modal approach for "Retrieval and QA Analysis"
of "IRPAPERS". It processes input papers through three distinct pathways:
"Page Image" leading to "Multi-Vector Image Embeddings", "OCR Transcription"
leading to "BM25" and "Single-Vector Text Embeddings". All these embedding and
retrieval methods converge into the final "Retrieval and QA Analysis" stage.

Test plan

API ingest of PDF with figures — chunks include descriptions
API ingest of PDF without figures — no vision calls, no overhead
API ingest of non-PDF file — no regression
Learnings extraction works with figure-enriched text
Ruff lint passes (no new violations)
Pre-commit hooks pass

…ction Enable include_image_base64=True in Mistral OCR calls and describe significant figures (>2% page area) using Gemini 2.5 Flash. Descriptions are appended as [Figure description – page N] blocks before chunking, making diagram/chart content available for downstream atom extraction. - Parallel vision calls (4 workers, 10s timeout per image) - Size-based filtering skips icons/logos - Graceful degradation: no GEMINI_API_KEY = silent no-op - Non-PDF paths completely unaffected - ~$0.01 additional cost per PDF, ~4-5% text increase

… extraction Update vision prompt to explicitly request method names, numeric values, metric names, and key comparisons. Suppress noise (raw vector values, index numbers). Add title line parsing so descriptions render as "Figure (page N): <title>" followed by a dense paragraph. Tested on IRPAPERS (24pp, 7 figures): chart descriptions now capture individual method scores (e.g. Recall@1 = 0.49) instead of vague "achieves highest recall". Diagram descriptions stay concise.

Add scripts/eval_figure_retrieval.py - runs 5 figure-dependent queries and measures Hit@5 and MRR for chunks containing figure descriptions. Results on IRPAPERS (7 figures, 93 chunks): Hit@5: 1.00 (5/5 queries) MRR: 0.690 Baseline RAGAS eval (15 cases) shows no regression: Gate Average: 0.845 (+0.032 vs previous)

Add 3 cases (tc_fig_001-003) that query figure description content from the IRPAPERS paper. Tagged with "figure" for filtering. - tc_fig_001: numeric lookup (recall@20 value from bar chart) - tc_fig_002: method comparison (best at Recall@1) - tc_fig_003: pipeline architecture (diagram components) Results: Faithfulness 0.95-1.0, Context Recall 1.0 on all three. Figure chunks rank #1-3 in retrieval for all cases. Dataset version bumped to 2.1.0 (15 → 18 cases).

Update ingestion, critical-path, and engineering docs to reflect: - Mistral OCR with include_image_base64=True - Gemini 2.5 Flash figure descriptions (>2% page area filter) - Figure description format in chunks - Gemini and Mistral OCR in LLM provider map - Figure-dependent eval cases (tc_fig_*) and retrieval eval script

Add mermaid flowchart showing the figure detection and vision description step in the PDF ingest path.

Add scripts/eval_figure_chat_answers.py and `make eval-figure-chat`. Verifies the full pipeline: PDF figure descriptions -> retrieval -> chat answers contain figure-derived content (numeric values, method names, pipeline components). Includes sanity check that non-figure queries don't surface figure chunks. Runtime: ~13 seconds.

scaleborg added 7 commits March 14, 2026 20:53

docs(architecture): add multimodal PDF ingestion diagram

9954358

Add mermaid flowchart showing the figure detection and vision description step in the PDF ingest path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): describe PDF figures with vision model#55

feat(ingest): describe PDF figures with vision model#55
scaleborg wants to merge 7 commits intomainfrom
feat/pdf-figure-descriptions

scaleborg commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scaleborg commented Mar 14, 2026

Summary

Verified on

Example stored chunk

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant