feat(ingest): describe PDF figures with vision model#55
Open
feat(ingest): describe PDF figures with vision model#55
Conversation
…ction Enable include_image_base64=True in Mistral OCR calls and describe significant figures (>2% page area) using Gemini 2.5 Flash. Descriptions are appended as [Figure description – page N] blocks before chunking, making diagram/chart content available for downstream atom extraction. - Parallel vision calls (4 workers, 10s timeout per image) - Size-based filtering skips icons/logos - Graceful degradation: no GEMINI_API_KEY = silent no-op - Non-PDF paths completely unaffected - ~$0.01 additional cost per PDF, ~4-5% text increase
… extraction Update vision prompt to explicitly request method names, numeric values, metric names, and key comparisons. Suppress noise (raw vector values, index numbers). Add title line parsing so descriptions render as "Figure (page N): <title>" followed by a dense paragraph. Tested on IRPAPERS (24pp, 7 figures): chart descriptions now capture individual method scores (e.g. Recall@1 = 0.49) instead of vague "achieves highest recall". Diagram descriptions stay concise.
Add scripts/eval_figure_retrieval.py - runs 5 figure-dependent queries and measures Hit@5 and MRR for chunks containing figure descriptions. Results on IRPAPERS (7 figures, 93 chunks): Hit@5: 1.00 (5/5 queries) MRR: 0.690 Baseline RAGAS eval (15 cases) shows no regression: Gate Average: 0.845 (+0.032 vs previous)
Add 3 cases (tc_fig_001-003) that query figure description content from the IRPAPERS paper. Tagged with "figure" for filtering. - tc_fig_001: numeric lookup (recall@20 value from bar chart) - tc_fig_002: method comparison (best at Recall@1) - tc_fig_003: pipeline architecture (diagram components) Results: Faithfulness 0.95-1.0, Context Recall 1.0 on all three. Figure chunks rank #1-3 in retrieval for all cases. Dataset version bumped to 2.1.0 (15 → 18 cases).
Update ingestion, critical-path, and engineering docs to reflect: - Mistral OCR with include_image_base64=True - Gemini 2.5 Flash figure descriptions (>2% page area filter) - Figure description format in chunks - Gemini and Mistral OCR in LLM provider map - Figure-dependent eval cases (tc_fig_*) and retrieval eval script
Add mermaid flowchart showing the figure detection and vision description step in the PDF ingest path.
Add scripts/eval_figure_chat_answers.py and `make eval-figure-chat`. Verifies the full pipeline: PDF figure descriptions -> retrieval -> chat answers contain figure-derived content (numeric values, method names, pipeline components). Includes sanity check that non-figure queries don't surface figure chunks. Runtime: ~13 seconds.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
include_image_base64=Truein Mistral OCR and describe significant figures using Gemini 2.5 Flash[Figure description – page N]blocks before chunkingVerified on
Example stored chunk
Test plan