Add full page context expansion with neighboring page support#338
Add full page context expansion with neighboring page support#338shubhadeepd merged 11 commits intodevelopfrom
Conversation
da27ea4 to
c18624c
Compare
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
|
Warning
|
| Cohort / File(s) | Summary |
|---|---|
Configuration & Environment deploy/compose/docker-compose-rag-server.yaml, src/nvidia_rag/utils/configuration.py |
Added APP_FETCH_FULL_PAGE_CONTEXT and APP_FETCH_NEIGHBORING_PAGES environment variables with field validators enforcing that neighboring-pages expansion requires full-page context. |
RAG Pipeline Core src/nvidia_rag/rag_server/main.py |
Major additions to support page-context expansion: new parameters propagated through generate() and _rag_chain(), helper methods for page extraction/expansion/formatting, VDB re-fetching of missing chunks, page-aware context formatting, and extensive logging for retrieved pages and context structure. |
API & Server Layer src/nvidia_rag/rag_server/server.py |
Added Prompt model fields fetch_full_page_context and fetch_neighboring_pages with CONFIG defaults, threaded through /generate endpoint to backend RAG.generate() call. |
VLM Integration src/nvidia_rag/rag_server/vlm.py |
Added organize_by_page mode to extract_and_process_messages and stream_with_messages to interleave text/images per page; new helpers for building page-organized content_parts, extracting images, and logging content structure. |
Vector Database Layer src/nvidia_rag/utils/vdb/elasticsearch/es_queries.py, src/nvidia_rag/utils/vdb/elasticsearch/elastic_vdb.py, src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py, src/nvidia_rag/utils/vdb/vdb_base.py |
Added abstract retrieve_chunks_by_filter() method and implementations in Elasticsearch and Milvus backends to fetch chunks by source name and page numbers; added Elasticsearch query builder for filtering by source and pages. |
Unit Tests tests/unit/test_rag_server/test_page_context_organization.py, tests/unit/test_utils/test_configuration.py |
Comprehensive test coverage for page extraction, expansion, formatting helpers; validation tests for configuration cross-field rules. |
Sequence Diagram
sequenceDiagram
actor Client
participant Server
participant RAG as RAG Pipeline
participant VDB as Vector Database
participant VLM as Vision Language Model
Client->>Server: POST /generate (fetch_full_page_context, fetch_neighboring_pages)
Server->>RAG: generate(prompt, fetch_full_page_context, fetch_neighboring_pages)
RAG->>VDB: semantic_search (initial retrieval)
VDB-->>RAG: initial_documents
RAG->>RAG: _extract_page_set_from_docs()
RAG->>RAG: _expand_page_set_with_neighbors()
RAG->>VDB: retrieve_chunks_by_filter(source, expanded_pages)
VDB-->>RAG: full_page_chunks
RAG->>RAG: _expand_and_organize_context()
RAG->>RAG: _format_context_by_page() or existing formatter
RAG->>VLM: stream_with_messages(docs, organize_by_page)
VLM->>VLM: _build_content_parts_by_page()
VLM->>VLM: _extract_images_from_docs()
VLM-->>Client: streaming response
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~65 minutes
Possibly related PRs
- Integrate nemotron-nano-12b-v2-vl VLM with RAG #73: Modifies VLM message-streaming integration (extract_and_process_messages, stream_with_messages signatures) used by the RAG generate/_rag_chain paths, creating a code-level dependency with this PR's VLM changes.
Suggested reviewers
- smasurekar
- nv-nikkulkarni
- shubhadeepd
Poem
🐰 A rabbit's ode to expanding pages:
Hops through chapters, page by page,
Neighboring neighbors join the stage,
Full context gathered, no chunk amiss,
VLM vision arranged like this—
Long-eared logic, wisely sage! 📖✨
🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Merge Conflict Detection | ❌ Merge conflicts detected (34 files): ⚔️ .github/workflows/ci-pipeline.yml (content)⚔️ README.md (content)⚔️ deploy/compose/docker-compose-rag-server.yaml (content)⚔️ docs/api-ingestor.md (content)⚔️ docs/api-rag.md (content)⚔️ docs/assets/arch_diagram.png (content)⚔️ docs/change-model.md (content)⚔️ docs/deploy-docker-self-hosted.md (content)⚔️ docs/deploy-helm.md (content)⚔️ docs/mig-deployment.md (content)⚔️ docs/multi-collection-retrieval.md (content)⚔️ docs/observability.md (content)⚔️ docs/python-client.md (content)⚔️ docs/release-notes.md (content)⚔️ docs/support-matrix.md (content)⚔️ docs/text_only_ingest.md (content)⚔️ docs/troubleshooting.md (content)⚔️ docs/user-interface.md (content)⚔️ examples/nvidia_rag_mcp/mcp_server.py (content)⚔️ examples/rag_react_agent/pyproject.toml (content)⚔️ examples/rag_react_agent/uv.lock (content)⚔️ notebooks/launchable.ipynb (content)⚔️ notebooks/nat_mcp_integration.ipynb (content)⚔️ src/nvidia_rag/rag_server/main.py (content)⚔️ src/nvidia_rag/rag_server/server.py (content)⚔️ src/nvidia_rag/rag_server/vlm.py (content)⚔️ src/nvidia_rag/utils/configuration.py (content)⚔️ src/nvidia_rag/utils/vdb/elasticsearch/elastic_vdb.py (content)⚔️ src/nvidia_rag/utils/vdb/elasticsearch/es_queries.py (content)⚔️ src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py (content)⚔️ src/nvidia_rag/utils/vdb/vdb_base.py (content)⚔️ tests/integration/README.md (content)⚔️ tests/unit/test_utils/test_configuration.py (content)⚔️ uv.lock (content)These conflicts must be resolved before merging into develop. |
Resolve conflicts locally and push changes to this branch. |
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Docstring Coverage | ✅ Passed | Docstring coverage is 86.27% which is sufficient. The required threshold is 80.00%. |
| Title check | ✅ Passed | The title accurately and specifically describes the main change: adding full page context expansion with neighboring page support, which is the primary objective reflected across all modified files. |
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
- 📝 Generate docstrings (stacked PR)
- 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
- Commit unit tests in branch
dev/pranjald/page-context-expansion
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/nvidia_rag/rag_server/vlm.py (1)
706-717:⚠️ Potential issue | 🟠 Major
analyze_with_messageslacks theorganize_by_pageparameter present instream_with_messages.
stream_with_messages(line 753) accepts and passesorganize_by_pagetoextract_and_process_messages, butanalyze_with_messages(line 657) does not have this parameter at all. Sinceextract_and_process_messagessupportsorganize_by_page, the non-streaming VLM path cannot utilize per-page organization regardless of caller intent.Should
analyze_with_messagesalso accept and forward theorganize_by_pageparameter for API consistency?src/nvidia_rag/rag_server/server.py (1)
1355-1393:⚠️ Potential issue | 🟡 MinorNew fields missing from
request_datalogging dict.All other
Promptfields are logged inrequest_data, butfetch_full_page_contextandfetch_neighboring_pagesare omitted. This makes debugging page-context issues harder.Proposed fix
"filter_expr": prompt.filter_expr, "confidence_threshold": prompt.confidence_threshold, + "fetch_full_page_context": prompt.fetch_full_page_context, + "fetch_neighboring_pages": prompt.fetch_neighboring_pages, }
🤖 Fix all issues with AI agents
In `@src/nvidia_rag/rag_server/main.py`:
- Around line 3468-3492: The loop computing grouped keys contains an unused
`filename` variable; replace that variable with `_` (or remove its assignment)
so only the intended loop variables `(filename, source_path, page_num, doc)` are
used for grouping, and remove the redundant filename recomputation earlier to
avoid confusion. In the `keys_sorted` iteration use the computed `filename` only
once when creating `marker = f"=== Page {page_num} ({filename}) ===\n"` (keep
the existing os.path.splitext/os.path.basename expression there) and delete the
earlier unused `filename` assignment. For the `no_page` branch, append a single
combined string (e.g., `"=== Additional context ===\n" +
"\n\n".join(format_fn(d) for d in no_page)`) instead of two separate parts so
formatting matches the `has_page` path and prevents an extra "\n\n" from
appearing when joining `parts`. Ensure references to `grouped`, `keys_sorted`,
`format_fn`, `marker`, and `no_page` are updated accordingly.
- Around line 3365-3440: The _expand_and_organize_context method currently
performs synchronous VDB network calls (vdb_op.retrieve_chunks_by_filter) inside
the async RAG path which will block the event loop; update this by removing the
unused collection_names parameter from the signature, and change the fetch loop
to run retrieve_chunks_by_filter on a worker thread (use asyncio.to_thread or
submit to ThreadPoolExecutor) and schedule the per-(coll,source) fetches in
parallel (gather the tasks and then merge results), and replace the misleading
hasattr(vdb_op, "retrieve_chunks_by_filter") guard with a direct call wrapped in
try/except that explicitly catches NotImplementedError and logs warnings; keep
the existing dedup logic (doc_key, seen, merged) but perform deduping after
collected parallel fetch results before returning merged.
In `@src/nvidia_rag/rag_server/server.py`:
- Around line 583-593: Add a cross-field `@model_validator` to the Prompt pydantic
model that enforces the same rule as validate_page_context_options: if
fetch_neighboring_pages > 0 then fetch_full_page_context must be True; otherwise
raise a ValidationError (or ValueError) with a clear message. Place the
validator inside the Prompt class near the existing field definitions for
fetch_full_page_context and fetch_neighboring_pages and name/reference it
similarly (e.g., validate_page_context_options) so API inputs like
fetch_neighboring_pages=5, fetch_full_page_context=False are rejected.
In `@src/nvidia_rag/rag_server/vlm.py`:
- Around line 400-414: The method _build_content_parts_by_page currently ignores
the textual_context parameter and assigns q but never uses it; change the
human_template.format call (and any later question_text usage inside
_build_content_parts_by_page) to use the provided textual_context and the
trimmed q variable (q = (question_text or "").strip()) instead of empty strings
or raw question_text, so the preformatted context and normalized question are
respected; if you decide not to use textual_context here, remove the unused
parameter and q, otherwise update the caller to supply the textual_context
argument so the method can include it when building the intro and any question
formatting.
In `@src/nvidia_rag/utils/vdb/milvus/milvus_vdb.py`:
- Around line 1206-1221: The Milvus query builds filter_expr with substring
matching (like "%{source_name}%") and interpolates source_name raw, causing
inconsistent behavior with the Elasticsearch implementation (which uses exact
term match) and risk of syntax-breaking characters; update the code that
constructs filter_expr in milvus_vdb.py (the filter_expr variable used with
MilvusClient.query) to perform an exact equality check on source["source_name"]
(to mirror es_queries.py's term on metadata.source.source_name.keyword) and
properly escape/quote source_name before interpolation (e.g., ensure internal
quotes/backslashes are escaped and the value is wrapped in quotes) so the filter
string is safe and semantics match across backends.
In `@tests/unit/test_rag_server/test_page_context_organization.py`:
- Around line 198-227: The test test_fetch_full_page_context_calls_vdb should
use call_args.kwargs for clarity and strengthen the final assertion: replace
accessing mock_vdb.retrieve_chunks_by_filter.call_args[1] with
mock_vdb.retrieve_chunks_by_filter.call_args.kwargs to explicitly read keyword
arguments, and change the weak assertion assert len(result) >= 1 to assert
len(result) == 1 because the mock returns an empty list and deduplication in
rag._expand_and_organize_context should leave only the original document.
🧹 Nitpick comments (9)
src/nvidia_rag/utils/configuration.py (1)
773-801: Consider an upper bound forfetch_neighboring_pages.The validator ensures non-negative values, but there's no upper bound. A very large value (e.g., 1000) would cause the system to attempt fetching thousands of pages per retrieved chunk, which could degrade performance or overwhelm the VDB. A reasonable cap (e.g., 10 or 20) would provide a safety net.
💡 Optional: Add an upper bound
`@field_validator`("fetch_neighboring_pages") `@classmethod` def validate_fetch_neighboring_pages(cls, v: int) -> int: if not isinstance(v, int) or isinstance(v, bool): raise TypeError( f"fetch_neighboring_pages must be an integer, got {type(v).__name__}" ) if v < 0: raise ValueError( f"fetch_neighboring_pages must be >= 0, got {v}" ) + if v > 20: + raise ValueError( + f"fetch_neighboring_pages must be <= 20, got {v}" + ) return vsrc/nvidia_rag/utils/vdb/milvus/milvus_vdb.py (1)
1226-1235: Entity-to-Document conversion is duplicated withretrieval_image_langchain.Lines 1226-1233 are nearly identical to the conversion loop in
retrieval_image_langchain(lines 1180-1187). Consider extracting a shared helper method.♻️ Optional: Extract helper
+ `@staticmethod` + def _entities_to_documents(entities: list[dict]) -> list[Document]: + """Convert Milvus entities to LangChain Document objects.""" + docs: list[Document] = [] + for item in entities: + page_content = item.get("text") or item.get("chunk") or "" + metadata = { + "source": item.get("source"), + "content_metadata": item.get("content_metadata", {}), + } + docs.append(Document(page_content=page_content, metadata=metadata)) + return docsThen use it in both methods:
docs = self._entities_to_documents(entities) return self._add_collection_name_to_retreived_docs(docs, collection_name)src/nvidia_rag/rag_server/vlm.py (2)
376-397: Silent exception swallowing hinders debugging.The bare
except Exception: continueat line 396 silently discards all errors during image extraction (e.g., MinIO connection failures, malformed payloads). Adding a debug-level log would help diagnose issues in production without cluttering normal output.💡 Suggested improvement
- except Exception: + except Exception as e: + logger.debug("Skipping image extraction for doc: %s", e) continue
365-370: Nested ternary expression is hard to read.The
source_idextraction has a confusing doubleisinstancecheck:source_id = ( source_meta.get("source_id", "") or (source_meta.get("source_name", "") if isinstance(source_meta, dict) else "") if isinstance(source_meta, dict) else "" )The inner
isinstance(source_meta, dict)check is redundant since the outer one already guards the entire expression.♻️ Simplified version
- source_id = ( - source_meta.get("source_id", "") - or (source_meta.get("source_name", "") if isinstance(source_meta, dict) else "") - if isinstance(source_meta, dict) - else "" - ) + source_id = ( + (source_meta.get("source_id", "") or source_meta.get("source_name", "")) + if isinstance(source_meta, dict) + else "" + )tests/unit/test_utils/test_configuration.py (1)
847-862: Good validation coverage. Consider adding a positive test case.The negative validation paths are well tested. Consider adding a test for the valid configuration to ensure it doesn't raise:
💡 Suggested additional test
def test_fetch_full_page_context_with_neighboring_pages_valid(self): """Test that fetch_neighboring_pages > 0 with fetch_full_page_context=True is valid.""" config = RetrieverConfig( fetch_full_page_context=True, fetch_neighboring_pages=2, ) assert config.fetch_full_page_context is True assert config.fetch_neighboring_pages == 2src/nvidia_rag/utils/vdb/elasticsearch/elastic_vdb.py (1)
952-996: LGTM — solid implementation matching the base class contract.The method correctly handles empty
page_numbers, caps the result size, and gracefully returns an empty list on errors. The document construction mirrors existing patterns in the class.Two observations worth noting:
Source matching inconsistency across backends: This ES implementation uses exact keyword term matching (
term: {"metadata.source.source_name.keyword": source_name}), while the Milvus implementation uses substring matching (like "%{source_name}%"). These have different semantics—ES will only match exact source names, while Milvus will match source names containing the string. If source naming conventions differ across deployments, this could yield different retrieval results. Consider documenting whether this behavioral difference is intentional.Static analysis:
except Exception(line 974) is broad; consider catching specific Elasticsearch exceptions if feasible. Also,logger.erroron line 975 could belogger.exceptionto include the traceback automatically.src/nvidia_rag/rag_server/main.py (3)
3389-3401: Dedup key truncates content at 300 chars — potential for false merges.If two distinct chunks on the same page share the first 300 characters (e.g., repeated headers, table rows, or structured data), one will be silently dropped. Consider using a hash of the full content instead:
♻️ Safer dedup key
+ import hashlib + def doc_key(d: Document) -> tuple[str, str, int, str]: meta = getattr(d, "metadata", {}) or {} content_md = meta.get("content_metadata", {}) or {} source = meta.get("source", {}) source_path = ( source.get("source_name", "") if isinstance(source, dict) else source ) coll = meta.get("collection_name", "") page_num = content_md.get("page_number", 0) - content_preview = (getattr(d, "page_content", "") or "")[:300] - return (str(coll), str(source_path), int(page_num), content_preview) + content_hash = hashlib.md5( + (getattr(d, "page_content", "") or "").encode() + ).hexdigest() + return (str(coll), str(source_path), int(page_num), content_hash)
3293-3319: Unusedmax_charsparameter.
max_chars(line 3297) is declared but never referenced in the method body. Remove it to keep the signature clean.♻️ Proposed fix
def _log_context_structure( self, context_str: str, prefix: str = "Context structure", - max_chars: int = 60, ) -> None:
3217-3291: Consider extracting shared metadata extraction logic.Both
_log_retrieved_pagesand_log_expanded_context_layout(as well as_extract_page_set_from_docs,_format_context_by_page, anddoc_keyinside_expand_and_organize_context) repeat the same metadata extraction pattern — pullingcontent_metadata.page_numberandsource.source_namefrom nesteddoc.metadata. A small shared helper (e.g.,_extract_doc_page_info(doc) -> tuple[str, str | None, int | None]) would reduce duplication across these five call sites.
c18624c to
37e356a
Compare
37e356a to
6e2a6f8
Compare
…e-aware retrieval
bb4a218 to
a7c8a0e
Compare
a4bb5bb to
f3e5eed
Compare
ca4914d to
c70ac73
Compare
…st cleanup during compaction
Context Expansion support for pdf files, this will fetch content of entire page and send it to llm for generation.
Description
Checklist
git commit -s) and GPG signed (git commit -S).Summary by CodeRabbit