Skip to content

docs: strengthen RAG schema and guidance#374

Merged
tsivaprasad merged 1 commit into
mainfrom
PLAT-584-rag-service-returns-no-relevant-information-found-on-nodes-added-via-database-update
May 4, 2026
Merged

docs: strengthen RAG schema and guidance#374
tsivaprasad merged 1 commit into
mainfrom
PLAT-584-rag-service-returns-no-relevant-information-found-on-nodes-added-via-database-update

Conversation

@tsivaprasad
Copy link
Copy Markdown
Contributor

Summary

Strengthens the RAG service documentation with clearer schema requirements and guidance, based on issues observed during testing.

Changes

  • Set embedding vector(N) NOT NULL in all table schema examples to prevent NULL embeddings at the database level
  • Document that non-superuser connect_as users require GRANT SELECT on document tables — omitting this causes the service to silently return "No relevant information found" for all queries
  • Replace the \dt table check with has_table_privilege() in the "Service Fails to Start" troubleshooting section
  • Update the "Poor Query Results" section to explain that NULL embeddings cause silent vector search failures and show how to find and remove them on existing tables

Testing

  • Documentation-only change; no code changes.

Checklist

  • Documentation updated (if needed)

Notes for Reviewers

Two silent failure modes documented here were discovered during RAG service testing:

  1. NULL embeddings — inserting rows without the embedding column causes the vector search to crash internally and return empty results with no visible error. The NOT NULL constraint on the schema prevents this. The corresponding RAG server fix (filtering NULL rows in the SQL query) is in a separate PR in the pgedge-rag-server repo.

  2. Missing SELECT privilege — if connect_as is a non-superuser, the RAG container connects successfully but gets permission denied on the table, which also surfaces as "No relevant information found" with no indication of the cause.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

Updated RAG documentation to enforce database permissions and schema constraints. Changed the embedding column from nullable to NOT NULL, added guidance on granting SELECT permissions to connect_as users, and enhanced troubleshooting steps with has_table_privilege verification and NULL embedding detection.

Changes

Cohort / File(s) Summary
RAG Documentation Updates
docs/services/rag.md
Enforced database permissions by adding explicit GRANT SELECT requirements for non-superuser connect_as users; changed documents_content_chunks.embedding column to NOT NULL across all schema examples; enhanced troubleshooting with has_table_privilege verification, conditional grant remediation, and checks for existing NULL embeddings; refined prose and formatting.

Poem

🐰 With permissions now declared so clear,
And embeddings no longer we fear,
Schema constraints keep vectors true,
Troubleshooting knows just what to do,
No silent "No relevant" refrain—
Your RAG now runs like a rabbit's brain! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: strengthening RAG schema and guidance documentation with clearer requirements and operational improvements.
Description check ✅ Passed The description includes all required sections from the template (Summary, Changes, Testing, Checklist) with comprehensive details about schema changes, permission requirements, and silent failure modes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch PLAT-584-rag-service-returns-no-relevant-information-found-on-nodes-added-via-database-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/services/rag.md (1)

1025-1049: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

"Service Fails to Start" may be misleading for SELECT-privilege issues.

Based on the provided server preflight behavior (role-existence only), lacking GRANT SELECT on document tables is more likely to cause empty/“No relevant information found” query results than an outright failure to start. Since you also added a separate “Poor Query Results” section for NULL embeddings, it would help to either:

  • rename/scope this section to “connection/user/privilege issues affecting queries”, or
  • add a short note that SELECT-privilege problems typically surface during querying.
Suggested edit
-### Service Fails to Start
+### Service Fails to Start (or Returns Empty Results)
 
 To diagnose a service that fails to start, check database
 connectivity and user permissions.
+
+Note: If the `connect_as` role exists but lacks `GRANT SELECT` on
+the document table(s), the service may still start—queries can
+return `"No relevant information found"` instead.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/services/rag.md around lines 1025 - 1049, The "Service Fails to Start"
section heading is misleading because the preflight only checks role existence
and missing SELECT privileges usually cause empty/"No relevant information
found" query results rather than startup failure; update the section title
and/or content in docs/services/rag.md (the "Service Fails to Start" heading and
its paragraphs) to either rename it to something like "Connection / User /
Privilege Issues Affecting Queries" or add a short clarifying sentence under the
existing heading stating that missing SELECT privileges on
documents_content_chunks typically surface as poor/empty query results during
runtime rather than preventing the service from starting, and keep the existing
diagnostic commands (\du+ example and has_table_privilege example) as-is.


</details>

</blockquote></details>

</blockquote></details>
🧹 Nitpick comments (4)
docs/services/rag.md (4)

77-87: ⚡ Quick win

Clarify GRANT SELECT guidance: apply to each configured document table and qualify schema/table names.

The new “connect_as user must have SELECT privilege on each document table” section is good, but the example SQL hard-codes documents_content_chunks without a schema and doesn’t explicitly tie the GRANT target to pipelines[].tables[].table (which is what users will actually configure).

Consider updating the snippet/prose to:

  • mention you must grant on every table listed under pipelines[].tables[], and
  • use a schema-qualified table name (or at least instruct users to adjust based on their schema).
Suggested edit
 If the `connect_as` user is not a database owner or superuser, it must
 have `SELECT` privilege on each document table. Grant this privilege in
 the `scripts.post_database_create` array:
 
 ```sql
-GRANT SELECT ON documents_content_chunks TO app_read_only;
+-- Grant SELECT on EVERY table referenced by `pipelines[].tables[]`.
+-- If your tables are not in `public`, qualify the schema accordingly.
+GRANT SELECT ON public.documents_content_chunks TO app_read_only;
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/services/rag.md around lines 77 - 87, Update the guidance to state that
the GRANT must be applied to every table listed under pipelines[].tables[].table
and to use schema-qualified names; replace the hard-coded example by explaining
to run a GRANT SELECT on each configured table (e.g., GRANT SELECT ON
.<table_name> TO <connect_as_user>) and add a short note suggesting to
replace with public if using the default schema.


</details>

---

`1050-1073`: _⚡ Quick win_

**DELETE remediation is destructive: add a short safety/capacity note.**

The “Poor Query Results” section proposes:
- `DELETE FROM documents_content_chunks WHERE embedding IS NULL;`

That’s correct for remediation, but it can be disruptive on large tables (locks) and is irreversible without backups. A brief note like “back up / test in staging / consider batching for large tables” would reduce operator risk.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/services/rag.md around lines 1050 - 1073, Update the "Poor Query
Results" remediation that suggests DELETE FROM documents_content_chunks WHERE
embedding IS NULL to include a short safety/capacity note: advise taking a
backup or testing in staging first, consider running a SELECT COUNT(*) check
and/or deleting in small batched transactions (or using LIMIT with repeated
deletes) to avoid long locks, and remind operators that deletes are irreversible
without backups. Reference the existing section title "Poor Query Results" and
the DELETE statement so the note is inserted immediately after that remediation.


</details>

---

`1036-1048`: _⚡ Quick win_

**Make `has_table_privilege()` example robust: schema-qualify (or cast to regclass).**

`SELECT has_table_privilege('app_read_only', 'documents_content_chunks', 'SELECT');` assumes the table is discoverable via the database search_path. If the document tables live outside `public`, this example can yield `false` (or error), confusing users.

Prefer `public.documents_content_chunks` (or instruct users to use their schema), or use `::regclass`.

<details>
<summary>Suggested edit</summary>

```diff
-SELECT has_table_privilege('app_read_only', 'documents_content_chunks', 'SELECT');
+-- Use schema-qualified table name to avoid search_path issues
+SELECT has_table_privilege('app_read_only', 'public.documents_content_chunks', 'SELECT');
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/services/rag.md` around lines 1036 - 1048, Update the
has_table_privilege example to avoid search_path issues by schema-qualifying the
table name or casting it to regclass: change the call to use either
'public.documents_content_chunks' (or the user's schema) or
'documents_content_chunks'::regclass when calling has_table_privilege for the
user 'app_read_only' (referencing has_table_privilege, documents_content_chunks,
app_read_only) so the check reliably returns the correct result regardless of
search_path.

193-206: ⚡ Quick win

Embedding NOT NULL change: add an explicit “ALTER TABLE ... SET NOT NULL” migration step.

You’ve already documented how to find and delete existing NULL embeddings in “Poor Query Results”, and the examples set embedding vector(N) NOT NULL going forward. What’s missing is the final step for existing deployments: after cleanup, users should be guided to enforce the constraint (otherwise NULLs can reappear later).

Suggested addition
 -- Remove rows missing embeddings
 DELETE FROM documents_content_chunks WHERE embedding IS NULL;
+
+-- Then enforce the constraint to prevent future NULL embeddings
+ALTER TABLE documents_content_chunks
+  ALTER COLUMN embedding SET NOT NULL;
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/services/rag.md around lines 193 - 206, Add a final migration step that
enforces the non-null constraint on the embedding column after users have
cleaned up any NULL embeddings: create a migration that alters the
documents_content_chunks table to set the embedding column (embedding) to NOT
NULL, and run it only after confirming there are no rows with NULL embeddings so
the constraint will succeed.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @docs/services/rag.md:

  • Around line 1025-1049: The "Service Fails to Start" section heading is
    misleading because the preflight only checks role existence and missing SELECT
    privileges usually cause empty/"No relevant information found" query results
    rather than startup failure; update the section title and/or content in
    docs/services/rag.md (the "Service Fails to Start" heading and its paragraphs)
    to either rename it to something like "Connection / User / Privilege Issues
    Affecting Queries" or add a short clarifying sentence under the existing heading
    stating that missing SELECT privileges on documents_content_chunks typically
    surface as poor/empty query results during runtime rather than preventing the
    service from starting, and keep the existing diagnostic commands (\du+ example
    and has_table_privilege example) as-is.

Nitpick comments:
In @docs/services/rag.md:

  • Around line 77-87: Update the guidance to state that the GRANT must be applied
    to every table listed under pipelines[].tables[].table and to use
    schema-qualified names; replace the hard-coded example by explaining to run a
    GRANT SELECT on each configured table (e.g., GRANT SELECT ON
    .<table_name> TO <connect_as_user>) and add a short note suggesting to
    replace with public if using the default schema.
  • Around line 1050-1073: Update the "Poor Query Results" remediation that
    suggests DELETE FROM documents_content_chunks WHERE embedding IS NULL to include
    a short safety/capacity note: advise taking a backup or testing in staging
    first, consider running a SELECT COUNT(*) check and/or deleting in small batched
    transactions (or using LIMIT with repeated deletes) to avoid long locks, and
    remind operators that deletes are irreversible without backups. Reference the
    existing section title "Poor Query Results" and the DELETE statement so the note
    is inserted immediately after that remediation.
  • Around line 1036-1048: Update the has_table_privilege example to avoid
    search_path issues by schema-qualifying the table name or casting it to
    regclass: change the call to use either 'public.documents_content_chunks' (or
    the user's schema) or 'documents_content_chunks'::regclass when calling
    has_table_privilege for the user 'app_read_only' (referencing
    has_table_privilege, documents_content_chunks, app_read_only) so the check
    reliably returns the correct result regardless of search_path.
  • Around line 193-206: Add a final migration step that enforces the non-null
    constraint on the embedding column after users have cleaned up any NULL
    embeddings: create a migration that alters the documents_content_chunks table to
    set the embedding column (embedding) to NOT NULL, and run it only after
    confirming there are no rows with NULL embeddings so the constraint will
    succeed.

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `a4ac85c8-6380-4b58-9a55-f1f9c07d79ed`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 1e081a892622594a3142c7d093ce1fbd30bc7a7c and 1d2d32e1c2d184826d7ecf92bb7e9235c997b3b5.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `docs/services/rag.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@tsivaprasad tsivaprasad merged commit 76f2be6 into main May 4, 2026
3 checks passed
@tsivaprasad tsivaprasad deleted the PLAT-584-rag-service-returns-no-relevant-information-found-on-nodes-added-via-database-update branch May 4, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants