Skip to content

CAMEL-21540: Add PGVector component for PostgreSQL vector database#22207

Open
gnodet wants to merge 17 commits intomainfrom
hungry-quark
Open

CAMEL-21540: Add PGVector component for PostgreSQL vector database#22207
gnodet wants to merge 17 commits intomainfrom
hungry-quark

Conversation

@gnodet
Copy link
Contributor

@gnodet gnodet commented Mar 23, 2026

Summary

New camel-pgvector component for vector similarity search using the PostgreSQL pgvector extension. This provides a lightweight, SQL-native vector database option — no separate infrastructure needed beyond PostgreSQL.

Example Routes

Store embeddings with OpenAI:

- route:
    from:
      uri: direct:index
    steps:
      - setVariable:
          name: text
          simple: "${body}"
      - to:
          uri: openai:embeddings
          parameters:
            embeddingModel: nomic-embed-text
      - setHeader:
          name: CamelPgVectorAction
          constant: UPSERT
      - setHeader:
          name: CamelPgVectorTextContent
          simple: "${variable.text}"
      - to: pgvector:documents

Similarity search:

- route:
    from:
      uri: direct:search
    steps:
      - to:
          uri: openai:embeddings
          parameters:
            embeddingModel: nomic-embed-text
      - setHeader:
          name: CamelPgVectorAction
          constant: SIMILARITY_SEARCH
      - setHeader:
          name: CamelPgVectorQueryTopK
          constant: 5
      - to: pgvector:documents

LangChain4j RAG pipeline:

from("direct:search")
    .to("langchain4j-embeddings:embed")
    .transformDataType(new DataType("pgvector:embeddings"))
    .setHeader(PgVectorHeaders.ACTION, constant(PgVectorAction.SIMILARITY_SEARCH))
    .to("pgvector:myCollection")
    .transformDataType(new DataType("pgvector:rag"));

Features

  • Actions: CREATE_TABLE, CREATE_INDEX (HNSW), DROP_TABLE, UPSERT, DELETE, SIMILARITY_SEARCH
  • Distance types: cosine (default), euclidean, inner product
  • Filtering: SQL WHERE clause on text_content and metadata columns
  • Auto-ID: UUID auto-generated when no record ID is provided
  • Upsert: ON CONFLICT DO UPDATE for idempotent writes
  • Data type transformers: pgvector:embeddings and pgvector:rag for LangChain4j integration
  • Cross-documentation: OpenAI and LangChain4j embeddings docs updated with pgvector examples

Also included

  • Docs gulp race condition fix (ENOENT on target/ directories during parallel builds)

Test plan

  • PgVectorComponentIT (11 tests) — all actions, upsert update, auto-ID, filtered search
  • LangChain4jEmbeddingsComponentPgVectorTargetIT (4 tests) — end-to-end embeddings pipeline
  • Code formatted and imports sorted
  • All generated files committed (catalog, endpoint DSL, component DSL)

@github-actions
Copy link
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟
🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run
  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot although they are normally detected and executed by CI.
  • You can label PRs using build-all, build-dependents, skip-tests and test-dependents to fine-tune the checks executed by this PR.
  • Build and test logs are available in the summary page. Only Apache Camel committers have access to the summary.

⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

🧪 CI tested the following changed modules:

  • bom/camel-bom
  • catalog/camel-allcomponents
  • catalog/camel-catalog
  • components/camel-ai
  • components/camel-ai/camel-langchain4j-embeddings
  • components/camel-ai/camel-openai
  • components/camel-ai/camel-pgvector
  • core/camel-main
  • docs
  • dsl/camel-componentdsl
  • dsl/camel-endpointdsl
  • dsl/camel-kamelet-main
  • parent
  • tooling/maven/camel-package-maven-plugin

ℹ️ Dependent modules were not tested because the total number of affected modules exceeded the threshold (50). Use the test-dependents label to force testing all dependents.

Build reactor — dependencies compiled but only changed modules were tested (28 modules)
  • Camel :: AI :: LangChain4j :: Embedding
  • Camel :: AI :: LangChain4j :: Embedding [jar]
  • Camel :: AI :: OpenAI
  • Camel :: AI :: OpenAI [jar]
  • Camel :: AI :: PGVector
  • Camel :: AI :: PGVector [jar]
  • Camel :: AI :: Parent
  • Camel :: AI :: Parent [pom]
  • Camel :: All Components Sync point
  • Camel :: All Components Sync point [pom]
  • Camel :: BOM
  • Camel :: BOM [pom]
  • Camel :: Catalog :: Camel Catalog
  • Camel :: Catalog :: Camel Catalog [jar]
  • Camel :: Component DSL
  • Camel :: Component DSL [jar]
  • Camel :: Docs
  • Camel :: Docs [pom]
  • Camel :: Endpoint DSL
  • Camel :: Endpoint DSL [jar]
  • Camel :: Kamelet Main
  • Camel :: Kamelet Main [jar]
  • Camel :: Main
  • Camel :: Main [jar]
  • Camel :: Maven Plugins :: Camel Maven Package
  • Camel :: Maven Plugins :: Camel Maven Package [maven-plugin]
  • Camel :: Parent
  • Camel :: Parent [pom]

@gnodet gnodet marked this pull request as ready for review March 24, 2026 05:53
Copy link
Contributor

@apupier apupier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for a different postgres component specific for the vectors? Why not including these capabilities directly in the existing Postgres component?

Copy link
Contributor Author

@gnodet gnodet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code on behalf of Guillaume Nodet

Good question! There are several reasons for keeping pgvector as a separate component under camel-ai/ rather than adding it to the existing PostgreSQL components:

  1. Consistency with other vector DB components. All vector database components live under camel-ai/: Milvus, Qdrant, Pinecone, Weaviate, Neo4j. pgvector serves the same purpose — vector similarity search for AI/RAG pipelines — it just happens to use PostgreSQL as its storage engine. Users looking for vector database components would naturally look in the AI group.

  2. Different use case and audience. The existing PG components (camel-pgevent for LISTEN/NOTIFY, camel-pg-replication-slot for WAL replication) are PostgreSQL-specific integration utilities. pgvector targets a completely different use case: embeddings storage, similarity search, and RAG pipelines. These serve different personas.

  3. Additional dependencies. pgvector brings in com.pgvector:pgvector (for vector type support) and LangChain4j data type transformers — AI-oriented dependencies that would be out of place in a general PostgreSQL component.

  4. Separate lifecycle. Keeping it as its own component allows it to evolve independently without affecting the existing PostgreSQL components.

This is the same approach taken by other frameworks — for example, LangChain and LlamaIndex treat pgvector as a vector store alongside Pinecone/Qdrant/etc., not as a PostgreSQL utility.

gnodet added a commit that referenced this pull request Mar 25, 2026
When a new dependency is added to parent/pom.xml, the diff contains
structural XML elements like <groupId>, <artifactId>, <version> which
were incorrectly extracted as "changed properties" by detectChangedProperties.

This caused the script to search for modules using ${artifactId} or
${groupId} as property references, which either matched nothing useful
or caused spurious failures.

Fix: filter out known structural POM element names (groupId, artifactId,
version, scope, type, etc.) so only actual property names like
"pgvector-version" or "openai-java-version" are detected.

Fixes the CI script bug seen in PR #22207 where adding a new component
to parent/pom.xml caused the dependency detection to fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gnodet
Copy link
Contributor Author

gnodet commented Mar 25, 2026

The CI failures on the "mvn test parent pom dependencies changed" step are caused by a bug in the detect-test.sh script: it incorrectly extracts structural XML elements (<groupId>, <artifactId>, <version>) as "changed properties" when a new dependency is added to parent/pom.xml. This is fixed in #22247.

All actual tests pass on JDK 17 and 21. The "Fail if there are uncommitted changes" check also passes on all JDKs.

Claude Code on behalf of Guillaume Nodet

@gnodet gnodet marked this pull request as draft March 25, 2026 14:53
@Croway
Copy link
Contributor

Croway commented Mar 25, 2026

Hi @gnodet could you check how this component integrates with the camel-openai component? in the openai documentation https://camel.apache.org/components/4.18.x/openai-component.html#_postgresql_pgvector_recommended there is an example that uses openai + sql, I was wondering if there are potential improvements with the pgvector component

@gnodet
Copy link
Contributor Author

gnodet commented Mar 25, 2026

Good point! I've updated the camel-openai docs in this PR to show how the pgvector component integrates with openai embeddings. The pgvector component simplifies the flow — no need for raw SQL, manual ::vector casting, or intermediate variables for the embedding. It also adds similarity search, HNSW indexing, and metadata filtering out of the box.

The raw SQL approach via camel-sql is still documented as a fallback for custom table schemas or complex queries (joins, CTEs, etc.).

Claude Code on behalf of Guillaume Nodet

@gnodet
Copy link
Contributor Author

gnodet commented Mar 25, 2026

CI status: all actual tests pass (JDK 21 fully green, JDK 17 and 25 cancelled by fail-fast). The only failing step is "mvn test parent pom dependencies changed" — the CI script bug that will be fixed by #22247.

Claude Code on behalf of Guillaume Nodet

gnodet and others added 5 commits March 26, 2026 23:35
- New camel-pgvector component under components/camel-ai/
- Supports CREATE_TABLE, DROP_TABLE, UPSERT, DELETE, SIMILARITY_SEARCH actions
- Uses PostgreSQL pgvector extension via JDBC with com.pgvector library
- Supports cosine, euclidean, and inner product distance types
- LangChain4j data type transformers: pgvector:embeddings and pgvector:rag
- Integration tests with testcontainers pgvector image
- LangChain4j embeddings integration test with AllMiniLmL6V2 model

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add BOM, catalog, DSL, and documentation generated files
- Remove Spring Boot starter reference (no starter yet)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…in and add pgvector entries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
gnodet and others added 10 commits March 26, 2026 23:35
- Remove pgvector from alphabetical nav listing (grouped under AI only)
- Fix trailing spaces in javadoc blank comment lines
- Update EventEndpointBuilderFactory to CamelEventEndpointBuilderFactory
- Update rest-openapi description

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add CREATE_INDEX action to create HNSW indexes for faster approximate
  nearest neighbor search, using the configured distance type
- Add CamelPgVectorFilter header to apply SQL WHERE clause filtering
  on similarity search results (e.g., filter by metadata or text content)
- Add integration tests for both features

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…index action

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace ** glob with {*,*/*} for dsl source pattern to prevent
scandir of target/ directories created during parallel builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dsl.adoc lives at dsl/src/main/docs/ (depth 0), which is not
matched by {*,*/*}. Add explicit pattern for it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the raw SQL example with the pgvector component for
PostgreSQL vector database integration. Add a similarity search
example showing the full index + search workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add pgvector examples to langchain4j-embeddings docs (alongside Qdrant)
- Add OpenAI and LangChain4j integration examples to pgvector docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make PgVectorHeaders final with private constructor
- Add null check for embedding header in data type transformer
- Add upsertUpdate test to verify ON CONFLICT DO UPDATE behavior
- Add upsertAutoId test to verify auto-generated UUID
- Fix test ordering and assertions after adding new tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add log4j2.properties for test logging (consistent with other vector DB components)
- Remove redundant camel-pgvector from openai "Alternative" list since it's already recommended above

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gnodet gnodet marked this pull request as ready for review March 26, 2026 22:58
@gnodet gnodet marked this pull request as draft March 26, 2026 22:58
The {*,*/*} pattern causes the glob library to scan into target/
directories at the second level, which can race with concurrent test
execution deleting target/surefire. Use !(target) extglob to exclude
target directories from the scan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gnodet gnodet marked this pull request as ready for review March 26, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants