Skip to content

Enable Kubernetes-based Antalya swarms to accelerate data lake queries for #1289

@hodgesrm

Description

@hodgesrm

Is your feature request related to a problem? Please describe.
It's hard to use swarms to accelerate data lake queries for ClickHouse clusters running outside of Kubernetes. This for two reasons:

  1. Swarms run best in Kubernetes, which offers fast, simple scaling of stateless clusters.
  2. It's difficult for processes outside the swarm (say running in AWS EC2) to access swarms, because it requires exposing pod DNS names so that the initiating servers can dispatch queries to them.

Here's the communication pattern. In addition to the invocation of the swarm via exposed DNS and IP addresses, the Initiator has to have access to S3 and optionally has to see an Iceberg REST catalog.

                 Kubernetes
+-----------+     +-------+
| Initiator | ==> | Swarm | ==> Object Storage
+-----------+     +-------+       ^
      \                           |
       \-------------------------/

Describe the solution you'd like

Enable Kubernetes swarms to act as accelerators for query on data lakes from outside ClickHouse clusters. The swarm implementation should fully encapsulate access to object storage and Iceberg REST (if present). Initiators should see the data lake tables as tables in a remote server. Here's a diagram of a possible implementation:

                                 Kubernetes
         +-----------+      +----------------------------+ 
 query-> | Initiator |  ==> | [Query Server] --> [Swarm] | ==> Object Storage
         +-----------+      +----------------------------+ 

The implementation includes the following features:

  1. The initiator sees tables, which means that the target data must be in table format, e.g., Iceberg tables or S3 tables.
  2. It would be reasonable to set up distributed tables pointing to the Query Server, which would allow the Initiator to see the schema to plan queries.

This approach will look like "distributed in distributed" which uses nested distributed tables.

Describe alternatives you've considered

  1. Direct access to swarm nodes. Requires careful setup to enable DNS names to appear to the Initiator.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions