-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Is your feature request related to a problem? Please describe.
It's hard to use swarms to accelerate data lake queries for ClickHouse clusters running outside of Kubernetes. This for two reasons:
- Swarms run best in Kubernetes, which offers fast, simple scaling of stateless clusters.
- It's difficult for processes outside the swarm (say running in AWS EC2) to access swarms, because it requires exposing pod DNS names so that the initiating servers can dispatch queries to them.
Here's the communication pattern. In addition to the invocation of the swarm via exposed DNS and IP addresses, the Initiator has to have access to S3 and optionally has to see an Iceberg REST catalog.
Kubernetes
+-----------+ +-------+
| Initiator | ==> | Swarm | ==> Object Storage
+-----------+ +-------+ ^
\ |
\-------------------------/
Describe the solution you'd like
Enable Kubernetes swarms to act as accelerators for query on data lakes from outside ClickHouse clusters. The swarm implementation should fully encapsulate access to object storage and Iceberg REST (if present). Initiators should see the data lake tables as tables in a remote server. Here's a diagram of a possible implementation:
Kubernetes
+-----------+ +----------------------------+
query-> | Initiator | ==> | [Query Server] --> [Swarm] | ==> Object Storage
+-----------+ +----------------------------+
The implementation includes the following features:
- The initiator sees tables, which means that the target data must be in table format, e.g., Iceberg tables or S3 tables.
- It would be reasonable to set up distributed tables pointing to the Query Server, which would allow the Initiator to see the schema to plan queries.
This approach will look like "distributed in distributed" which uses nested distributed tables.
Describe alternatives you've considered
- Direct access to swarm nodes. Requires careful setup to enable DNS names to appear to the Initiator.
Additional context