Skip to content

feat(clickhouse): support configurable storage engine for events table#4271

Open
pdeaudney wants to merge 4 commits into
openmeterio:mainfrom
pdeaudney:feat/clickhouse-replicated-engine
Open

feat(clickhouse): support configurable storage engine for events table#4271
pdeaudney wants to merge 4 commits into
openmeterio:mainfrom
pdeaudney:feat/clickhouse-replicated-engine

Conversation

@pdeaudney
Copy link
Copy Markdown

@pdeaudney pdeaudney commented May 3, 2026

Adds aggregation.clickhouse.eventsTableEngine config block so self-hosted multi-node ClickHouse deployments can create the events table with ReplicatedMergeTree (with optional ON CLUSTER) instead of the hardcoded MergeTree. Default behavior is unchanged: ClickHouse Cloud users continue to rely on Cloud's transparent rewrite to SharedMergeTree.

The connector now plumbs an EventsTableEngine value through to the SQL builder. Cluster names are backtick-quoted in the emitted DDL so any identifier ClickHouse permits in <remote_servers> is accepted (including hyphens) without re-implementing identifier validation. ZooKeeper path and replica name are escaped against single-quote injection and validated for non-whitespace content.

Coverage:

  • Unit tests on rendered SQL for legacy default, explicit MergeTree, ReplicatedMergeTree with macros, ON CLUSTER rendering, hyphenated cluster names, embedded backtick escaping, and single-quote escaping in zk path
  • Validation tests for required fields, whitespace-only inputs, and unknown engine types in both the config and connector layers
  • YAML unmarshal coverage in TestComplete pinning the camelCase Viper key naming for all four engine fields
  • Live single-node integration tests (gated by TEST_CLICKHOUSE_DSN) exercising explicit MergeTree, default-zero-value, and invalid-config paths through the connector constructor
  • Live 2-node replicated integration test (gated by TEST_CLICKHOUSE_REPLICATED=1) that inserts on node-01 and reads back from node-02 to prove Keeper substituted {shard}/{replica} and replication is wired end-to-end

Adds docker-compose.replicated.yaml + make up-replicated/down-replicated targets bringing up a Keeper-backed 2-replica cluster on ports 38123/39000 and 38124/39001 (namespaced clear of the default stack), with server logs streaming to stdout via docker logs.

Adds docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md with the config reference, an offline RENAME + INSERT SELECT recipe for converting an existing MergeTree table, and the
system.distributed_ddl_queue / system.replicas checks operators should run between steps.

Refs #3050

Summary by CodeRabbit

  • New Features

    • Configure ClickHouse events table engine (MergeTree or ReplicatedMergeTree); startup validates required fields and defaults to MergeTree.
    • New Make targets to start/stop a local 2-node replicated ClickHouse test cluster.
  • Documentation

    • Migration guide detailing switching om_events to ReplicatedMergeTree, offline migration steps, single-shard alternative, and test stack usage.
  • Tests

    • Added unit and integration tests covering engine validation, SQL rendering, and replicated end-to-end behavior.

Review Change Stack

Adds aggregation.clickhouse.eventsTableEngine config block so self-hosted
multi-node ClickHouse deployments can create the events table with
ReplicatedMergeTree (with optional ON CLUSTER) instead of the hardcoded
MergeTree. Default behavior is unchanged: ClickHouse Cloud users continue
to rely on Cloud's transparent rewrite to SharedMergeTree.

The connector now plumbs an EventsTableEngine value through to the SQL
builder. Cluster names are backtick-quoted in the emitted DDL so any
identifier ClickHouse permits in <remote_servers> is accepted (including
hyphens) without re-implementing identifier validation. ZooKeeper path
and replica name are escaped against single-quote injection and validated
for non-whitespace content.

Coverage:

- Unit tests on rendered SQL for legacy default, explicit MergeTree,
  ReplicatedMergeTree with macros, ON CLUSTER rendering, hyphenated cluster
  names, embedded backtick escaping, and single-quote escaping in zk path
- Validation tests for required fields, whitespace-only inputs, and
  unknown engine types in both the config and connector layers
- YAML unmarshal coverage in TestComplete pinning the camelCase Viper key
  naming for all four engine fields
- Live single-node integration tests (gated by TEST_CLICKHOUSE_DSN)
  exercising explicit MergeTree, default-zero-value, and invalid-config
  paths through the connector constructor
- Live 2-node replicated integration test (gated by
  TEST_CLICKHOUSE_REPLICATED=1) that inserts on node-01 and reads back
  from node-02 to prove Keeper substituted {shard}/{replica} and
  replication is wired end-to-end

Adds docker-compose.replicated.yaml + make up-replicated/down-replicated
targets bringing up a Keeper-backed 2-replica cluster on ports
38123/39000 and 38124/39001 (namespaced clear of the default stack), with
server logs streaming to stdout via docker logs.

Adds docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md
with the config reference, an offline RENAME + INSERT SELECT recipe for
converting an existing MergeTree table, and the
system.distributed_ddl_queue / system.replicas checks operators should
run between steps.

Refs openmeterio#3050

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pdeaudney pdeaudney requested a review from a team as a code owner May 3, 2026 00:36
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ddd0245-afee-455b-a716-4ca055cfeecc

📥 Commits

Reviewing files that changed from the base of the PR and between 1c7f061 and 180150a.

📒 Files selected for processing (5)
  • app/config/aggregation.go
  • app/config/aggregation_test.go
  • openmeter/streaming/clickhouse/event_query.go
  • openmeter/streaming/clickhouse/event_query_test.go
  • openmeter/streaming/clickhouse/events_table_engine_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • app/config/aggregation.go
  • app/config/aggregation_test.go
  • openmeter/streaming/clickhouse/events_table_engine_test.go
  • openmeter/streaming/clickhouse/event_query.go
  • openmeter/streaming/clickhouse/event_query_test.go

📝 Walkthrough

Walkthrough

Adds configurable ClickHouse events table engine support (MergeTree or ReplicatedMergeTree) with validation and SQL/ON CLUSTER rendering, wires the engine through the streaming connector, adds unit and integration tests (gated replicated E2E), provides a 2-node replicated Docker Compose stack plus Make targets, and publishes a migration guide.

Changes

ClickHouse ReplicatedMergeTree Engine Support

Layer / File(s) Summary
Config Schema & Validation
app/config/aggregation.go
Adds ClickhouseEventsTableEngineConfig, engine type constants, Validate() and ResolvedType(), and imports strings for trimming.
Config Defaults & Fixtures
app/config/aggregation.go, app/config/testdata/complete.yaml, app/config/config_test.go
Defaults aggregation.clickhouse.eventsTableEngine.type to MergeTree; testdata and expected config updated with a ReplicatedMergeTree example.
SQL Types & CREATE TABLE Wiring
openmeter/streaming/clickhouse/event_query.go
Introduces EventsTableEngine type, Validate(), engineClause(), escaping/backtick helpers, and threads Engine into createEventsTable/toSQL() with optional ON CLUSTER.
Connector Integration
openmeter/streaming/clickhouse/connector.go, app/common/streaming.go
Connector Config gains EventsTableEngine; Config.Validate() validates it; NewStreamingConnector forwards resolved engine fields into the ClickHouse connector config.
Unit Tests (engine + SQL)
openmeter/streaming/clickhouse/event_query_test.go, app/config/aggregation_test.go
Adds table-driven tests for SQL rendering and for engine config validation/resolution, including escaping/quoting cases and defaulting behavior.
Integration Tests
openmeter/streaming/clickhouse/events_table_engine_test.go
New suite asserting table engine in system.tables, validation-failure behavior, and a gated replicated E2E test that creates the DB/table across replicas and verifies replication.
Local Replicated Stack & Make targets
docker-compose.replicated.yaml, Makefile
Adds 2-node replicated ClickHouse compose stack (Keeper + 2 nodes) and make up-replicated / make down-replicated targets.
Documentation / Migration Guide
docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md
New migration guide with config schema, new-deploy and offline migration procedures, backfill/verification steps, local test stack usage, and notes on defaults/Cloud behavior.

Sequence Diagram(s)

sequenceDiagram
    participant OM as OpenMeter
    participant Keeper as ClickHouse Keeper
    participant CH1 as ClickHouse Node 1
    participant CH2 as ClickHouse Node 2
    OM->>OM: Validate EventsTableEngine config
    OM->>Keeper: Connect / coordinate (ZooKeeper path)
    OM->>CH1: CREATE TABLE (ENGINE clause / ON CLUSTER if set)
    CH1->>CH2: Distributed DDL replication
    CH1->>Keeper: ReplicatedMergeTree metadata registration
    CH2->>Keeper: Replica registration
    OM->>CH1: Insert event (streaming.BatchInsert)
    CH1->>CH2: Replicate inserted row
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • tothandras
  • gergely-kurucz-konghq
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding configurable storage engine support for ClickHouse events tables, which is the core feature across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docker-compose.replicated.yaml (1)

1-220: 💤 Low value

Well-designed test cluster!

A few nice touches: the imok healthcheck comment explaining BusyBox nc quirks, stdout logging so docker logs is actually useful, and internal_replication: true which is the right setting for this topology. The distributed_ddl path is also correctly wired so CREATE TABLE ... ON CLUSTER propagates properly.

One small note: the Keeper image on line 27 (clickhouse/clickhouse-keeper:25.12.3-alpine) isn't SHA-pinned, while both server images are. Totally fine for a dev stack, but pinning it would make the setup fully reproducible.

📌 Pin keeper image for full reproducibility
-    image: clickhouse/clickhouse-keeper:25.12.3-alpine
+    image: clickhouse/clickhouse-keeper:25.12.3-alpine@sha256:<keeper-image-sha>

Run docker pull clickhouse/clickhouse-keeper:25.12.3-alpine and grab the digest from docker inspect to fill in the SHA.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker-compose.replicated.yaml` around lines 1 - 220, The clickhouse-keeper
image in the clickhouse-keeper-01 service is not SHA-pinned (image:
clickhouse/clickhouse-keeper:25.12.3-alpine); update that image field to a
digest-pinned form (e.g.,
clickhouse/clickhouse-keeper:25.12.3-alpine@sha256:<digest>) to make the stack
fully reproducible—obtain the digest by running docker pull
clickhouse/clickhouse-keeper:25.12.3-alpine and docker inspect to copy the
sha256 digest, then replace the image value under the clickhouse-keeper-01
service.
openmeter/streaming/clickhouse/events_table_engine_test.go (1)

100-196: 💤 Low value

Clever defer trick and thorough end-to-end coverage. A couple of optional things to consider:

  1. Hardcoded node-02 address (line 176): The DSN 127.0.0.1:39001 is fine given the skip gate, but you could read it from an env var (e.g., TEST_CLICKHOUSE_REPLICA2_DSN) to make the test stack more flexible without breaking anything.

  2. Polling loop ceiling (lines 184–194): 3 s max wait is tight on a slow VM. A TEST_CLICKHOUSE_REPLICATION_TIMEOUT env var or bumping to 50 iterations wouldn't hurt, though it's probably fine for local use.

Neither is blocking — just flagging for future ergonomics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openmeter/streaming/clickhouse/events_table_engine_test.go` around lines 100
- 196, The test TestReplicatedMergeTree hardcodes the secondary ClickHouse DSN
and uses a fixed 30-iteration poll loop; change it to read node-02 DSN from an
env var (e.g., TEST_CLICKHOUSE_REPLICA2_DSN) when creating node02Opts and
fallback to the current "clickhouse://default:default@127.0.0.1:39001/openmeter"
if unset, and make the replication polling ceiling configurable via an env var
(e.g., TEST_CLICKHOUSE_REPLICATION_TIMEOUT or
TEST_CLICKHOUSE_REPLICATION_ATTEMPTS) used in the for-loop that scans into seen
(the loop around row := node02.QueryRow(...)); ensure default behavior remains
the same when env vars are absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md`:
- Around line 86-146: Update the migration steps that use the ClickHouse ON
CLUSTER macro (specifically the statements "RENAME TABLE openmeter.om_events TO
openmeter.om_events_legacy ON CLUSTER {cluster};", the "CREATE TABLE ... ON
CLUSTER {cluster} (...)" statement, and "DROP TABLE openmeter.om_events_legacy
ON CLUSTER {cluster} SYNC;") to include a short clarifying note that {cluster}
is a ClickHouse server-side <macros> value (not an auto-substituted placeholder)
and advise operators to either define the cluster macro in their ClickHouse
config or replace {cluster} with their literal cluster name (e.g.
openmeter_cluster) when running the DDL; add the note inline where each ON
CLUSTER usage appears so readers see the guidance for steps 2, 3 and 7.

---

Nitpick comments:
In `@docker-compose.replicated.yaml`:
- Around line 1-220: The clickhouse-keeper image in the clickhouse-keeper-01
service is not SHA-pinned (image: clickhouse/clickhouse-keeper:25.12.3-alpine);
update that image field to a digest-pinned form (e.g.,
clickhouse/clickhouse-keeper:25.12.3-alpine@sha256:<digest>) to make the stack
fully reproducible—obtain the digest by running docker pull
clickhouse/clickhouse-keeper:25.12.3-alpine and docker inspect to copy the
sha256 digest, then replace the image value under the clickhouse-keeper-01
service.

In `@openmeter/streaming/clickhouse/events_table_engine_test.go`:
- Around line 100-196: The test TestReplicatedMergeTree hardcodes the secondary
ClickHouse DSN and uses a fixed 30-iteration poll loop; change it to read
node-02 DSN from an env var (e.g., TEST_CLICKHOUSE_REPLICA2_DSN) when creating
node02Opts and fallback to the current
"clickhouse://default:default@127.0.0.1:39001/openmeter" if unset, and make the
replication polling ceiling configurable via an env var (e.g.,
TEST_CLICKHOUSE_REPLICATION_TIMEOUT or TEST_CLICKHOUSE_REPLICATION_ATTEMPTS)
used in the for-loop that scans into seen (the loop around row :=
node02.QueryRow(...)); ensure default behavior remains the same when env vars
are absent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f5546f08-5193-4e71-84f2-156111b962a8

📥 Commits

Reviewing files that changed from the base of the PR and between ff379af and 1bbb53a.

📒 Files selected for processing (12)
  • Makefile
  • app/common/streaming.go
  • app/config/aggregation.go
  • app/config/aggregation_test.go
  • app/config/config_test.go
  • app/config/testdata/complete.yaml
  • docker-compose.replicated.yaml
  • docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md
  • openmeter/streaming/clickhouse/connector.go
  • openmeter/streaming/clickhouse/event_query.go
  • openmeter/streaming/clickhouse/event_query_test.go
  • openmeter/streaming/clickhouse/events_table_engine_test.go

Comment thread docs/migration-guides/2026-05-02-clickhouse-replicated-engine.md
pdeaudney and others added 2 commits May 3, 2026 10:47
Note added at the first ON CLUSTER {cluster} occurrence in the migration
guide, telling operators that {cluster} is a ClickHouse <macros> value
they may need to define (or replace with the literal cluster name).
The note extends to every subsequent ON CLUSTER {cluster} in the steps,
so we don't repeat it for each.

Addresses CodeRabbit feedback on PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The replicated compose stack was pinning every image to tag@sha256:...
except clickhouse-keeper, which was tag-only. Match the convention used
for clickhouse-server, kafka, postgres, redis, and svix in
docker-compose.base.yaml so this stack is reproducible across hosts and
not silently affected if the upstream :25.12.3-alpine tag is rebuilt.

Digest is the multi-platform manifest index, so the same line works on
both arm64 (local Mac dev) and amd64 (CI / Linux), matching the existing
clickhouse-server pin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@GAlexIHU GAlexIHU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pdeaudney for your contrib, it's likely something that will be well received by users!

I only have a few comments, once those are resolved I'd be happy to merge this :)

Comment on lines +80 to +91
{
name: "cluster name with embedded backtick is escaped",
data: createEventsTable{
Database: "openmeter",
EventsTableName: "om_events",
Engine: EventsTableEngine{
Type: EventsTableEngineMergeTree,
Cluster: "weird`name",
},
},
want: "CREATE TABLE IF NOT EXISTS openmeter.om_events ON CLUSTER `weird``name` (namespace String, id String, type LowCardinality(String), subject String, source String, time DateTime, data String, ingested_at DateTime, stored_at DateTime, INDEX om_events_stored_at stored_at TYPE minmax GRANULARITY 4, store_row_id String) ENGINE = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (namespace, type, subject, toStartOfHour(time))",
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string escaping test is good, but I think this exact case (ON CLUSTER + ENGINE = MergeTree) should not be a valid configuration.

Copy link
Copy Markdown

@georgechieng-sc georgechieng-sc May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 180150a. Both event_query.go and app/config/aggregation.go now reject Cluster set when the resolved engine is MergeTree (returns: "cluster requires ReplicatedMergeTree engine; MergeTree with ON CLUSTER produces independent (non-replicated) tables per node"). The two render tests that previously combined MergeTree + Cluster were reworked to use ReplicatedMergeTree, and the "valid" cluster-name cases in the validate suite (hyphen, leading digit, embedded backtick, whitespace-only) all now sit under a ReplicatedMergeTree config.

StoreRowID: "1",
}}))

node02Opts, err := clickhouse.ParseDSN("clickhouse://default:default@127.0.0.1:39001/openmeter")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please make all replicated CH configuration used in the test ENV based (similar to TEST_CLICKHOUSE_REPLICATED flag)? it would be nice to decouple + would make it easier to run this not just locally

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 180150a. The replicated test now reads its full topology from env vars, following the same skip-on-missing pattern as the existing TEST_CLICKHOUSE_DSN gate in suite_test.go:

  • TEST_CLICKHOUSE_REPLICATED (=1, top-level gate)
  • TEST_CLICKHOUSE_REPLICATED_CLUSTER
  • TEST_CLICKHOUSE_REPLICATED_ZK_PATH
  • TEST_CLICKHOUSE_REPLICATED_REPLICA_NAME
  • TEST_CLICKHOUSE_REPLICATED_NODE2_DSN
  • TEST_CLICKHOUSE_REPLICATED_REPLICA_COUNT

Each is read inline at the top of TestReplicatedMergeTreeEngine with a named s.T().Skip(...) if missing. No defaults — fully decoupled from make up-replicated, runnable against any externally-managed cluster (or CI) by setting the env vars.


// escapeSingleQuotes doubles single quotes so the value can be embedded in a
// single-quoted SQL string literal without breaking out.
func escapeSingleQuotes(s string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its good you added escaping but

  1. i don't think it matches the CH spec
  2. i think it would be a safer bet if you didn't handroll it

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 180150a. Replaced the handrolled strings.ReplaceAll(s, "'", "''") with a package-level strings.NewReplacer that matches the ClickHouse string-literal grammar — backslash as the escape character, with \\ and \' as the canonical escapes:

var stringLiteralEscaper = strings.NewReplacer(`\`, `\\`, `'`, `\'`)

This mirrors clickhouse-go's internal stringQuoteReplacer (see bind.go:275), so we're using the same escape policy as the driver rather than reinventing it. Test coverage extended to also exercise an embedded backslash in the ZooKeeper path.

… env-drive replicated tests

- Reject `Cluster` set with non-Replicated engine in both event_query.go
  and app/config/aggregation.go validation. MergeTree + ON CLUSTER
  produces independent (non-replicated) tables per node, which is
  almost never intended for the events table.
- Replace handrolled single-quote doubling with a NewReplacer that
  follows the ClickHouse string-literal grammar (mirrors clickhouse-go's
  stringQuoteReplacer): `\` -> `\\`, `'` -> `\'`.
- Drive the ReplicatedMergeTree integration test entirely from env vars
  (cluster, zk path, replica name, node-02 DSN, expected replica count)
  using the same skip-on-missing pattern as the existing
  TEST_CLICKHOUSE_DSN gate, so the test decouples from the local
  docker-compose stack.

Addresses PR review comments from @GAlexIHU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/feature Release note: Exciting New Features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants