feat: add index for dedup copy pipe (CM-1000)#3900
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR improves incremental copying from activities into activities_deduplicated_ds by adding updatedAt-based skipping indexes and tightening the watermark subquery to a recent time window, aiming to reduce ClickHouse granule scans during scheduled COPY runs.
Changes:
- Added a
minmaxskipping index onupdatedAttoactivities.datasource. - Added a
minmaxskipping index onupdatedAttoactivities_deduplicated_ds.datasource(and updated its sorting key). - Updated
activities_deduplicated_copy_pipe_append_modeto compute the watermark from only the last 3 hours ofactivities_deduplicated_ds.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| services/libs/tinybird/pipes/activities_deduplicated_copy_pipe_append_mode.pipe | Uses a range-limited watermark subquery to avoid full scans during COPY. |
| services/libs/tinybird/datasources/activities_deduplicated_ds.datasource | Adds updatedAt minmax skipping index; also changes sorting key to include updatedAt. |
| services/libs/tinybird/datasources/activities.datasource | Adds updatedAt minmax skipping index to accelerate range filters on updatedAt. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
services/libs/tinybird/pipes/activities_deduplicated_copy_pipe_append_mode.pipe
Show resolved
Hide resolved
services/libs/tinybird/datasources/activities_deduplicated_ds.datasource
Outdated
Show resolved
Hide resolved
services/libs/tinybird/pipes/activities_deduplicated_copy_pipe_append_mode.pipe
Show resolved
Hide resolved
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
services/libs/tinybird/pipes/activities_deduplicated_copy_pipe_append_mode.pipe
Show resolved
Hide resolved
Signed-off-by: Umberto Sgueglia <ulemons92@gmail.com>
Signed-off-by: Umberto Sgueglia <ulemons92@gmail.com>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Changes
minmaxskipping index onupdatedAttoactivities.datasource, allowing ClickHouse to skip irrelevantgranules when filtering by
updatedAtminmaxskipping index onupdatedAttoactivities_deduplicated_ds.datasource, enabling efficient rangefiltering on the watermark subquery
activities_deduplicated_copy_pipe_append_modeto replace the full table scan subquery with arange-limited one:
Known limitation
Since activities uses ReplacingMergeTree, deduplication happens in background merges. Intermediate versions of the
same activity captured between merges may result in multiple rows with the same id in activities_deduplicated_ds.
Downstream pipes consuming this datasource are expected to handle this via argMax or equivalent deduplication
patterns.