Skip to content

feat: add index for dedup copy pipe (CM-1000)#3900

Merged
ulemons merged 9 commits intomainfrom
feat/update-activities-sorting-key
Mar 11, 2026
Merged

feat: add index for dedup copy pipe (CM-1000)#3900
ulemons merged 9 commits intomainfrom
feat/update-activities-sorting-key

Conversation

@ulemons
Copy link
Contributor

@ulemons ulemons commented Mar 9, 2026

Changes

  • Added minmax skipping index on updatedAt to activities.datasource, allowing ClickHouse to skip irrelevant
    granules when filtering by updatedAt
  • Added minmax skipping index on updatedAt to activities_deduplicated_ds.datasource, enabling efficient range
    filtering on the watermark subquery
  • Updated activities_deduplicated_copy_pipe_append_mode to replace the full table scan subquery with a
    range-limited one:
    WHERE a.updatedAt > (
        SELECT max(updatedAt)
        FROM activities_deduplicated_ds
        WHERE updatedAt > now() - INTERVAL 3 HOUR
    )
    This reduces the subquery from a full scan of 150k+ granules to only scanning recent data.
    

Known limitation

Since activities uses ReplacingMergeTree, deduplication happens in background merges. Intermediate versions of the
same activity captured between merges may result in multiple rows with the same id in activities_deduplicated_ds.
Downstream pipes consuming this datasource are expected to handle this via argMax or equivalent deduplication
patterns.


<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk performance change: adds ClickHouse skipping indexes and narrows a watermark subquery window; primary risk is missing older late-arriving updates outside the 3-hour range.
> 
> **Overview**
> Improves performance of the `activities` → `activities_deduplicated_ds` append copy by adding `minmax` skipping indexes on `updatedAt` to both datasources.
> 
> Updates `activities_deduplicated_copy_pipe_append_mode` to compute its `updatedAt` watermark using only the last 3 hours (and clamps it with `greatest(max(updatedAt), now() - INTERVAL 3 HOUR)`), avoiding a full-table scan of `activities_deduplicated_ds` during incremental loads.
> 
> <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit c00322091bb5f27ce7edafd4c293700d6ddb50d3. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ulemons ulemons self-assigned this Mar 11, 2026
@ulemons ulemons changed the title feat: add index for dedup copy pipe feat: add index for dedup copy pipe (CM-1000) Mar 11, 2026
@ulemons ulemons added the Bug Created by Linear-GitHub Sync label Mar 11, 2026
@ulemons ulemons marked this pull request as ready for review March 11, 2026 09:55
Copilot AI review requested due to automatic review settings March 11, 2026 09:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves incremental copying from activities into activities_deduplicated_ds by adding updatedAt-based skipping indexes and tightening the watermark subquery to a recent time window, aiming to reduce ClickHouse granule scans during scheduled COPY runs.

Changes:

  • Added a minmax skipping index on updatedAt to activities.datasource.
  • Added a minmax skipping index on updatedAt to activities_deduplicated_ds.datasource (and updated its sorting key).
  • Updated activities_deduplicated_copy_pipe_append_mode to compute the watermark from only the last 3 hours of activities_deduplicated_ds.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
services/libs/tinybird/pipes/activities_deduplicated_copy_pipe_append_mode.pipe Uses a range-limited watermark subquery to avoid full scans during COPY.
services/libs/tinybird/datasources/activities_deduplicated_ds.datasource Adds updatedAt minmax skipping index; also changes sorting key to include updatedAt.
services/libs/tinybird/datasources/activities.datasource Adds updatedAt minmax skipping index to accelerate range filters on updatedAt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

ulemons and others added 4 commits March 11, 2026 11:51
Signed-off-by: Umberto Sgueglia <ulemons92@gmail.com>
Signed-off-by: Umberto Sgueglia <ulemons92@gmail.com>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
@ulemons ulemons merged commit 83ba5d3 into main Mar 11, 2026
9 of 10 checks passed
@ulemons ulemons deleted the feat/update-activities-sorting-key branch March 11, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Created by Linear-GitHub Sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants