feat(eap): Add v2 co-occurring attributes storage with count column#7801
Draft
feat(eap): Add v2 co-occurring attributes storage with count column#7801
Conversation
Add a new SummingMergeTree-based storage for co-occurring attributes that includes a count column for proper deduplication via key_hash. The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature flag. Also simplify result row parsing in the attribute names endpoint. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
|
This PR has a migration; here is the generated SQL for -- start migrations
-- forward migration events_analytics_platform : 0054_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool))), attributes_string Array(String), attributes_float Array(String), attributes_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_bool)))), count UInt64) AS
SELECT
organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toMonday(timestamp) AS date,
retention_days as retention_days,
arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
mapKeys(attributes_bool) AS attributes_bool,
arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
1 AS count
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0054_add_count_to_co_occurring_attrs
-- backward migration events_analytics_platform : 0054_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0054_add_count_to_co_occurring_attrs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a new SummingMergeTree-based storage (
eap_item_co_occurring_attrs_v2) forco-occurring attributes that includes a
countcolumn andkey_hashfor properrow deduplication during merges. This replaces the existing ReplacingMergeTree
approach with a more efficient summing strategy.
The v2 storage is gated behind a
use_co_occurring_attrs_v2runtime config flag,allowing gradual rollout. The migration (0054) creates the new local/dist tables,
bloom filter index, and materialized view from
eap_items_1_local.Also simplifies
convert_co_occurring_results_to_attributesto access theattr_keycolumn by name instead of asserting on row length.
Agent transcript: https://claudescope.sentry.dev/share/jjGnsb7JWH13GyrGe-wbHapP5rwLIJPOJyGwWJKv-70