-
Notifications
You must be signed in to change notification settings - Fork 209
Description
Describe the bug
I have a table that has data ingested into it daily, all rows using the same ingestion timestamp (for the sake of this issue, let's call it ts). I'm trying to implement an elementary test based on a given dimension called row_type, so that I can evaluate if the volume for each dimension is still accurate or if it has drifted with time.
I found thedimension_anomalies test, but when I tried to use it, it returned inconsistent training statistics when using timestamp_column=ts and dimension=row_type. In my case specifically, latest_metric_value in table elementary_db.metrics_anomaly_score reflects correctly what I have in the source if I group by row_type, however, the reported training_avg is far below that dimension's historical values.
The main thing is that, for this particular row_type dimension value X, it has 30x more data than the reported avg and historically, doesn't have any anomaly as values have been pretty aligned since the table exists, but still, it's being reported as an anomaly, consequently triggering an error. Could you please check?
P.S.: I have 40+ distinct row_type and some of them have just a few records in comparison with the main ones, with millions of rows every single day. I believe elementary dimension_anomalies should take the avg grouped by dimension, but if that's not the case by design, I'm not sure how this can be useful. Not sure if it's due to this issue reported and closed in the past, or if it's something else.
To Reproduce
N/A
Expected behavior
Assuming I have dimension_anomalies in my sources.yml set as:
config:
elementary:
timestamp_column: ts
...
- elementary.dimension_anomalies:
arguments:
dimensions:
- row_type
anomaly_direction: both
time_bucket:
period: day
count: 1
If you have a dimension with distinct values X, Y and Z, and their counts, in day1, day2, day3 and day4 are:
| D1 | D2 | D3 | D4 | |
|---|---|---|---|---|
| X | 1000 | 1001 | 1010 | 1002 |
| Y | 200 | 200 | 201 | 7000 |
| Z | 10 | 10 | 9 | 10 |
Assuming anomaly is already trained, and it's configured to trigger in both directions, we should expect anomaly alert in D4 for dimension value Y because it has deviated from the Y historic avg/std or whatever algorithm it uses.
But in my case, X is being called anomalous since D1.
Screenshots
N/A
Environment (please complete the following information):
- Elementary CLI (edr) version: [e.g. 0.5.3], can be found by running
pip show elementary-data - Elementary dbt package version: [e.g. 0.4.1], can be found in
packages.ymlfile: 0.20.1 (tested with 0.23.0, no success) - dbt version you're using [e.g. 1.8.1]: 1.11.2
- Data warehouse [e.g. snowflake]: AWS Athena
- Infrastructure details (e.g. operating system, prod / dev / staging, deployment infra, CI system, etc)
Additional context
Would you be willing to contribute a fix for this issue?