dimension_anomalies is failing for a given dimension when it shouldn't

**Describe the bug**
I have a table that has data ingested into it daily, all rows using the same ingestion timestamp (for the sake of this issue, let's call it `ts`). I'm trying to implement an elementary test based on a given dimension called `row_type`, so that I can evaluate if the volume for each dimension is still accurate or if it has drifted with time.

I found the`dimension_anomalies` test, but when I tried to use it, it returned inconsistent training statistics when using `timestamp_column=ts` and `dimension=row_type`. In my case specifically, `latest_metric_value` in table `elementary_db.metrics_anomaly_score` reflects correctly what I have in the source if I group by `row_type`, however, the reported `training_avg` is far below that dimension's historical values.

The main thing is that, for this particular `row_type` dimension value X, it has 30x more data than the reported avg and historically, doesn't have any anomaly as values have been pretty aligned since the table exists, but still, it's being reported as an anomaly, consequently triggering an error. Could you please check?

P.S.: I have 40+ distinct `row_type` and some of them have just a few records in comparison with the main ones, with millions of rows every single day. I believe elementary dimension_anomalies should take the avg grouped by dimension, but if that's not the case by design, I'm not sure how this can be useful. Not sure if it's due to this [issue](https://github.com/elementary-data/elementary/issues/1729) reported and closed in the past, or if it's something else.


**To Reproduce**
N/A


**Expected behavior**

Assuming I have `dimension_anomalies` in my sources.yml set as:
```
config:
  elementary:
    timestamp_column: ts
...
- elementary.dimension_anomalies:
    arguments:
      dimensions: 
        - row_type
      anomaly_direction: both
      time_bucket: 
        period: day
        count: 1
```
If you have a dimension with distinct values X, Y and Z, and their counts, in day1, day2, day3 and day4 are:
|   | D1   | D2   | D3   | D4   |
|---|------|------|------|------|
| X | 1000 | 1001 | 1010 | 1002 |
| Y | 200  | 200  | 201  | **7000** |
| Z | 10   | 10   | 9    | 10   |

Assuming anomaly is already trained, and it's configured to trigger in both directions, we should expect anomaly alert in D4 for dimension value Y because it has deviated from the Y historic avg/std or whatever algorithm it uses.

But in my case, X is being called anomalous since D1.

**Screenshots**
N/A

**Environment (please complete the following information):**

- Elementary CLI (edr) version: [e.g. 0.5.3], can be found by running `pip show elementary-data`
- Elementary dbt package version: [e.g. 0.4.1], can be found in `packages.yml` file: 0.20.1 (tested with 0.23.0, no success)
- dbt version you're using [e.g. 1.8.1]: 1.11.2
- Data warehouse [e.g. snowflake]: AWS Athena
- Infrastructure details (e.g. operating system, prod / dev / staging, deployment infra, CI system, etc)

**Additional context**

**Would you be willing to contribute a fix for this issue?**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dimension_anomalies is failing for a given dimension when it shouldn't #2172

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

dimension_anomalies is failing for a given dimension when it shouldn't #2172

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions