Skip to content

dimension_anomalies is failing for a given dimension when it shouldn't #2172

@cabral1888

Description

@cabral1888

Describe the bug
I have a table that has data ingested into it daily, all rows using the same ingestion timestamp (for the sake of this issue, let's call it ts). I'm trying to implement an elementary test based on a given dimension called row_type, so that I can evaluate if the volume for each dimension is still accurate or if it has drifted with time.

I found thedimension_anomalies test, but when I tried to use it, it returned inconsistent training statistics when using timestamp_column=ts and dimension=row_type. In my case specifically, latest_metric_value in table elementary_db.metrics_anomaly_score reflects correctly what I have in the source if I group by row_type, however, the reported training_avg is far below that dimension's historical values.

The main thing is that, for this particular row_type dimension value X, it has 30x more data than the reported avg and historically, doesn't have any anomaly as values have been pretty aligned since the table exists, but still, it's being reported as an anomaly, consequently triggering an error. Could you please check?

P.S.: I have 40+ distinct row_type and some of them have just a few records in comparison with the main ones, with millions of rows every single day. I believe elementary dimension_anomalies should take the avg grouped by dimension, but if that's not the case by design, I'm not sure how this can be useful. Not sure if it's due to this issue reported and closed in the past, or if it's something else.

To Reproduce
N/A

Expected behavior

Assuming I have dimension_anomalies in my sources.yml set as:

config:
  elementary:
    timestamp_column: ts
...
- elementary.dimension_anomalies:
    arguments:
      dimensions: 
        - row_type
      anomaly_direction: both
      time_bucket: 
        period: day
        count: 1

If you have a dimension with distinct values X, Y and Z, and their counts, in day1, day2, day3 and day4 are:

D1 D2 D3 D4
X 1000 1001 1010 1002
Y 200 200 201 7000
Z 10 10 9 10

Assuming anomaly is already trained, and it's configured to trigger in both directions, we should expect anomaly alert in D4 for dimension value Y because it has deviated from the Y historic avg/std or whatever algorithm it uses.

But in my case, X is being called anomalous since D1.

Screenshots
N/A

Environment (please complete the following information):

  • Elementary CLI (edr) version: [e.g. 0.5.3], can be found by running pip show elementary-data
  • Elementary dbt package version: [e.g. 0.4.1], can be found in packages.yml file: 0.20.1 (tested with 0.23.0, no success)
  • dbt version you're using [e.g. 1.8.1]: 1.11.2
  • Data warehouse [e.g. snowflake]: AWS Athena
  • Infrastructure details (e.g. operating system, prod / dev / staging, deployment infra, CI system, etc)

Additional context

Would you be willing to contribute a fix for this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions