Conversation
9433224 to
f39c372
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
yessss. Where is the setting for how often it runs? |
It's configured to run in bqetl_accounts_derived DAG, which is scheduled to run daily. DAG generation magic will make sure that this runs after MySQL tables are synced to BQ (expand |
| Note: because its source tables are overwritten daily, query used to | ||
| populate this table is not idempotent so this table should not be backfilled. |
There was a problem hiding this comment.
As it is, if someone re-runs a historical DAG run then historical data would be incorrectly overwritten with current data (which is highly probable since other normal daily-incremental ETLs exist in the bqetl_accounts_derived DAG).
I'd suggest using BigQuery's time travel feature in this ETL to select the data as it existed at data_interval_end so that any DAG run within the last ~7 days could potentially be backfilled, but trying to backfill further than time travel allows will fail outright rather than incorrectly overwriting data.
Here's an example of an ETL using time travel like that (though this doesn't need to be a script, and can remain a normal query ETL; a script was needed in that case because of the discrepancy between the monthly scheduling and the date partitioning setup).
There was a problem hiding this comment.
Cool, thanks for this idea!
| ( | ||
| SELECT | ||
| "accounts_linked_to_google" AS table_name, | ||
| COUNT(uid) AS total_rows |
There was a problem hiding this comment.
Are there cases where uid is null in this table?
There was a problem hiding this comment.
Unlikely, IIUC this query has been running for a while in a custom environment.
There was a problem hiding this comment.
I've run it by hand a few times but not "awhile". A NULL there would be a bug in the system. There shouldn't be any.
There was a problem hiding this comment.
Usually the only reason to use COUNT(expression) rather than COUNT(*) is if the expression might be null and shouldn't be counted in that case.
Or if it was intended to count distinct uid values then it should be COUNT(DISTINCT uid).
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8aa5367 to
c98a15f
Compare
This comment has been minimized.
This comment has been minimized.
c98a15f to
ba2bd7c
Compare
This comment has been minimized.
This comment has been minimized.
| SELECT | ||
| "accounts_with_secondary_emails" AS table_name, | ||
| COUNT( | ||
| DISTINCT `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1`.uid | ||
| ) AS total_rows | ||
| FROM | ||
| `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1` FOR SYSTEM_TIME AS OF TIMESTAMP( | ||
| @as_of_date, | ||
| 'UTC' | ||
| ) | ||
| JOIN | ||
| `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1` FOR SYSTEM_TIME AS OF TIMESTAMP( | ||
| @as_of_date, | ||
| 'UTC' | ||
| ) | ||
| ON `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1`.uid = `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1`.uid | ||
| WHERE | ||
| `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1`.isPrimary = FALSE |
There was a problem hiding this comment.
Using table aliases would make this less verbose/more readable.
| SELECT | |
| "accounts_with_secondary_emails" AS table_name, | |
| COUNT( | |
| DISTINCT `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1`.uid | |
| ) AS total_rows | |
| FROM | |
| `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1` FOR SYSTEM_TIME AS OF TIMESTAMP( | |
| @as_of_date, | |
| 'UTC' | |
| ) | |
| JOIN | |
| `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1` FOR SYSTEM_TIME AS OF TIMESTAMP( | |
| @as_of_date, | |
| 'UTC' | |
| ) | |
| ON `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1`.uid = `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1`.uid | |
| WHERE | |
| `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1`.isPrimary = FALSE | |
| SELECT | |
| "accounts_with_secondary_emails" AS table_name, | |
| COUNT(DISTINCT fxa_accounts_v1.uid) AS total_rows | |
| FROM | |
| `moz-fx-data-shared-prod.accounts_db_external.fxa_accounts_v1` AS fxa_accounts_v1 FOR SYSTEM_TIME AS OF TIMESTAMP( | |
| @as_of_date, | |
| 'UTC' | |
| ) | |
| JOIN | |
| `moz-fx-data-shared-prod.accounts_db_external.fxa_emails_v1` AS fxa_emails_v1 FOR SYSTEM_TIME AS OF TIMESTAMP( | |
| @as_of_date, | |
| 'UTC' | |
| ) | |
| ON fxa_accounts_v1.uid = fxa_emails_v1.uid | |
| WHERE | |
| fxa_emails_v1.isPrimary = FALSE |
(same goes for the similar subquery below)
There was a problem hiding this comment.
This would unfortunately fail on formatting. Sqlgot seems not to be able to handle table aliases combined with another expression following them (FOR SYSTEM_TIME here).
There was a problem hiding this comment.
Dang. I'd suggest reporting that SQLGlot issue, as they're very responsive and have fixed every issue I've reported to them (often the same day on main, to be included in the next release).
There was a problem hiding this comment.
Another option is if you don't quote the whole table identifier (e.g. `moz-fx-data-shared-prod`.accounts_db_external.fxa_accounts_v1) then BigQuery allows you to refer to just the final table name segment, like an automatic alias. (though I do generally prefer to quote the entire table identifier)
ba2bd7c to
cb9478e
Compare
This comment has been minimized.
This comment has been minimized.
…g_db_counts_v1/metadata.yaml Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
…g_db_counts_v1/query.sql Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
| COUNT(*) AS total_rows | ||
| FROM | ||
| `moz-fx-data-shared-prod.accounts_db_external.fxa_account_groups_v1` FOR SYSTEM_TIME AS OF TIMESTAMP( | ||
| @as_of_date, |
There was a problem hiding this comment.
All of the other time travel expressions in this query should also use @as_of_date + 1 like the first one does, otherwise the data will be inconsistent.
Integration report for "s/@as_of_date,/@as_of_date+1,/g"
|
|
Thanks for landing. I see data in the table. I created mozilla/lookml-generator#1116 to be able to access it in Looker |
Supersedes mozilla/docker-etl#295
/cc @clouserw