Add --deep quality checks and previewable cleanup SQL#73
Conversation
Batches per-column null and numeric scans in build_quality_overview into one query per table. Adds a --deep flag that runs additional detectors: whole-row duplicates, PK-shaped duplicates, text whitespace and empty- string flags, IQR-based numeric outliers, and orphan foreign-key-shaped values. Each finding carries a dialect-aware previewable cleanup SQL emitted by a new cleanup module, rendered in the CLI table and markdown outputs alongside the existing sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #73 +/- ##
==========================================
+ Coverage 85.86% 86.07% +0.20%
==========================================
Files 64 65 +1
Lines 12317 12639 +322
==========================================
+ Hits 10576 10879 +303
- Misses 1741 1760 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Replaces the value != value idiom (flagged by CodeQL as comparison of identical values) with math.isnan, which removes both the bare except block and the need for an inline comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an opt-in “deep” mode for data-quality checks and attaches dialect-aware, preview-only cleanup SQL to findings, while also reducing per-column scan roundtrips by batching column stats into a single per-table query.
Changes:
- Refactors
build_quality_overviewto use a single batched stats query per table and introducesdeep+sql_dialectparameters. - Implements
datasight quality --deepwith additional detectors (duplicates, text cleanliness, outliers, orphan FK-shaped values) and renders “Suggested Cleanup” SQL in CLI/markdown outputs. - Adds
datasight.cleanupSQL preview builders and threadssql_dialectthrough CLI/web/audit report call sites with tests covering the new behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_web_app.py | Updates test fakes to accept new keyword arguments passed to build_quality_overview. |
| tests/test_data_profile_extra.py | Adds deep-mode fixtures/tests and asserts batched-scan behavior + cleanup SQL dialect branching. |
| src/datasight/web/app.py | Passes state.sql_dialect into build_quality_overview for web quality overview. |
| src/datasight/data_profile.py | Implements batched scan, deep detectors, and threads sql_dialect/deep through quality overview. |
| src/datasight/cli.py | Extends markdown renderer with deep finding sections and suggested cleanup SQL blocks. |
| src/datasight/cli_commands/quality.py | Adds --deep flag, passes dialect/deep to quality overview, and prints suggested cleanup SQL panels. |
| src/datasight/cli_commands/inspect.py | Passes SQL dialect into build_quality_overview. |
| src/datasight/cli_commands/audit_report.py | Passes SQL dialect into build_audit_report. |
| src/datasight/cleanup.py | New module generating previewable, dialect-aware cleanup SQL for findings. |
| src/datasight/audit_report.py | Threads sql_dialect/deep into audit report quality section. |
| docs/reference/cli.md | Documents the new datasight quality --deep option. |
Comments suppressed due to low confidence (2)
src/datasight/cleanup.py:74
pk_dedup_previewclaims to show “one canonical row per duplicate … value”, but the current queries return one row for every distinct PK value (including non-duplicates). This makes the preview misleading and can produce a huge result set; it should filter down to only PK values withCOUNT(*) > 1(e.g., via a windowedCOUNT(*) OVER (PARTITION BY ...) > 1condition or a join to a subquery of duplicate keys).
def pk_dedup_preview(table: str, pk_column: str, dialect: str) -> str:
"""Show one canonical row per duplicate PK value.
DuckDB uses ``QUALIFY`` for a one-liner; Postgres uses a CTE with
``ROW_NUMBER``; SQLite falls back to ``MIN(rowid)``.
"""
qt = _quote_identifier(table)
qc = _quote_identifier(pk_column)
if dialect == "duckdb":
return (
f"-- One canonical row per duplicate {pk_column!r} value.\n"
f"SELECT * FROM {qt} "
f"QUALIFY ROW_NUMBER() OVER (PARTITION BY {qc} ORDER BY {qc}) = 1;"
)
src/datasight/data_profile.py:1093
_scalar_or_nonecurrently coerces all non-null scalars tostr. For aggregates used in downstream logic (min/max/avg/q1/q3), this loses type information and makes later comparisons/formatting brittle. Consider returning the original value (or a float/Decimal) and handling NaN/None without converting to string here; stringify only where the value is embedded into human-facing messages or SQL text.
def _scalar_or_none(value: Any) -> str | None:
"""Convert a SQL scalar to a stringified value or None."""
if value is None:
return None
if isinstance(value, float) and math.isnan(value):
return None
return str(value)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if dialect == "duckdb": | ||
| materialize = ( | ||
| f"-- To materialize: CREATE OR REPLACE TABLE {qt} AS SELECT DISTINCT * FROM {qt};" | ||
| ) | ||
| else: | ||
| materialize = ( | ||
| f"-- To materialize: BEGIN; DROP TABLE IF EXISTS {qt}_deduped; " | ||
| f"CREATE TABLE {qt}_deduped AS SELECT DISTINCT * FROM {qt}; COMMIT;" |
| if _is_numeric_dtype(dtype) and not _looks_like_identifier(column_name): | ||
| stats = await _get_numeric_stats(run_sql, table_name, column_name) | ||
| if stats: | ||
| min_value = stats.get("min") | ||
| max_value = stats.get("max") | ||
| avg_value = stats.get("avg") | ||
| if min_value == max_value and min_value is not None: | ||
| numeric_flags.append( | ||
| { | ||
| "table": table_name, | ||
| "column": column_name, | ||
| "issue": f"constant numeric value ({min_value})", | ||
| } | ||
| ) | ||
| elif avg_value in {min_value, max_value} and min_value != max_value: | ||
| numeric_flags.append( | ||
| { | ||
| "table": table_name, | ||
| "column": column_name, | ||
| "issue": f"average sits on boundary ({avg_value})", | ||
| } | ||
| ) | ||
| min_value = stats.get("min") | ||
| max_value = stats.get("max") | ||
| avg_value = stats.get("avg") | ||
| if min_value == max_value and min_value is not None: | ||
| numeric_flags.append( | ||
| { | ||
| "table": table_name, | ||
| "column": column_name, | ||
| "issue": f"constant numeric value ({min_value})", | ||
| } | ||
| ) | ||
| elif avg_value in {min_value, max_value} and min_value != max_value: | ||
| numeric_flags.append( |
The first cut left the markdown and Rich-table renderers for deep findings uncovered, and the detector exception/skip paths only had indirect coverage. Adds direct tests: - render_quality_markdown with synthesized deep data, asserting every new section and the Suggested Cleanup block render - end-to-end CLI tests for --deep against both the Rich table and markdown output, monkey-patching build_quality_overview to return a populated deep result - cleanup.py builder coverage for text, outlier (literal and fallback), and orphan-FK previews - detector tests confirming every detector returns [] when SQL raises, the outlier detector short-circuits on SQLite, and orphan detection skips self-references and unmatched parents Lifts patch coverage on cleanup.py to 100%, cli_commands/quality.py from 68% to 94%, and data_profile.py from 94% to 95%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new section to the audit-data-quality how-to describing the --deep flag, each new detector, and the previewable cleanup SQL emitted alongside findings. The auto-generated CLI reference already covered the flag itself; this is the user-facing guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
build_quality_overviewinto one query per table, replacing O(columns) roundtrips with one batchedSELECT COUNT(*), COUNT(col), MIN(col), MAX(col), AVG(col), ...per table.datasight quality --deepto run expensive detectors: whole-row duplicates, PK-shaped duplicates onid/*_id/id_*columns, text whitespace and empty-string-as-NULL flags, IQR-based numeric outliers, and orphan foreign-key-shaped values matched against parent tables with a single ID-shaped column.src/datasight/cleanup.pyemits dialect-aware previewable SQL per finding (DuckDBQUALIFY, PostgresROW_NUMBERCTE, SQLiterowid); destructive operations appear only as comments inside the preview, never auto-executed.sql_dialectthroughbuild_quality_overviewandbuild_audit_reportfrom all four call sites (cli_commands/quality.py,cli_commands/inspect.py,cli_commands/audit_report.py,web/app.py). Defaults preserve legacy behavior for any external caller.cli_commands/quality.pyandrender_quality_markdownincli.pyadd sections for each new finding type plus aSuggested Cleanuppanel with the emitted SQL.Test plan
pytest -m "not integration"— 1545 passedruff check,ruff format --check,ty checkcleandatasight demo time-validationthendatasight quality --deepcorrectly flags 95 planted whole-row duplicates, IQR outliers across measure columns, and emits previewable cleanup SQL🤖 Generated with Claude Code