Add --deep quality checks and previewable cleanup SQL by daniel-thom · Pull Request #73 · dsgrid/datasight

daniel-thom · 2026-05-16T02:48:05Z

Summary

Batches per-column null and numeric scans in build_quality_overview into one query per table, replacing O(columns) roundtrips with one batched SELECT COUNT(*), COUNT(col), MIN(col), MAX(col), AVG(col), ... per table.
Adds datasight quality --deep to run expensive detectors: whole-row duplicates, PK-shaped duplicates on id/*_id/id_* columns, text whitespace and empty-string-as-NULL flags, IQR-based numeric outliers, and orphan foreign-key-shaped values matched against parent tables with a single ID-shaped column.
New src/datasight/cleanup.py emits dialect-aware previewable SQL per finding (DuckDB QUALIFY, Postgres ROW_NUMBER CTE, SQLite rowid); destructive operations appear only as comments inside the preview, never auto-executed.
Threads sql_dialect through build_quality_overview and build_audit_report from all four call sites (cli_commands/quality.py, cli_commands/inspect.py, cli_commands/audit_report.py, web/app.py). Defaults preserve legacy behavior for any external caller.
CLI table renderer in cli_commands/quality.py and render_quality_markdown in cli.py add sections for each new finding type plus a Suggested Cleanup panel with the emitted SQL.

Test plan

pytest -m "not integration" — 1545 passed
ruff check, ruff format --check, ty check clean
Smoke: datasight demo time-validation then datasight quality --deep correctly flags 95 planted whole-row duplicates, IQR outliers across measure columns, and emits previewable cleanup SQL
New tests cover: batched-scan single-query-per-table behavior, each deep detector firing on a planted-error fixture, and dialect branching in cleanup SQL builders
Manual verification on a SQLite project (outlier check skipped as designed)
Manual verification on a Postgres project

🤖 Generated with Claude Code

Batches per-column null and numeric scans in build_quality_overview into one query per table. Adds a --deep flag that runs additional detectors: whole-row duplicates, PK-shaped duplicates, text whitespace and empty- string flags, IQR-based numeric outliers, and orphan foreign-key-shaped values. Each finding carries a dialect-aware previewable cleanup SQL emitted by a new cleanup module, rendered in the CLI table and markdown outputs alongside the existing sections. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-16T02:57:28Z

Codecov Report

❌ Patch coverage is 92.60355% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.07%. Comparing base (6f79f8a) to head (0e1e3bc).

Files with missing lines	Patch %	Lines
src/datasight/data_profile.py	88.83%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
+ Coverage   85.86%   86.07%   +0.20%     
==========================================
  Files          64       65       +1     
  Lines       12317    12639     +322     
==========================================
+ Hits        10576    10879     +303     
- Misses       1741     1760      +19

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Replaces the value != value idiom (flagged by CodeQL as comparison of identical values) with math.isnan, which removes both the bare except block and the need for an inline comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an opt-in “deep” mode for data-quality checks and attaches dialect-aware, preview-only cleanup SQL to findings, while also reducing per-column scan roundtrips by batching column stats into a single per-table query.

Changes:

Refactors build_quality_overview to use a single batched stats query per table and introduces deep + sql_dialect parameters.
Implements datasight quality --deep with additional detectors (duplicates, text cleanliness, outliers, orphan FK-shaped values) and renders “Suggested Cleanup” SQL in CLI/markdown outputs.
Adds datasight.cleanup SQL preview builders and threads sql_dialect through CLI/web/audit report call sites with tests covering the new behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_web_app.py	Updates test fakes to accept new keyword arguments passed to `build_quality_overview`.
tests/test_data_profile_extra.py	Adds deep-mode fixtures/tests and asserts batched-scan behavior + cleanup SQL dialect branching.
src/datasight/web/app.py	Passes `state.sql_dialect` into `build_quality_overview` for web quality overview.
src/datasight/data_profile.py	Implements batched scan, deep detectors, and threads `sql_dialect/deep` through quality overview.
src/datasight/cli.py	Extends markdown renderer with deep finding sections and suggested cleanup SQL blocks.
src/datasight/cli_commands/quality.py	Adds `--deep` flag, passes dialect/deep to quality overview, and prints suggested cleanup SQL panels.
src/datasight/cli_commands/inspect.py	Passes SQL dialect into `build_quality_overview`.
src/datasight/cli_commands/audit_report.py	Passes SQL dialect into `build_audit_report`.
src/datasight/cleanup.py	New module generating previewable, dialect-aware cleanup SQL for findings.
src/datasight/audit_report.py	Threads `sql_dialect`/`deep` into audit report quality section.
docs/reference/cli.md	Documents the new `datasight quality --deep` option.

Comments suppressed due to low confidence (2)

src/datasight/cleanup.py:74

pk_dedup_preview claims to show “one canonical row per duplicate … value”, but the current queries return one row for every distinct PK value (including non-duplicates). This makes the preview misleading and can produce a huge result set; it should filter down to only PK values with COUNT(*) > 1 (e.g., via a windowed COUNT(*) OVER (PARTITION BY ...) > 1 condition or a join to a subquery of duplicate keys).

def pk_dedup_preview(table: str, pk_column: str, dialect: str) -> str:
    """Show one canonical row per duplicate PK value.

    DuckDB uses ``QUALIFY`` for a one-liner; Postgres uses a CTE with
    ``ROW_NUMBER``; SQLite falls back to ``MIN(rowid)``.
    """
    qt = _quote_identifier(table)
    qc = _quote_identifier(pk_column)
    if dialect == "duckdb":
        return (
            f"-- One canonical row per duplicate {pk_column!r} value.\n"
            f"SELECT * FROM {qt} "
            f"QUALIFY ROW_NUMBER() OVER (PARTITION BY {qc} ORDER BY {qc}) = 1;"
        )

src/datasight/data_profile.py:1093

_scalar_or_none currently coerces all non-null scalars to str. For aggregates used in downstream logic (min/max/avg/q1/q3), this loses type information and makes later comparisons/formatting brittle. Consider returning the original value (or a float/Decimal) and handling NaN/None without converting to string here; stringify only where the value is embedded into human-facing messages or SQL text.

def _scalar_or_none(value: Any) -> str | None:
    """Convert a SQL scalar to a stringified value or None."""
    if value is None:
        return None
    if isinstance(value, float) and math.isnan(value):
        return None
    return str(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if dialect == "duckdb":
+        materialize = (
+            f"-- To materialize: CREATE OR REPLACE TABLE {qt} AS SELECT DISTINCT * FROM {qt};"
+        )
+    else:
+        materialize = (
+            f"-- To materialize: BEGIN; DROP TABLE IF EXISTS {qt}_deduped; "
+            f"CREATE TABLE {qt}_deduped AS SELECT DISTINCT * FROM {qt}; COMMIT;"


            if _is_numeric_dtype(dtype) and not _looks_like_identifier(column_name):
-                stats = await _get_numeric_stats(run_sql, table_name, column_name)
-                if stats:
-                    min_value = stats.get("min")
-                    max_value = stats.get("max")
-                    avg_value = stats.get("avg")
-                    if min_value == max_value and min_value is not None:
-                        numeric_flags.append(
-                            {
-                                "table": table_name,
-                                "column": column_name,
-                                "issue": f"constant numeric value ({min_value})",
-                            }
-                        )
-                    elif avg_value in {min_value, max_value} and min_value != max_value:
-                        numeric_flags.append(
-                            {
-                                "table": table_name,
-                                "column": column_name,
-                                "issue": f"average sits on boundary ({avg_value})",
-                            }
-                        )
+                min_value = stats.get("min")
+                max_value = stats.get("max")
+                avg_value = stats.get("avg")
+                if min_value == max_value and min_value is not None:
+                    numeric_flags.append(
+                        {
+                            "table": table_name,
+                            "column": column_name,
+                            "issue": f"constant numeric value ({min_value})",
+                        }
+                    )
+                elif avg_value in {min_value, max_value} and min_value != max_value:
+                    numeric_flags.append(


The first cut left the markdown and Rich-table renderers for deep findings uncovered, and the detector exception/skip paths only had indirect coverage. Adds direct tests: - render_quality_markdown with synthesized deep data, asserting every new section and the Suggested Cleanup block render - end-to-end CLI tests for --deep against both the Rich table and markdown output, monkey-patching build_quality_overview to return a populated deep result - cleanup.py builder coverage for text, outlier (literal and fallback), and orphan-FK previews - detector tests confirming every detector returns [] when SQL raises, the outlier detector short-circuits on SQLite, and orphan detection skips self-references and unmatched parents Lifts patch coverage on cleanup.py to 100%, cli_commands/quality.py from 68% to 94%, and data_profile.py from 94% to 95%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a new section to the audit-data-quality how-to describing the --deep flag, each new detector, and the previewable cleanup SQL emitted alongside findings. The auto-generated CLI reference already covered the flag itself; this is the user-facing guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems May 16, 2026

View reviewed changes

Comment thread src/datasight/data_profile.py Fixed

Comment thread src/datasight/data_profile.py Fixed

daniel-thom requested a review from Copilot May 16, 2026 18:39

Copilot started reviewing on behalf of daniel-thom May 16, 2026 18:39 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

daniel-thom and others added 2 commits May 16, 2026 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --deep quality checks and previewable cleanup SQL#73

Add --deep quality checks and previewable cleanup SQL#73
daniel-thom wants to merge 4 commits into
mainfrom
feat/quality-deep-checks

daniel-thom commented May 16, 2026

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daniel-thom commented May 16, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 16, 2026 •

edited

Loading