Skip to content

Add --deep quality checks and previewable cleanup SQL#73

Open
daniel-thom wants to merge 4 commits into
mainfrom
feat/quality-deep-checks
Open

Add --deep quality checks and previewable cleanup SQL#73
daniel-thom wants to merge 4 commits into
mainfrom
feat/quality-deep-checks

Conversation

@daniel-thom
Copy link
Copy Markdown
Contributor

Summary

  • Batches per-column null and numeric scans in build_quality_overview into one query per table, replacing O(columns) roundtrips with one batched SELECT COUNT(*), COUNT(col), MIN(col), MAX(col), AVG(col), ... per table.
  • Adds datasight quality --deep to run expensive detectors: whole-row duplicates, PK-shaped duplicates on id/*_id/id_* columns, text whitespace and empty-string-as-NULL flags, IQR-based numeric outliers, and orphan foreign-key-shaped values matched against parent tables with a single ID-shaped column.
  • New src/datasight/cleanup.py emits dialect-aware previewable SQL per finding (DuckDB QUALIFY, Postgres ROW_NUMBER CTE, SQLite rowid); destructive operations appear only as comments inside the preview, never auto-executed.
  • Threads sql_dialect through build_quality_overview and build_audit_report from all four call sites (cli_commands/quality.py, cli_commands/inspect.py, cli_commands/audit_report.py, web/app.py). Defaults preserve legacy behavior for any external caller.
  • CLI table renderer in cli_commands/quality.py and render_quality_markdown in cli.py add sections for each new finding type plus a Suggested Cleanup panel with the emitted SQL.

Test plan

  • pytest -m "not integration" — 1545 passed
  • ruff check, ruff format --check, ty check clean
  • Smoke: datasight demo time-validation then datasight quality --deep correctly flags 95 planted whole-row duplicates, IQR outliers across measure columns, and emits previewable cleanup SQL
  • New tests cover: batched-scan single-query-per-table behavior, each deep detector firing on a planted-error fixture, and dialect branching in cleanup SQL builders
  • Manual verification on a SQLite project (outlier check skipped as designed)
  • Manual verification on a Postgres project

🤖 Generated with Claude Code

Batches per-column null and numeric scans in build_quality_overview into
one query per table. Adds a --deep flag that runs additional detectors:
whole-row duplicates, PK-shaped duplicates, text whitespace and empty-
string flags, IQR-based numeric outliers, and orphan foreign-key-shaped
values. Each finding carries a dialect-aware previewable cleanup SQL
emitted by a new cleanup module, rendered in the CLI table and markdown
outputs alongside the existing sections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/datasight/data_profile.py Fixed
Comment thread src/datasight/data_profile.py Fixed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 92.60355% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.07%. Comparing base (6f79f8a) to head (0e1e3bc).

Files with missing lines Patch % Lines
src/datasight/data_profile.py 88.83% 25 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
+ Coverage   85.86%   86.07%   +0.20%     
==========================================
  Files          64       65       +1     
  Lines       12317    12639     +322     
==========================================
+ Hits        10576    10879     +303     
- Misses       1741     1760      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Replaces the value != value idiom (flagged by CodeQL as comparison of
identical values) with math.isnan, which removes both the bare except
block and the need for an inline comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “deep” mode for data-quality checks and attaches dialect-aware, preview-only cleanup SQL to findings, while also reducing per-column scan roundtrips by batching column stats into a single per-table query.

Changes:

  • Refactors build_quality_overview to use a single batched stats query per table and introduces deep + sql_dialect parameters.
  • Implements datasight quality --deep with additional detectors (duplicates, text cleanliness, outliers, orphan FK-shaped values) and renders “Suggested Cleanup” SQL in CLI/markdown outputs.
  • Adds datasight.cleanup SQL preview builders and threads sql_dialect through CLI/web/audit report call sites with tests covering the new behavior.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_web_app.py Updates test fakes to accept new keyword arguments passed to build_quality_overview.
tests/test_data_profile_extra.py Adds deep-mode fixtures/tests and asserts batched-scan behavior + cleanup SQL dialect branching.
src/datasight/web/app.py Passes state.sql_dialect into build_quality_overview for web quality overview.
src/datasight/data_profile.py Implements batched scan, deep detectors, and threads sql_dialect/deep through quality overview.
src/datasight/cli.py Extends markdown renderer with deep finding sections and suggested cleanup SQL blocks.
src/datasight/cli_commands/quality.py Adds --deep flag, passes dialect/deep to quality overview, and prints suggested cleanup SQL panels.
src/datasight/cli_commands/inspect.py Passes SQL dialect into build_quality_overview.
src/datasight/cli_commands/audit_report.py Passes SQL dialect into build_audit_report.
src/datasight/cleanup.py New module generating previewable, dialect-aware cleanup SQL for findings.
src/datasight/audit_report.py Threads sql_dialect/deep into audit report quality section.
docs/reference/cli.md Documents the new datasight quality --deep option.
Comments suppressed due to low confidence (2)

src/datasight/cleanup.py:74

  • pk_dedup_preview claims to show “one canonical row per duplicate … value”, but the current queries return one row for every distinct PK value (including non-duplicates). This makes the preview misleading and can produce a huge result set; it should filter down to only PK values with COUNT(*) > 1 (e.g., via a windowed COUNT(*) OVER (PARTITION BY ...) > 1 condition or a join to a subquery of duplicate keys).
def pk_dedup_preview(table: str, pk_column: str, dialect: str) -> str:
    """Show one canonical row per duplicate PK value.

    DuckDB uses ``QUALIFY`` for a one-liner; Postgres uses a CTE with
    ``ROW_NUMBER``; SQLite falls back to ``MIN(rowid)``.
    """
    qt = _quote_identifier(table)
    qc = _quote_identifier(pk_column)
    if dialect == "duckdb":
        return (
            f"-- One canonical row per duplicate {pk_column!r} value.\n"
            f"SELECT * FROM {qt} "
            f"QUALIFY ROW_NUMBER() OVER (PARTITION BY {qc} ORDER BY {qc}) = 1;"
        )

src/datasight/data_profile.py:1093

  • _scalar_or_none currently coerces all non-null scalars to str. For aggregates used in downstream logic (min/max/avg/q1/q3), this loses type information and makes later comparisons/formatting brittle. Consider returning the original value (or a float/Decimal) and handling NaN/None without converting to string here; stringify only where the value is embedded into human-facing messages or SQL text.
def _scalar_or_none(value: Any) -> str | None:
    """Convert a SQL scalar to a stringified value or None."""
    if value is None:
        return None
    if isinstance(value, float) and math.isnan(value):
        return None
    return str(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/datasight/cleanup.py
Comment on lines +49 to +56
if dialect == "duckdb":
materialize = (
f"-- To materialize: CREATE OR REPLACE TABLE {qt} AS SELECT DISTINCT * FROM {qt};"
)
else:
materialize = (
f"-- To materialize: BEGIN; DROP TABLE IF EXISTS {qt}_deduped; "
f"CREATE TABLE {qt}_deduped AS SELECT DISTINCT * FROM {qt}; COMMIT;"
Comment on lines 904 to +917
if _is_numeric_dtype(dtype) and not _looks_like_identifier(column_name):
stats = await _get_numeric_stats(run_sql, table_name, column_name)
if stats:
min_value = stats.get("min")
max_value = stats.get("max")
avg_value = stats.get("avg")
if min_value == max_value and min_value is not None:
numeric_flags.append(
{
"table": table_name,
"column": column_name,
"issue": f"constant numeric value ({min_value})",
}
)
elif avg_value in {min_value, max_value} and min_value != max_value:
numeric_flags.append(
{
"table": table_name,
"column": column_name,
"issue": f"average sits on boundary ({avg_value})",
}
)
min_value = stats.get("min")
max_value = stats.get("max")
avg_value = stats.get("avg")
if min_value == max_value and min_value is not None:
numeric_flags.append(
{
"table": table_name,
"column": column_name,
"issue": f"constant numeric value ({min_value})",
}
)
elif avg_value in {min_value, max_value} and min_value != max_value:
numeric_flags.append(
daniel-thom and others added 2 commits May 16, 2026 12:47
The first cut left the markdown and Rich-table renderers for deep
findings uncovered, and the detector exception/skip paths only had
indirect coverage. Adds direct tests:

- render_quality_markdown with synthesized deep data, asserting every
  new section and the Suggested Cleanup block render
- end-to-end CLI tests for --deep against both the Rich table and
  markdown output, monkey-patching build_quality_overview to return a
  populated deep result
- cleanup.py builder coverage for text, outlier (literal and fallback),
  and orphan-FK previews
- detector tests confirming every detector returns [] when SQL raises,
  the outlier detector short-circuits on SQLite, and orphan detection
  skips self-references and unmatched parents

Lifts patch coverage on cleanup.py to 100%, cli_commands/quality.py
from 68% to 94%, and data_profile.py from 94% to 95%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a new section to the audit-data-quality how-to describing the
--deep flag, each new detector, and the previewable cleanup SQL
emitted alongside findings. The auto-generated CLI reference already
covered the flag itself; this is the user-facing guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants