Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
c079b9f
feat(analytics): scaffold data_analyzer package structure
svij-sc Apr 17, 2026
3988493
feat(analytics): add DataAnalyzerConfig with YAML loading and tests
svij-sc Apr 17, 2026
cf69b38
fix(analytics): remove unused imports in config_test.py
svij-sc Apr 17, 2026
8abae4a
feat(analytics): add result type dataclasses (DegreeStats, GraphAnaly…
svij-sc Apr 17, 2026
f1c7f52
feat(analytics): add 18 SQL query templates for graph structure analysis
svij-sc Apr 17, 2026
21255d0
feat(analytics): add GraphStructureAnalyzer with 4-tier BQ validation
svij-sc Apr 17, 2026
793190c
style(analytics): apply black formatter to test files
svij-sc Apr 17, 2026
0b01b5c
feat(analytics): add report SPEC.md and initial AI-owned HTML/JS/CSS …
svij-sc Apr 17, 2026
28503d9
feat(analytics): add ReportGenerator with snapshot test
svij-sc Apr 17, 2026
018e35e
feat(analytics): add DataAnalyzer orchestrator with CLI entry point
svij-sc Apr 17, 2026
42f8d78
feat(analytics): add FeatureProfiler stub (TFDV/Dataflow integration …
svij-sc Apr 17, 2026
56eb170
fix(analytics): cast OmegaConf.to_object result in config_test
svij-sc Apr 17, 2026
7f387f6
style(analytics): apply isort and mdformat to data_analyzer files
svij-sc Apr 17, 2026
14df2b8
docs(analytics): add PRD.md for HTML report (product intent)
svij-sc Apr 17, 2026
5e166fa
docs(analytics): add BQ Data Analyzer design docs, literature review,…
svij-sc Apr 17, 2026
d3f1eb8
delete plans
svij-sc Apr 18, 2026
c2c05e2
feat(analytics): write the HTML report to disk or GCS from the orches…
svij-sc Apr 20, 2026
e67eeac
docs(analytics): add practitioner README for the analytics module
svij-sc Apr 20, 2026
40f379a
fix(analytics): address code-reviewer feedback on practitioner README
svij-sc Apr 20, 2026
826c893
tfdv
svij-sc Apr 21, 2026
99cbc3b
fix
svij-sc May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 225 additions & 0 deletions gigl/analytics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
# GiGL Analytics

Pre-training graph data validation and analysis tooling. Use this module before committing to a GNN training run to
catch data quality and structural issues that silently degrade model quality.

Two subpackages:

- [`data_analyzer/`](data_analyzer/) — end-to-end `DataAnalyzer` that runs BigQuery checks and produces a single
self-contained HTML report. **Start here.**
- [`graph_validation/`](graph_validation/) — lightweight standalone validators (currently: `BQGraphValidator` for
dangling-edge checks). Use when you only need one check and not the full report.

## Quickstart

**Prerequisites.** Follow the [GiGL installation guide](../../docs/user_guide/getting_started/installation.md) so that
`uv` and GiGL's Python dependencies are available. Then authenticate to BigQuery:

```bash
gcloud auth application-default login
```

**1. Write a YAML config.** Save as `my_analyzer_config.yaml`:

```yaml
node_tables:
- bq_table: "your-project.your_dataset.user_nodes"
node_type: "user"
id_column: "user_id"
feature_columns: ["age", "country"] # optional; omit to auto-infer all non-ID, TFDV-compatible columns from the BQ schema
# label_column: "label" # optional; enables Tier 3 label checks

edge_tables:
- bq_table: "your-project.your_dataset.user_edges"
edge_type: "follows"
src_id_column: "src_user_id"
dst_id_column: "dst_user_id"

# Where to write the HTML report. Local path for quick iteration, or a gs:// URI.
output_gcs_path: "/tmp/my_analysis/"

# Optional: sizing for the neighbor-explosion estimate (fan-out per GNN layer).
fan_out: [15, 10, 5]
```

**2. Run the analyzer.**

```bash
uv run python -m gigl.analytics.data_analyzer \
--analyzer_config_uri my_analyzer_config.yaml
```

**3. Open the report.** When the run completes:

```
[INFO] Report written to /tmp/my_analysis/report.html
```

Open the file in any browser. No server, no external dependencies, fully offline.

## What it checks

The analyzer organizes checks into four tiers. Tiers 1 and 2 always run; Tier 3 auto-enables when your config supports
it; Tier 4 is opt-in.

| Tier | When | What it checks |
| ---------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Hard fails** | Always | Dangling edges (NULL src/dst), referential integrity (edges pointing to nodes not in the node table), duplicate nodes. Raises `DataQualityError` — the report still renders to show partial results. |
| **2. Core metrics** | Always | Node/edge counts, degree distribution (in/out) with percentiles, degree buckets, top-K hubs, super-hub int16 clamp count, cold-start node count, self-loops, duplicate edges, NULL rates per column, feature memory budget estimate, neighbor-explosion estimate (requires `fan_out`). |
| **3. Label + heterogeneous** | Auto when `label_column` is set on any node table, or when multiple edge types exist | Class imbalance, label coverage, edge type distribution, per-edge-type node coverage. |
| **4. Advanced** | Opt-in via config flags | Power-law exponent (implemented as a degree-stats approximation). Reciprocity, homophily, connected components, clustering coefficient are **not yet implemented** — the flags are accepted but currently no-op. |

The thresholds below come from a review of production GNN papers (PinSage, BLADE, LiGNN, TwHIN, AliGraph, GraphSMOTE,
Beyond Homophily, Feature Propagation, and others). See the inline citations in the threshold table for what each paper
contributes.

## Feature profiling

In addition to the structural checks above, the analyzer runs
[TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) on every node and edge table and embeds the
resulting Facets HTML report in the final output.

- **Auto-inference.** By default, the profiler reads the BQ table schema and profiles every non-ID column whose type is
TFDV-compatible — scalars `STRING`, `INT64`, `FLOAT64`, `NUMERIC`, `BIGNUMERIC`, `BOOL`. Temporal types (`DATE`,
`DATETIME`, `TIMESTAMP`, `TIME`) and complex types (`RECORD`, `GEOGRAPHY`, `JSON`, `BYTES`) are not supported by TFDV
and are skipped with an info log.
- **Embedding columns.** `REPEATED` `FLOAT64` / `FLOAT` / `NUMERIC` / `BIGNUMERIC` columns are treated as embedding
vectors. Each expands in the Beam `SELECT` into four scalar hygiene companions — `<col>_len`, `<col>_has_nan`,
`<col>_has_inf`, `<col>_is_all_zero` — which are profiled by TFDV like any other scalar. Other REPEATED types
(`STRING` / `INT64` arrays, etc.) are skipped.
- **Embedding diagnostics.** After the TFDV pipelines finish, one BigQuery aggregate per embedding column computes
`total`, `unique_count`, `unique_ratio`, and top-K most-frequent hash clusters (via
`FARM_FINGERPRINT(TO_JSON_STRING(<col>))`). Results land in `FeatureProfileResult.embedding_diagnostics` and render as
a dedicated "Embedding Diagnostics" section in the report.
- **Explicit override.** Setting `feature_columns` in the YAML narrows the projection to those columns (still honoring
embedding expansion for REPEATED FLOAT families). Use this to scope down to a handful of columns, or to exclude PII /
expensive fields.
- **Join keys are excluded.** `id_column` on nodes and `src_id_column` / `dst_id_column` on edges are always dropped
from the auto-inferred list. `label_column` and `timestamp_column` are kept (profiling class balance / temporal
sparsity is useful).
- **Cost.** One Dataflow job is launched per table, so a config with many tables translates to many concurrent Dataflow
runs. During iteration, pass `--only structure` to skip the profiler entirely. Run `--only feature` (or the default
`both`) once the config is stable.

## Machine-readable outputs

Alongside `report.html`, each analyzer run writes versioned Pydantic JSON sidecars under `output_gcs_path/`:

- `graph_structure.json` — the `GraphAnalysisResult` payload from `GraphStructureAnalyzer`. Written on success and also
on a Tier 1 `DataQualityError` (partial result) so failures are still recoverable.
- `feature_profile.json` — the `FeatureProfileResult` payload (facets URIs, TFDV stats URIs, embedding diagnostics).

Each sidecar wraps its payload in an envelope: `{schema_version, component, generated_at, data}`. Load one with
`gigl.analytics.data_analyzer.types.load_artifact(path, expected_component=...)`. Schema changes are additive-only at
`schema_version="1"`; breaking changes bump the version.

## Interpreting the report

The report color-codes every numeric finding. Summary of the most important thresholds:

| Metric | Green | Yellow | Red | What to do when yellow/red |
| -------------------------------------------------------- | ----- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Dangling edges / referential integrity / duplicate nodes | 0 | — | any > 0 | Fix the input tables. Training will fail or silently corrupt otherwise. |
| Feature missing rate | < 10% | 10–50% | > 90% | Plan an imputation strategy; above ~95% the Feature Propagation phase transition (Rossi et al., ICLR 2022) hits and GNNs stop recovering signal reliably. |
| Isolated node fraction | < 1% | 1–5% | > 5% | Filter isolated nodes or densify (LiGNN, KDD 2024) for cold-start cohorts. |
| Cold-start fraction (degree 0–1) | < 5% | 5–10% | > 10% | Candidates for graph densification; also flag for special handling at serving time. |
| Super-hub int16 clamp (degree > 32,767) | 0 | — | any > 0 | GiGL silently truncates super-hub degrees in `gigl/distributed/utils/degree.py`. Either cap the hub's edges upstream or plan to address the clamp. |
| Degree p99 / median | < 50 | 50–100 | > 100 | Use importance sampling (PinSage, KDD 2018) or degree-adaptive neighborhoods (BLADE, WSDM 2023) — degree skew is the single biggest lever in production GNNs. |
| Class imbalance ratio | < 1:5 | 1:5 – 1:10 | > 1:10 | Message passing amplifies label imbalance 2–3× in representation space (GraphSMOTE, WSDM 2021). Consider resampling or GraphSMOTE-style synthetic nodes. |
| Edge homophily (Tier 4, future) | > 0.7 | 0.3 – 0.7 | < 0.3 | Standard GCN/GAT fail at low h (Zhu et al., NeurIPS 2020). Consider H2GCN-style architectures; below h ≈ 0.2 a plain MLP often wins. |

## Advanced config

Optional YAML keys beyond the minimal quickstart:

```yaml
# Enable Tier 3 class-imbalance + label-coverage checks for a node type:
node_tables:
- bq_table: ...
label_column: "label"

# Neighbor explosion estimation — the fan-out per GNN layer you plan to train with:
fan_out: [15, 10, 5]

# Tier 4 opt-in flags. Default false.
# NOTE: Only `compute_reciprocity` is wired into the analyzer today and it logs a
# warning rather than computing a result. The other three flags are placeholders
# for future work (see "Scope and limitations" below).
compute_reciprocity: true
compute_homophily: true
compute_connected_components: true
compute_clustering: true

# Per-edge-type timestamp hint. NOTE: accepted by the config schema but not yet
# consumed by any Tier 4 query (temporal freshness check is planned).
edge_tables:
- bq_table: ...
timestamp_column: "created_at"
```

## Python API

The CLI wraps a regular class. Call from your own code when you want programmatic access to the `GraphAnalysisResult`:

```python
from gigl.analytics.data_analyzer import DataAnalyzer
from gigl.analytics.data_analyzer.config import load_analyzer_config

config = load_analyzer_config("my_analyzer_config.yaml")
analyzer = DataAnalyzer()
report_path = analyzer.run(config=config)
# report_path points to the written report.html (local path or gs:// URI)
```

The underlying `GraphStructureAnalyzer` is also callable directly if you want the raw result dataclass and no HTML:

```python
from gigl.analytics.data_analyzer.graph_structure_analyzer import GraphStructureAnalyzer

result = GraphStructureAnalyzer().analyze(config)
print(result.degree_stats)
```

See a rendered report example at
[`tests/test_assets/analytics/golden_report.html`](../../tests/test_assets/analytics/golden_report.html) to preview the
output format before authenticating to BQ.

## graph_validation

One-off validators for the subset of cases where the full analyzer is overkill. Today the only check is dangling-edge
detection:

```python
from gigl.analytics.graph_validation import BQGraphValidator

has_dangling = BQGraphValidator.does_edge_table_have_dangling_edges(
edge_table="your-project.your_dataset.user_edges",
src_node_column_name="src_user_id",
dst_node_column_name="dst_user_id",
)
```

The `DataAnalyzer` runs this check (and many more) as part of Tier 1, so prefer the full analyzer unless you
specifically need a one-line gate (e.g., inside an Airflow task or a preprocessing job). This subpackage is the intended
home for additional standalone validators in the future.

## Scope and limitations

Current implementation status:

- **Tier 4 checks are partial.** Power-law exponent is computed as a degree-stats approximation. Reciprocity, homophily,
connected components, and clustering coefficient config flags are accepted but currently no-op. The `timestamp_column`
edge field is accepted but no temporal-freshness query runs yet.
- **Heterogeneous graphs: referential integrity caveat.** For each edge table, the referential-integrity check joins
against `config.node_tables[0]`. On heterogeneous graphs where different edges reference different node types, the
current implementation will under-report integrity violations — fix is tracked for a follow-up.
- **GCS upload** works via `GcsUtils.upload_from_string` when `output_gcs_path` is a `gs://` URI, and falls back to
local filesystem write otherwise.

## Related documents

Within this module:

- [`data_analyzer/report/PRD.md`](data_analyzer/report/PRD.md) — product intent for the HTML report (AI-owned)
- [`data_analyzer/report/SPEC.md`](data_analyzer/report/SPEC.md) — technical contract for the AI-owned HTML/JS/CSS
assets
10 changes: 10 additions & 0 deletions gigl/analytics/data_analyzer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""
BQ Data Analyzer for pre-training graph data analysis.

Produces a single HTML report covering data quality, feature distributions,
and graph structure metrics from BigQuery node/edge tables.
"""

from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer

__all__ = ["DataAnalyzer"]
6 changes: 6 additions & 0 deletions gigl/analytics/data_analyzer/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Entry point for running the BQ Data Analyzer as a module: python -m gigl.analytics.data_analyzer."""

from gigl.analytics.data_analyzer.data_analyzer import main

if __name__ == "__main__":
main()
Loading