Skip to content

hypertopos/hypertopos-py

Repository files navigation

hypertopos

A behavioral feature layer for graph and temporal data — turning behavior into coordinates, trajectories, and explanations.

Python 3.12+ License: BSL 1.1 DOI PyArrow Lance MCP Version

hypertopos is not a database, and not a machine learning model. It is a layer that turns relational data into a coordinate system where every entity gets a position derived from its relationships and the population around it.

typical:    data → features → feature store → ML → decision
hypertopos: data → representation (hypertopos) → ML / decision
pip install hypertopos

hypertopos overview

How it works

You describe your data in YAML — entity types, sources, relationships. hypertopos computes population statistics and produces a sphere: pre-computed geometry stored in Apache Arrow format.

Agents (or Python code) open the sphere and navigate it using twelve primitives that cover movement, clustering, anomaly detection, population comparison, and temporal analysis. Each step is stateful — where you are determines what you see next.

For the full picture: Introduction · Core Concepts · Quick Start

What's different

Each capability below emerges from treating entities as points in a shared, population-calibrated space.

Capability What it does Compared to Since
Population-relative coordinates delta = (shape - mu) / sigma — one coordinate for anomaly, clustering, drift node2vec/GNN: latent dims, retraining on shift 0.1.0
Self-calibrating threshold theta = percentile(norms, 95) — no tuning, no labels PyOD: choose contamination rate 0.1.0
Named dimension attribution explain_anomalyloan_count: +3.2σ (42%). Sums to 100% SHAP/LIME: approximate, model-dependent 0.1.0
Temporal deformation Append-only delta log. Displacement, path length, directionality Time-series DBs: metric values, not trajectories 0.1.0
Stateful navigation 12 typed primitives. Position type constrains valid ops SQL/GraphQL: stateless queries 0.1.0
Cross-sphere comparison ||delta|| is dimensionless — 4.2σ means the same in any domain Requires shared features or joint embeddings 0.1.0
Counterfactual simulation simulate_edges recomputes delta against fixed baseline Causal inference: explicit DAG required 0.1.0
Regime change detection Per-bucket centroids, self-calibrating shift threshold Evidently/NannyML: model prediction drift 0.1.0
Graph contagion Mean ||delta|| of neighbors. Cohen's d vs control group PageRank: topology, not behavioral propagation 0.2.0
Witness cohort Similar deltas, NOT connected. Validates pattern vs one-off k-NN on features (includes neighbors) 0.2.1
FDR-controlled detection Benjamini-Hochberg on rank p-values. Per-entity q-values BH not combined with geometric detection 0.3.1
Diverse anomaly selection Facility location covers distinct anomaly regions Top-N returns redundant extremes 0.3.1
Distribution-aware scoring Per-dim Bregman divergence (gaussian/poisson/bernoulli). Additive PyOD/sklearn: uniform metric across all features 0.4.0
Anomaly confidence Bootstrap anomaly_confidence: 0-1. min_confidence filter No equivalent — binary verdict, no stability signal 0.4.0
Graph algorithm dimensions PageRank, betweenness, community, clustering as geometry dims Separate graph DB + manual joins 0.4.1
Adaptive false-discovery-rate Storey π₀ estimator + χ² parametric p-values recover BH power loss BH without adaptive π₀ overcorrects on null-heavy populations 0.5.0
Drift direction gradient_alignment + drift_direction ∈ {normalizing, deteriorating, neutral} Drift magnitude only — no toward/away-from-centre signal 0.5.0
One-call root cause trace_root_cause returns bounded evidence DAG — witness, edge-counterparty, contamination, hub Manual chain of explain_anomaly → find_counterparties → contagion_score → π7 hub 0.5.0
Geometric edge potential ||delta_from − delta_to|| × (1/pair_tx_count) — per-edge layering signature Node-level delta_norm misses one-off transactions between divergent accounts 0.5.0
Structural motif scoring Product of edge_potential across closed-vocab motifs (fan_out, cycle_2, cycle_3, structuring) Graph DB motif matching has no geometric rarity score 0.5.0
Extended motif catalog fan_in (sink-centric concentrator) and chain_k (open directed chain, 3 ≤ k ≤ 8) extend the motif vocabulary; window-filter correctness fix on fan_out/cycle_2/cycle_3 Prior motifs silently ignored declared time_window_hours in production 0.5.1
Bipartite motif catalog split_recombine (diamond scatter-gather S → k intermediaries → D, forward/backward seed anchoring) and bipartite_burst (complete K_{k,m} bipartite subgraph in tight window) cover scatter-gather smurfing, parallel layering, and coordinated-burst atoms Closed-vocabulary atomic queries — no manual graph inversion or per-side enumeration glue 0.5.2
Multi-epoch calibration audit compare_calibrations(v_from, v_to) — per-dim μ/σ/θ drift between two retained calibration epochs of one pattern Drift detectors compare model predictions; nothing compares the underlying coordinate system itself 0.6.0
Intrinsic vs extrinsic drift decomposition decompose_drift splits an entity's geometric drift into its own movement vs population recalibration; intrinsic_fraction ∈ [0, 1] Drift magnitude alone — population shift and entity behaviour change confound 0.6.0
Hidden-influencer matrix find_calibration_influencers — 4-cell classification (hidden / distorter / standard_anomaly / normal) via exact leave-one-out impact on calibration SHAP / counterfactual: explain a prediction, not the coordinate system itself 0.6.0
Cross-pattern temporal lead-lag find_lead_lag(pattern_a, pattern_b) — cross-correlates differenced population-centroid drift; peak lag, Bonferroni-adjusted significance, per-dim FDR matrix Granger causality on raw metrics — not on population-relative geometry 0.6.0
Anomaly by absence find_density_gaps — joint-density gaps under independence null with BH-corrected q-values; surfaces under-populated cells in named delta-space ranges Outlier detection finds extremes; gap detection finds missing combinations 0.6.0
Declarative motif API find_motif_by_hops(pattern_id, hops, *, seed_keys) — caller passes per-hop HopPredicates (amount / time-delta / direction / edge-dim filters); navigator walks chains of length 1..8 with optional total-span cap (time_window_hours) Closed-vocab motif registry — no escape hatch for ad-hoc structural shapes 0.6.0
Anchor-pattern aggregation of edge-derived dims Anchor patterns declare edge_dim_aggregations: to bake per-edge sidecar signals (pair_edge_count, find_motif_structuring, …) into per-anchor _mean / _max columns; surfaces in every anchor primitive (find_anomalies, explain_anomaly, find_clusters) Hand-rolled SQL roll-up + manual feature engineering 0.6.1
Richer hop predicates HopPredicate.amount_ratio_to_prev (decreasing-chain ratio) and require_anomalous_entity (filter chains routing through calibrated-anomalous nodes) extend the declarative motif API Closed-vocab motif library has fixed amount thresholds and no anomaly-routing filter 0.6.1
Event-aware motif scoring find_motif_by_hops(score=True) ranks motifs by the product of event-aware edge_potential across edges (uses both the anchor companion's per-entity geometry and the event pattern's per-transaction polygons); distinct transactions between the same accounts produce distinct scores Pure node-pair scoring collapses ranks when motifs share a node sequence 0.6.1
Chain-anchor aggregation Chain anchor patterns auto-emitted from chain_lines: declare edge_dim_aggregations: to bake per-event sidecar signals into per-chain _mean / _max columns; closes the third anchor_kind after single and pair Per-chain manual aggregation outside the geometry, or no chain-level edge-dim summary at all 0.6.2
Expanded edge_dim_aggregations: surface Three additional canonical aggregates per source dim (_std, _p95, _count_above_threshold with population p95 cutoff persisted in calibration epoch JSON); k>2 composite anchor support (tripartite and beyond); per-source-dim subset selector — dims: accepts list (sugar = all five aggregates) or mapping {dim: [agg, …]} (per-dim subset); cross-epoch edge_dim_threshold_drift surface on compare_calibrations Manual rebuild + re-engineering whenever the aggregate vocabulary or the per-anchor breadth changes 0.6.3
Chain-coherent investigative loop Four primitives compose into a complete chain investigation: find_chains_with_coherent_anomaly (population sweep — chains where consecutive entity-anchor positions are individually anomalous on the same dominant delta dim), anomaly_propagation_in_chain (per-chain hop-by-hop trace), classify_chain_typology (five-axis label: shape / peak_position / position_in_chain / extension_signals / dominant_top_dim), extend_chain (boundary-extension suggester via the chain reverse index) Manual SQL over chain pattern + ad-hoc Python scoring + no per-chain typology label 0.6.4
Anomaly-anchored seed prune for motifs find_motif_by_hops(anomaly_seed_filter=True) intersects the BFS starting frontier with the anomaly subset of the resolved anchor companion (replaces "all keys" frontier when seed_keys=None, intersects with explicit list otherwise); result dict carries seed_filter_summary ({requested, anomaly, filtered}) Manual pre-filter on every call, no built-in convergence on anomaly-anchored seeds 0.6.4
Cross-bank and structuring chain features Chain anchor patterns gain two derived columns — cross_bank_count (distinct banks the chain transits, textbook jurisdictional layering signal) and amount_monotone_decreasing (boolean, true when amounts strictly decrease at every hop, textbook structuring pattern). Auto-populated when the event line declares from_bank / to_bank columns; surfaces in find_anomalies(<chain_pattern>), the chain-coherent loop, and classify_chain_typology dominant_top_dim. Effect gated on next chain pattern rebuild Hand-rolled per-chain rollup post-extraction, no chain-level structuring detector 0.6.5
Strict-prefix chain subsumption extract_chains post-merge dedup gains a strict ordered prefix pass — chains whose entity sequence is a strict prefix of another chain's are dropped, since the longer chain investigates every entity the shorter one does plus more Three near-duplicate chain rows in the points table for what is effectively one investigative finding 0.6.5
Theta sensitivity diagnostic Calibration epochs gain a theta_sensitivity field — per-percentile sweep of the anomaly threshold at p90..p99 with theta_mean, anomaly_count_mean, and anomaly_rate per percentile. New theta_sensitivity(pattern_id) MCP tool plus sphere_overview summary block surface a stable band (longest contiguous range where adjacent-pair theta ratio stays below 1.30) and cliff list (boundaries where the ratio is 1.50 or higher, signalling heavy-tail regions). Lets investigators see at a glance whether the chosen anomaly_percentile sits in a smooth zone or near a recalibration cliff. Glues onto the builder's existing population sort — zero new I/O cost in the build path Manual percentile sweep + custom analytics to characterise threshold sensitivity per pattern 0.6.6
Per-dim runtime weights on find_anomalies dimension_weights={dim: float} scales each dim's contribution to the rank score before computing delta_norm; default 1.0 for missing dims, 0.0 silences a dim. Wires stratified correlation-gate verdicts (NOISE / DIRECTION-INCONSISTENT / VOLUME-MEDIATED / ROBUST) into runtime ranking — discount or silence per-dim signals that fail confounder-controlled gates Gate verdicts only inform CHANGELOG / skill cheatsheet narrative; no runtime knob to discount NOISE-classified dims when scoring 0.6.7
Chain-coherent triage + one-shot R9 orchestrator + SAR narrative Three composable additions close the chain investigation→SAR pipeline: chain_investigation_summary (population-level triage — coherent_run_rate, cross_pattern_overlap.jaccard, recommended_min_hops); investigate_chain (one-shot orchestrator running trace + typology + shape + extension forward + extension backward server-side, returns SAR-ready summary); generate_sar_rationale (template-based composition of a 3-5 paragraph SAR-ready narrative from R9 evidence with structured evidence_anchors per claim, no LLM call) Manual chain of four MCP round-trips per investigation, manual SAR narrative drafting from a blank page 0.6.7
Dim-quality warnings on sphere_overview dim_quality_warnings[] block surfaces two silent build-time failure modes that break z-score / delta_norm semantics: dead_dim (sigma_diag below 1e-10, z-score undefined) and sparse_dim (median == 0 with rare nonzero, gaussian assumption wrong). Each warning carries type, dim_label, reason, and concrete advice. Computed sub-millisecond from cached pattern state Both classes silently broke the delta vector with no agent-visible signal, requiring calibration-log archaeology to discover 0.6.7
External-chain ingestion as anchor lines New cookbook + schema convention documenting how chains discovered outside hypertopos (SAR typology engines, ERP supply-chain workflows, EHR clinical pathways, customer-journey platforms) ingest as anchor lines. Optional chain_keys column (comma-joined member primary_keys in chain order) unlocks the full chain-coherent investigative loop on externally-curated chains — same primitives, no code change Build-time chain_lines: BFS extraction was the only documented path; external chain identifiers had no documented ingestion route 0.6.7

What changes in practice

The same problems look different when graph, time, and statistics are unified:

Problem Typical approach With hypertopos
Detect anomalies Train model, engineer features, choose contamination rate, retrain on shift hypertopos build from YAML, find_anomalies() — threshold auto-calibrated from population
Explain an anomaly SHAP on trained model — feature importance for latent dimensions explain_anomaly(entity) — ranked real dimensions: loan_count: +3.2 sigma (42%)
Compare across domains Align schemas, build shared features, normalize units Compare ||delta|| directly — 4.2 sigma means the same in any sphere
Track behavioral drift Export to time-series DB, build dashboard, set manual thresholds attract_drift(window) — displacement, path length, directionality per entity
Validate anomaly is real Manual investigation, ask domain expert find_witness_cohort(entity) — similar non-connected entities confirming the pattern
Understand propagation PageRank, manual path tracing, cross-table joins propagate_influence(source) — Cohen's d between connected vs control group
Trust an anomaly verdict Re-run with different thresholds, manual sensitivity analysis find_anomalies(min_confidence=0.8) — only entities stable under population perturbation
Understand why anomalous SHAP on black-box model, approximate feature importance explain_anomaly — per-dimension Bregman contribution with distribution kind, sums to 100%
Root-cause an anomaly Manual chain: explain → counterparties → contagion → hub check, 4+ tool calls trace_root_cause(entity) — single call returns bounded DAG of evidence
Detect relationship layering Custom rule engine or manual rare-pair SQL queries edge_potential(A, B) — per-edge score combining endpoint distance and pair rarity
Match AML typology patterns Graph DB subgraph queries + separate risk scoring find_motif(type="structuring", …) — structural pattern + geometric rarity product
Match ad-hoc structural chains Custom subgraph queries per shape, no built-in scoring find_motif_by_hops(hops=[HopPredicate(...)]) — declarative per-hop predicates (amount, time, direction, edge-dim, ratio-to-prev, anomaly filter) with event-aware geometric scoring on the same call
Account-level transaction layering recall Hand-rolled SQL roll-ups of per-edge signals into account features edge_dim_aggregations: on the anchor pattern bakes per-edge structuring / pair-recurrence / chain-depth signals into anchor geometry — read off find_anomalies like any other dim
Direction of behavioural drift Drift magnitude alone — no toward/away-from-centre signal attract_drift returns drift_direction ∈ {normalizing, deteriorating, neutral}

Benchmarks

Validated on three domains with the same engine, zero domain rules, zero labels:

Domain Dataset Key result
Banking Berka (Czech, real data) 85.5% recall on loan defaults
AML IBM AML (synthetic) 80.4% recall, zero labels
Transport NYC Yellow Taxi (7.6M trips) 8/8 anomaly categories detected

Benchmark scripts and data preparation are included. Results are reproducible. Numbers are from the pre-0.1.0 validation run and have not been re-evaluated against recent releases.

Full results: Benchmarks

Documentation

Introduction The idea and where it stands
Quick Start Install, build, navigate
Core Concepts Mathematical foundation
Configuration Sphere builder YAML reference
API Reference Python API
Data Format On-disk storage format
Architecture Package layers and design

Status

Research-stage project. Working code, reproducible benchmarks, active development. API may change.

License

Business Source License 1.1. Free for internal use, development, testing, and research. See LICENSE.md for details.

About

Understand the structure of your data — without training machine learning models

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages