hypertopos

A behavioral feature layer for graph and temporal data — turning behavior into coordinates, trajectories, and explanations.

hypertopos is not a database, and not a machine learning model. It is a layer that turns relational data into a coordinate system where every entity gets a position derived from its relationships and the population around it.

typical:    data → features → feature store → ML → decision
hypertopos: data → representation (hypertopos) → ML / decision

pip install hypertopos

How it works

You describe your data in YAML — entity types, sources, relationships. hypertopos computes population statistics and produces a sphere: pre-computed geometry stored in Apache Arrow format.

Agents (or Python code) open the sphere and navigate it using twelve primitives that cover movement, clustering, anomaly detection, population comparison, and temporal analysis. Each step is stateful — where you are determines what you see next.

For the full picture: Introduction · Core Concepts · Quick Start

What's different

Each capability below emerges from treating entities as points in a shared, population-calibrated space.

Capability	What it does	Compared to	Since
Population-relative coordinates	`delta = (shape - mu) / sigma` — one coordinate for anomaly, clustering, drift	node2vec/GNN: latent dims, retraining on shift	`0.1.0`
Self-calibrating threshold	`theta = percentile(norms, 95)` — no tuning, no labels	PyOD: choose contamination rate	`0.1.0`
Named dimension attribution	`explain_anomaly` → `loan_count: +3.2σ (42%)`. Sums to 100%	SHAP/LIME: approximate, model-dependent	`0.1.0`
Temporal deformation	Append-only delta log. Displacement, path length, directionality	Time-series DBs: metric values, not trajectories	`0.1.0`
Stateful navigation	12 typed primitives. Position type constrains valid ops	SQL/GraphQL: stateless queries	`0.1.0`
Cross-sphere comparison	`\|\|delta\|\|` is dimensionless — 4.2σ means the same in any domain	Requires shared features or joint embeddings	`0.1.0`
Counterfactual simulation	`simulate_edges` recomputes delta against fixed baseline	Causal inference: explicit DAG required	`0.1.0`
Regime change detection	Per-bucket centroids, self-calibrating shift threshold	Evidently/NannyML: model prediction drift	`0.1.0`
Graph contagion	Mean `\|\|delta\|\|` of neighbors. Cohen's d vs control group	PageRank: topology, not behavioral propagation	`0.2.0`
Witness cohort	Similar deltas, NOT connected. Validates pattern vs one-off	k-NN on features (includes neighbors)	`0.2.1`
FDR-controlled detection	Benjamini-Hochberg on rank p-values. Per-entity q-values	BH not combined with geometric detection	`0.3.1`
Diverse anomaly selection	Facility location covers distinct anomaly regions	Top-N returns redundant extremes	`0.3.1`
Distribution-aware scoring	Per-dim Bregman divergence (gaussian/poisson/bernoulli). Additive	PyOD/sklearn: uniform metric across all features	`0.4.0`
Anomaly confidence	Bootstrap `anomaly_confidence: 0-1`. `min_confidence` filter	No equivalent — binary verdict, no stability signal	`0.4.0`
Graph algorithm dimensions	PageRank, betweenness, community, clustering as geometry dims	Separate graph DB + manual joins	`0.4.1`
Adaptive false-discovery-rate	Storey π₀ estimator + χ² parametric p-values recover BH power loss	BH without adaptive π₀ overcorrects on null-heavy populations	`0.5.0`
Drift direction	`gradient_alignment` + `drift_direction ∈ {normalizing, deteriorating, neutral}`	Drift magnitude only — no toward/away-from-centre signal	`0.5.0`
One-call root cause	`trace_root_cause` returns bounded evidence DAG — witness, edge-counterparty, contamination, hub	Manual chain of `explain_anomaly → find_counterparties → contagion_score → π7 hub`	`0.5.0`
Geometric edge potential	`\|\|delta_from − delta_to\|\| × (1/pair_tx_count)` — per-edge layering signature	Node-level `delta_norm` misses one-off transactions between divergent accounts	`0.5.0`
Structural motif scoring	Product of `edge_potential` across closed-vocab motifs (fan_out, cycle_2, cycle_3, structuring)	Graph DB motif matching has no geometric rarity score	`0.5.0`
Extended motif catalog	`fan_in` (sink-centric concentrator) and `chain_k` (open directed chain, 3 ≤ k ≤ 8) extend the motif vocabulary; window-filter correctness fix on fan_out/cycle_2/cycle_3	Prior motifs silently ignored declared `time_window_hours` in production	`0.5.1`
Bipartite motif catalog	`split_recombine` (diamond scatter-gather S → k intermediaries → D, forward/backward seed anchoring) and `bipartite_burst` (complete K_{k,m} bipartite subgraph in tight window) cover scatter-gather smurfing, parallel layering, and coordinated-burst atoms	Closed-vocabulary atomic queries — no manual graph inversion or per-side enumeration glue	`0.5.2`
Multi-epoch calibration audit	`compare_calibrations(v_from, v_to)` — per-dim μ/σ/θ drift between two retained calibration epochs of one pattern	Drift detectors compare model predictions; nothing compares the underlying coordinate system itself	`0.6.0`
Intrinsic vs extrinsic drift decomposition	`decompose_drift` splits an entity's geometric drift into its own movement vs population recalibration; `intrinsic_fraction ∈ [0, 1]`	Drift magnitude alone — population shift and entity behaviour change confound	`0.6.0`
Hidden-influencer matrix	`find_calibration_influencers` — 4-cell classification (`hidden` / `distorter` / `standard_anomaly` / `normal`) via exact leave-one-out impact on calibration	SHAP / counterfactual: explain a prediction, not the coordinate system itself	`0.6.0`
Cross-pattern temporal lead-lag	`find_lead_lag(pattern_a, pattern_b)` — cross-correlates differenced population-centroid drift; peak lag, Bonferroni-adjusted significance, per-dim FDR matrix	Granger causality on raw metrics — not on population-relative geometry	`0.6.0`
Anomaly by absence	`find_density_gaps` — joint-density gaps under independence null with BH-corrected q-values; surfaces under-populated cells in named delta-space ranges	Outlier detection finds extremes; gap detection finds missing combinations	`0.6.0`
Declarative motif API	`find_motif_by_hops(pattern_id, hops, *, seed_keys)` — caller passes per-hop `HopPredicate`s (amount / time-delta / direction / edge-dim filters); navigator walks chains of length 1..8 with optional total-span cap (`time_window_hours`)	Closed-vocab motif registry — no escape hatch for ad-hoc structural shapes	`0.6.0`
Anchor-pattern aggregation of edge-derived dims	Anchor patterns declare `edge_dim_aggregations:` to bake per-edge sidecar signals (`pair_edge_count`, `find_motif_structuring`, …) into per-anchor `_mean` / `_max` columns; surfaces in every anchor primitive (`find_anomalies`, `explain_anomaly`, `find_clusters`)	Hand-rolled SQL roll-up + manual feature engineering	`0.6.1`
Richer hop predicates	`HopPredicate.amount_ratio_to_prev` (decreasing-chain ratio) and `require_anomalous_entity` (filter chains routing through calibrated-anomalous nodes) extend the declarative motif API	Closed-vocab motif library has fixed amount thresholds and no anomaly-routing filter	`0.6.1`
Event-aware motif scoring	`find_motif_by_hops(score=True)` ranks motifs by the product of event-aware `edge_potential` across edges (uses both the anchor companion's per-entity geometry and the event pattern's per-transaction polygons); distinct transactions between the same accounts produce distinct scores	Pure node-pair scoring collapses ranks when motifs share a node sequence	`0.6.1`
Chain-anchor aggregation	Chain anchor patterns auto-emitted from `chain_lines:` declare `edge_dim_aggregations:` to bake per-event sidecar signals into per-chain `_mean` / `_max` columns; closes the third `anchor_kind` after `single` and `pair`	Per-chain manual aggregation outside the geometry, or no chain-level edge-dim summary at all	`0.6.2`
Expanded `edge_dim_aggregations:` surface	Three additional canonical aggregates per source dim (`_std`, `_p95`, `_count_above_threshold` with population p95 cutoff persisted in calibration epoch JSON); k>2 composite anchor support (tripartite and beyond); per-source-dim subset selector — `dims:` accepts list (sugar = all five aggregates) or mapping `{dim: [agg, …]}` (per-dim subset); cross-epoch `edge_dim_threshold_drift` surface on `compare_calibrations`	Manual rebuild + re-engineering whenever the aggregate vocabulary or the per-anchor breadth changes	`0.6.3`
Chain-coherent investigative loop	Four primitives compose into a complete chain investigation: `find_chains_with_coherent_anomaly` (population sweep — chains where consecutive entity-anchor positions are individually anomalous on the same dominant delta dim), `anomaly_propagation_in_chain` (per-chain hop-by-hop trace), `classify_chain_typology` (five-axis label: shape / peak_position / position_in_chain / extension_signals / dominant_top_dim), `extend_chain` (boundary-extension suggester via the chain reverse index)	Manual SQL over chain pattern + ad-hoc Python scoring + no per-chain typology label	`0.6.4`
Anomaly-anchored seed prune for motifs	`find_motif_by_hops(anomaly_seed_filter=True)` intersects the BFS starting frontier with the anomaly subset of the resolved anchor companion (replaces "all keys" frontier when `seed_keys=None`, intersects with explicit list otherwise); result dict carries `seed_filter_summary` (`{requested, anomaly, filtered}`)	Manual pre-filter on every call, no built-in convergence on anomaly-anchored seeds	`0.6.4`
Cross-bank and structuring chain features	Chain anchor patterns gain two derived columns — `cross_bank_count` (distinct banks the chain transits, textbook jurisdictional layering signal) and `amount_monotone_decreasing` (boolean, true when amounts strictly decrease at every hop, textbook structuring pattern). Auto-populated when the event line declares `from_bank` / `to_bank` columns; surfaces in `find_anomalies(<chain_pattern>)`, the chain-coherent loop, and `classify_chain_typology` `dominant_top_dim`. Effect gated on next chain pattern rebuild	Hand-rolled per-chain rollup post-extraction, no chain-level structuring detector	`0.6.5`
Strict-prefix chain subsumption	`extract_chains` post-merge dedup gains a strict ordered prefix pass — chains whose entity sequence is a strict prefix of another chain's are dropped, since the longer chain investigates every entity the shorter one does plus more	Three near-duplicate chain rows in the points table for what is effectively one investigative finding	`0.6.5`
Theta sensitivity diagnostic	Calibration epochs gain a `theta_sensitivity` field — per-percentile sweep of the anomaly threshold at p90..p99 with `theta_mean`, `anomaly_count_mean`, and `anomaly_rate` per percentile. New `theta_sensitivity(pattern_id)` MCP tool plus `sphere_overview` summary block surface a stable band (longest contiguous range where adjacent-pair theta ratio stays below 1.30) and cliff list (boundaries where the ratio is 1.50 or higher, signalling heavy-tail regions). Lets investigators see at a glance whether the chosen `anomaly_percentile` sits in a smooth zone or near a recalibration cliff. Glues onto the builder's existing population sort — zero new I/O cost in the build path	Manual percentile sweep + custom analytics to characterise threshold sensitivity per pattern	`0.6.6`
Per-dim runtime weights on `find_anomalies`	`dimension_weights={dim: float}` scales each dim's contribution to the rank score before computing `delta_norm`; default `1.0` for missing dims, `0.0` silences a dim. Wires stratified correlation-gate verdicts (NOISE / DIRECTION-INCONSISTENT / VOLUME-MEDIATED / ROBUST) into runtime ranking — discount or silence per-dim signals that fail confounder-controlled gates	Gate verdicts only inform CHANGELOG / skill cheatsheet narrative; no runtime knob to discount NOISE-classified dims when scoring	`0.6.7`
Chain-coherent triage + one-shot R9 orchestrator + SAR narrative	Three composable additions close the chain investigation→SAR pipeline: `chain_investigation_summary` (population-level triage — `coherent_run_rate`, `cross_pattern_overlap.jaccard`, `recommended_min_hops`); `investigate_chain` (one-shot orchestrator running trace + typology + shape + extension forward + extension backward server-side, returns SAR-ready summary); `generate_sar_rationale` (template-based composition of a 3-5 paragraph SAR-ready narrative from R9 evidence with structured `evidence_anchors` per claim, no LLM call)	Manual chain of four MCP round-trips per investigation, manual SAR narrative drafting from a blank page	`0.6.7`
Dim-quality warnings on `sphere_overview`	`dim_quality_warnings[]` block surfaces two silent build-time failure modes that break z-score / `delta_norm` semantics: `dead_dim` (sigma_diag below 1e-10, z-score undefined) and `sparse_dim` (median == 0 with rare nonzero, gaussian assumption wrong). Each warning carries `type`, `dim_label`, `reason`, and concrete `advice`. Computed sub-millisecond from cached pattern state	Both classes silently broke the delta vector with no agent-visible signal, requiring calibration-log archaeology to discover	`0.6.7`
External-chain ingestion as anchor lines	New cookbook + schema convention documenting how chains discovered outside hypertopos (SAR typology engines, ERP supply-chain workflows, EHR clinical pathways, customer-journey platforms) ingest as anchor lines. Optional `chain_keys` column (comma-joined member primary_keys in chain order) unlocks the full chain-coherent investigative loop on externally-curated chains — same primitives, no code change	Build-time `chain_lines:` BFS extraction was the only documented path; external chain identifiers had no documented ingestion route	`0.6.7`

What changes in practice

The same problems look different when graph, time, and statistics are unified:

Problem	Typical approach	With hypertopos
Detect anomalies	Train model, engineer features, choose contamination rate, retrain on shift	`hypertopos build` from YAML, `find_anomalies()` — threshold auto-calibrated from population
Explain an anomaly	SHAP on trained model — feature importance for latent dimensions	`explain_anomaly(entity)` — ranked real dimensions: `loan_count: +3.2 sigma (42%)`
Compare across domains	Align schemas, build shared features, normalize units	Compare `\|\|delta\|\|` directly — 4.2 sigma means the same in any sphere
Track behavioral drift	Export to time-series DB, build dashboard, set manual thresholds	`attract_drift(window)` — displacement, path length, directionality per entity
Validate anomaly is real	Manual investigation, ask domain expert	`find_witness_cohort(entity)` — similar non-connected entities confirming the pattern
Understand propagation	PageRank, manual path tracing, cross-table joins	`propagate_influence(source)` — Cohen's d between connected vs control group
Trust an anomaly verdict	Re-run with different thresholds, manual sensitivity analysis	`find_anomalies(min_confidence=0.8)` — only entities stable under population perturbation
Understand why anomalous	SHAP on black-box model, approximate feature importance	`explain_anomaly` — per-dimension Bregman contribution with distribution kind, sums to 100%
Root-cause an anomaly	Manual chain: explain → counterparties → contagion → hub check, 4+ tool calls	`trace_root_cause(entity)` — single call returns bounded DAG of evidence
Detect relationship layering	Custom rule engine or manual rare-pair SQL queries	`edge_potential(A, B)` — per-edge score combining endpoint distance and pair rarity
Match AML typology patterns	Graph DB subgraph queries + separate risk scoring	`find_motif(type="structuring", …)` — structural pattern + geometric rarity product
Match ad-hoc structural chains	Custom subgraph queries per shape, no built-in scoring	`find_motif_by_hops(hops=[HopPredicate(...)])` — declarative per-hop predicates (amount, time, direction, edge-dim, ratio-to-prev, anomaly filter) with event-aware geometric scoring on the same call
Account-level transaction layering recall	Hand-rolled SQL roll-ups of per-edge signals into account features	`edge_dim_aggregations:` on the anchor pattern bakes per-edge structuring / pair-recurrence / chain-depth signals into anchor geometry — read off `find_anomalies` like any other dim
Direction of behavioural drift	Drift magnitude alone — no toward/away-from-centre signal	`attract_drift` returns `drift_direction ∈ {normalizing, deteriorating, neutral}`

Benchmarks

Validated on three domains with the same engine, zero domain rules, zero labels:

Domain	Dataset	Key result
Banking	Berka (Czech, real data)	85.5% recall on loan defaults
AML	IBM AML (synthetic)	80.4% recall, zero labels
Transport	NYC Yellow Taxi (7.6M trips)	8/8 anomaly categories detected

Benchmark scripts and data preparation are included. Results are reproducible. Numbers are from the pre-0.1.0 validation run and have not been re-evaluated against recent releases.

Full results: Benchmarks

Documentation


Introduction	The idea and where it stands
Quick Start	Install, build, navigate
Core Concepts	Mathematical foundation
Configuration	Sphere builder YAML reference
API Reference	Python API
Data Format	On-disk storage format
Architecture	Package layers and design

Status

Research-stage project. Working code, reproducible benchmarks, active development. API may change.

License

Business Source License 1.1. Free for internal use, development, testing, and research. See LICENSE.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
docs		docs
hypertopos		hypertopos
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hypertopos

How it works

What's different

What changes in practice

Benchmarks

Documentation

Status

License

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hypertopos

How it works

What's different

What changes in practice

Benchmarks

Documentation

Status

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages