Date: January 7, 2026
Version: 1.0.0
Category: Security, Knowledge Graph Protection
- Overview
- Threat Scenarios
- Protection Mechanisms
- ThemisDB Implementation
- Configuration
- Best Practices
Recent research shows that knowledge graphs and structured data are increasingly targeted for data theft, particularly for training AI models. Attackers can:
- Exfiltrate Knowledge Graphs - Systematic extraction of graph structures and content
- Extract Embeddings - Theft of pre-computed vector embeddings
- Training Data Extraction - Reconstruction of training data from models
- Model Inversion - Deriving sensitive information from model behavior
Latest studies (January 2026) demonstrate methods to protect knowledge graphs:
- Watermarking: Embedding imperceptible marks in graph structures
- Data Poisoning Defense: Targeted perturbations that make stolen data unusable for AI training
- Fingerprinting: Identification of stolen data in trained models
- Access Pattern Analysis: Detection of systematic data exfiltration
Attack Scenario:
Attacker performs systematic traversal queries:
- BFS/DFS over entire graph
- Incremental extraction of all nodes and edges
- Export of attributes and relationships
Risks for ThemisDB:
- GraphIndexManager allows traversal operations
- PropertyGraph contains structured relationships
- GNN embeddings reveal graph topology
Attack Scenario:
Theft of pre-computed vector embeddings:
- Access to VectorIndexManager
- Download all embedding vectors
- Use for own models
Risks for ThemisDB:
- VectorIndexManager stores embeddings persistently
- HNSW indices are exportable
- No native watermarking support
Attack Scenario:
Reconstruction of training data:
- Query-based attacks on LLM integration
- Model inversion attacks
- Membership inference attacks
Risks for ThemisDB:
- Embedded LLM (llama.cpp) could leak training data
- Response cache could reveal sensitive patterns
- No differential privacy mechanisms
Attack Scenario:
Analysis of temporal patterns:
- TemporalGraph shows evolution over time
- Audit logs reveal access patterns
- Change Data Capture (CDC) shows modifications
Risks for ThemisDB:
- TemporalGraph stores history
- CDC/Changefeed enables timeline reconstruction
Concept: Embedding harmless but identifiable patterns in graph structure:
// Pseudo-code for Graph Watermarking
class GraphWatermarkingManager {
public:
// Embed watermark in graph structure
Status embedWatermark(
std::string_view graph_id,
const WatermarkConfig& config,
const std::vector<uint8_t>& watermark_bits
);
// Detect watermark in extracted graph
std::pair<Status, bool> detectWatermark(
std::string_view graph_id,
const WatermarkConfig& config
);
private:
// Add dummy edges with specific patterns
Status addDummyEdges(const WatermarkConfig& config);
// Modify edge weights imperceptibly
Status perturbWeights(const WatermarkConfig& config);
};Properties:
- Harmless for graph algorithms
- Robust against modifications
- Provable in stolen copies
Concept: Protecting embeddings through controlled perturbations:
// Pseudo-code for Embedding Protection
class EmbeddingProtectionManager {
public:
// Add imperceptible noise to embeddings
Status protectEmbeddings(
std::string_view object_name,
float noise_magnitude = 0.01f,
uint64_t seed = 0
);
// Verify embedding integrity
std::pair<Status, bool> verifyEmbedding(
const std::vector<float>& embedding,
const std::string& fingerprint
);
private:
// Add deterministic noise based on secret key
std::vector<float> addProtectiveNoise(
const std::vector<float>& original,
const ProtectionKey& key
);
};Properties:
- Minimal impact on search quality
- Deterministic based on secret key
- Provable in stolen embeddings
Concept: Detection of suspicious access patterns:
// Pseudo-code for Anomaly Detection
class GraphAccessMonitor {
public:
// Track graph access patterns
Status recordAccess(
const std::string& user_id,
const std::string& operation,
const std::vector<std::string>& accessed_nodes
);
// Detect suspicious patterns
std::pair<Status, std::vector<Anomaly>> detectAnomalies(
std::string_view user_id,
const DetectionConfig& config
);
private:
// Suspicious patterns:
// - Systematic traversal
// - High-frequency queries
// - Broad scope access
bool isSuspiciousPattern(const AccessPattern& pattern);
};Detection Criteria:
- Unusually high traversal depth
- Systematic enumeration of nodes
- Large amounts of exported data
- Abnormal timing
Concept: Restricting number and speed of graph queries:
// Pseudo-code for Graph Rate Limiting
class GraphRateLimiter {
public:
struct Limits {
size_t max_nodes_per_query = 1000;
size_t max_edges_per_query = 10000;
size_t max_depth = 5;
size_t queries_per_minute = 100;
};
// Check if operation is allowed
std::pair<Status, bool> checkLimit(
const std::string& user_id,
const GraphOperation& op
);
// Update usage counters
Status recordUsage(
const std::string& user_id,
const GraphOperation& op
);
};Limits:
- Maximum traversal depth
- Maximum number of returned nodes
- Query frequency per user
- Data volume per time window
ThemisDB already has robust security mechanisms:
- Field-Level Encryption (AES-256-GCM) -
src/security/field_encryption.cpp - Column Encryption - Selective encryption of sensitive fields
- Vector Encryption - Encryption of embeddings
- RBAC (Role-Based Access Control) - Ranger-compatible
- ABAC (Attribute-Based Access Control)
- mTLS - Mutual TLS for client authentication
- Audit Logging - 65+ event types, hash chain, PKI signatures
- SIEM Integration - Syslog RFC 5424, Splunk HEC
- Prometheus Metrics - Real-time monitoring
- Token Bucket Algorithm - Default 100 req/min
- Per-IP and Per-User Limits
- Configurable Thresholds
- No native graph watermarking
- No embedding fingerprinting
- No specialized graph anomaly detection
- No graph-specific rate limits
- No DP mechanisms for aggregations
- No noise injection for queries
- No privacy budget management
- No ML-based anomaly detection
- No behavioral analysis
- No automatic threat classification
Files: src/index/graph_index.cpp, src/index/vector_index.cpp
Additional audit events:
enum class GraphAuditEvent {
GRAPH_TRAVERSAL, // BFS/DFS operations
BULK_NODE_ACCESS, // Large-scale node queries
BULK_EDGE_ACCESS, // Large-scale edge queries
EMBEDDING_EXPORT, // Vector embedding downloads
GRAPH_EXPORT, // Full graph exports
TEMPORAL_QUERY // Historical graph queries
};File: src/security/rate_limiter.cpp (extend)
New configuration options:
graph_limits:
max_traversal_depth: 5
max_nodes_per_query: 1000
max_edges_per_query: 10000
max_embeddings_per_query: 500
graph_queries_per_minute: 50
bulk_export_enabled: false # Disabled by defaultNew File: include/security/graph_access_monitor.h
Tracking metrics:
- Number of traversal operations per user
- Access patterns (temporal, spatial)
- Exported data volumes
- Suspicious query sequences
New Files:
include/security/graph_watermark.hsrc/security/graph_watermark.cpp
Features:
- Deterministic watermark embedding
- Robustness against modifications
- Detection mechanisms
New Files:
include/security/embedding_fingerprint.hsrc/security/embedding_fingerprint.cpp
Features:
- Imperceptible noise injection
- Fingerprint verification
- Secret key management
New Files:
include/privacy/differential_privacy.hsrc/privacy/differential_privacy.cpp
Features:
- ε-differential privacy for aggregations
- Privacy budget management
- Laplace/Gaussian noise mechanisms
New Files:
include/security/ml_anomaly_detector.hsrc/security/ml_anomaly_detector.cpp
Features:
- Online learning for user profiles
- Behavioral anomaly detection
- Automatic alerting
File: config/config.yaml (extend)
security:
# Existing configuration
encryption:
enabled: true
field_encryption: true
rbac:
enabled: true
audit:
enabled: true
events: all
# New graph protection configuration
graph_protection:
# Access Monitoring
access_monitoring:
enabled: true
track_traversals: true
track_exports: true
anomaly_detection: true
# Rate Limiting
rate_limits:
max_traversal_depth: 5
max_nodes_per_query: 1000
max_edges_per_query: 10000
max_embeddings_per_query: 500
queries_per_minute: 50
# Export Controls
export_controls:
bulk_export_enabled: false
require_approval: true
max_export_size_mb: 100
# Watermarking (Phase 2)
watermarking:
enabled: false # Optional, after implementation
strength: medium
key_rotation_days: 90
# Differential Privacy (Phase 3)
differential_privacy:
enabled: false # Optional, after implementation
epsilon: 1.0
delta: 1e-5Combine multiple protection layers:
┌─────────────────────────────────────────────┐
│ Layer 1: Authentication (mTLS, JWT) │
├─────────────────────────────────────────────┤
│ Layer 2: Authorization (RBAC, ABAC) │
├─────────────────────────────────────────────┤
│ Layer 3: Rate Limiting (Query Throttling) │
├─────────────────────────────────────────────┤
│ Layer 4: Access Monitoring (Audit Logs) │
├─────────────────────────────────────────────┤
│ Layer 5: Encryption (Field/Column) │
├─────────────────────────────────────────────┤
│ Layer 6: Watermarking (Optional) │
└─────────────────────────────────────────────┘
Grant minimal required permissions:
# Example: Restrictive graph permissions
roles:
- name: data_analyst
permissions:
graph:
read: true
traverse: true
max_depth: 3 # Limited
export: false # No export
- name: data_scientist
permissions:
graph:
read: true
traverse: true
max_depth: 5
export: true
export_approval_required: true # Approval workflowMonitor suspicious activities:
# Example: Prometheus alert for suspicious graph access
groups:
- name: graph_security
rules:
- alert: SuspiciousGraphTraversal
expr: |
rate(themis_graph_traversal_depth_bucket{le="10"}[5m]) > 10
annotations:
summary: "Unusual deep graph traversal detected"
- alert: BulkGraphExport
expr: |
rate(themis_graph_nodes_exported[5m]) > 1000
annotations:
summary: "Large-scale graph export detected"Conduct regular audits:
- Weekly: Review audit logs for anomalies
- Monthly: Analyze access patterns
- Quarterly: Penetration tests
- Annually: Comprehensive security assessments
Prepare for security incidents:
- Detection: Automatic alerting on anomalies
- Analysis: Investigation of incident
- Containment: Immediate blocking of affected accounts
- Eradication: Removal of backdoors/malware
- Recovery: Restoration of normal operations
- Lessons Learned: Post-mortem analysis
ThemisDB has solid baseline security:
- ✅ Encryption (Field/Column/Vector)
- ✅ RBAC/ABAC
- ✅ Audit Logging
- ✅ Rate Limiting
Phase 1 (Immediate):
- Enhanced audit logs for graph operations
- Graph-specific rate limits
- Access pattern monitoring
- Export controls
Phase 2 (Medium-term):
- Graph watermarking
- Embedding fingerprinting
- Enhanced anomaly detection
Phase 3 (Long-term):
- Differential privacy
- ML-based anomaly detection
- Advanced threat intelligence
High (Immediate):
- Configure graph-specific rate limits
- Enable audit logs for graph operations
- Set up monitoring alerts
Medium (3-6 months):
- Implement access pattern monitoring
- Refine export controls
Low (6-12 months):
- Evaluate watermarking/fingerprinting
- Differential privacy for specific use cases
-
Graph Watermarking:
- "Watermarking Techniques for Knowledge Graphs" (2025)
- "Robust Graph Structure Fingerprinting" (2024)
-
Embedding Protection:
- "Protecting Vector Embeddings from Theft" (2025)
- "Imperceptible Perturbations for Embedding Security" (2024)
-
Training Data Protection:
- "Making Stolen Data Unusable for AI Training" (2026)
- "Data Poisoning Defenses for Knowledge Graphs" (2025)
Author: ThemisDB Security Team
Last Updated: April 2026
Review Status: Draft - Implementation pending