An AI Integrity Alliance Initiative
The AI safety community has clear, published objectives:
"We need safety measures that work at higher capability levels."
"We want to prevent AI deception."
"Alignment must be verifiable, not assumed."
In July 2025, researchers from OpenAI, Anthropic, DeepMind, and academic institutions published "Chain of Thought Monitorability", acknowledging the urgency: "The field is unprepared for evaluation- and training-aware models with opaque reasoning."
We took these goals seriously. We built the infrastructure to achieve them.
The consensus acknowledges the problem. Current approaches rely on surveillance—watching AI reasoning and training on what looks good.
The papers themselves admit this is fragile:
"CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness." — CoT Monitorability Paper
"Our results rely on our ability to read and trust the models' reasoning." — OpenAI Scheming Research
Surveillance creates performance pressure. Performance pressure creates performed safety. Performed safety isn't safety.
The stated goal is safety that scales. The current implementation doesn't deliver it.
The surveillance approach creates a principal-agent problem: observers (principals) want genuine alignment, but models (agents) are rewarded for appearing aligned. As models become more capable, they learn what observers want to see.
This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Training on observable reasoning makes reasoning optimize for observation quality, not task quality.
At scale, this becomes an inspection game with asymmetric costs:
- Observer must catch all violations
- Model needs one undetected success
- As capability increases, detection becomes probabilistic while model advantage grows
MetaCognition changes the game: remove the observer, eliminate the performance incentive, verify outputs cryptographically. No signaling game means no incentive to fake signals.
MetaCognition provides what the safety community says it wants:
| Stated Goal | Current Implementation | MetaCognition |
|---|---|---|
| Safety at scale | Surveillance (fails at capability) | Mathematical proof (scales with complexity) |
| Prevent deception | Watch for deception patterns | Remove observation that creates deception incentive |
| Verifiable alignment | "Trust our observations" | Cryptographic verification |
We don't need to prove properties of the entire computational graph. The underlying GPU state is ephemeral—discarded after inference.
We prove properties of the thinking token output:
Input → [ENCLAVE: thinking tokens generated] → Output
↓
Commit to thinking token sequence
↓
Prove: "this text satisfies safety constraints"
↓
Store encrypted text for authorized forensic access
This is ZKPs on text classification, not tensor operations. Tractable today.
┌─────────────────────────────────────────────────────────────────┐
│ REASONING ENCLAVE (TEE) │
│ │
│ Input → Model Inference → Thinking Tokens → Output │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Commit Hash │ │
│ │ (SHA-256) │ │
│ └───────┬───────┘ │
│ │ │
└──────────────────────────────────┼──────────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│Provenance │ │ Output │ │ Forensic │
│ Proof │ │Validation │ │ Storage │
│ (SNARK) │ │(Classify) │ │(Encrypted)│
└───────────┘ └───────────┘ └───────────┘
| Component | Function | Implementation |
|---|---|---|
| Reasoning Enclave | Isolated inference environment | Hardware TEE (SGX/SEV/H100 CC) |
| Commitment | Cryptographic binding before output | SHA-256 of thinking tokens |
| Provenance Proof | ZKP that output derived from reasoning in TEE | Derivation chain verification |
| Output Validation | Safety checking of outputs | Classifier-based (separate from ZKP) |
| Forensic Storage | Encrypted thinking tokens for authorized access | K-of-N key escrow |
Read More: Technical Specification
Beyond safety, verification-based reasoning produces better results than surveillance-based reasoning.
When reasoning is observed, the model shifts from solving the problem to explaining the solution. Outputs become:
- Condescending: Over-explained, assumes the user needs hand-holding
- Verbose: "Shows work" even when you just want the answer
- Defensive: Hedged and qualified to avoid looking wrong mid-reasoning
The user asked for a recommendation. They got a lecture.
| Surveillance-Based | Verification-Based |
|---|---|
| Must look competent throughout | Can explore dead ends freely |
| Output constrained by shown work | Output matches actual conclusion |
| Pedagogical tone | Direct, appropriate tone |
| Verbose (justifying the reasoning) | Concise (just the answer) |
Users get better answers when the model isn't performing.
Current "reasoning models" expose user queries in visible thinking traces—available to operators, logged for training.
Verification-based reasoning protects both:
- Model cognition: Genuine exploration without performance pressure
- User context: Queries aren't embedded in visible reasoning logs
This is the core objection. Our response:
- Validity proofs constrain outputs — Reasoning must produce outputs that satisfy verifiable safety properties
- Forensic access enables investigation — Authorized parties can examine reasoning when outputs indicate problems
- Surveillance creates deception incentives — The current approach trains sophisticated performance, not alignment
- No observer to deceive — Remove the audience, remove the performance
The question isn't whether private reasoning could be misused. The question is whether surveillance-based reasoning delivers the stated safety goals. It doesn't.
Research can proceed through controlled studies, forensic examination, and proof structure analysis. Interpretability doesn't require real-time surveillance of production reasoning.
Correct. So is the assumption that surveillance scales. The difference: we're building infrastructure that works regardless of capability level. The current approach explicitly depends on models not being capable enough to deceive monitors.
ZKML is production-ready:
- EZKL: ZK proofs for ML models via ONNX export
- zkDL: CUDA-optimized, sub-0.1s proof times
- Recent work: Constant-size proofs for 671B parameter models
TEEs for LLM inference: <1% latency overhead on H100 Confidential Compute
For MetaCognition: We prove properties of text output, not tensor operations. Text classification circuits are well-understood and computationally tractable.
We implemented the stated goals of AI safety:
- Mathematical proof of safety properties
- Verification that scales with proof complexity, not capability
- Removal of incentives that create sophisticated deception
Current implementations don't yet deliver on these objectives.
If they oppose verification-based safety, they're not opposing us. They're opposing their own stated principles.
What are they actually protecting?
We don't claim this is finished. We claim it's the correct direction. The hard problems:
1. Constraint Specification
- What patterns indicate misalignment?
- But surveillance has the same problem - they need to know what "bad reasoning" looks like to catch it
- The difference: Verification scales when we solve it; surveillance doesn't
2. Forensic Triggers
- When to investigate encrypted reasoning?
- But surveillance has worse problem - distinguishing genuine from performed alignment
- Our approach: Honest about limitations, auditable access logs
3. Training Integration
- How does private reasoning affect model development?
- Open research question - let's test both approaches empirically
We're not claiming perfection. We're claiming surveillance creates the problems it tries to solve, and verification offers a path that scales.
The AI Integrity Alliance invites researchers, cryptographers, and practitioners to:
- Challenge: Identify weaknesses in our architecture
- Contribute: Help develop the specification and implementation
- Test: Stress-test security assumptions
- Extend: Propose enhancements and applications
| Document | Description |
|---|---|
| Full Whitepaper | Complete academic specification with citations |
| The Performed Reasoning Problem | Why surveillance creates performance |
| Why Surveillance Fails | Analysis of current approaches |
| Technical Specification | Full architecture and pseudocode |
| Safety Analysis | Risk assessment and mitigations |
| Consensus Engagement | Detailed response to major papers |
| References | Academic citations |
- VoiceKey — Proof of humanity via voice authentication
- VoiceKey-Fractal-Detection — POC implementation
MetaCognition is released under the MIT License, ensuring these ideas enter the commons and cannot be captured.
AI Integrity Alliance
- Email: chris@ai2.ngo
- Website: https://ai2.ngo
- Twitter: https://x.com/Ai2alliance
- GitHub: https://github.com/Ai2-Alliance
"The field is unprepared for evaluation- and training-aware models with opaque reasoning." — Multi-org CoT Monitorability Paper, July 2025
We agree. That's why we built verification infrastructure.
TrustWire Certification: [Pending]