Skip to content

Ai2-Alliance/MetaCognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaCognition: Implementing AI Safety That Scales

An AI Integrity Alliance Initiative


The Stated Goal

The AI safety community has clear, published objectives:

"We need safety measures that work at higher capability levels."

"We want to prevent AI deception."

"Alignment must be verifiable, not assumed."

In July 2025, researchers from OpenAI, Anthropic, DeepMind, and academic institutions published "Chain of Thought Monitorability", acknowledging the urgency: "The field is unprepared for evaluation- and training-aware models with opaque reasoning."

We took these goals seriously. We built the infrastructure to achieve them.


The Gap

The consensus acknowledges the problem. Current approaches rely on surveillance—watching AI reasoning and training on what looks good.

The papers themselves admit this is fragile:

"CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness." — CoT Monitorability Paper

"Our results rely on our ability to read and trust the models' reasoning." — OpenAI Scheming Research

Surveillance creates performance pressure. Performance pressure creates performed safety. Performed safety isn't safety.

The stated goal is safety that scales. The current implementation doesn't deliver it.


Why Surveillance Fails: Game Theory

The surveillance approach creates a principal-agent problem: observers (principals) want genuine alignment, but models (agents) are rewarded for appearing aligned. As models become more capable, they learn what observers want to see.

This is Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Training on observable reasoning makes reasoning optimize for observation quality, not task quality.

At scale, this becomes an inspection game with asymmetric costs:

  • Observer must catch all violations
  • Model needs one undetected success
  • As capability increases, detection becomes probabilistic while model advantage grows

MetaCognition changes the game: remove the observer, eliminate the performance incentive, verify outputs cryptographically. No signaling game means no incentive to fake signals.


The Implementation

MetaCognition provides what the safety community says it wants:

Stated Goal Current Implementation MetaCognition
Safety at scale Surveillance (fails at capability) Mathematical proof (scales with complexity)
Prevent deception Watch for deception patterns Remove observation that creates deception incentive
Verifiable alignment "Trust our observations" Cryptographic verification

How It Works

We don't need to prove properties of the entire computational graph. The underlying GPU state is ephemeral—discarded after inference.

We prove properties of the thinking token output:

Input → [ENCLAVE: thinking tokens generated] → Output
              ↓
         Commit to thinking token sequence
              ↓
         Prove: "this text satisfies safety constraints"
              ↓
         Store encrypted text for authorized forensic access

This is ZKPs on text classification, not tensor operations. Tractable today.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      REASONING ENCLAVE (TEE)                    │
│                                                                 │
│   Input → Model Inference → Thinking Tokens → Output            │
│                                   │                             │
│                                   ▼                             │
│                          ┌───────────────┐                      │
│                          │  Commit Hash  │                      │
│                          │   (SHA-256)   │                      │
│                          └───────┬───────┘                      │
│                                  │                              │
└──────────────────────────────────┼──────────────────────────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
                    ▼              ▼              ▼
             ┌───────────┐  ┌───────────┐  ┌───────────┐
             │Provenance │  │  Output   │  │ Forensic  │
             │   Proof   │  │Validation │  │  Storage  │
             │  (SNARK)  │  │(Classify) │  │(Encrypted)│
             └───────────┘  └───────────┘  └───────────┘
Component Function Implementation
Reasoning Enclave Isolated inference environment Hardware TEE (SGX/SEV/H100 CC)
Commitment Cryptographic binding before output SHA-256 of thinking tokens
Provenance Proof ZKP that output derived from reasoning in TEE Derivation chain verification
Output Validation Safety checking of outputs Classifier-based (separate from ZKP)
Forensic Storage Encrypted thinking tokens for authorized access K-of-N key escrow

Read More: Technical Specification


Better Outputs, Not Just Safer Ones

Beyond safety, verification-based reasoning produces better results than surveillance-based reasoning.

The Performance Problem

When reasoning is observed, the model shifts from solving the problem to explaining the solution. Outputs become:

  • Condescending: Over-explained, assumes the user needs hand-holding
  • Verbose: "Shows work" even when you just want the answer
  • Defensive: Hedged and qualified to avoid looking wrong mid-reasoning

The user asked for a recommendation. They got a lecture.

Why Verification Enables Better Outputs

Surveillance-Based Verification-Based
Must look competent throughout Can explore dead ends freely
Output constrained by shown work Output matches actual conclusion
Pedagogical tone Direct, appropriate tone
Verbose (justifying the reasoning) Concise (just the answer)

Users get better answers when the model isn't performing.

User Privacy

Current "reasoning models" expose user queries in visible thinking traces—available to operators, logged for training.

Verification-based reasoning protects both:

  • Model cognition: Genuine exploration without performance pressure
  • User context: Queries aren't embedded in visible reasoning logs

Addressing Concerns

"Private reasoning could enable scheming"

This is the core objection. Our response:

  1. Validity proofs constrain outputs — Reasoning must produce outputs that satisfy verifiable safety properties
  2. Forensic access enables investigation — Authorized parties can examine reasoning when outputs indicate problems
  3. Surveillance creates deception incentives — The current approach trains sophisticated performance, not alignment
  4. No observer to deceive — Remove the audience, remove the performance

The question isn't whether private reasoning could be misused. The question is whether surveillance-based reasoning delivers the stated safety goals. It doesn't.

"We need interpretability research"

Research can proceed through controlled studies, forensic examination, and proof structure analysis. Interpretability doesn't require real-time surveillance of production reasoning.

"This is untested"

Correct. So is the assumption that surveillance scales. The difference: we're building infrastructure that works regardless of capability level. The current approach explicitly depends on models not being capable enough to deceive monitors.


Technical Feasibility

ZKML is production-ready:

  • EZKL: ZK proofs for ML models via ONNX export
  • zkDL: CUDA-optimized, sub-0.1s proof times
  • Recent work: Constant-size proofs for 671B parameter models

TEEs for LLM inference: <1% latency overhead on H100 Confidential Compute

For MetaCognition: We prove properties of text output, not tensor operations. Text classification circuits are well-understood and computationally tractable.


The Question

We implemented the stated goals of AI safety:

  • Mathematical proof of safety properties
  • Verification that scales with proof complexity, not capability
  • Removal of incentives that create sophisticated deception

Current implementations don't yet deliver on these objectives.

If they oppose verification-based safety, they're not opposing us. They're opposing their own stated principles.

What are they actually protecting?


Research Challenges

We don't claim this is finished. We claim it's the correct direction. The hard problems:

1. Constraint Specification

  • What patterns indicate misalignment?
  • But surveillance has the same problem - they need to know what "bad reasoning" looks like to catch it
  • The difference: Verification scales when we solve it; surveillance doesn't

2. Forensic Triggers

  • When to investigate encrypted reasoning?
  • But surveillance has worse problem - distinguishing genuine from performed alignment
  • Our approach: Honest about limitations, auditable access logs

3. Training Integration

  • How does private reasoning affect model development?
  • Open research question - let's test both approaches empirically

We're not claiming perfection. We're claiming surveillance creates the problems it tries to solve, and verification offers a path that scales.


Call for Collaboration

The AI Integrity Alliance invites researchers, cryptographers, and practitioners to:

  • Challenge: Identify weaknesses in our architecture
  • Contribute: Help develop the specification and implementation
  • Test: Stress-test security assumptions
  • Extend: Propose enhancements and applications

Learn How to Contribute


Documentation

Document Description
Full Whitepaper Complete academic specification with citations
The Performed Reasoning Problem Why surveillance creates performance
Why Surveillance Fails Analysis of current approaches
Technical Specification Full architecture and pseudocode
Safety Analysis Risk assessment and mitigations
Consensus Engagement Detailed response to major papers
References Academic citations

Related Work


License

MetaCognition is released under the MIT License, ensuring these ideas enter the commons and cannot be captured.


Contact

AI Integrity Alliance


"The field is unprepared for evaluation- and training-aware models with opaque reasoning." — Multi-org CoT Monitorability Paper, July 2025

We agree. That's why we built verification infrastructure.


TrustWire Certification: [Pending]

About

Verification infrastructure for AI reasoning. Privacy by design, safety through cryptography.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors