Skip to content

Complexity MoE + PID Dynamics (Token-Routed I64)#224

Closed
Complexity-ML wants to merge 8 commits intoopenai:mainfrom
Complexity-ML:complexity-moe-pid
Closed

Complexity MoE + PID Dynamics (Token-Routed I64)#224
Complexity-ML wants to merge 8 commits intoopenai:mainfrom
Complexity-ML:complexity-moe-pid

Conversation

@Complexity-ML
Copy link
Copy Markdown

Summary

Novel architecture from Complexity Framework:

  • Token-Routed MoE — 4 experts, deterministic routing (token_id % 4), mask-multiply (fullgraph safe)
  • PID Dynamics — mu traverses all 9 layers, tight clamping for stability
  • SwiGLU activation replacing relu²
  • Cosine Warm Restarts (SGDR) LR schedule

14.7M params, under 16MB cap. Awaiting compute credits for final val_bpb.

Status

⏳ Pending training results — compute credits requested.

🤖 Generated with Claude Code

Token-Routed MoE (4 experts, deterministic routing) + PID dynamics
with mu traversing all layers + SwiGLU + Cosine Warm Restarts.
14.7M params, under 16MB cap. Awaiting compute for final val_bpb.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LearnedHashRouter: Linear(H, E) micro-router, soft routing, fullgraph safe
- 3 routing modes: modulo | learned | hybrid (modulo base + learned override)
- Hybrid starts deterministic, learns to override when beneficial
- Cosine warm restarts (SGDR): cycles 5k/10k/20k, peak decay 0.7x
- Condensed comments to stay under 1500 lines (1497)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Complexity-ML
Copy link
Copy Markdown
Author

Update v2 — Learned Hash Router (Hybrid Mode)

Added context-aware routing on top of deterministic base:

  • LearnedHashRouter: Linear(H, E) micro-router, soft routing, fullgraph safe
  • 3 routing modes: modulo | learned | hybrid
  • Hybrid: starts with modulo stability, learns to override when beneficial
  • Added Cosine Warm Restarts (SGDR) LR schedule
  • 1497 lines, under 1500 cap

Awaiting compute credits to train and produce val_bpb.

- 11 layers (was 9) + MLP 2x expansion (was 1x)
- 26.5M params, ~14.2MB with int6+zstd (fits 16MB)
- Hybrid learned router + PID dynamics unchanged
- Need int6+zstd quantizer for final artifact

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Complexity-ML
Copy link
Copy Markdown
Author

Update v2 — Scaled up to 26.5M params

  • 11 layers (was 9), MLP 2x expansion (was 1x)
  • 26.5M params, estimated ~14.2MB with int6+zstd → fits 16MB
  • Hybrid learned router + PID dynamics unchanged
  • 4 SwiGLU experts with token-routed dispatch

Still awaiting compute credits. Will update with val_bpb once trained.

@Complexity-ML Complexity-ML marked this pull request as ready for review March 20, 2026 16:30
@Complexity-ML Complexity-ML marked this pull request as draft March 20, 2026 16:31
Complexity-ML and others added 5 commits March 20, 2026 19:19
- Int6 quantization (QUANT_BITS=6, range [-31,31]) instead of int8
- zstd-22 compression with zlib fallback
- SWA: fp32 checkpoint averaging during late training
- 1500 lines exactly (at the limit)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JIT-compiled CUDA kernel for true scatter-dispatch MoE (from vllm-i64)
- Route → scatter → cuBLAS expert GEMM → gather pipeline
- 4x less wasted compute vs mask-multiply (only active expert runs)
- Falls back to PyTorch mask multiply if kernel unavailable
- Used at eval time only (training uses torch.compile path)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Separate eval script (not counted in 1500 line limit):
- Sliding window with configurable stride (default 64) and window (default 2048)
- Loads quantized model, scores only last stride tokens per window
- Expected ~0.03 BPB improvement at eval time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant