Complexity MoE + PID Dynamics (Token-Routed I64)#224
Closed
Complexity-ML wants to merge 8 commits intoopenai:mainfrom
Closed
Complexity MoE + PID Dynamics (Token-Routed I64)#224Complexity-ML wants to merge 8 commits intoopenai:mainfrom
Complexity-ML wants to merge 8 commits intoopenai:mainfrom
Conversation
Token-Routed MoE (4 experts, deterministic routing) + PID dynamics with mu traversing all layers + SwiGLU + Cosine Warm Restarts. 14.7M params, under 16MB cap. Awaiting compute for final val_bpb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- LearnedHashRouter: Linear(H, E) micro-router, soft routing, fullgraph safe - 3 routing modes: modulo | learned | hybrid (modulo base + learned override) - Hybrid starts deterministic, learns to override when beneficial - Cosine warm restarts (SGDR): cycles 5k/10k/20k, peak decay 0.7x - Condensed comments to stay under 1500 lines (1497) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Update v2 — Learned Hash Router (Hybrid Mode) Added context-aware routing on top of deterministic base:
Awaiting compute credits to train and produce val_bpb. |
- 11 layers (was 9) + MLP 2x expansion (was 1x) - 26.5M params, ~14.2MB with int6+zstd (fits 16MB) - Hybrid learned router + PID dynamics unchanged - Need int6+zstd quantizer for final artifact Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Update v2 — Scaled up to 26.5M params
Still awaiting compute credits. Will update with val_bpb once trained. |
- Int6 quantization (QUANT_BITS=6, range [-31,31]) instead of int8 - zstd-22 compression with zlib fallback - SWA: fp32 checkpoint averaging during late training - 1500 lines exactly (at the limit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- JIT-compiled CUDA kernel for true scatter-dispatch MoE (from vllm-i64) - Route → scatter → cuBLAS expert GEMM → gather pipeline - 4x less wasted compute vs mask-multiply (only active expert runs) - Falls back to PyTorch mask multiply if kernel unavailable - Used at eval time only (training uses torch.compile path) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Separate eval script (not counted in 1500 line limit): - Sliding window with configurable stride (default 64) and window (default 2048) - Loads quantized model, scores only last stride tokens per window - Expected ~0.03 BPB improvement at eval time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Novel architecture from Complexity Framework:
token_id % 4), mask-multiply (fullgraph safe)14.7M params, under 16MB cap. Awaiting compute credits for final val_bpb.
Status
⏳ Pending training results — compute credits requested.
🤖 Generated with Claude Code