|
| 1 | +# Qwen3.5 → Claude Opus Reasoning Scaffold Analysis |
| 2 | + |
| 3 | +Generated: 2026-03-30 |
| 4 | +L1 threshold: 1 (Base17 golden-step projection at stride=16) |
| 5 | + |
| 6 | +## Model Matrix |
| 7 | + |
| 8 | +| ID | Repo | Shards | Path | |
| 9 | +|---|---|---|---| |
| 10 | +| qwen35_27b_base | Qwen/Qwen3.5-27B | 11 | safetensors BF16 | |
| 11 | +| qwen35_27b_v1 | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | 11 | safetensors BF16 | |
| 12 | +| qwen35_27b_v2 | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 | 11 | safetensors BF16 | |
| 13 | +| qwen35_9b_base | Qwen/Qwen3.5-9B | 4 | safetensors BF16 | |
| 14 | +| qwen35_9b_dist | Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled | 4 | safetensors BF16 | |
| 15 | + |
| 16 | +## Training Data Composition |
| 17 | + |
| 18 | +| Model | Opus 4.6 | Opus 4.5 | Qwen-self | Total | |
| 19 | +|---|---|---|---|---| |
| 20 | +| v1 | 3000× (nohurry) | 250× (TeichAI) | 700× (Jackrong) | ~3950 | |
| 21 | +| v2 | 3000× + 10000× (Roman1111111) | 250× (TeichAI) | 700× (Jackrong) | ~13950 | |
| 22 | +| 9B | Same as v1 + 27B distill cascade | 250× | 700× | ~3950+ | |
| 23 | + |
| 24 | +## 4-Diff Results |
| 25 | + |
| 26 | +### Diff 1: 27B base → v1 (Opus 4.5 + early 4.6, ~3950 samples) |
| 27 | +10,845 / 3,880,960 rows shifted (0.3%) — 9 of 11 shards |
| 28 | + |
| 29 | +| Projection | Shifted | Total | % | |
| 30 | +|---|---|---|---| |
| 31 | +| FfnGate | 6,396 | 1,131,520 | 0.6% | |
| 32 | +| FfnUp | 3,677 | 1,131,520 | 0.3% | |
| 33 | +| Q | 608 | 208,896 | 0.3% | |
| 34 | +| O | 40 | 20,480 | 0.2% | |
| 35 | +| FfnDown | 93 | 332,800 | 0.0% | |
| 36 | +| K | 0 | — | 0.0% | |
| 37 | +| V | 0 | — | 0.0% | |
| 38 | +| Embedding | 0 | 248,320 | 0.0% | |
| 39 | + |
| 40 | +### Diff 2: 27B base → v2 (Opus 4.6 heavy, ~13950 samples) |
| 41 | +1,921 / 5,241,695 rows shifted (0.0%) — 11 shards |
| 42 | + |
| 43 | +| Projection | Shifted | Total | % | |
| 44 | +|---|---|---|---| |
| 45 | +| FfnGate | 982 | 1,131,520 | 0.1% | |
| 46 | +| FfnUp | 707 | 1,131,520 | 0.1% | |
| 47 | +| Q | 142 | 208,896 | 0.1% | |
| 48 | +| O | 20 | 87,040 | 0.0% | |
| 49 | +| K | 3 | 17,408 | 0.0% | |
| 50 | +| V | 7 | 17,408 | 0.0% | |
| 51 | +| Embedding | 0 | 251,777 | 0.0% | |
| 52 | + |
| 53 | +### Diff 3: 27B v1 → v2 (iteration delta) |
| 54 | +11,509 / 5,202,783 rows shifted (0.2%) — 11 shards |
| 55 | + |
| 56 | +| Projection | Shifted | Total | % | |
| 57 | +|---|---|---|---| |
| 58 | +| FfnGate | 6,042 | 1,131,520 | 0.5% | |
| 59 | +| FfnUp | 3,907 | 1,131,520 | 0.3% | |
| 60 | +| Q | 664 | 208,896 | 0.3% | |
| 61 | +| O | 185 | 81,920 | 0.2% | |
| 62 | +| K | 56 | 17,408 | 0.3% | |
| 63 | +| V | 51 | 17,408 | 0.3% | |
| 64 | +| Embedding | 0 | 251,777 | 0.0% | |
| 65 | + |
| 66 | +### Diff 4: 9B base → distilled |
| 67 | +7,577 / 2,451,295 rows shifted (0.3%) — 4 shards |
| 68 | + |
| 69 | +| Projection | Shifted | Total | % | |
| 70 | +|---|---|---|---| |
| 71 | +| FfnGate | 3,857 | 405,504 | 1.0% | |
| 72 | +| FfnUp | 2,437 | 405,504 | 0.6% | |
| 73 | +| Q | 416 | 73,728 | 0.6% | |
| 74 | +| O | 170 | 36,864 | 0.5% | |
| 75 | +| K | 49 | 9,216 | 0.5% | |
| 76 | +| V | 47 | 9,216 | 0.5% | |
| 77 | +| Embedding | 0 | 251,777 | 0.0% | |
| 78 | + |
| 79 | +## Key Findings |
| 80 | + |
| 81 | +### 1. The reasoning scaffold lives in SwiGLU FFN gating |
| 82 | +FfnGate is the dominant shift in ALL 4 diffs. Not attention Q/K/V/O. |
| 83 | +The LoRA distillation primarily teaches the model HOW to route information |
| 84 | +through its feed-forward network, not how to attend differently. |
| 85 | + |
| 86 | +### 2. v2 is a REVERT, not an upgrade |
| 87 | +- base→v1: 0.6% FfnGate (aggressive modification) |
| 88 | +- base→v2: 0.1% FfnGate (conservative — much closer to base) |
| 89 | +- v1→v2: 0.5% FfnGate (v2 undid most of v1's changes) |
| 90 | + |
| 91 | +v2's 14K additional Opus-4.6 samples didn't amplify v1's changes — they |
| 92 | +**stabilized the optimizer back toward base**. v2 is closer to base than v1. |
| 93 | + |
| 94 | +### 3. K stable at 27B, K shifted at 9B (capacity split) |
| 95 | +- 27B: K=0.0% → knowledge base preserved, only routing changed |
| 96 | +- 9B: K=0.5% → knowledge must also change (insufficient capacity) |
| 97 | + |
| 98 | +At 27B, the model learns new routing without touching its knowledge. |
| 99 | +At 9B, it must rewrite both. This is the capacity-dependent split. |
| 100 | + |
| 101 | +### 4. v1 is the control experiment (not redundant) |
| 102 | +v1 vs v2 separates traits: |
| 103 | + |
| 104 | +| Category | Definition | Interpretation | |
| 105 | +|---|---|---| |
| 106 | +| GOOD (v1 ∩ v2 ∩ 9B) | All three agree | Scale-invariant reasoning scaffold | |
| 107 | +| BEHAVIOR (v1 \ v2) | v1 only, v2 reverted | Opus 4.5 behavioral traits | |
| 108 | +| REASONING (v2 \ v1) | v2 only, not in v1 | Pure Opus 4.6 signal (but minimal) | |
| 109 | +| UNCERTAIN (v1 ∩ v2 \ 9B) | Both rounds, not 9B | 27B capacity-dependent | |
| 110 | + |
| 111 | +### 5. The "orchestrator" insight |
| 112 | +Qwen3.5-base had the knowledge. It lacked the orchestration. |
| 113 | +The LoRA taught routing (FfnGate + Q), not knowledge (K + Embedding). |
| 114 | +Claude-style reasoning = different FFN activation patterns. |
| 115 | +"Let me analyze this: 1... 2... 3..." is a routing pattern, not new knowledge. |
| 116 | + |
| 117 | +## Architectural Implications |
| 118 | + |
| 119 | +### Palette3D should prioritize FfnGate |
| 120 | +The HEEL planes should weight FfnGate > FfnUp > Q > O. |
| 121 | +K/V bits are informational at 27B (near-zero), critical at 9B. |
| 122 | + |
| 123 | +### L1-metric palette, not POPCNT bitmask |
| 124 | +Base17 fingerprints are not random — they are structured golden-step projections. |
| 125 | +POPCNT (Hamming distance) requires random bit distribution → gives biased results. |
| 126 | +Use Base17 L1 distance (PaletteSemiring) for all palette operations. |
| 127 | + |
| 128 | +### Shallow vs deep thinking maps to HHTL levels |
| 129 | +- HEEL (9B palette, 512 bytes): shallow/fast routing |
| 130 | +- TWIG (27B palette, Sparse256): deep/analytical routing |
| 131 | +- Style ordinal in PAL8 header controls escalation threshold |
| 132 | + |
| 133 | +## Next Steps |
| 134 | +1. Run inference on all 5 models with same prompts |
| 135 | +2. NARS-score output quality per head (dynamic validation) |
| 136 | +3. Self-reinforcement LoRA guided by quality-scored Palette3D |
| 137 | +4. Validate: Q8_0 + Palette overlay vs BF16 reference |
0 commit comments