Skip to content

Commit 7dccaa1

Browse files
authored
Merge pull request #68 from AdaWorldAPI/claude/qwen-claude-reverse-eng-vHuHv
Claude/qwen claude reverse eng v hu hv
2 parents eb93d0d + a30e5c0 commit 7dccaa1

40 files changed

Lines changed: 3794 additions & 16 deletions
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Qwen3.5 → Claude Opus Reasoning Scaffold Analysis
2+
3+
Generated: 2026-03-30
4+
L1 threshold: 1 (Base17 golden-step projection at stride=16)
5+
6+
## Model Matrix
7+
8+
| ID | Repo | Shards | Path |
9+
|---|---|---|---|
10+
| qwen35_27b_base | Qwen/Qwen3.5-27B | 11 | safetensors BF16 |
11+
| qwen35_27b_v1 | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled | 11 | safetensors BF16 |
12+
| qwen35_27b_v2 | Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 | 11 | safetensors BF16 |
13+
| qwen35_9b_base | Qwen/Qwen3.5-9B | 4 | safetensors BF16 |
14+
| qwen35_9b_dist | Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled | 4 | safetensors BF16 |
15+
16+
## Training Data Composition
17+
18+
| Model | Opus 4.6 | Opus 4.5 | Qwen-self | Total |
19+
|---|---|---|---|---|
20+
| v1 | 3000× (nohurry) | 250× (TeichAI) | 700× (Jackrong) | ~3950 |
21+
| v2 | 3000× + 10000× (Roman1111111) | 250× (TeichAI) | 700× (Jackrong) | ~13950 |
22+
| 9B | Same as v1 + 27B distill cascade | 250× | 700× | ~3950+ |
23+
24+
## 4-Diff Results
25+
26+
### Diff 1: 27B base → v1 (Opus 4.5 + early 4.6, ~3950 samples)
27+
10,845 / 3,880,960 rows shifted (0.3%) — 9 of 11 shards
28+
29+
| Projection | Shifted | Total | % |
30+
|---|---|---|---|
31+
| FfnGate | 6,396 | 1,131,520 | 0.6% |
32+
| FfnUp | 3,677 | 1,131,520 | 0.3% |
33+
| Q | 608 | 208,896 | 0.3% |
34+
| O | 40 | 20,480 | 0.2% |
35+
| FfnDown | 93 | 332,800 | 0.0% |
36+
| K | 0 || 0.0% |
37+
| V | 0 || 0.0% |
38+
| Embedding | 0 | 248,320 | 0.0% |
39+
40+
### Diff 2: 27B base → v2 (Opus 4.6 heavy, ~13950 samples)
41+
1,921 / 5,241,695 rows shifted (0.0%) — 11 shards
42+
43+
| Projection | Shifted | Total | % |
44+
|---|---|---|---|
45+
| FfnGate | 982 | 1,131,520 | 0.1% |
46+
| FfnUp | 707 | 1,131,520 | 0.1% |
47+
| Q | 142 | 208,896 | 0.1% |
48+
| O | 20 | 87,040 | 0.0% |
49+
| K | 3 | 17,408 | 0.0% |
50+
| V | 7 | 17,408 | 0.0% |
51+
| Embedding | 0 | 251,777 | 0.0% |
52+
53+
### Diff 3: 27B v1 → v2 (iteration delta)
54+
11,509 / 5,202,783 rows shifted (0.2%) — 11 shards
55+
56+
| Projection | Shifted | Total | % |
57+
|---|---|---|---|
58+
| FfnGate | 6,042 | 1,131,520 | 0.5% |
59+
| FfnUp | 3,907 | 1,131,520 | 0.3% |
60+
| Q | 664 | 208,896 | 0.3% |
61+
| O | 185 | 81,920 | 0.2% |
62+
| K | 56 | 17,408 | 0.3% |
63+
| V | 51 | 17,408 | 0.3% |
64+
| Embedding | 0 | 251,777 | 0.0% |
65+
66+
### Diff 4: 9B base → distilled
67+
7,577 / 2,451,295 rows shifted (0.3%) — 4 shards
68+
69+
| Projection | Shifted | Total | % |
70+
|---|---|---|---|
71+
| FfnGate | 3,857 | 405,504 | 1.0% |
72+
| FfnUp | 2,437 | 405,504 | 0.6% |
73+
| Q | 416 | 73,728 | 0.6% |
74+
| O | 170 | 36,864 | 0.5% |
75+
| K | 49 | 9,216 | 0.5% |
76+
| V | 47 | 9,216 | 0.5% |
77+
| Embedding | 0 | 251,777 | 0.0% |
78+
79+
## Key Findings
80+
81+
### 1. The reasoning scaffold lives in SwiGLU FFN gating
82+
FfnGate is the dominant shift in ALL 4 diffs. Not attention Q/K/V/O.
83+
The LoRA distillation primarily teaches the model HOW to route information
84+
through its feed-forward network, not how to attend differently.
85+
86+
### 2. v2 is a REVERT, not an upgrade
87+
- base→v1: 0.6% FfnGate (aggressive modification)
88+
- base→v2: 0.1% FfnGate (conservative — much closer to base)
89+
- v1→v2: 0.5% FfnGate (v2 undid most of v1's changes)
90+
91+
v2's 14K additional Opus-4.6 samples didn't amplify v1's changes — they
92+
**stabilized the optimizer back toward base**. v2 is closer to base than v1.
93+
94+
### 3. K stable at 27B, K shifted at 9B (capacity split)
95+
- 27B: K=0.0% → knowledge base preserved, only routing changed
96+
- 9B: K=0.5% → knowledge must also change (insufficient capacity)
97+
98+
At 27B, the model learns new routing without touching its knowledge.
99+
At 9B, it must rewrite both. This is the capacity-dependent split.
100+
101+
### 4. v1 is the control experiment (not redundant)
102+
v1 vs v2 separates traits:
103+
104+
| Category | Definition | Interpretation |
105+
|---|---|---|
106+
| GOOD (v1 ∩ v2 ∩ 9B) | All three agree | Scale-invariant reasoning scaffold |
107+
| BEHAVIOR (v1 \ v2) | v1 only, v2 reverted | Opus 4.5 behavioral traits |
108+
| REASONING (v2 \ v1) | v2 only, not in v1 | Pure Opus 4.6 signal (but minimal) |
109+
| UNCERTAIN (v1 ∩ v2 \ 9B) | Both rounds, not 9B | 27B capacity-dependent |
110+
111+
### 5. The "orchestrator" insight
112+
Qwen3.5-base had the knowledge. It lacked the orchestration.
113+
The LoRA taught routing (FfnGate + Q), not knowledge (K + Embedding).
114+
Claude-style reasoning = different FFN activation patterns.
115+
"Let me analyze this: 1... 2... 3..." is a routing pattern, not new knowledge.
116+
117+
## Architectural Implications
118+
119+
### Palette3D should prioritize FfnGate
120+
The HEEL planes should weight FfnGate > FfnUp > Q > O.
121+
K/V bits are informational at 27B (near-zero), critical at 9B.
122+
123+
### L1-metric palette, not POPCNT bitmask
124+
Base17 fingerprints are not random — they are structured golden-step projections.
125+
POPCNT (Hamming distance) requires random bit distribution → gives biased results.
126+
Use Base17 L1 distance (PaletteSemiring) for all palette operations.
127+
128+
### Shallow vs deep thinking maps to HHTL levels
129+
- HEEL (9B palette, 512 bytes): shallow/fast routing
130+
- TWIG (27B palette, Sparse256): deep/analytical routing
131+
- Style ordinal in PAL8 header controls escalation threshold
132+
133+
## Next Steps
134+
1. Run inference on all 5 models with same prompts
135+
2. NARS-score output quality per head (dynamic validation)
136+
3. Self-reinforcement LoRA guided by quality-scored Palette3D
137+
4. Validate: Q8_0 + Palette overlay vs BF16 reference

0 commit comments

Comments
 (0)