Skip to content

Declarative checkpoint config conversion (Llama pilot)#508

Open
jlamypoirier wants to merge 3 commits intomainfrom
jlp_simplify_conversion
Open

Declarative checkpoint config conversion (Llama pilot)#508
jlamypoirier wants to merge 3 commits intomainfrom
jlp_simplify_conversion

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

Summary

First step of the conversion-simplification refactor. Reintroduces declarative config-conversion primitives, applied within the post-#362 modular per-section structure, and migrates Llama as the pilot to validate the design.

Three sequential commits:

  1. Reclassify architecture-impacting fields under FieldHint.architecture — eight fields (attention dense_layer / softmax_scale_power, MLP activation, MoE router, four Llama3 / five Yarn rotary scaling fields, StochasticMixer main_mixer_name, vision patch height/width). These drive the new coverage check.
  2. Add declarative ConfigConverter primitives and section-converter ABC in fast_llm/engine/checkpoint/external.py. Eight primitives (Rename, ConstantExport, ConstantImport, Default, Optional, Ignored, Custom, Nested, Dispatch) plus ConfigSectionConverter. Walker is implicit — NestedConfigConverter and DispatchConfigConverter call public import_config/export_config so subclass overrides participate. Coverage check fires only when type(config) exactly matches the converter's declared fast_llm_config_class, so unmigrated subclasses (Mixtral on Llama, Qwen2's _check_config override, etc.) keep working through super().
  3. Migrate Llama config converters to declarative primitives. Eight section converters cover normalization/MLP/attention/block/embeddings/head/decoder/base-model. Weight side unchanged. LlamaDecoderConverter stays imperative (Fixed/Pattern block-sequence dispatch doesn't fit cleanly). _check_config is retained as an overridable hook. PEFT non-default values now fail loudly on export instead of being silently dropped.

Notable shape decisions (open to course-correction)

  • Coverage check is type-strict (type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter. This was needed to keep Mixtral working through super().export_config() on MoEMLPConfig while only Llama is migrated.
  • NestedConfigConverter is flat-merge only. The transformer side is assumed flat. Non-flat HF cases (Apriel2 mixers) will use DispatchConfigConverter with an hf_path, or CustomConfigConverter.
  • No global type-keyed registry. Sub-converter dispatch is local: parents declare NestedConfigConverter(field, converter_class) for fixed types, DispatchConfigConverter(field, registry) for polymorphic ones. Subclasses override sub-converter classes the same way as today's ClassVar[type] pattern.
  • parent_context plumbing is dropped for now (was speculative, unused in Llama). Will re-introduce as an explicit kwarg when Apriel migration needs it for mamba sibling-field defaults.
  • IgnoredConfigConverter is permissive — silently passes architecture fields through without check. Used for ParameterConfig sub-fields (init/lr_scale only, no architecture sub-fields) and for fields where Llama HF format genuinely has no representation. PEFT (which IS architecture-significant when configured) uses CustomConfigConverter with an explicit Assert.custom(isinstance, config.peft, NoPeftConfig) instead.

Verification

  • Live round-trip parity for Llama-3, Qwen2, Mistral, Mixtral, MTP-Llama with realistic HF configs.
  • Coverage check fires on missing declarations (verified by removing head_size).
  • Constant assertions fire on non-default softmax_scale_power and on configured PEFT.
  • pytest tests/models/test_checkpoint.py --models gpt: 139 passed, 0 failed across llama / qwen_2 / mistral / mixtral / mtp_llama / apriel2_attn / llava / diffusion_llama.

Test plan

  • pytest -v -n 6 tests/models/test_checkpoint.py 2>&1 | tee /tmp/fast_llm_tests/pytest_out.txt
  • pytest -v -n 6 tests/models/test_hf_roundtrip.py
  • pytest -v -n 6 --models gpt tests/
  • pytest -v -n 6 fast_llm_external_models/tests/ (separate invocation per CLAUDE.md)
  • Manual smoke: fast-llm convert --input.format llama --input.path <ref> --output.format llama --output.path <tmp>; reload both and compare configs.

What's not in this PR

Phase 2 steps 3–8 of the plan (apriel2 / mistral / qwen2 / mtp_llama / mixtral / diffusion / apriel / multimodal migrations + cleanup) and the weight-converter declarative refactor are deferred. The framework is built so they can land incrementally on top of this.

🤖 Generated with Claude Code

jlamypoirier and others added 3 commits May 5, 2026 18:33
Eight config fields whose values directly affect model architecture were
tagged as feature/core/(none). They drive the upcoming declarative-converter
coverage check, which uses FieldHint.architecture as the source of truth
for "must be handled by every checkpoint format".

- AttentionConfig.dense_layer (output projection presence)
- AttentionConfig.softmax_scale_power (attention scaling)
- MLPConfig.activation (forward-pass activation type)
- MoEMLPConfig.router (routing weights drive token assignment)
- Llama3RotaryConfig: scale_factor, low_frequency_factor,
  high_frequency_factor, original_context_length
- YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow,
  original_context_length
- StochasticMixerConfig.main_mixer_name (selects inference mixer)
- PatchEmbeddingsConfig.patch_height/patch_width (input tokenization)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reintroduces the declarative config-conversion shape that pre-dated PR #362,
applied within the post-#362 modular per-section structure. Replaces the
imperative import_config/export_config bodies with a small set of named
primitives and a recursive walker driven by per-section declarations.

Primitives in fast_llm.engine.checkpoint.external:
- RenameConfigConverter — 1:1 path rename
- ConstantExportConfigConverter — write constant on export, assert on import
- ConstantImportConfigConverter — assert on export, inject on import
- DefaultConfigConverter — rename with HF-side fallback
- OptionalConfigConverter — emit/import only when non-sentinel
- IgnoredConfigConverter — declare a field as intentionally not converted
- CustomConfigConverter — escape hatch for cross-field transforms
- NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges
  HF output into the parent (transformer side is assumed flat)
- DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs

ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses
declare their conversion via _create_config_converters() and inherit
import_config/export_config concretely. The architecture-coverage check fires
only when type(config) exactly matches the converter's declared
fast_llm_config_class — strict subclass types defer to a more specific
converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama)
to call super().export_config() without tripping the parent's check on
fields the parent doesn't know about.

The walker is implicit: NestedConfigConverter / DispatchConfigConverter
call the public import_config/export_config on the sub-converter class so
subclass overrides participate, rather than a private path that bypasses
them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pilot of the new ConfigSectionConverter framework. Each Llama section
converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel)
now declares its conversion via _create_config_converters() instead of
imperative import_config/export_config bodies. Weight side is unchanged.

Notable shape decisions:
- LlamaDecoderConverter stays as a regular (imperative) class because
  Fixed/Pattern block-sequence dispatch doesn't lend itself to the
  declarative shape. LlamaBaseModelConverter wires it in via a small
  CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...)
  continue to plug in different block converters via block_converter_class.
- _check_config is retained as an overridable classmethod and called from
  the linear_layers CustomConfigConverter, so Qwen2 can keep its
  asymmetric Q/K/V bias rule without re-implementing the export.
- IgnoredConfigConverter is used for ParameterConfig sub-fields with no
  architecture-significant content (weight, output_weight, word_embeddings),
  and for prediction_heads (which Llama HF doesn't expose; subclass
  MTP-Llama adds it imperatively).
- peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama
  HF format cannot represent PEFT, so a configured LoRA now fails loudly
  rather than being silently dropped.
- Rotary remains in CustomConfigConverter — the v4/v5 transformers split
  (rope_theta/rope_scaling vs. rope_parameters) and three rope_type
  variants don't fit pure rename primitives.

Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and
MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT
formats (139 passed, 0 failed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier force-pushed the jlp_simplify_conversion branch from 5567a71 to 0c406db Compare May 5, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant