Add OLMo2/3 models support in fairseq2 by YunchaoYang · Pull Request #1410 · facebookresearch/fairseq2

YunchaoYang · 2025-11-05T17:04:11Z

What does this PR do? Please describe:

Add OLMo2 and OLMo3 model architecture support in fairseq2.

OLMo2: 1B, 7B, 13B, 32B
OLMo3: 7B, 32B

Both architectures share a unified olmo module. The key architecture features:

OLMORMSNorm: The order of operations is normalize → multiply by weight → cast to original dtype (differs from standard RMSNorm which casts before multiplying).
OLMOTransformerLMDecoderLayer: Uses a custom Post-Norm order: Attention/FFN → Norm → Add Residual. This differs from both standard Pre-Norm (Norm → Attention/FFN → Add Residual) and standard Post-Norm (Attention/FFN → Add Residual → Norm).
OLMOMultiheadAttention: Inherits directly from the MultiheadAttention abstract base and inlines Q/K/V projection setup.
- Q/K Norm applied in the order: Project → Normalize → Reshape → RoPE.
- Supports both MHA and GQA (OLMo2-32B and OLMo3-32B use GQA; all other sizes use MHA).
- KV-caching for incremental (auto-regressive) decoding.
Rotary Encoding:
- OLMo2 uses ReferenceRotaryEncoder (standard RoPE) directly.
- OLMo3 uses a standalone YaRNRotaryEncoder for long-context extension (8K → 65K), implementing frequency-dependent scaling and attention scaling (mscale).
OLMo3 Hybrid Attention (sliding window + full attention):
- Most layers use sliding window attention (window=4096); every 4th layer and the final layer use full global attention.
- Per-layer RoPE: sliding window layers use standard ReferenceRotaryEncoder, full attention layers use YaRNRotaryEncoder.

Testing:

Integration tests verify output consistency with HuggingFace Transformers for OLMo2-1B and OLMo3-7B.
Unit tests cover attention (Q/K norm order), YaRN RoPE, HF weight conversion, and incremental decoding.
An SFT training recipe is included for OLMo2-1B on GSM8K.

Fixes #1402

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

cirquit

Almost at the finish line here! Key things are the @Final implementations (already taken care of afaik) and the KV-caching implementation.

Rest are minor things that need some attention. Happy to merge when KV caching + tests are passing.

…to ensure outputs align with HF Transformer

Fix all E501 line length violations (>88 chars) in the OLMO module: - Reformat long docstrings and comments across all files - Split long dictionary mappings in interop.py for better readability - Wrap long error messages and function arguments - All tests still passing after formatting changes

…ding - Add KV caching to OLMOMultiheadAttention.forward() using existing AttentionState/FullAttentionState infrastructure - Add GQA head expansion via repeat_interleave for grouped query attention - Wire LocalAttentionStateFactory for OLMO3 sliding window layers in factory.py - Add incremental decode test using pretrained OLMO2-1B checkpoint with real tokenized sentences, verifying step-by-step decoding matches full-sequence forward pass

…ents, and fix imports - config.py: Remove unresolved TODOs and dead code (initializer_range, shard_embed_dim) - interop.py: Remove legacy state_dict["model"] unwrapping (no v0.4 OLMo models) - interop.py: Remove speculative RoPE permutation comment - normalization.py: Import Callable, Sequence from collections.abc instead of typing - tokenizer.py: Remove loose comments and docstring above/below chat template

…lasses

YunchaoYang · 2026-03-18T13:56:47Z

OLMO Post-Norm vs Pre-Norm vs Standard Post-Norm

flowchart LR
    subgraph PreNorm["Pre-Norm (e.g. LLaMA)"]
        direction LR
        P1["Input"] --> P2["Norm"] --> P3["Attn/FFN"] --> P4["⊕ Add"] --> P5["Output"]
        P1 -.->|"residual"| P4
    end

    subgraph PostNorm["Standard Post-Norm"]
        direction LR
        S1["Input"] --> S2["Attn/FFN"] --> S3["⊕ Add"] --> S4["Norm"] --> S5["Output"]
        S1 -.->|"residual"| S3
    end

    subgraph OLMONorm["OLMO Post-Norm"]
        direction LR
        O1["Input"] --> O2["Attn/FFN"] --> O3["Norm"] --> O4["⊕ Add"] --> O5["Output"]
        O1 -.->|"residual"| O4
    end

    style P2 fill:#95a5a6,color:#fff
    style P3 fill:#95a5a6,color:#fff
    style P4 fill:#95a5a6,color:#fff

    style S2 fill:#95a5a6,color:#fff
    style S3 fill:#95a5a6,color:#fff
    style S4 fill:#95a5a6,color:#fff

    style O2 fill:#8e44ad,color:#fff
    style O3 fill:#c0392b,color:#fff
    style O4 fill:#27ae60,color:#fff

YunchaoYang · 2026-03-18T13:57:21Z

Attention Q/K Norm Order

flowchart LR
    Input["Input"] --> QProj["Q Proj"]
    Input --> KProj["K Proj"]
    Input --> VProj["V Proj"]

    QProj --> QNorm["Q Norm"]
    KProj --> KNorm["K Norm"]

    QNorm --> QReshape["Reshape"]
    KNorm --> KReshape["Reshape"]
    VProj --> VReshape["Reshape"]

    QReshape --> QRoPE["RoPE"]
    KReshape --> KRoPE["RoPE"]

    QRoPE --> SDPA["SDPA"]
    KRoPE --> SDPA
    VReshape --> SDPA

    SDPA --> OutProj["Output Proj"]

    style QNorm fill:#c0392b,color:#fff
    style KNorm fill:#c0392b,color:#fff
    style QRoPE fill:#16a085,color:#fff
    style KRoPE fill:#16a085,color:#fff
    style SDPA fill:#8e44ad,color:#fff

YunchaoYang · 2026-03-18T13:57:34Z

OLMo3 Hybrid Attention Pattern

block-beta
    columns 8

    L0["0<br/>Sliding"]:1
    L1["1<br/>Sliding"]:1
    L2["2<br/>Sliding"]:1
    L3["3<br/>Full"]:1
    L4["4<br/>Sliding"]:1
    L5["5<br/>Sliding"]:1
    L6["6<br/>Sliding"]:1
    L7["7<br/>Full"]:1

    L8["8<br/>Sliding"]:1
    L9["9<br/>Sliding"]:1
    L10["10<br/>Sliding"]:1
    L11["11<br/>Full"]:1
    space:3
    LN["N-1<br/>Full ✱"]:1

    style L0 fill:#3498db,color:#fff
    style L1 fill:#3498db,color:#fff
    style L2 fill:#3498db,color:#fff
    style L4 fill:#3498db,color:#fff
    style L5 fill:#3498db,color:#fff
    style L6 fill:#3498db,color:#fff
    style L8 fill:#3498db,color:#fff
    style L9 fill:#3498db,color:#fff
    style L10 fill:#3498db,color:#fff
    style L3 fill:#e67e22,color:#fff
    style L7 fill:#e67e22,color:#fff
    style L11 fill:#e67e22,color:#fff
    style LN fill:#e67e22,color:#fff

🔵 Sliding Window (window=4096) — uses ReferenceRotaryEncoder (standard RoPE)

🟠 Full Attention (every 4th layer) — uses YaRNRotaryEncoder (8K → 65K context)

✱ Final layer is always Full Attention

Sliding Window vs Full Attention

flowchart TB
    subgraph Full["Full Causal Attention"]
        direction TB
        FT["Tokens: &ensp; T1 &ensp; T2 &ensp; T3 &ensp; T4 &ensp; T5 &ensp; T6 &ensp; T7 &ensp; T8"]
        FM["
        ✅  ·  ·  ·  ·  ·  ·  ·
        ✅ ✅  ·  ·  ·  ·  ·  ·
        ✅ ✅ ✅  ·  ·  ·  ·  ·
        ✅ ✅ ✅ ✅  ·  ·  ·  ·
        ✅ ✅ ✅ ✅ ✅  ·  ·  ·
        ✅ ✅ ✅ ✅ ✅ ✅  ·  ·
        ✅ ✅ ✅ ✅ ✅ ✅ ✅  ·
        ✅ ✅ ✅ ✅ ✅ ✅ ✅ ✅
        "]
        FD["Each token attends to ALL previous tokens"]
    end

    subgraph Sliding["Sliding Window Attention (window=4)"]
        direction TB
        ST["Tokens: &ensp; T1 &ensp; T2 &ensp; T3 &ensp; T4 &ensp; T5 &ensp; T6 &ensp; T7 &ensp; T8"]
        SM["
        ✅  ·  ·  ·  ·  ·  ·  ·
        ✅ ✅  ·  ·  ·  ·  ·  ·
        ✅ ✅ ✅  ·  ·  ·  ·  ·
        ✅ ✅ ✅ ✅  ·  ·  ·  ·
         · ✅ ✅ ✅ ✅  ·  ·  ·
         ·  · ✅ ✅ ✅ ✅  ·  ·
         ·  ·  · ✅ ✅ ✅ ✅  ·
         ·  ·  ·  · ✅ ✅ ✅ ✅
        "]
        SD["Each token attends to only the last W tokens"]
    end

    style Full fill:#e67e22,color:#fff
    style Sliding fill:#3498db,color:#fff
    style FT fill:none,stroke:none,color:#fff
    style FM fill:none,stroke:none,color:#fff,font-family:monospace
    style FD fill:none,stroke:none,color:#fff
    style ST fill:none,stroke:none,color:#fff
    style SM fill:none,stroke:none,color:#fff,font-family:monospace
    style SD fill:none,stroke:none,color:#fff

YunchaoYang · 2026-03-18T13:57:57Z

OLMo RMSNorm vs Standard RMSNorm

flowchart LR
    subgraph Standard["Standard RMSNorm"]
        direction LR
        S1["Normalize<br/>(float32)"] --> S2["Cast to<br/>input dtype"] --> S3["× Weight"]
    end

    subgraph OLMO["OLMO RMSNorm"]
        direction LR
        O1["Normalize<br/>(float32)"] --> O2["× Weight"] --> O3["Cast to<br/>input dtype"]
    end

    style S1 fill:#95a5a6,color:#fff
    style S2 fill:#95a5a6,color:#fff
    style S3 fill:#95a5a6,color:#fff
    style O1 fill:#c0392b,color:#fff
    style O2 fill:#c0392b,color:#fff
    style O3 fill:#c0392b,color:#fff

YunchaoYang · 2026-03-18T14:25:09Z

OLMo Architecture Diagrams

OLMo Transformer Architecture

flowchart TD
    Input["Input Tokens"] --> Embed["Token Embedding"]
    Embed --> Stack

    subgraph Stack["× N Decoder Layers"]
        subgraph DL["Decoder Layer"]
            direction TB

            SelfAttn["Self-Attention Block<br/><i>Q/K Norm · RoPE · SDPA</i>"]
            AttnNorm["OLMORMSNorm"]
            AttnRes["⊕ Residual"]
            FFNBlock["Feed-Forward Block<br/><i>(SwiGLU)</i>"]
            FFNNorm["OLMORMSNorm"]
            FFNRes["⊕ Residual"]

            SelfAttn --> AttnNorm --> AttnRes
            AttnRes --> FFNBlock --> FFNNorm --> FFNRes
        end
    end

    FFNRes --> FinalNorm["OLMORMSNorm<br/><i>(final)</i>"]
    FinalNorm --> LMHead["LM Head<br/><i>(Linear → vocab)</i>"]
    LMHead --> Output["Logits"]

    %% Residual skip connections
    Embed -.->|"residual"| AttnRes
    AttnRes -.->|"residual"| FFNRes

    style Input fill:#34495e,color:#fff
    style Embed fill:#2c3e50,color:#fff
    style Output fill:#34495e,color:#fff
    style LMHead fill:#2c3e50,color:#fff
    style FinalNorm fill:#c0392b,color:#fff
    style SelfAttn fill:#8e44ad,color:#fff
    style AttnNorm fill:#c0392b,color:#fff
    style AttnRes fill:#27ae60,color:#fff
    style FFNBlock fill:#2980b9,color:#fff
    style FFNNorm fill:#c0392b,color:#fff
    style FFNRes fill:#27ae60,color:#fff

OLMORMSNorm (🔴) is used in 5 places:

Q Norm and K Norm — inside Self-Attention, after Q/K projections

Attention layer norm — after the self-attention block, before residual add

FFN layer norm — after the feed-forward block, before residual add

Final layer norm — at the end of the decoder, before LM Head

🟢 OLMO Post-Norm residuals — Attn/FFN → Norm → ⊕ Add (unique ordering)

🟣 Self-Attention — Q/K Norm, RoPE (Reference or YaRN), Full or Sliding Window SDPA

🔵 FFN — SwiGLU (Gate · SiLU ⊗ Up → Down)

Self-Attention Block Detail

flowchart LR
    Input["Input"] --> QProj["Q Proj"]
    Input --> KProj["K Proj"]
    Input --> VProj["V Proj"]

    QProj --> QNorm["Q Norm<br/><i>(OLMORMSNorm)</i>"]
    KProj --> KNorm["K Norm<br/><i>(OLMORMSNorm)</i>"]

    QNorm --> QReshape["Reshape"]
    KNorm --> KReshape["Reshape"]
    VProj --> VReshape["Reshape"]

    QReshape --> QRoPE["RoPE"]
    KReshape --> KRoPE["RoPE"]

    QRoPE --> SDPA["SDPA<br/><i>(Full or Sliding Window)</i>"]
    KRoPE --> SDPA
    VReshape --> SDPA

    SDPA --> Flatten["Flatten"] --> OutProj["Output Proj"]

    style QNorm fill:#c0392b,color:#fff
    style KNorm fill:#c0392b,color:#fff
    style QRoPE fill:#16a085,color:#fff
    style KRoPE fill:#16a085,color:#fff
    style SDPA fill:#8e44ad,color:#fff

Feed-Forward Block Detail (SwiGLU)

flowchart LR
    Input["Input"] --> Gate["Gate Proj"]
    Input --> Up["Up Proj"]

    Gate --> SiLU["SiLU"]
    SiLU --> Mul["⊗ Multiply"]
    Up --> Mul

    Mul --> Down["Down Proj"] --> Output["Output"]

    style Gate fill:#2980b9,color:#fff
    style Up fill:#2980b9,color:#fff
    style SiLU fill:#2980b9,color:#fff
    style Mul fill:#2980b9,color:#fff
    style Down fill:#2980b9,color:#fff

cirquit

LGTM! Just a few nits on verbose comment.

I've verified all the tests to run on my end, the training recipe also works.

python3 -m recipes.lm.sft --config-file recipes/lm/sft/configs/olmo2_1b_gsm8k.yaml /tmp/olmo2_sft_smoke_test --config regime.num_steps=5 regime.checkpoint_every_n_steps=1000
  regime.validate_every_n_steps=1000 common.metric_recorders.wandb.enabled=false common.metric_recorders.tensorboard.enabled=false

Exporting to HF format and restarting back from the checkpoint also works. Happy to merge.

YunchaoYang requested review from MartinGleize, cbalioglu, cirquit and zyaoj as code owners November 5, 2025 17:04

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025

YunchaoYang changed the title ~~Add OLMo2 models (1B, 7B, 13B) support to fairseq2~~ Add OLMo2/3 models support in fairseq2 Jan 24, 2026

YunchaoYang force-pushed the yy/add-olmo2-model branch from 4975b9a to 1d050dd Compare February 23, 2026 15:10

cirquit reviewed Mar 16, 2026

View reviewed changes

Comment thread src/fairseq2/models/olmo/config.py Outdated

cirquit reviewed Mar 16, 2026

View reviewed changes

Comment thread src/fairseq2/models/olmo/config.py Outdated

cirquit suggested changes Mar 16, 2026

View reviewed changes

cirquit reviewed Mar 16, 2026

View reviewed changes

Comment thread src/fairseq2/composition/models.py

yunchaoyang1 user and others added 19 commits March 17, 2026 23:32

add __init__.py file for the olmo model

3712d90

implement fs2 olmo2

bd52c90

Add the olmo2-0425-1B model to fs2, with 3 integrations tests passed …

9158b42

…to ensure outputs align with HF Transformer

add 13b model

0597bbc

add olmo2 model to the hub

64261c1

add test to olmo2 model output consistency

9f46ee1

fix the model name for 7b and 13b

14b7560

use the ReferenceRotaryEncoder

b1cc376

rename arch and add olmo2 tokenizer family

c852631

update tests on using olmo2 own tokenizer

915aaf3

Tests passed! OLMO2 HF and FS2 produce same outputs

95394d0

refactor olmo2 attention and normalization; fix linter type check issues

2029bd8

fix format

e86872b

run test_olmo2 on cpu

cea0a02

fix format issues in test_olmo2.py

9c406d4

fix lint issues

af649d9

first commit to add olmo3 models

3efbf44

refine yarn rope function

248f46e

fix composition imports: olmo2 -> olmo

468b505

YunchaoYang added 17 commits March 17, 2026 23:32

fix lint error

c27c818

clean up

46fddfa

fix lint errors

25a5aee

refactor attention and yarn to avoid inheritance from @Final class

7e08d17

fix lint and format errors

57b0797

add olmo doc

33f5329

fix missing import

f86ba87

fix lint error

f7c145d

Remove inhertance of OLMOConfig from LLAMAConfig

ab2abf0

intermediate test for olmo3 7b

eb03782

both olmo2/3 pass

cac09b5

fix lint

4043c46

removed the heavy weight olmo integration test

6573180

Fix doc references to use renamed OLMOTokenizer/OLMOTokenizerConfig c…

272a989

…lasses

YunchaoYang force-pushed the yy/add-olmo2-model branch from fbaa4f9 to 272a989 Compare March 17, 2026 23:33

YunchaoYang added 3 commits March 18, 2026 02:26

add olmo2 training recipe

9bb2c9f

clean up local model path

ef6e690

update hub and __init__ function

3e68e7a

cirquit approved these changes Mar 18, 2026

View reviewed changes

Comment thread src/fairseq2/models/olmo/attention.py Outdated

Comment thread src/fairseq2/models/olmo/config.py Outdated

Comment thread src/fairseq2/models/olmo/config.py Outdated

Comment thread src/fairseq2/models/olmo/config.py Outdated

Removed overly verbose comments

be4cc2e

cirquit force-pushed the yy/add-olmo2-model branch from 0dbbf0d to be4cc2e Compare March 18, 2026 19:46

cirquit merged commit 5caa914 into main Mar 18, 2026
18 of 19 checks passed

Conversation

YunchaoYang commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cirquit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YunchaoYang commented Mar 18, 2026

OLMO Post-Norm vs Pre-Norm vs Standard Post-Norm

Uh oh!

YunchaoYang commented Mar 18, 2026

Attention Q/K Norm Order

Uh oh!

YunchaoYang commented Mar 18, 2026

OLMo3 Hybrid Attention Pattern

Sliding Window vs Full Attention

Uh oh!

YunchaoYang commented Mar 18, 2026

OLMo RMSNorm vs Standard RMSNorm

Uh oh!

YunchaoYang commented Mar 18, 2026

OLMo Architecture Diagrams

OLMo Transformer Architecture

Self-Attention Block Detail

Feed-Forward Block Detail (SwiGLU)

Uh oh!

cirquit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YunchaoYang commented Nov 5, 2025 •

edited

Loading