Adding support for the granite multilingual embeddings R2 (ibm-granite/granite-embedding-{97,311}m-multilingual-r2 models)#22716
Conversation
ModernBert derivatives such as IBM Granite Embedding multilingual R2 (97m / 311m) use SiLU/SwiGLU in the FFN instead of the original GELU/GeGLU. Persist the hidden activation in the GGUF and select LLM_FFN_SWIGLU vs LLM_FFN_GEGLU at graph build time. Also register the Granite R2 tokenizers so the converter recognizes them as modern-bert.
…e-types, and tokenizer configurations
34541a7 to
4f283cf
Compare
|
@gabe-l-hart here is the PR. |
|
Thanks @hansolosan! I'll take a first pass review in the next day or two and notify maintainers once we're ready for final review. |
gabe-l-hart
left a comment
There was a problem hiding this comment.
I think it would be good to make the hparam more flexible for future models that need it.
| // FFN gated activation flavor (used by ModernBert/derivatives that may use | ||
| // SwiGLU instead of the default GeGLU). The graph for those archs reads | ||
| // this to pick LLM_FFN_SWIGLU vs LLM_FFN_GEGLU. | ||
| bool ffn_act_swiglu = false; |
There was a problem hiding this comment.
NIT: Most model-specific hparams live towards the bottom of the field declaration. This effects how structs are initialized and while this repo doesn't ever use direct initialization for hparams, other tools that use this header (yes that violates encapsulation, but it's the internet), might.
There was a problem hiding this comment.
Less-NIT: In the GGUF, this looks like it's represented as a string, but here it's a bool which limits the future usability. I think it would be cleaner to use llm_ffn_op_type (declared in llama-graph.h, so available here). This would also avoid the need for the ternary above.
If we go that route, we could also align the name as ffn_op. Further even, we could add a helper in llama-graph.* to do the enum <-> string mapping so it's centralized and reusable.
There was a problem hiding this comment.
Hi Gabe - looks like llama-graph.h includes this file (llama-hparams.h) - we could move the field llm_ffn_op_type to llama-arch.h and have it included from here.
…pt-4o tokenizer, changed that.
- the pretokenizer regex for gpt-4o had a bug (exposed in Arabic) - added the marks marker \p{M} in the lookup
|
I've confirmed that the inference is working as intended. Here was my process: Conversion(cd ~/models && hf download ibm-granite/granite-embedding-97m-multilingual-r2 --local-dir ibm-granite/granite-embedding-97m-multilingual-r2)
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-embedding-97m-multilingual-r2/Baseline w/ Sentence TransformersI used this script to compare the results of running with granite_embed.pyfrom sentence_transformers import SentenceTransformer
import numpy as np
import subprocess
import shlex
import sys
model_path = "/Users/ghart/models/ibm-granite/granite-embedding-97m-multilingual-r2"
lcpp_model = f"{model_path}/granite-embedding-97M-multilingual-r2-BF16.gguf"
lcpp_exe = "./build/bin/llama-embedding"
if len(sys.argv) > 1:
model_path = sys.argv[1]
if len(sys.argv) > 2:
lcpp_model = sys.argv[2]
if len(sys.argv) > 3:
lcpp_exe = sys.argv[3]
model = SentenceTransformer(model_path)
input_queries = [
"hello world",
"tell me a story about a developer and their dog",
"123sfg this is a r@nd0m t35t",
]
def cosine_similarity(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
vector_a = np.asarray(vector_a)
vector_b = np.asarray(vector_b)
numerator = np.dot(vector_a, vector_b)
denominator_a = np.linalg.norm(vector_a)
denominator_b = np.linalg.norm(vector_b)
if denominator_a == 0 or denominator_b == 0: return 0.0
cosine_sim = numerator / (denominator_a * denominator_b)
return cosine_sim
for query in input_queries:
print("### BASELINE ###")
embedding = model.encode([query])
print("Embedding shape:", embedding.shape)
print("Embedding vector:", embedding[:, :8])
print("### llama.cpp ###")
cmd = f"{lcpp_exe} -m {lcpp_model} -p \"{query}\" --temp 0 --embd-normalize -1"
print(f"llama.cpp command: {cmd}")
proc = subprocess.Popen(
shlex.split(cmd),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
out, _ = proc.communicate()
vals = out.decode("utf-8").split(":")[-1]
vals = [
float(v) for v in vals.split()
if v.strip()
]
lcpp_emb = np.array(vals)
print("llama.cpp Embedding shape:", lcpp_emb.shape)
print("llama.cpp Embedding vector:", lcpp_emb[:8])
print()
cos_sim = cosine_similarity(embedding, lcpp_emb)
print(f"COSINE SIMILARITY: {cos_sim}")
print("--------------------------------")
print()Results w/out branchResults w/ branch |
…ith the other model specific hparams.
Overview
The PR adds support for 2 granite multilingual models just released, based on the ModernBERT architecture. Support is added to link the tokenizers properly and to use a different activation function for the 97m model (SiLU/SwiGLU) instead of the regular GeGLU.
Additional information
The models are available here: https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2 and https://huggingface.co/ibm-granite/granite-embedding-311m-multilingual-r2. In retrieval scores, the 97m is 8 points better than the next model on the MMTEB leaderboard under 100M parameters, and the 311m model is the second one in the <500M parameters category.
Requirements
I am not an AI agent :).