You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?
New models submission:
Have you added an explanation of why it's important to include this model?
Closes#140. This was previously attempted in #181 but was not completed. intfloat/multilingual-e5-large-instruct is a state-of-the-art multilingual embedding model that supports instruction-based embeddings across 100+ languages. It outperforms multilingual-e5-large on MTEB benchmarks and is widely used for multilingual retrieval tasks. Personally, multilingual-e5-large-instruct is very much better in retrieval tasks(even with other supported languages) than multilingual-e5-large.
Have you added tests for the new model? Were canonical values for tests computed via the original model?
Yes, canonical values were computed using fastembed itself (not sentence-transformers).
Have you added the code snippet for how canonical values were computed?
Have you successfully ran tests with your changes locally?
Yes, verified via a standalone script that the canonical vector matches within atol=1e-3.
cc @hh-space-invader@joein
Note: #181 previously attempted to add this model but required a manual ONNX export as official ONNX support was unavailable at the time(I believe). The ONNX model is now officially available on the model's HuggingFace page, making this a clean addition without any manual export.
This pull request adds support for a new ONNX-based text embedding model, intfloat/multilingual-e5-large-instruct, by extending the supported models registry in the FastEmbed library. The model entry specifies an embedding dimension of 1024 tokens, references the Hugging Face model source, and declares the ONNX artifact locations (model file and data file). A corresponding test entry was added to the canonical vector values dictionary to support validation testing of the new model.
The title directly and clearly describes the main change: adding support for the intfloat/multilingual-e5-large-instruct model.
Linked Issues check
✅ Passed
The PR successfully implements the primary objective from issue #140 by adding the intfloat/multilingual-e5-large-instruct model to the supported models with proper configuration and tests.
Out of Scope Changes check
✅ Passed
All changes are directly related to adding the new model: model entry in onnx_embedding.py and corresponding test canonical vector in test file.
Docstring Coverage
✅ Passed
No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check
✅ Passed
The pull request description directly addresses the changeset by explaining the addition of the intfloat/multilingual-e5-large-instruct model, providing justification, test details, and verification steps.
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches🧪 Generate unit tests (beta)
Create PR with unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All Submissions:
New models submission:
Closes #140. This was previously attempted in #181 but was not completed.
intfloat/multilingual-e5-large-instructis a state-of-the-art multilingual embedding model that supports instruction-based embeddings across 100+ languages. It outperformsmultilingual-e5-largeon MTEB benchmarks and is widely used for multilingual retrieval tasks. Personally, multilingual-e5-large-instruct is very much better in retrieval tasks(even with other supported languages) than multilingual-e5-large.Yes, canonical values were computed using fastembed itself (not sentence-transformers).
Yes, verified via a standalone script that the canonical vector matches within
atol=1e-3.cc @hh-space-invader @joein