mem: exclude unused spaCy pipeline components to reduce model memory#4296
Open
KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
Open
mem: exclude unused spaCy pipeline components to reduce model memory#4296KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
Conversation
badGarnet
reviewed
Mar 27, 2026
Collaborator
badGarnet
left a comment
There was a problem hiding this comment.
The trade-off — sentence splitting quality:
- Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
- After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
- This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.
I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.
Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection. Saves ~12 MiB of model weights per process.
Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.
2291c23 to
23c4fff
Compare
Collaborator
Author
|
Good call — updated to keep Also rebased onto main and bumped to 0.22.9. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Only tok2vec, tagger, and sentence splitting are used (
pos_tagandsent_tokenize). Excludener,parser,lemmatizer,attribute_rulerwhen loadingen_core_web_sm, and add lightweightsentencizerto replace the dependency parser for sentence boundary detection.Benchmark
Measured with memray (
memray run+memray stats --json), 3 rounds × 5 texts throughpos_tag()+sent_tokenize()+word_tokenize(), Python 3.12.