Skip to content

mem: exclude unused spaCy pipeline components to reduce model memory#4296

Open
KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused
Open

mem: exclude unused spaCy pipeline components to reduce model memory#4296
KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused

Conversation

@KRRT7
Copy link
Copy Markdown
Collaborator

@KRRT7 KRRT7 commented Mar 24, 2026

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection.

Benchmark

Measured with memray (memray run + memray stats --json), 3 rounds × 5 texts through pos_tag() + sent_tokenize() + word_tokenize(), Python 3.12.

bench_spacy_exclude

spaCy en_core_web_sm — component exclusion benchmark
pos_tag + sent_tokenize + word_tokenize  |  3 rounds x 5 texts  |  Python 3.12.12

Configuration                               Peak MB      Saved      %
----------------------------------------------------------------------
All components (default)                    202.1MB      0.0MB   0.0%
Exclude ner/parser/lemma/attr_ruler         189.3MB     12.7MB   6.3%

Copy link
Copy Markdown
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trade-off — sentence splitting quality:

  • Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
  • After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
  • This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.

I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.

KRRT7 added 3 commits March 27, 2026 13:51
Only tok2vec, tagger, and sentence splitting are used (pos_tag and
sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when
loading en_core_web_sm, and add lightweight sentencizer to replace the
dependency parser for sentence boundary detection.

Saves ~12 MiB of model weights per process.
Per review feedback, removing parser and using sentencizer causes
sentence splitting regressions. Keep parser loaded, only exclude
ner, lemmatizer, and attribute_ruler.
@KRRT7 KRRT7 force-pushed the mem/spacy-exclude-unused branch from 2291c23 to 23c4fff Compare March 27, 2026 18:53
@KRRT7
Copy link
Copy Markdown
Collaborator Author

KRRT7 commented Mar 27, 2026

Good call — updated to keep parser loaded for accurate sentence boundaries. Now only excluding ner, lemmatizer, and attribute_ruler. Memory savings drop from ~12.7 MiB to ~7 MiB but we avoid the sentence splitting regression.

Also rebased onto main and bumped to 0.22.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants