mem: exclude unused spaCy pipeline components to reduce model memory by KRRT7 · Pull Request #4296 · Unstructured-IO/unstructured

KRRT7 · 2026-03-24T12:31:59Z

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection.

Benchmark

Measured with memray (memray run + memray stats --json), 3 rounds × 5 texts through pos_tag() + sent_tokenize() + word_tokenize(), Python 3.12.

spaCy en_core_web_sm — component exclusion benchmark
pos_tag + sent_tokenize + word_tokenize  |  3 rounds x 5 texts  |  Python 3.12.12

Configuration                               Peak MB      Saved      %
----------------------------------------------------------------------
All components (default)                    202.1MB      0.0MB   0.0%
Exclude ner/parser/lemma/attr_ruler         189.3MB     12.7MB   6.3%

badGarnet

The trade-off — sentence splitting quality:

Currently sent_tokenize() (line 173) gets sentence boundaries from the parser (dependency-parse-based, more accurate).
After this change, it uses the sentencizer (rule-based, splits on punctuation like .?!).
This is less accurate for edge cases (abbreviations like "Dr. Smith", numbered lists, etc.) but faster and lighter.

I think this is why we see the ingest test failure (some minor changes). I would put parser back just to be safe.

Only tok2vec, tagger, and sentence splitting are used (pos_tag and sent_tokenize). Exclude ner, parser, lemmatizer, attribute_ruler when loading en_core_web_sm, and add lightweight sentencizer to replace the dependency parser for sentence boundary detection. Saves ~12 MiB of model weights per process.

Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.

KRRT7 · 2026-03-27T18:53:40Z

Good call — updated to keep parser loaded for accurate sentence boundaries. Now only excluding ner, lemmatizer, and attribute_ruler. Memory savings drop from ~12.7 MiB to ~7 MiB but we avoid the sentence splitting regression.

Also rebased onto main and bumped to 0.22.9.

badGarnet reviewed Mar 27, 2026

View reviewed changes

KRRT7 added 3 commits March 27, 2026 13:51

Bump version to 0.22.3 for changelog CI check

b565a50

fix: keep parser for accurate sentence boundaries

23c4fff

Per review feedback, removing parser and using sentencizer causes sentence splitting regressions. Keep parser loaded, only exclude ner, lemmatizer, and attribute_ruler.

KRRT7 force-pushed the mem/spacy-exclude-unused branch from 2291c23 to 23c4fff Compare March 27, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mem: exclude unused spaCy pipeline components to reduce model memory#4296

mem: exclude unused spaCy pipeline components to reduce model memory#4296
KRRT7 wants to merge 3 commits intoUnstructured-IO:mainfrom
KRRT7:mem/spacy-exclude-unused

KRRT7 commented Mar 24, 2026

Uh oh!

badGarnet left a comment

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KRRT7 commented Mar 24, 2026

Benchmark

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants