Add lazy spaCy CLI loading and static launcher#13933
Open
Add lazy spaCy CLI loading and static launcher#13933
Conversation
* change typer-slim dependency to typer * set rich_markup_mode to None to preserve behaviour
Replace all pydantic.v1 compat imports with direct pydantic v2 imports. Migrate schemas to v2 API: ConfigDict instead of inner Config class, field_validator instead of validator, RootModel instead of __root__, model_dump() instead of dict(), model_validate() instead of parse_obj(), Annotated[str, StringConstraints()] instead of ConstrainedStr, min_length instead of min_items, populate_by_name instead of allow_population_by_field_name.
- Replace pydantic.v1 compat imports with direct v2 imports - Replace class Config with model_config = ConfigDict(...) - Replace @validator with @field_validator - Replace ConstrainedStr with constr() - Replace min_items with min_length, allow_population_by_field_name with populate_by_name - Add model_rebuild() calls in __init__.py for forward ref resolution - Update test error type assertions for v2
…pe annotation - Update expected error counts in test_pattern_validation.py for pydantic v2 (v2 reports errors for all union members, increasing counts for OP and nested pattern validation) - Fix AttributeRulerPatternType to include List[MatcherPatternType] in the union (v2 is strict about nested list-of-list-of-dict types that v1 accepted laxly)
- requirements.txt: remove black, isort, flake8; add ruff - pyproject.toml: replace [tool.isort] with [tool.ruff] config - setup.cfg: remove [flake8] section (rules moved to pyproject.toml) - .pre-commit-config.yaml: replace black/flake8 hooks with ruff/ruff-format
Use confection v1.3 and Thinc v8.3.13, which implement custom validation logic in place of Pydantic, allowing us to properly adopt Pydantic v2 and provide full Python 3.14 support.
Our dependency tree used Pydantic v1 in unusual ways, and relied on behaviours that Pydantic v2 reformed. In the time since Pydantic v2 was released there were a few attempts to migrate over to it, but the task has been complicated by the fact that the confection library has a fairly tangled implementation and I had reduced availability for open-source work in 2024 and 2025.
Specifically, our library confection provides the extensible configuration system we use in spaCy and Thinc. The config system allows you to refer to values that will be supplied by arbitrary functions, that e.g. define some neural network model or its sublayers. The functionality in confection is complicated because we aggressively prioritised user experience in the specification, even if it required increased implementation complexity.
Confection's original implementation built a dynamic Pydantic v1 schema for function-supplied values ("promises"). We validate the schema before calling any promises, and then validate the schema again after calling all the promises and substituting in their values. The variable-interpolation system adds further difficulties to the implementation, and we have to do it all subclassing the Python built-in configparser, which ties us to implementation choices I'd do differently if I had a clean slate.
Here's one summary of Pydantic v1-specific behaviours that the migration to v2 particularly difficult for us. This particular summary was produced during a session with Claude Code Opus 4.6, so nuances of it might be wrong. The full history of attempts at doing this spans over different refactors separated by a few months at a time, so I don't have a full record of all the things that I struggled with. It's possible some details of this summary are incorrect though.
The core problem we kept hitting: Pydantic v2 compiles validation schemas upfront and has much stricter immutability. The whole session has been a series of workarounds for this:
```
1. Schema mutation — v1 let you mutate __fields__ in place; v2 needs model_rebuild() which loses forward ref namespaces, or create_model subclasses which don't propagate to parent schemas.
2. model_dump vs dict — v2 converts dataclasses to dicts, breaking resolved objects. Needed a custom _model_to_dict helper.
3. model_construct drops extras — v2 silently drops fields with extra="forbid", needed manual workarounds.
4. Strict coercion — v2 coerces ndarray to List[Floats1d] via iteration, needed strict=True.
5. Forward refs — Every schema with TYPE_CHECKING imports needs model_rebuild() with the right namespace, and that breaks when confection re-rebuilds later.
In order to adjust for behavioural differences like this, I'd refactored confection to build the different versions of the schema in multiple passes, instead of building all the representations together as we'd been doing. However this refactor itself had problems, further complicating the migration.
```
~I've now bitten the bullet and rolled back the refactor I'd been attempting of confection, and instead replaced the Pydantic validation with custom logic. This allows Confection to remove Pydantic as a dependency entirely.~ Update: Actually I went back and got the refactor working. All much nicer now.
I've taken some lengths to explain this because migrating off a dependency after breaking changes can be a sensitive topic. I want to stress that the changes Pydantic made from v1 to v2 are very good, and I greatly appreciate them as a user of FastAPI in our services. It would be very bad for the ecosystem if Pydantic pinned themselves to exactly matching the behaviours they had in v1 just to avoid breaking support for the sort of thing we'd been doing. Instead users who were relying on those behaviours like us should just find some way to adapt --- either vendor the v1 version we need, or change our behaviours, or implement an alternative. I would have liked to do this sooner but we've ultimately gone with the third option.
- setup.py: rename loop variable shadowing parameter (B020) - _util.py: remove unused registry import (F401), use specific except clause (E722, B904) - test_cli_app.py: use dict literals instead of dict() (C408) - main.py: extract _try_static_group to reduce complexity (C901)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
spacyCLI takes ages to start the first time you run it because it loads everything, and it's still kind of slow subsequently. This has always sucked a bit, but it will suck especially in agentic coding workflows.This PR tries to address the issue by adding a second package
spacy_clithat will be bundled into the same PyPi distribution (spacy). Thespacyentrypoint will be provided from the lightweightspacy_clipackage so that help can run instantly.