-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat/Allow PDF partitioning without unstructured_inference #2128
Description
Is your feature request related to a problem? Please describe.
Up until unstructured 0.10.27 it was possible to use the fast and ocr_only strategy without having unstructured_inference installed (which pulls in a lot of transitive dependencies). However, starting from 0.10.28 there is a hard dependency on unstructured_inference for PDF partitioning in two ways:
Top level import of unstructured.partition.ocr which in turn has a top level import from unstructured_inference:
| from unstructured.partition.ocr import ( |
This makes it impossible to use pdf partitioning without having unstructured_inference installed as importing from unstructured.partition.pdf will fail.
For OCR partitioning, there is another explicit check in place to require unstructured_inference:
unstructured/unstructured/partition/pdf.py
Line 324 in 2931cb3
| @requires_dependencies("unstructured_inference") |
Describe the solution you'd like
Ideally, both fast and ocr_only partitioning are possible without having to install all of unstructured_inference including transitive dependencies, basically the state of 0.10.27. This can be done by guarding all imports with explicit checks in various places.
Describe alternatives you've considered
- Installing
unstructured_inference. In my environment, the application using unstructured is packaged in a docker image - adding theunstructured_inferencedependency increases the size of the docker image by more than 3GB which makes distribution difficult. - Restoring
fastpartitioning by avoiding top-level imports fromunstructured.partition.ocrinunstructured.partition.pdffor the code path of thefaststrategy. While this restores basic functionality, it reduces the number of parseable PDFs considerably.
Additional context
Happy to provide a PR if you agree with this being a useful feature.