feat/Allow PDF partitioning without unstructured_inference

**Is your feature request related to a problem? Please describe.**
Up until unstructured `0.10.27` it was possible to use the `fast` and `ocr_only` strategy without having `unstructured_inference` installed (which pulls in a lot of transitive dependencies). However, starting from `0.10.28` there is a hard dependency on `unstructured_inference` for PDF partitioning in two ways:

Top level import of `unstructured.partition.ocr` which in turn has a top level import from `unstructured_inference`: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L76

This makes it impossible to use pdf partitioning without having unstructured_inference installed as importing from `unstructured.partition.pdf` will fail.

For OCR partitioning, there is another explicit check in place to require `unstructured_inference`: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L324

**Describe the solution you'd like**

Ideally, both `fast` and `ocr_only` partitioning are possible without having to install all of `unstructured_inference` including transitive dependencies, basically the state of `0.10.27`. This can be done by guarding all imports with explicit checks in various places.

**Describe alternatives you've considered**

* Installing `unstructured_inference`. In my environment, the application using unstructured is packaged in a docker image - adding the `unstructured_inference` dependency increases the size of the docker image by more than 3GB which makes distribution difficult.
* Restoring `fast` partitioning by avoiding top-level imports from `unstructured.partition.ocr` in `unstructured.partition.pdf` for the code path of the `fast` strategy. While this restores basic functionality, it reduces the number of parseable PDFs considerably.

**Additional context**

Happy to provide a PR if you agree with this being a useful feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/Allow PDF partitioning without unstructured_inference #2128

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat/Allow PDF partitioning without unstructured_inference #2128

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions