Skip to content

feat/Allow PDF partitioning without unstructured_inference #2128

@flash1293

Description

@flash1293

Is your feature request related to a problem? Please describe.
Up until unstructured 0.10.27 it was possible to use the fast and ocr_only strategy without having unstructured_inference installed (which pulls in a lot of transitive dependencies). However, starting from 0.10.28 there is a hard dependency on unstructured_inference for PDF partitioning in two ways:

Top level import of unstructured.partition.ocr which in turn has a top level import from unstructured_inference:

from unstructured.partition.ocr import (

This makes it impossible to use pdf partitioning without having unstructured_inference installed as importing from unstructured.partition.pdf will fail.

For OCR partitioning, there is another explicit check in place to require unstructured_inference:

@requires_dependencies("unstructured_inference")

Describe the solution you'd like

Ideally, both fast and ocr_only partitioning are possible without having to install all of unstructured_inference including transitive dependencies, basically the state of 0.10.27. This can be done by guarding all imports with explicit checks in various places.

Describe alternatives you've considered

  • Installing unstructured_inference. In my environment, the application using unstructured is packaged in a docker image - adding the unstructured_inference dependency increases the size of the docker image by more than 3GB which makes distribution difficult.
  • Restoring fast partitioning by avoiding top-level imports from unstructured.partition.ocr in unstructured.partition.pdf for the code path of the fast strategy. While this restores basic functionality, it reduces the number of parseable PDFs considerably.

Additional context

Happy to provide a PR if you agree with this being a useful feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions