Skip to content

Feature : Preserve Folder Hierarchy In PDF Processing Workflows #38

@fcong922

Description

@fcong922

Summary
The PDF-to-PDF and PDF-to-HTML solutions currently only process files in the root folder of the input S3 bucket. Files uploaded to subfolders are not processed, and output does not preserve the original folder structure. This limitation creates significant operational overhead for bulk document processing with complex folder hierarchies.

Current Behavior
PDF-to-PDF Solution:
✅ Files in root folder (e.g., /pdf/test.pdf) → Output appears in "result/" folder
❌ Files in subfolders (e.g., /pdf/sub-folder/test.pdf) → No output generated

PDF-to-HTML Solution:
✅ Files in root folder (e.g., /uploads/test.pdf) → Output generated in "/remediated/" folder
❌ Files in subfolders (e.g., /uploads/sub-folder/test.pdf) → No output generated

Desired Behavior
When a file is uploaded to a subfolder structure, the solution should:

Detect and process the file regardless of folder depth
Preserve the original folder hierarchy in the output bucket
Example:

Input: s3://input-bucket/pdf/department-a/2024/document.pdf
Output: s3://output-bucket/result/department-a/2024/document.pdf

This enhancement would enable:
Automated bulk processing of documents with existing folder structures
Reduced operational overhead and manual intervention
Better scalability for enterprise document processing workflows

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions