feat: CRE document processing pipeline with PDF and Excel support#7
Open
MichaelWalker-git wants to merge 2 commits intomainfrom
Open
feat: CRE document processing pipeline with PDF and Excel support#7MichaelWalker-git wants to merge 2 commits intomainfrom
MichaelWalker-git wants to merge 2 commits intomainfrom
Conversation
Add a Step Functions pipeline that splits large PDFs into individual pages, OCRs each page via DeepSeek ALB, classifies CRE document types (rent roll, operating statement, tax return, appraisal, offering memorandum) via Bedrock Claude, and extracts structured JSON. Pipeline architecture: - SplitPdf (Docker/Python) - splits PDF into single-page PDFs in S3 - Map state (maxConcurrency: 3) per page: - ConvertToImage (Docker/Python) - PDF to JPEG via Poppler - OcrPage (Node.js, VPC) - sends JPEG to DeepSeek /ocr/image via ALB - ClassifyAndExtract (Node.js) - Bedrock Claude classifies + extracts - SaveResults (Node.js) - consolidates all page results to S3 JSON Key decisions: - Single-page splitting (CRE tables are self-contained per page) - Bedrock Claude Sonnet 4 for classification/extraction (no second GPU) - classify-extract Lambda outside VPC (Bedrock public endpoint) - ocr-page Lambda inside VPC (needs internal ALB access) - maxConcurrency=3 with retry logic to avoid overwhelming single GPU Tested with 2-page tax return, 24-page Hampton Inn OM, and rent roll.
Add Docker Lambda (Python 3.11) to convert .xlsx/.xls files to markdown tables using openpyxl and xlrd. Restructure Step Functions state machine with a Choice state that routes by fileType: PDF path (split -> convert -> OCR -> classify) and Excel path (sheets-to-markdown -> classify). StartPipeline Lambda now detects file extension and passes fileType.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PipelineStackwith 3 Docker Lambdas (split-pdf, pdf-to-image, excel-to-text), 4 TypeScript Lambdas (ocr-page, classify-extract, save-results, start-pipeline), IAM roles, and state machinePOST /pipeline/processendpoint to trigger pipelineNew files (17)
src/stacks/pipeline.stack.ts— CDK stack with all Lambdas, IAM, Step Functionssrc/resources/lambda/split-pdf/— Docker Lambda to split PDF into single pages (pypdf)src/resources/lambda/pdf-to-image/— Docker Lambda to convert single-page PDF to JPEG (Poppler)src/resources/lambda/excel-to-text/— Docker Lambda to convert Excel sheets to markdown (openpyxl/xlrd)src/resources/lambda/pipeline/ocr-page/— TypeScript Lambda, sends JPEG to DeepSeek ALB (VPC-attached)src/resources/lambda/pipeline/classify-extract/— TypeScript Lambda, Bedrock Claude classifies CRE doc type + extracts structured JSONsrc/resources/lambda/pipeline/save-results/— TypeScript Lambda, consolidates results to S3src/resources/lambda/pipeline/start-pipeline/— TypeScript Lambda, API Gateway handler, starts Step Functions executionModified files
.projenrc.ts— Added@aws-sdk/client-bedrock-runtimedependencysrc/lib/stages.ts— Instantiate PipelineStack, wire dependenciessrc/stacks/api-gateway.stack.ts— AddPOST /pipeline/processrouteTested with
Test plan
STAGE=dev npm run buildcompiles without errorsSTAGE=dev npm run synthgenerates CloudFormation without cdk-nag failurespipeline/results/{filename}/results.json