Skip to content

feat: CRE document processing pipeline with PDF and Excel support#7

Open
MichaelWalker-git wants to merge 2 commits intomainfrom
feat/cre-document-pipeline
Open

feat: CRE document processing pipeline with PDF and Excel support#7
MichaelWalker-git wants to merge 2 commits intomainfrom
feat/cre-document-pipeline

Conversation

@MichaelWalker-git
Copy link
Contributor

Summary

  • Add Step Functions pipeline for automated CRE document processing: split PDF/Excel → OCR (DeepSeek) or convert to markdown → classify document type (Bedrock Claude) → extract structured JSON → save to S3
  • Support both PDF (.pdf) and Excel (.xlsx/.xls) files via Choice state routing in Step Functions
  • PDF path: SplitPdf → Map(ConvertToImage → OcrPage via DeepSeek ALB → ClassifyAndExtract via Bedrock)
  • Excel path: ExcelToSheets (openpyxl/xlrd → markdown tables) → Map(ClassifyAndExtract via Bedrock)
  • New CDK PipelineStack with 3 Docker Lambdas (split-pdf, pdf-to-image, excel-to-text), 4 TypeScript Lambdas (ocr-page, classify-extract, save-results, start-pipeline), IAM roles, and state machine
  • API Gateway POST /pipeline/process endpoint to trigger pipeline

New files (17)

  • src/stacks/pipeline.stack.ts — CDK stack with all Lambdas, IAM, Step Functions
  • src/resources/lambda/split-pdf/ — Docker Lambda to split PDF into single pages (pypdf)
  • src/resources/lambda/pdf-to-image/ — Docker Lambda to convert single-page PDF to JPEG (Poppler)
  • src/resources/lambda/excel-to-text/ — Docker Lambda to convert Excel sheets to markdown (openpyxl/xlrd)
  • src/resources/lambda/pipeline/ocr-page/ — TypeScript Lambda, sends JPEG to DeepSeek ALB (VPC-attached)
  • src/resources/lambda/pipeline/classify-extract/ — TypeScript Lambda, Bedrock Claude classifies CRE doc type + extracts structured JSON
  • src/resources/lambda/pipeline/save-results/ — TypeScript Lambda, consolidates results to S3
  • src/resources/lambda/pipeline/start-pipeline/ — TypeScript Lambda, API Gateway handler, starts Step Functions execution

Modified files

  • .projenrc.ts — Added @aws-sdk/client-bedrock-runtime dependency
  • src/lib/stages.ts — Instantiate PipelineStack, wire dependencies
  • src/stacks/api-gateway.stack.ts — Add POST /pipeline/process route

Tested with

Document Type Result
2021 Tax Return (2 pages) tax_return (95%) Extracted property address, assessed values, tax amounts
Hampton Inn OM (24 pages) 8 OM, 1 operating_statement, 15 other Extracted investment highlights, NOI, ADR, RevPAR
Rent Roll (2 pages) rent_roll (70%) Classification correct
Flemington Financials .xlsx (1 sheet) operating_statement (95%) Revenue $232K, Expenses $200K, NOI $32K, ADR $115.74
OSAR-NOIWS .xls (10 sheets) 9 operating_statement, 1 other CREFC templates (blank forms), correctly processed

Test plan

  • STAGE=dev npm run build compiles without errors
  • STAGE=dev npm run synth generates CloudFormation without cdk-nag failures
  • Deploy and test with PDF files (single page and multi-page)
  • Deploy and test with Excel files (.xlsx and .xls)
  • Verify results JSON in S3 pipeline/results/{filename}/results.json

Add a Step Functions pipeline that splits large PDFs into individual
pages, OCRs each page via DeepSeek ALB, classifies CRE document types
(rent roll, operating statement, tax return, appraisal, offering
memorandum) via Bedrock Claude, and extracts structured JSON.

Pipeline architecture:
- SplitPdf (Docker/Python) - splits PDF into single-page PDFs in S3
- Map state (maxConcurrency: 3) per page:
  - ConvertToImage (Docker/Python) - PDF to JPEG via Poppler
  - OcrPage (Node.js, VPC) - sends JPEG to DeepSeek /ocr/image via ALB
  - ClassifyAndExtract (Node.js) - Bedrock Claude classifies + extracts
- SaveResults (Node.js) - consolidates all page results to S3 JSON

Key decisions:
- Single-page splitting (CRE tables are self-contained per page)
- Bedrock Claude Sonnet 4 for classification/extraction (no second GPU)
- classify-extract Lambda outside VPC (Bedrock public endpoint)
- ocr-page Lambda inside VPC (needs internal ALB access)
- maxConcurrency=3 with retry logic to avoid overwhelming single GPU

Tested with 2-page tax return, 24-page Hampton Inn OM, and rent roll.
Add Docker Lambda (Python 3.11) to convert .xlsx/.xls files to markdown
tables using openpyxl and xlrd. Restructure Step Functions state machine
with a Choice state that routes by fileType: PDF path (split -> convert
-> OCR -> classify) and Excel path (sheets-to-markdown -> classify).
StartPipeline Lambda now detects file extension and passes fileType.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant