feat: CRE document processing pipeline with PDF and Excel support by MichaelWalker-git · Pull Request #7 · aws-samples/sample-deepseek-ocr-selfhost

MichaelWalker-git · 2026-03-09T16:42:15Z

Summary

Add Step Functions pipeline for automated CRE document processing: split PDF/Excel → OCR (DeepSeek) or convert to markdown → classify document type (Bedrock Claude) → extract structured JSON → save to S3
Support both PDF (.pdf) and Excel (.xlsx/.xls) files via Choice state routing in Step Functions
PDF path: SplitPdf → Map(ConvertToImage → OcrPage via DeepSeek ALB → ClassifyAndExtract via Bedrock)
Excel path: ExcelToSheets (openpyxl/xlrd → markdown tables) → Map(ClassifyAndExtract via Bedrock)
New CDK PipelineStack with 3 Docker Lambdas (split-pdf, pdf-to-image, excel-to-text), 4 TypeScript Lambdas (ocr-page, classify-extract, save-results, start-pipeline), IAM roles, and state machine
API Gateway POST /pipeline/process endpoint to trigger pipeline

New files (17)

src/stacks/pipeline.stack.ts — CDK stack with all Lambdas, IAM, Step Functions
src/resources/lambda/split-pdf/ — Docker Lambda to split PDF into single pages (pypdf)
src/resources/lambda/pdf-to-image/ — Docker Lambda to convert single-page PDF to JPEG (Poppler)
src/resources/lambda/excel-to-text/ — Docker Lambda to convert Excel sheets to markdown (openpyxl/xlrd)
src/resources/lambda/pipeline/ocr-page/ — TypeScript Lambda, sends JPEG to DeepSeek ALB (VPC-attached)
src/resources/lambda/pipeline/classify-extract/ — TypeScript Lambda, Bedrock Claude classifies CRE doc type + extracts structured JSON
src/resources/lambda/pipeline/save-results/ — TypeScript Lambda, consolidates results to S3
src/resources/lambda/pipeline/start-pipeline/ — TypeScript Lambda, API Gateway handler, starts Step Functions execution

Modified files

.projenrc.ts — Added @aws-sdk/client-bedrock-runtime dependency
src/lib/stages.ts — Instantiate PipelineStack, wire dependencies
src/stacks/api-gateway.stack.ts — Add POST /pipeline/process route

Tested with

Document	Type	Result
2021 Tax Return (2 pages)	tax_return (95%)	Extracted property address, assessed values, tax amounts
Hampton Inn OM (24 pages)	8 OM, 1 operating_statement, 15 other	Extracted investment highlights, NOI, ADR, RevPAR
Rent Roll (2 pages)	rent_roll (70%)	Classification correct
Flemington Financials .xlsx (1 sheet)	operating_statement (95%)	Revenue $232K, Expenses $200K, NOI $32K, ADR $115.74
OSAR-NOIWS .xls (10 sheets)	9 operating_statement, 1 other	CREFC templates (blank forms), correctly processed

Test plan

STAGE=dev npm run build compiles without errors
STAGE=dev npm run synth generates CloudFormation without cdk-nag failures
Deploy and test with PDF files (single page and multi-page)
Deploy and test with Excel files (.xlsx and .xls)
Verify results JSON in S3 pipeline/results/{filename}/results.json

Add a Step Functions pipeline that splits large PDFs into individual pages, OCRs each page via DeepSeek ALB, classifies CRE document types (rent roll, operating statement, tax return, appraisal, offering memorandum) via Bedrock Claude, and extracts structured JSON. Pipeline architecture: - SplitPdf (Docker/Python) - splits PDF into single-page PDFs in S3 - Map state (maxConcurrency: 3) per page: - ConvertToImage (Docker/Python) - PDF to JPEG via Poppler - OcrPage (Node.js, VPC) - sends JPEG to DeepSeek /ocr/image via ALB - ClassifyAndExtract (Node.js) - Bedrock Claude classifies + extracts - SaveResults (Node.js) - consolidates all page results to S3 JSON Key decisions: - Single-page splitting (CRE tables are self-contained per page) - Bedrock Claude Sonnet 4 for classification/extraction (no second GPU) - classify-extract Lambda outside VPC (Bedrock public endpoint) - ocr-page Lambda inside VPC (needs internal ALB access) - maxConcurrency=3 with retry logic to avoid overwhelming single GPU Tested with 2-page tax return, 24-page Hampton Inn OM, and rent roll.

Add Docker Lambda (Python 3.11) to convert .xlsx/.xls files to markdown tables using openpyxl and xlrd. Restructure Step Functions state machine with a Choice state that routes by fileType: PDF path (split -> convert -> OCR -> classify) and Excel path (sheets-to-markdown -> classify). StartPipeline Lambda now detects file extension and passes fileType.

MichaelWalker-git added 2 commits March 5, 2026 09:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CRE document processing pipeline with PDF and Excel support#7

feat: CRE document processing pipeline with PDF and Excel support#7
MichaelWalker-git wants to merge 2 commits intomainfrom
feat/cre-document-pipeline

MichaelWalker-git commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MichaelWalker-git commented Mar 9, 2026

Summary

New files (17)

Modified files

Tested with

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant