Promy - OCR Pipeline for Invoice Documents

An end-to-end OCR pipeline for invoice document processing. The recognition module (CRNN with MobileNetV3 + BiLSTM + CTC) is fine-tuned on a corpus of 1,413 annotated invoices. A working prototype is exposed via FastAPI and Streamlit, packaged with Docker.

Overview

Promy covers the full OCR chain for invoice images: image preprocessing, text detection, character recognition, evaluation, and a deployable prototype. It is structured to go beyond a notebook experiment, with a clear separation between research notebooks, training artifacts, and a self-contained deployment package.

Problem statement

Extracting text from invoice images is not straightforward. Invoices vary in layout, font, resolution, and scanning quality. Off-the-shelf OCR models are not always adapted to this type of document. Promy addresses this by fine-tuning a lightweight OCR recognition model specifically on invoice data, within a complete and reproducible pipeline.

What the project does

The pipeline receives a JPG or PNG image (up to 10 MB) and returns a structured output containing:

recognized text, line by line
per-line confidence scores
preprocessing metadata (deskew angle, original size, processed size)

Output is available as JSON or CSV.

OCR pipeline

Raw image (JPG/PNG)
        |
        v
+---------------------------+
| Preprocessing             |  LAB grayscale, CLAHE, deskew, denoise, resize
| deployment/preprocessing  |
+---------------------------+
        |
        v
+---------------------------+
| Detection: RapidOCR       |  bounding boxes (DBNet ONNX)
| (not fine-tuned)          |
+---------------------------+
        |
        v  (line crops)
+---------------------------+
| Recognition: PaddleOCR    |  CRNN: MobileNetV3 + BiLSTM + CTC
| CRNN, fine-tuned          |  approx. 8M parameters
| deployment/models/rec_infer
+---------------------------+
        |
        v
Structured output (JSON / CSV)
  - lines
  - confidences
  - preprocessing metadata

Only the recognition module (REC) is fine-tuned. Text detection is handled by RapidOCR without retraining. This is a deliberate scope boundary, documented in the notebooks.

Dataset

High Quality Invoice Images for OCR (Kaggle, Osama Hosam Abdellatif)

1,413 annotated invoices (batch_1, used for training)
300 unannotated invoices (batch_2, used for qualitative validation)
2 additional out-of-corpus invoices for A/B tests

Dataset link: https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr

The dataset is not redistributed in this repository.

Preprocessing

The preprocessing module (deployment/preprocessing.py, also present in notebooks/preprocessing.py) applies the following steps in sequence:

Grayscale conversion via LAB color space
CLAHE contrast enhancement
Skew correction (deskew)
Light denoising
Resolution normalization

This module is shared between the notebook environment and the deployed API.

Detection model

Text detection uses RapidOCR with DBNet in ONNX format. It localizes text regions on the full invoice image and produces bounding boxes, which are cropped and passed to the recognition module.

RapidOCR is used as-is, without fine-tuning. A benchmark of detection alternatives is documented in NB_DET_Benchmark.ipynb.

Recognition model

The recognition model is the PaddleOCR CRNN architecture:

Backbone: MobileNetV3
Sequence modeling: BiLSTM
Decoder: CTC
Approximate size: 8M parameters

It is fine-tuned on invoice crops generated by pseudo-labelling from batch_1 annotations. Training used a 75/25 anti-leakage split by invoice to avoid data contamination. Training ran for 40 epochs; the best checkpoint was selected at epoch 34 based on val_norm_edit_dis.

Model comparison

The notebook NB_Comparatif.ipynb documents a quantitative comparison between:

TrOCR (Microsoft, Transformer-based)
PaddleOCR CRNN (fine-tuned on invoices)

PaddleOCR CRNN was selected for its lower inference latency, smaller model footprint, and better fit for a prototype deployment context. The TrOCR experiment is preserved in NB_experiment_TrOCR.ipynb.

Evaluation metrics

Fine-tuning uses norm_edit_dis (normalized edit distance) as the training metric, equivalent to 1 - CER at the character level. Epoch-by-epoch metrics are logged in:

workspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.csv
workspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.png

Key results

Results from the internal validation set (75/25 anti-leakage split by invoice):

CER proxy: 0.19% on the validation set
Inference latency: 3.4 ms per crop at batch size 12 on a GPU L4

These numbers reflect a controlled benchmark on the training corpus. Performance on out-of-corpus or significantly different invoice formats may vary.

API and prototype

The deployment package exposes:

FastAPI (port 8000):

GET /health - service health check
POST /ocr - multipart file upload, returns {lines, confidences, mean_confidence, n_segments, preprocessing}
Swagger docs: http://localhost:8000/docs

Streamlit (port 8501): a web interface to upload an invoice, adjust the confidence threshold, visualize the output table, and download the CSV.

Both services are packaged together with Docker Compose.

Repository structure

Promy/
├── deployment/                     # Docker prototype (API + frontend)
│   ├── api/                        # FastAPI routes (/ocr, /health)
│   ├── front/                      # Streamlit app
│   ├── models/rec_infer/           # Fine-tuned CRNN model (inference)
│   ├── preprocessing.py
│   ├── tests/                      # pytest tests (API + vendor)
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── pyproject.toml
│
├── notebooks/                      # Research and training notebooks
│   ├── NB1_EDA.ipynb
│   ├── NB2_Preprocessing.ipynb
│   ├── NB3_Fine-tuning_DETRapidOCR_RECPaddleOCR.ipynb
│   ├── NB_Comparatif.ipynb         # TrOCR vs PaddleOCR comparison
│   ├── NB_DET_Benchmark.ipynb      # Detection benchmark
│   ├── NB_experiment_TrOCR.ipynb   # TrOCR experiment archive
│   ├── preprocessing.py
│   └── outputs/
│
├── models/                         # Final model and metrics
│   └── PaddleOCR_Invoice_v2/
│       ├── rec_infer/
│       ├── latency_benchmark.json
│       └── README.md
│
├── workspace_paddleocr_invoice/    # Training artifacts (see local README)
│   ├── export/rec_infer/
│   ├── runs/
│   │   ├── rec/
│   │   │   ├── config.yml          # Fine-tuning configuration
│   │   │   └── train.log           # Full training log (40 epochs)
│   │   └── metrics/
│   ├── prepared_data/
│   ├── testsAB_outputs/
│   └── README.md
│
├── Promy_raw/                      # Raw data (see local README)
│   └── datasets/
│
├── .gitignore
├── pyproject.toml
├── uv.lock
├── README.md
└── README.fr.md

How to run

Prerequisites

For the Docker prototype:

Docker 24+ and Docker Compose v2

For local notebook execution:

Python 3.12
uv for environment management
Optional: NVIDIA GPU with CUDA for retraining

Get the dataset

Download from Kaggle:

https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr

Extract to:

Promy_raw/datasets/High-Quality Invoice Images for OCR/
├── batch_1/
│   ├── *.csv
│   └── images...
└── batch_2/
    └── images...

Alternative: in NB3, set ALLOW_KAGGLEHUB_FALLBACK = True (cell 3) to download via kagglehub (requires a configured Kaggle API key).

The two out-of-corpus test images used for A/B tests are already present in Promy_raw/datasets/.

Run the Docker prototype

cd deployment
docker compose up -d --build

Once running:

Streamlit UI: http://localhost:8501
FastAPI docs: http://localhost:8000/docs

Read the notebooks

The notebooks are primarily written in French because they document the project methodology in detail. Each notebook includes an English summary at the top to make the workflow understandable for non-French readers.

Retrain the model (optional)

Clone PaddleOCR into the workspace:

cd workspace_paddleocr_invoice
git clone https://github.com/PaddlePaddle/PaddleOCR.git .

Download the pretrained weights referenced in runs/rec/config.yml (section Global.pretrained_model).
In NB3, set:
- FORCE_REBUILD_PREPARED_DATA = True (cell 5) to regenerate pseudo-labels and crops
- RUN_REC_TRAINING = True (cell 5) to start training
Checkpoints will be written to runs/rec/ and the best model exported to export/rec_infer/.

See workspace_paddleocr_invoice/README.md for details.

Limitations

Detection is not fine-tuned. RapidOCR is used as-is. Performance on atypical invoice layouts depends on the DBNet pretrained model.
Word spacing. The CRNN model can miss spaces between words in some configurations.
French-language coverage. The base model and training corpus are English-dominant. Performance on French invoices is not fully characterized.
No structured field extraction. The pipeline outputs raw text lines. It does not extract fields such as amounts, dates, or vendor names.
Template diversity. Results may degrade on invoice formats significantly different from the training corpus.
Prototype scope. The Docker deployment is a demonstrator, not a production-ready system.

Future improvements

Fine-tune the detection module on invoice-specific layouts
Add a structured field extraction layer (KIE)
Expand French-language coverage in the training corpus
Benchmark on a broader range of invoice templates
Add a CI pipeline for automated regression testing

Tech stack

Component	Technology
Language	Python 3.12
OCR recognition	PaddleOCR (CRNN fine-tuning)
OCR detection	RapidOCR (DBNet ONNX)
Image preprocessing	OpenCV, NumPy, Pillow
Model experiments	PyTorch, Hugging Face Transformers (TrOCR)
API	FastAPI
Frontend	Streamlit
Deployment	Docker, Docker Compose
Environment	uv

License

Code: MIT, unless otherwise noted in source files.
Dataset: the Kaggle dataset license applies. The dataset is not redistributed here.
PaddleOCR: Apache 2.0 - https://github.com/PaddlePaddle/PaddleOCR
RapidOCR: Apache 2.0 - https://github.com/RapidAI/RapidOCR
Out-of-corpus test images in Promy_raw/datasets/: anonymized internal scans, for educational use only.

Author

Valentin Valluet

GitHub: github.com/V-Vaal
LinkedIn: linkedin.com/in/valentin-valluet
X: @val2_x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Promy - OCR Pipeline for Invoice Documents

Table of contents

Overview

Problem statement

What the project does

OCR pipeline

Dataset

Preprocessing

Detection model

Recognition model

Model comparison

Evaluation metrics

Key results

API and prototype

Repository structure

How to run

Prerequisites

Get the dataset

Run the Docker prototype

Read the notebooks

Retrain the model (optional)

Limitations

Future improvements

Tech stack

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Promy_raw		Promy_raw
deployment		deployment
models/PaddleOCR_Invoice_v2		models/PaddleOCR_Invoice_v2
notebooks		notebooks
workspace_paddleocr_invoice		workspace_paddleocr_invoice
.gitignore		.gitignore
LICENSE		LICENSE
README.fr.md		README.fr.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Promy - OCR Pipeline for Invoice Documents

Table of contents

Overview

Problem statement

What the project does

OCR pipeline

Dataset

Preprocessing

Detection model

Recognition model

Model comparison

Evaluation metrics

Key results

API and prototype

Repository structure

How to run

Prerequisites

Get the dataset

Run the Docker prototype

Read the notebooks

Retrain the model (optional)

Limitations

Future improvements

Tech stack

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages