Skip to content

V-Vaal/Promy

Repository files navigation

Promy - OCR Pipeline for Invoice Documents

Python License

Version française

An end-to-end OCR pipeline for invoice document processing. The recognition module (CRNN with MobileNetV3 + BiLSTM + CTC) is fine-tuned on a corpus of 1,413 annotated invoices. A working prototype is exposed via FastAPI and Streamlit, packaged with Docker.

Table of contents

  1. Overview
  2. Problem statement
  3. What the project does
  4. OCR pipeline
  5. Dataset
  6. Preprocessing
  7. Detection model
  8. Recognition model
  9. Model comparison
  10. Evaluation metrics
  11. Key results
  12. API and prototype
  13. Repository structure
  14. How to run
  15. Limitations
  16. Future improvements
  17. Tech stack
  18. License
  19. Author

Overview

Promy covers the full OCR chain for invoice images: image preprocessing, text detection, character recognition, evaluation, and a deployable prototype. It is structured to go beyond a notebook experiment, with a clear separation between research notebooks, training artifacts, and a self-contained deployment package.

Problem statement

Extracting text from invoice images is not straightforward. Invoices vary in layout, font, resolution, and scanning quality. Off-the-shelf OCR models are not always adapted to this type of document. Promy addresses this by fine-tuning a lightweight OCR recognition model specifically on invoice data, within a complete and reproducible pipeline.

What the project does

The pipeline receives a JPG or PNG image (up to 10 MB) and returns a structured output containing:

  • recognized text, line by line
  • per-line confidence scores
  • preprocessing metadata (deskew angle, original size, processed size)

Output is available as JSON or CSV.

OCR pipeline

Raw image (JPG/PNG)
        |
        v
+---------------------------+
| Preprocessing             |  LAB grayscale, CLAHE, deskew, denoise, resize
| deployment/preprocessing  |
+---------------------------+
        |
        v
+---------------------------+
| Detection: RapidOCR       |  bounding boxes (DBNet ONNX)
| (not fine-tuned)          |
+---------------------------+
        |
        v  (line crops)
+---------------------------+
| Recognition: PaddleOCR    |  CRNN: MobileNetV3 + BiLSTM + CTC
| CRNN, fine-tuned          |  approx. 8M parameters
| deployment/models/rec_infer
+---------------------------+
        |
        v
Structured output (JSON / CSV)
  - lines
  - confidences
  - preprocessing metadata

Only the recognition module (REC) is fine-tuned. Text detection is handled by RapidOCR without retraining. This is a deliberate scope boundary, documented in the notebooks.

Dataset

High Quality Invoice Images for OCR (Kaggle, Osama Hosam Abdellatif)

  • 1,413 annotated invoices (batch_1, used for training)
  • 300 unannotated invoices (batch_2, used for qualitative validation)
  • 2 additional out-of-corpus invoices for A/B tests

Dataset link: https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr

The dataset is not redistributed in this repository.

Preprocessing

The preprocessing module (deployment/preprocessing.py, also present in notebooks/preprocessing.py) applies the following steps in sequence:

  1. Grayscale conversion via LAB color space
  2. CLAHE contrast enhancement
  3. Skew correction (deskew)
  4. Light denoising
  5. Resolution normalization

This module is shared between the notebook environment and the deployed API.

Detection model

Text detection uses RapidOCR with DBNet in ONNX format. It localizes text regions on the full invoice image and produces bounding boxes, which are cropped and passed to the recognition module.

RapidOCR is used as-is, without fine-tuning. A benchmark of detection alternatives is documented in NB_DET_Benchmark.ipynb.

Recognition model

The recognition model is the PaddleOCR CRNN architecture:

  • Backbone: MobileNetV3
  • Sequence modeling: BiLSTM
  • Decoder: CTC
  • Approximate size: 8M parameters

It is fine-tuned on invoice crops generated by pseudo-labelling from batch_1 annotations. Training used a 75/25 anti-leakage split by invoice to avoid data contamination. Training ran for 40 epochs; the best checkpoint was selected at epoch 34 based on val_norm_edit_dis.

Model comparison

The notebook NB_Comparatif.ipynb documents a quantitative comparison between:

  • TrOCR (Microsoft, Transformer-based)
  • PaddleOCR CRNN (fine-tuned on invoices)

PaddleOCR CRNN was selected for its lower inference latency, smaller model footprint, and better fit for a prototype deployment context. The TrOCR experiment is preserved in NB_experiment_TrOCR.ipynb.

Evaluation metrics

Fine-tuning uses norm_edit_dis (normalized edit distance) as the training metric, equivalent to 1 - CER at the character level. Epoch-by-epoch metrics are logged in:

  • workspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.csv
  • workspace_paddleocr_invoice/runs/metrics/rec_epoch_metrics.png

Key results

Results from the internal validation set (75/25 anti-leakage split by invoice):

  • CER proxy: 0.19% on the validation set
  • Inference latency: 3.4 ms per crop at batch size 12 on a GPU L4

These numbers reflect a controlled benchmark on the training corpus. Performance on out-of-corpus or significantly different invoice formats may vary.

API and prototype

The deployment package exposes:

FastAPI (port 8000):

  • GET /health - service health check
  • POST /ocr - multipart file upload, returns {lines, confidences, mean_confidence, n_segments, preprocessing}
  • Swagger docs: http://localhost:8000/docs

Streamlit (port 8501): a web interface to upload an invoice, adjust the confidence threshold, visualize the output table, and download the CSV.

Both services are packaged together with Docker Compose.

Repository structure

Promy/
├── deployment/                     # Docker prototype (API + frontend)
│   ├── api/                        # FastAPI routes (/ocr, /health)
│   ├── front/                      # Streamlit app
│   ├── models/rec_infer/           # Fine-tuned CRNN model (inference)
│   ├── preprocessing.py
│   ├── tests/                      # pytest tests (API + vendor)
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── pyproject.toml
│
├── notebooks/                      # Research and training notebooks
│   ├── NB1_EDA.ipynb
│   ├── NB2_Preprocessing.ipynb
│   ├── NB3_Fine-tuning_DETRapidOCR_RECPaddleOCR.ipynb
│   ├── NB_Comparatif.ipynb         # TrOCR vs PaddleOCR comparison
│   ├── NB_DET_Benchmark.ipynb      # Detection benchmark
│   ├── NB_experiment_TrOCR.ipynb   # TrOCR experiment archive
│   ├── preprocessing.py
│   └── outputs/
│
├── models/                         # Final model and metrics
│   └── PaddleOCR_Invoice_v2/
│       ├── rec_infer/
│       ├── latency_benchmark.json
│       └── README.md
│
├── workspace_paddleocr_invoice/    # Training artifacts (see local README)
│   ├── export/rec_infer/
│   ├── runs/
│   │   ├── rec/
│   │   │   ├── config.yml          # Fine-tuning configuration
│   │   │   └── train.log           # Full training log (40 epochs)
│   │   └── metrics/
│   ├── prepared_data/
│   ├── testsAB_outputs/
│   └── README.md
│
├── Promy_raw/                      # Raw data (see local README)
│   └── datasets/
│
├── .gitignore
├── pyproject.toml
├── uv.lock
├── README.md
└── README.fr.md

How to run

Prerequisites

For the Docker prototype:

  • Docker 24+ and Docker Compose v2

For local notebook execution:

  • Python 3.12
  • uv for environment management
  • Optional: NVIDIA GPU with CUDA for retraining

Get the dataset

  1. Download from Kaggle:

    https://www.kaggle.com/datasets/osamahosamabdellatif/high-quality-invoice-images-for-ocr

  2. Extract to:

    Promy_raw/datasets/High-Quality Invoice Images for OCR/
    ├── batch_1/
    │   ├── *.csv
    │   └── images...
    └── batch_2/
        └── images...
    
  3. Alternative: in NB3, set ALLOW_KAGGLEHUB_FALLBACK = True (cell 3) to download via kagglehub (requires a configured Kaggle API key).

The two out-of-corpus test images used for A/B tests are already present in Promy_raw/datasets/.

Run the Docker prototype

cd deployment
docker compose up -d --build

Once running:

Read the notebooks

The notebooks are primarily written in French because they document the project methodology in detail. Each notebook includes an English summary at the top to make the workflow understandable for non-French readers.

Recommended reading order:

  1. NB1_EDA.ipynb - dataset exploration, annotation inventory, biases
  2. NB2_Preprocessing.ipynb - preprocessing pipeline and design choices
  3. NB3_Fine-tuning_DETRapidOCR_RECPaddleOCR.ipynb - pseudo-labelling, anti-leakage split, fine-tuning, export, A/B tests
  4. NB_Comparatif.ipynb - quantitative TrOCR vs PaddleOCR comparison
  5. NB_DET_Benchmark.ipynb - detection benchmark
  6. NB_experiment_TrOCR.ipynb - TrOCR experiment archive (narrative)

Retrain the model (optional)

  1. Clone PaddleOCR into the workspace:

    cd workspace_paddleocr_invoice
    git clone https://github.com/PaddlePaddle/PaddleOCR.git .
  2. Download the pretrained weights referenced in runs/rec/config.yml (section Global.pretrained_model).

  3. In NB3, set:

    • FORCE_REBUILD_PREPARED_DATA = True (cell 5) to regenerate pseudo-labels and crops
    • RUN_REC_TRAINING = True (cell 5) to start training
  4. Checkpoints will be written to runs/rec/ and the best model exported to export/rec_infer/.

See workspace_paddleocr_invoice/README.md for details.

Limitations

  • Detection is not fine-tuned. RapidOCR is used as-is. Performance on atypical invoice layouts depends on the DBNet pretrained model.
  • Word spacing. The CRNN model can miss spaces between words in some configurations.
  • French-language coverage. The base model and training corpus are English-dominant. Performance on French invoices is not fully characterized.
  • No structured field extraction. The pipeline outputs raw text lines. It does not extract fields such as amounts, dates, or vendor names.
  • Template diversity. Results may degrade on invoice formats significantly different from the training corpus.
  • Prototype scope. The Docker deployment is a demonstrator, not a production-ready system.

Future improvements

  • Fine-tune the detection module on invoice-specific layouts
  • Add a structured field extraction layer (KIE)
  • Expand French-language coverage in the training corpus
  • Benchmark on a broader range of invoice templates
  • Add a CI pipeline for automated regression testing

Tech stack

Component Technology
Language Python 3.12
OCR recognition PaddleOCR (CRNN fine-tuning)
OCR detection RapidOCR (DBNet ONNX)
Image preprocessing OpenCV, NumPy, Pillow
Model experiments PyTorch, Hugging Face Transformers (TrOCR)
API FastAPI
Frontend Streamlit
Deployment Docker, Docker Compose
Environment uv

License

Author

Valentin Valluet

About

An end-to-end OCR pipeline for invoice document processing. The recognition module (CRNN with MobileNetV3 + BiLSTM + CTC) is fine-tuned on a corpus of 1,413 annotated invoices. A working prototype is exposed via FastAPI and Streamlit, packaged with Docker.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages