🚧 Status: In Progress (under active development)
Full-stack project for extracting key business fields from invoices and receipts (PDF/JPG/PNG), with a built-in review & correction workflow and JSON/CSV export.
📌 Scope (MVP): vendor_name, invoice_date, total_amount, currency, tax_amount
🗂️ Statuses: queued → processing → done / failed
📝 Correction: inline editing + saved as corrected_result
Build an end-to-end system that:
- uploads documents (PDF/JPG/PNG),
- runs an extraction pipeline (OCR / ML),
- stores results and confidence,
- allows manual corrections in UI,
- exports final structured data to JSON and CSV.
├─ backend/ ASP.NET Core Web API + background worker
├─ frontend/ Angular UI (document list, details, corrections)
├─ ml-service/ Python FastAPI inference (stub → OCR → ML model)
├─ docs/ notes, diagrams, roadmap
├─ data/ datasets (ignored by git)
└─ docker-compose.yml
- Upload document → saved to storage + DB record created (
queued) - Analyze → background worker picks queued docs
- Worker sets
processing→ calls ml-service/infer - Store raw_result + confidence summary → status
done/failed - User can correct fields → saved as corrected_result
- Export final result (corrected if exists, else raw) → JSON / CSV
This section will be updated as MVP features land.
- Basic document processing foundations (statuses + DTOs)
- Storage-related changes (local MVP storage)
- Processing pipeline groundwork (worker/pipeline merged)
- Results storage (
document_results: raw/corrected JSON, model_version, confidence) - Analyze endpoint wiring (
POST /api/documents/{id}/analyze) - Export endpoints (JSON/CSV)
- Upload page (drag & drop + progress)
- Documents list + filters + pagination
- Document details + extracted fields table
- Inline correction + “edited” highlighting
- Export buttons (JSON/CSV)
-
/inferstub (to unblock end-to-end demo) - OCR baseline switched to paddle
- ML model integration (layout-aware document understanding)
/inferreturns fixed JSON in the final contract format- Goal: validate pipeline + UI + DB before real extraction
- PDF → images
- OCR with Polish characters support (UTF-8 end-to-end)
- heuristics for totals / tax / dates / currency
Train / fine-tune on public datasets (then extend with Polish samples):
- SROIE (receipts)
- CORD (receipts)
- FATURA (Zenodo) (synthetic invoices)
{
"fields": {
"vendor_name": {"value":"...", "confidence":0.82},
"invoice_date": {"value":"YYYY-MM-DD","confidence":0.77},
"total_amount": {"value":123.45, "confidence":0.91},
"currency": {"value":"PLN", "confidence":0.88},
"tax_amount": {"value":23.45, "confidence":0.69}
},
"model_version": "stub-v0"
}Backend: ASP.NET Core
Frontend: Angular
ML Service: Python + FastAPI
Database: PostgreSQL (dev)
Containers: Docker Compose
Created by Avuii