Developed by Paksa IT Solutions
Lab-PDF-To-Dataset is an intelligent data extraction system that automatically converts medical laboratory test reports (PDF and Word documents) into structured, machine-readable CSV and Excel datasets. This tool eliminates the tedious manual data entry process and provides clean, ready-to-use datasets for analysis and research.
Medical lab reports are typically stored as unstructured PDF or Word documents, making it extremely difficult to:
- Extract data for analysis
- Build datasets for machine learning models
- Conduct statistical research
- Perform comparative studies across multiple patients
Our system automates this entire process, converting hundreds of lab reports into structured datasets in minutes.
- Multi-Format Support: Extract data from PDF (text-based and scanned) and Word documents (.doc, .docx)
- OCR Capabilities: Automatically detects scanned PDFs and uses OCR (Tesseract) to extract text.
- Multiple Test Types: Supports CBC (Complete Blood Count), LFT (Liver Function Test), RFT (Renal Function Test), and TFT (Thyroid Function Test)
- Dual Output: Generates both CSV and Excel (.xlsx) datasets.
- Batch Processing: Upload ZIP files containing multiple reports for bulk processing
- Robust Error Handling: Skips problematic files without stopping the batch, logging errors to a dedicated CSV report.
- Web Interface: User-friendly React-based frontend with progress tracking and direct downloads.
- Accurate Extraction: Smart regex patterns to extract actual test results, avoiding ranges and dates.
- Quickly build datasets for medical informatics projects
- Analyze health trends across patient demographics
- Complete assignments requiring real-world medical data
- Understand how unstructured data (including scanned images) is converted to structured formats
- Learn about regex patterns, OCR, and text extraction techniques
- Study the architecture of full-stack applications (Flask + React)
- Use generated datasets for machine learning projects
- Build predictive models for disease diagnosis
- Create health analytics dashboards
- Convert thousands of lab reports into analysis-ready datasets in minutes
- eliminate weeks of manual data entry work
- Focus on analysis rather than data collection
- Excel Export: Get data in a format ready for immediate analysis in Excel, pandas, or other tools.
- Error Logging: detailed
Processing_Errors.csvhelps identify data quality issues source files. - Consistent Formatting: Standardized columns across all extracted files.
- Build predictive models for disease detection
- Train classification algorithms for abnormal test results
- Develop patient risk assessment systems
- Python 3.8+
- Node.js 14+
- Tesseract OCR (For scanned PDFs):
- Windows: Download Installer (Add to PATH)
- Linux:
sudo apt install tesseract-ocr - Mac:
brew install tesseract
- Poppler (For PDF-to-Image conversion):
- Windows: Download Release (Add
binfolder to PATH) - Linux:
sudo apt install poppler-utils - Mac:
brew install poppler
- Windows: Download Release (Add
cd lab_pdf_to_dataset
pip install -r requirements.txt
python app.pycd frontend
npm install
npm run dev- Start the Backend: Run
python app.py(Flask server on port 5000) - Start the Frontend: Run
npm run dev(React app on port 5173) - Upload Files:
- Single or Multiple PDF/Word files (Text or Scanned)
- ZIP archive containing multiple reports
- Select Test Types: Check boxes for CBC, LFT, RFT, or TFT.
- Download Results:
- Download CSV or Excel datasets directly from the UI.
- Download Error Log if any files failed.
Name, Age, Gender, HB, RBC, HCT, MCV, MCH, MCHC, Platelets, WBC, Neutrophils, Lymphocytes, Monocytes, Eosinophils, ESR, Source_PDF
Name, Age, Gender, Total Bilirubin, Direct Bilirubin, Indirect Bilirubin, ALT, AST, ALP, Albumin, Total Protein, Source_PDF
Name, Age, Gender, Urea, BUN, Creatinine, GFR, Uric Acid, Source_PDF
Name, Age, Gender, T3, T4, TSH, Free T3, Free T4, Source_PDF
Developed by Paksa IT Solutions
© 2026 Paksa IT Solutions. All rights reserved.