Releases: STENS66/Simple-Text-Extractor
Simple Text Extractor v1.2
Release v1.2: "Secure & Industrial Performance Edition"
This major update transforms Simple Text Extractor into a robust tool capable of handling professional workloads while reinforcing its "Security First" philosophy.
1. User Interface & Experience (UI/UX)
Safety Confirmation: Added a new confirmation dialog when clicking "Clear List". This prevents the accidental loss of a long queue of files during heavy processing sessions.
UI Zero-Freeze: Further optimization of the asynchronous bridge between the OCR engine and the interface, ensuring 100% responsiveness even on documents exceeding 2300 pages.
2. Key New Features & Robustness
Massive Volume Support: The engine has been re-engineered to process 2300+ pages in a single PDF without memory saturation, making it ideal for legal, medical, or administrative archives.
Corrupted File Handling: Improved detection of non-compliant or damaged PDF/images. The application now skips problematic files gracefully with a clear log entry instead of interrupting the entire batch.
3. Performance, Stability & Security
Optimized OCR Strategy:
Windows (MSIX): Uses a dedicated ProcessPoolExecutor to isolate OCR workers, preventing memory leaks and bypassing the Windows file-locking limitations.
Linux (Snap): Implements a specialized ThreadPoolExecutor with smart pointer references to navigate strict Snap confinement while avoiding OOM (Out Of Memory) crashes.
Security Hardening (CWE-377 & CWE-209):
Atomic File Operations: Temporary files are created in secure, restricted "bunkers" (0600 permissions) and deleted atomically upon closure.
Advanced Anonymization: Automatic scrubbing of personal usernames and local paths in system logs to guarantee absolute privacy.
Input Validation: Strict Regex-based sanitization of language parameters and DPI bounds (72-2400) to prevent command injection or engine crashes.
4. System Optimization
Multi-Core Intelligence: The app now dynamically calculates the optimal number of CPU cores to use (up to 8), ensuring maximum speed while leaving enough resources for your OS to remain fluid.
How to Install
- Windows
Available on the Microsoft Store: https://apps.microsoft.com/detail/9NVRKF4X80JZ or simply type "Simple Text Extractor" in the Store search bar.
- For Linux users, find us on the Snap Store here:
or simply type the command:
sudo snap install simple-text-extractor
Simple Text Extractor v1.1
1. User Interface & Experience (UI/UX):
The interface has been entirely rebuilt to provide modern and fluid ergonomics.
- New Modern Design: Replaced the classic Tkinter interface with CustomTkinter. The application now features a sleek, professional look and offers better support for high-resolution displays (High DPI).
- Integrated Help System (Tooltips): Added explanatory tooltips when hovering over each option (DPI, Language, PDF/A, etc.) to guide the user without cluttering the interface.
- Visual Queue: Replaced single selection fields with a dynamic list. You can now view all pending files and their details, and remove specific files from the list via a dedicated "X" button or clear everything in one click.
- Instant Feedback: A precise progress bar and detailed status messages (page by page) ensure the user is always informed of the current progress.
2. Key New Features:
Version 1.1 introduced powerful tools for productivity:
- Batch Processing: No more processing files one by one. Add dozens of PDFs or images to the list and let the application process them automatically in sequence.
- Drag & Drop: Simply drag your files (PDF, PNG, JPG, TIFF, BMP) directly into the window to add them to the queue.
- Archiving Format Export (PDF/A): A new option to generate files compliant with the PDF/A-1b standard, ensuring the long-term durability and readability of your documents (ideal for legal or administrative archiving).
- Output Folder Management: Users can now choose a specific destination folder. By default, the application intelligently manages filenames to prevent accidental overwrites (automatic incrementation: _ocr_1.pdf).
- Metadata Analysis: Before processing, the application now displays technical information for each file: size, page count, resolution (DPI), and current PDF/A compliance.
3. Performance, Stability & Security:
- Multiprocessing Architecture: The graphical interface and the OCR engine are now completely separated into distinct processes.
- Result: The application never freezes, even when processing heavy or complex documents.
- Cancel Button: Ability to cleanly interrupt the process at any time.
- In-Memory Processing: Optimized data streams to avoid unnecessary disk writes, ensuring maximum processing speed.
Enhanced Security:
- "Decompression Bomb" Protection: Integrated a pixel limit (PIL) to prevent crashes or attacks via excessively large malicious images.
- Path Sanitization: Strict verification of executables (Tesseract) and file paths to prevent vulnerabilities related to the system PATH.
Technical Robustness (New Fixes):
- JPG Anti-Crash Protection: Automatic normalization of images to RGB before processing. This eliminates "Unsupported image format" errors often occurring with smartphone photos (iPhone/Android) containing exotic metadata.
- DPI Input Validation: Secured the DPI field to prevent the entry of erroneous or aberrant values (strictly limited between 75 and 2400 DPI), ensuring the OCR engine is never launched with invalid parameters.
- Robust Worker (TESSDATA_PREFIX): Improved automatic detection of language files in parallel processes. This ensures OCR functionality even on non-standard or portable installations.
4. Linux Support & System Optimization (Latest Update):
This major update introduces official Linux (Snap) support and fundamental improvements to system resource management.
What's New & Fixed:
- Linux Support (Snap Store): Strictly confined release for maximum security on Ubuntu and compatible distributions.
- Stateless Engine: Complete overhaul of file processing. Tasks are now isolated to prevent memory corruption (Fixed Pdfium Data Format Error).
- Multi-Core Optimization: Stabilized parallel processing, allowing heavy document batches without UI freezing.
- Drag & Drop Fix: Implemented late initialization and desktop-legacy plugs to restore Drag & Drop functionality on X11/Xorg sessions.
- Log Anonymization: Enhanced privacy with automatic masking of personal paths in error logs.
This program is now available via the Microsoft Store: https://apps.microsoft.com/detail/9NVRKF4X80JZ or simply type "Simple Text Extractor" in the Store search bar.
For Linux users, find us on the Snap Store here:
Simple Text Extractor V1.0
Simple Text Extractor v1.0 :
Points Clés de la v1.0
Nouveau Moteur OCR : Intégration directe de Pytesseract (pour l'OCR) et pypdfium2 (pour le rendu et la manipulation de PDF).
Performance Parallélisée : Utilisation de multiprocessing.Pool pour traiter les pages PDF en parallèle, exploitant au maximum les cœurs de processeur disponibles et réduisant drastiquement le temps de traitement.
Traitement Intelligent : L'OCR n'est exécuté que si nécessaire. L'application détecte les PDF contenant déjà une couche de texte et les copie simplement, sauf si l'option "Forcer OCR" est activée.
Gestion Autonome des Dépendances : L'application localise et configure automatiquement les dépendances embarquées (comme Tesseract et TESSDATA_PREFIX) au démarrage.
Fonctionnalité Principale
L'objectif de "Simple Text Extractor" est de garantir qu'un document PDF ou une image possède une couche de texte consultable ("searchable").
Pour les fichiers PDF : Il analyse chaque page. Si du texte est présent, la page est copiée. Si elle n'en a pas (cas d'un PDF "image"), il effectue un OCR et ajoute une couche de texte invisible.
Pour les fichiers Image (PNG, JPG, etc.) : Il effectue un OCR sur l'image et génère un nouveau fichier PDF d'une seule page contenant l'image et la couche de texte correspondante.
Fichiers Pris en Charge
Entrée : .pdf, .png, .jpg, .jpeg, .tiff, .bmp
Sortie : Toujours .pdf
Ce programme est maintenant disponible via le Microsoft store : https://apps.microsoft.com/detail/9NVRKF4X80JZ?hl=fr-be&gl=BE&ocid=pdpshare ou taper simplement "Simple Text Extractor" dans la barre de recherche du store.