Dynamic Schema Discovery & Computer Vision Ingestion Engine#28
Open
AlgoFriend wants to merge 2 commits intodavidcmoore:masterfrom
Open
Dynamic Schema Discovery & Computer Vision Ingestion Engine#28AlgoFriend wants to merge 2 commits intodavidcmoore:masterfrom
AlgoFriend wants to merge 2 commits intodavidcmoore:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
System Architecture
This implementation focuses on an end-to-end pipeline that bridges the gap between unstructured visual inputs and a rigid processing engine. The core innovation is the decoupling of the data requirements from the UI, achieved through automated static analysis of the calculation modules.
Technical Implementation
Static Analysis via AST
To ensure the ingestion layer is always synced with the logic engine, I implemented a schema discovery routine using Python’s ast module.
Benefit: This allows the UI to dynamically provision itself based on the underlying logic's requirements, eliminating the need for manual field mapping and reducing technical debt during logic iterations.
Computer Vision Pipeline
The ingestion layer utilizes OpenCV to perform spatial feature extraction from unstructured document payloads (PDF/Images) via hierarchical contour filtering and aspect-ratio constraints.
Benefit: This allows the system to programmatically isolate data cells within a visual field, transforming a raw image into a structured grid of coordinate-aware "data pods" without hardcoding pixel locations.
Image Optimization Suite
Captured regions undergo Lanczos4 interpolation for high-fidelity upscaling and Otsu’s binarization to normalize the visual input before it reaches the OCR engine.
Benefit: This significantly increases the signal-to-noise ratio in low-fidelity or "noisy" captures, ensuring that the downstream OCR maintains high accuracy even when processing degraded source documents.
Regex Normalization & Sanitization
I implemented a robust regex-based cleaning layer that intercepts varied OCR string outputs and transforms them into standardized floating-point values.
Benefit: This prevents "Engine Bust" errors by ensuring the core calculation logic only receives sanitized, type-validated numbers, effectively creating a computational buffer.
Contextual Harvesting Logic
The system employs "Contextual Anchors" to map identified visual regions to the discovered schema, using size-thresholding to filter out non-target artifacts.
Benefit: This ensures that the engine only ingests relevant data, automatically discarding noise and visual artifacts that would otherwise corrupt the telemetry manifest.
Operational Interface (Mission Control)
The UI serves as a real-time verification stack, providing immediate visual feedback upon payload verification and data integrity checks.
Benefit: This enables "Human-in-the-Loop" validation, allowing an operator to oversee and override the automated ingestion process, which is critical for maintaining 100% accuracy in high-stakes environments.
Dependency Synchronization
I included a pre-initialization routine to verify $PATH parity for critical system-level dependencies like Tesseract and Poppler.
Benefit: This check ensures that the entire ground stack is synchronized and operational before the system attempts a mission, preventing runtime failures due to missing host-system components.