Skip to content

Commit faee5d5

Browse files
committed
feat: add advanced analysis, conversation, and synthesis capabilities
This commit introduces sophisticated analysis modules, conversation management, exploration engine, vision/document processors, QA validation, and synthesis capabilities for comprehensive document intelligence. ## Analysis Components (src/analyzers/) - semantic_analyzer.py: * Semantic similarity analysis * Vector-based document comparison * Clustering and topic modeling * FAISS integration for efficient search - dependency_analyzer.py: * Requirement dependency detection * Dependency graph construction * Circular dependency detection * Impact analysis - consistency_checker.py: * Cross-document consistency validation * Contradiction detection * Terminology alignment * Quality scoring ## Conversation Management (src/conversation/) - conversation_manager.py: * Multi-turn conversation handling * Context preservation across sessions * Provider-agnostic conversation API * Message history management - context_tracker.py: * Conversation context tracking * Relevance scoring * Context window management * Smart context pruning ## Exploration Engine (src/exploration/) - exploration_engine.py: * Interactive document exploration * Query-based navigation * Related content discovery * Insight generation ## Document Processors (src/processors/) - vision_processor.py: * Image and diagram analysis * OCR integration * Visual element extraction * Layout understanding - ai_document_processor.py: * AI-powered document enhancement * Smart content extraction * Multi-modal processing * Quality improvement ## QA and Validation (src/qa/) - qa_validator.py: * Automated quality assurance * Requirement completeness checking * Validation rule engine * Quality metrics calculation - test_generator.py: * Automatic test case generation * Requirement-to-test mapping * Coverage analysis * Test suite optimization ## Synthesis Capabilities (src/synthesis/) - requirement_synthesizer.py: * Multi-document requirement synthesis * Duplicate detection and merging * Hierarchical organization * Consolidated output generation - summary_generator.py: * Intelligent document summarization * Key point extraction * Executive summary creation * Configurable summary levels ## Key Features 1. **Semantic Analysis**: Vector-based similarity and clustering 2. **Dependency Tracking**: Automatic dependency graph construction 3. **Conversation AI**: Multi-turn context-aware interactions 4. **Vision Processing**: Image and diagram understanding 5. **Quality Assurance**: Automated validation and testing 6. **Smart Synthesis**: Multi-source requirement consolidation 7. **Exploration**: Interactive document navigation ## Integration Points These components provide advanced capabilities for: - Document understanding (analyzers + processors) - Interactive workflows (conversation + exploration) - Quality improvement (QA + validation) - Content synthesis (synthesizers + summarizers) Implements Phase 2 advanced intelligence and interaction capabilities.
1 parent 08bd644 commit faee5d5

12 files changed

+1631
-70
lines changed

.env.example

Lines changed: 468 additions & 0 deletions
Large diffs are not rendered by default.

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,3 +338,7 @@ doc/codeDocs/parsers.rst
338338
doc/codeDocs/parsers.database.rst
339339
doc/codeDocs/utils.rst
340340
documentation-output/
341+
342+
# External dependencies (now managed as pip packages, reference in oss/)
343+
requirements_agent/docling/
344+
.env

requirements-ai-processing.txt

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Phase 2: AI/ML Processing Requirements
2+
# Advanced machine learning capabilities for document processing
3+
4+
# Core ML/AI dependencies
5+
torch>=2.0.0
6+
transformers>=4.30.0
7+
sentence-transformers>=2.2.0
8+
datasets>=2.14.0
9+
10+
# Computer Vision
11+
torchvision>=0.15.0
12+
Pillow>=9.5.0
13+
opencv-python>=4.8.0
14+
15+
# NLP and Language Processing
16+
spacy>=3.6.0
17+
nltk>=3.8.0
18+
textblob>=0.17.1
19+
20+
# Vector Operations and Embeddings
21+
numpy>=1.24.0
22+
faiss-cpu>=1.7.4 # For similarity search
23+
scikit-learn>=1.3.0
24+
25+
# Advanced Document Understanding
26+
layoutparser>=0.3.4
27+
detectron2>=0.6 # For layout analysis
28+
29+
# Optional GPU support (user can upgrade)
30+
# torch[cuda] - user should install manually if needed
31+
32+
# Development and Testing (already in requirements-dev.txt)
33+
# pytest>=7.4.0
34+
# pytest-cov>=4.1.0

requirements-dev.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,9 @@ python-dotenv==1.0.1
88
PyYAML==6.0.1
99
types-PyYAML==6.0.12.20240917
1010
pre-commit==3.7.1
11+
google-generativeai>=0.3.0 # For Gemini LLM support
12+
13+
# ML and monitoring dependencies
14+
scikit-learn>=1.3.0 # For ML-based tagging
15+
numpy>=1.24.0 # For numerical operations
16+
pandas>=2.0.0 # For data analysis (optional)
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Document Processing Dependencies - Phase 1 (Core)
2+
# Essential dependencies for basic document processing functionality
3+
4+
# Core document processing
5+
docling>=1.0.0
6+
docling-core>=1.0.0
7+
8+
# PDF processing
9+
PyPDF2>=3.0.0
10+
pdfplumber>=0.9.0
11+
12+
# Office document support
13+
python-docx>=0.8.11
14+
python-pptx>=0.6.21
15+
16+
# HTML/XML processing
17+
beautifulsoup4>=4.12.0
18+
lxml>=4.9.0
19+
20+
# Image processing (for OCR)
21+
Pillow>=9.0.0
22+
23+
# Text processing utilities
24+
markdown>=3.4.0
25+
chardet>=5.0.0
26+
27+
# Optional: Advanced ML features (Phase 2)
28+
# Uncomment these for AI-enhanced document processing:
29+
# torch>=2.0.0
30+
# transformers>=4.30.0
31+
# sentence-transformers>=2.2.0
32+
# easyocr>=1.7.0
33+
# layoutparser>=0.3.4

requirements-streamlit.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Streamlit UI Dependencies
2+
# Install with: pip install -r requirements-streamlit.txt
3+
4+
streamlit>=1.28.0
5+
markdown>=3.5.0
6+
pandas>=2.0.0
7+
pyyaml>=6.0.0
8+
9+
# Optional for enhanced features
10+
plotly>=5.17.0 # For interactive charts

requirements.txt

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,29 @@ fastapi==0.117.1
77
uvicorn==0.37.0
88
psycopg2-binary==2.9.10
99
PyYAML==6.0.2
10+
11+
# Basic text processing
12+
requests>=2.28.0
13+
14+
# Optional document processing (install with: pip install -r requirements-document-processing.txt)
15+
# docling>=1.0.0
16+
# PyPDF2>=3.0.0
17+
18+
# Phase 3: Advanced LLM Integration (Optional)
19+
# Install these for full Phase 3 capabilities:
20+
21+
# LLM Client Libraries (uncomment to enable)
22+
# openai>=1.0.0
23+
# anthropic>=0.3.0
24+
25+
# Advanced NLP and ML (uncomment to enable)
26+
# sentence-transformers>=2.2.0
27+
# scikit-learn>=1.3.0
28+
# numpy>=1.24.0
29+
30+
# Graph and Network Analysis (uncomment to enable)
31+
# networkx>=3.0
32+
33+
# Note: Phase 3 components include graceful degradation
34+
# and will work with limited functionality if these
35+
# optional dependencies are not installed.

0 commit comments

Comments
 (0)