The Multi-Modal Academic Research System is designed as a modular, scalable platform for collecting, processing, indexing, and querying academic content from multiple sources. The system leverages RAG (Retrieval-Augmented Generation) to provide intelligent responses with citations.
┌─────────────────────────────────────────────────────────────────┐
│ User Interfaces │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ Gradio Web UI │ │ FastAPI Visualization │ │
│ │ (Port 7860) │ │ Dashboard (Port 8000) │ │
│ └──────────┬──────────┘ └──────────┬───────────────┘ │
└─────────────┼──────────────────────────────┼──────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Orchestration Layer │
│ ┌──────────────────────────┐ ┌──────────────────────────┐ │
│ │ Research Orchestrator │ │ Citation Tracker │ │
│ │ (LangChain + Gemini) │ │ (Bibliography Export) │ │
│ └────────────┬─────────────┘ └──────────────────────────┘ │
└───────────────┼──────────────────────────────────────────────────┘
│
┌───────────┴───────────┬─────────────────┬──────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ OpenSearch│ │ Database │ │Collectors│ │Processors│
│ Index │◄─────────│ SQLite │◄─────│ Layer │◄──│ Layer │
│ (Vector │ │ (Tracking)│ │ │ │ │
│ Search) │ │ │ │ │ │ │
└──────────┘ └──────────┘ └────┬─────┘ └──────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ ArXiv │ │ YouTube │ │ Podcasts │
│ API │ │ API │ │ RSS │
└──────────┘ └──────────┘ └──────────┘
Purpose: Fetch raw academic content from external sources
Components:
AcademicPaperCollector: Collects papers from ArXiv, PubMed Central, Semantic ScholarYouTubeLectureCollector: Collects educational videos with transcripts using yt-dlpPodcastCollector: Collects podcast episodes via RSS feeds
Key Features:
- Rate limiting to respect API quotas
- Automatic retry logic
- Metadata extraction
- Local file storage
Dependencies:
- External APIs (ArXiv, YouTube, RSS)
- File system for storage
- Database for tracking
Purpose: Transform raw content into indexed, searchable format
Components:
PDFProcessor: Extracts text and images from PDFs- Uses PyMuPDF for text extraction
- Gemini Vision API for diagram analysis
VideoProcessor: Analyzes video content- Uses Gemini for content analysis
- Processes transcripts
Key Features:
- Multi-modal content extraction (text + visual)
- AI-powered diagram description
- Structured data output
Dependencies:
- Google Gemini API
- PyMuPDF, PyPDF libraries
- Sentence transformers
Purpose: Store and retrieve content efficiently
Component: OpenSearchManager
Key Features:
- Hybrid Search: Combines keyword (BM25) + semantic (kNN vector) search
- Embedding Generation: Uses SentenceTransformer (all-MiniLM-L6-v2, 384 dimensions)
- Schema Management: Creates and manages indices
- Bulk Operations: Efficient batch indexing
Search Strategy:
- Convert query to embedding vector
- Perform multi-match text search with field boosting
- Combine scores for ranking
- Return top-k results
Dependencies:
- OpenSearch (Docker or local instance)
- sentence-transformers library
Purpose: Track all collected data with metadata
Component: CollectionDatabaseManager
Schema:
collections: Main table (id, type, title, source, url, status, indexed)papers: Paper-specific details (arxiv_id, abstract, authors, categories)videos: Video-specific details (video_id, channel, views, duration)podcasts: Podcast-specific details (episode_id, audio_url)collection_stats: Collection operation statistics
Key Features:
- Automatic tracking on collection
- Indexing status management
- Query history and statistics
- Metadata preservation
Dependencies:
- SQLite3 (built-in Python)
Purpose: Coordinate research queries and generate responses
Components:
a) ResearchOrchestrator:
- Processes user queries
- Retrieves relevant documents from OpenSearch
- Uses LangChain + Gemini to generate responses
- Maintains conversation memory
- Provides source citations
Query Flow:
User Query → Hybrid Search → Document Retrieval → Context Formation
↓
Gemini LLM Processing → Response Generation → Citation Extraction
↓
Formatted Response + Sources
b) CitationTracker:
- Extracts citations from LLM responses
- Matches citations to source documents
- Tracks citation usage statistics
- Exports bibliographies (BibTeX, APA, JSON)
Dependencies:
- LangChain framework
- Google Gemini API
- OpenSearchManager
Purpose: Provide REST API and visualization interface
Component: FastAPI Server
Endpoints:
GET /api/collections: List all collectionsGET /api/collections/{id}: Get collection detailsGET /api/statistics: Database statisticsGET /api/search: Search collectionsGET /viz: Interactive visualization dashboardGET /docs: Auto-generated API docs
Key Features:
- CORS-enabled for web access
- Pagination support
- Real-time statistics
- Interactive HTML dashboard
Dependencies:
- FastAPI framework
- Uvicorn ASGI server
- Database layer
Purpose: Provide user-friendly web interface
Component: ResearchAssistantUI (Gradio)
Tabs:
- Research: Query system, view answers with citations
- Data Collection: Collect papers/videos/podcasts
- Citation Manager: View and export citations
- Settings: Configure OpenSearch, API keys
- Data Visualization: View collection statistics
Key Features:
- Real-time collection status
- Citation formatting
- Statistics dashboard
- Index management
Dependencies:
- Gradio framework
- All backend modules
1. User initiates collection (via Gradio UI)
↓
2. Collector fetches data from external API
↓
3. Raw data saved to local storage (data/papers, data/videos, data/podcasts)
↓
4. Metadata recorded in SQLite database
↓
5. Processor extracts and structures content
↓
6. Document indexed in OpenSearch with embeddings
↓
7. Collection marked as "indexed" in database
1. User submits research query (via Gradio UI)
↓
2. ResearchOrchestrator receives query
↓
3. Hybrid search performed on OpenSearch
↓
4. Top-k relevant documents retrieved
↓
5. Documents formatted as context for LLM
↓
6. Gemini generates response with citations
↓
7. Citations extracted and tracked
↓
8. Response displayed to user with sources
1. User accesses visualization dashboard
↓
2. FastAPI endpoint queries SQLite database
↓
3. Statistics aggregated and formatted
↓
4. JSON response sent to frontend
↓
5. JavaScript renders interactive charts/tables
Each component has a single, well-defined responsibility and can operate independently.
- Collection ≠ Processing ≠ Indexing
- Each layer has distinct interfaces
- Bulk operations for efficiency
- Stateless API design
- Database-backed persistence
- Plugin architecture for new collectors
- Configurable search strategies
- Customizable UI tabs
- Graceful degradation (e.g., works without OpenSearch)
- Comprehensive error handling
- Detailed logging
- Local OpenSearch (Docker)
- SQLite (built-in)
- Free APIs (ArXiv, YouTube)
- Open-source libraries
| Component | Technology | Rationale |
|---|---|---|
| Search Engine | OpenSearch | Open-source, supports vector search, BM25 |
| Vector Embeddings | SentenceTransformers | Fast, local, no API costs |
| LLM | Google Gemini | Free tier, multimodal, good quality |
| Database | SQLite | Zero-config, serverless, embedded |
| API Framework | FastAPI | Fast, auto-docs, type hints |
| UI Framework | Gradio | Rapid prototyping, Python-native |
| PDF Processing | PyMuPDF | Fast, accurate text extraction |
| Video Metadata | yt-dlp | Robust, actively maintained |
- API Keys: Stored in
.envfile, never committed to git - OpenSearch: Local-only deployment by default
- CORS: Configured for localhost only in production
- Input Validation: All API endpoints validate inputs
- SQL Injection: Prevented via parameterized queries
- Indexing Speed: ~10-50 documents/second (bulk)
- Query Latency: ~1-3 seconds (including LLM)
- Embedding Generation: ~50ms per document
- Database Queries: <10ms for most operations
- Storage: ~1MB per paper (PDF + metadata + embeddings)
- Distributed Search: Multi-node OpenSearch cluster
- Caching Layer: Redis for frequently accessed data
- Background Workers: Celery for async processing
- Advanced Analytics: Time-series analysis of collection trends
- Collaborative Features: Shared collections, annotations
- Mobile Interface: Responsive UI for mobile devices