A powerful Document-based Retrieval Augmented Generation (RAG) system built with ChromaDB, Google's Gemini 2.0 Flash, and Streamlit. Upload PDF documents and ask questions based strictly on their content - no hallucinations, only factual answers from your document.
- 📁 PDF Document Upload: Extract text from PDF files automatically
- 🔍 Smart Text Chunking: Intelligently splits documents for optimal retrieval
- 🎯 Semantic Search: Find relevant document sections using vector similarity
- 🤖 Anti-Hallucination AI: Responses based only on document content
- 📖 Source Attribution: See exactly which document sections were used
- 🗑️ Database Management: Clear and manage your document collection
- ⚡ Real-time Processing: Fast document processing and query responses
- Python 3.8+
- Gemini API key from Google AI Studio
- Clone the repository:
git clone <your-repo-url>
cd "RAG DEMO"- Install required packages:
pip install -r requirements.txt- Set up environment variables:
Create a
.envfile in the root directory:
GEMINI_API_KEY=your_gemini_api_key_here
CHROMA_API_KEY=your_chroma_api_key_here
CHROMA_TENANT=your_chroma_tenant_id
CHROMA_DATABASE=your_database_name- Run the application:
streamlit run app.py- Click "Choose a PDF file" to upload your document
- Wait for processing (text extraction + chunking)
- See confirmation with chunk count and preview
- Type your question in the text input field
- Click "🔍 Get Answer" to search and generate response
- Review the AI answer and source document sections
- Use the sidebar to see document chunk count
- Click "🗑️ Clear Database" to remove all documents
- Upload multiple documents for cross-document queries
- AI responds only based on uploaded document content
- Says "I don't know" when information isn't available
- No external knowledge or made-up information
- Shows relevant document sections used for answers
- Displays filename and chunk information
- Enables fact-checking and verification
- Uses semantic similarity for relevant content discovery
- Configurable chunk size (1000 chars) with overlap (200 chars)
- Retrieves top 5 most relevant sections per query
PDF Upload → Text Extraction → Text Chunking → Vector Embeddings → ChromaDB Storage
↓
User Query → Semantic Search → Relevant Chunks → Context Building → Gemini 2.0 → Response
- Streamlit: Web interface and user experience
- ChromaDB: Vector database for document storage
- Google Gemini 2.0 Flash: Large language model for response generation
- PyPDF2: PDF text extraction
- LangChain: Text splitting and chunking utilities
RAG DEMO/
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── .env # Environment variables (not in git)
├── .gitignore # Git ignore rules
└── README.md # This file
GEMINI_API_KEY: Your Google AI Studio API keyCHROMA_API_KEY: ChromaDB cloud API key (optional)CHROMA_TENANT: ChromaDB tenant ID (optional)CHROMA_DATABASE: ChromaDB database name (optional)
- Chunk Size: 1000 characters (adjustable in code)
- Chunk Overlap: 200 characters (prevents information loss)
- Retrieval Count: 5 most relevant chunks per query
- API keys stored in environment variables
.envfile excluded from version control- No sensitive data in code repository
- Local ChromaDB instance for privacy
- PDF Only: Currently supports PDF documents only
- Text-based: Cannot process images, tables, or complex layouts
- Memory Storage: Uses in-memory ChromaDB (data lost on restart)
- Single Session: No persistent user sessions or multi-user support
- Support for Word documents, text files, and web pages
- Persistent ChromaDB storage
- Multi-user authentication and sessions
- Advanced document preprocessing (tables, images)
- Conversation history and follow-up questions
- Document comparison and analysis features
This project is open source and available under the MIT License.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
If you encounter any issues or have questions:
- Check the troubleshooting section in the app
- Review error messages in the Streamlit interface
- Ensure your API keys are correctly configured
- Verify PDF files contain extractable text
Happy Document Querying! 🎉
