A production-ready Retrieval-Augmented Generation (RAG) system that combines vector search with LLM capabilities to answer questions from your documents.
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances LLM responses by retrieving relevant context from a knowledge base before generating answers. This eliminates hallucinations and enables AI to answer questions about your private documents.
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAG ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
📄 Document 🔍 Query 💬 Response
│ │ ▲
▼ ▼ │
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Chunking │ │ Embed │ │ LLM │
│ & Embed │ │ Query │ │ Generate │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
▼ ▼ │
┌─────────────────────────────────────────────┐ │
│ 🗄️ PINECONE VECTOR DATABASE │◄───────────────────┘
│ │ Retrieved
│ [████] doc-1 similarity: 0.92 │ Context
│ [████] doc-2 similarity: 0.87 │
│ [████] doc-3 similarity: 0.81 │
└─────────────────────────────────────────────┘
| Feature | Description |
|---|---|
| 📤 Document Upload | Upload PDFs and text files via web UI |
| 🔍 Semantic Search | Find relevant content using vector similarity |
| 🎯 Reranking | Improve search accuracy with BGE reranker |
| 💬 Chat Interface | Modern, responsive chat UI |
| 🧠 LLM Integration | Groq's Llama 3.3 70B for fast responses |
| 📊 Context Display | View retrieved sources for transparency |
| 🔄 Session Memory | Multi-turn conversations with context |
┌──────────────────────────────────────────────────────────────────────────────┐
│ SYSTEM OVERVIEW │
└──────────────────────────────────────────────────────────────────────────────┘
┌─────────────┐
│ Browser │
│ (Chat UI) │
└──────┬──────┘
│
HTTP POST /api/chat, /api/ingest
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ EXPRESS.JS SERVER │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ /api/chat │ │ /api/ingest │ │ /api/health │ │
│ │ │ │ │ │ │ │
│ │ • Receive msg │ │ • Upload file │ │ • Index stats │ │
│ │ • RAG search │ │ • Chunk text │ │ • Health check │ │
│ │ • LLM call │ │ • Store embeds │ │ │ │
│ └────────┬────────┘ └────────┬────────┘ └─────────────────┘ │
└───────────┼─────────────────────┼────────────────────────────────────────────┘
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ PINECONE │ │ PDF LOADER │ │ GROQ │
│ Vector Store │ │ Text Splitter │ │ LLM (Llama 3) │
│ │ │ │ │ │
│ • Store vectors │ │ • Parse PDFs │ │ • Generate answer │
│ • Semantic search │ │ • Chunk @ 500 │ │ • Tool calling │
│ • BGE reranking │ │ • 100 overlap │ │ • Fast inference │
└───────────────────┘ └───────────────────┘ └───────────────────┘
- Node.js 18+ installed
- Pinecone account (free tier works)
- Groq account (free tier works)
git clone <your-repo-url>
cd RAG
npm installCreate a .env file:
# Pinecone - Get from https://console.pinecone.io
PINECONE_API_KEY=pcsk_xxxxxxxxxxxxx
# Groq - Get from https://console.groq.com
GROQ_API_KEY=gsk_xxxxxxxxxxxxx
# Optional
OPENAI_API_KEY=sk-xxxxxxxxxxxxxThe index uses integrated embeddings (Pinecone generates embeddings automatically):
# First time only - creates index with llama-text-embed-v2 model
npm run devOr manually via Pinecone CLI:
pc index create -n rag-embedded-index -m cosine -c aws -r us-east-1 \
--model llama-text-embed-v2 --field_map text=contentnpm start
# or for development with hot reload:
npm run serverNavigate to http://localhost:3000 and start chatting!
RAG/
├── 📄 server.js # Express server with API endpoints
├── 📄 index.js # Document ingestion utilities
├── 📁 public/
│ └── 📄 index.html # Chat interface (single-page app)
├── 📁 data/ # Sample documents
├── 📁 uploads/ # Temporary upload storage
├── 📄 package.json # Dependencies
├── 📄 .env # Environment variables
└── 📄 README.md # You are here!
POST /api/chat
Content-Type: application/json
{
"message": "What are the key ML concepts?",
"sessionId": "optional-session-id"
}Response:
{
"response": "Based on the knowledge base, key ML concepts include...",
"toolsUsed": ["rag_search"],
"context": "[Source 1] (Score: 0.92)\nML basics include..."
}POST /api/ingest
Content-Type: multipart/form-data
file: <PDF or TXT file>Response:
{
"success": true,
"message": "Successfully ingested document.pdf",
"chunksCreated": 15,
"totalRecords": 24
}GET /api/healthResponse:
{
"status": "ok",
"index": "rag-embedded-index",
"records": 24
}┌─────────────────────────────────────────────────────────────────────────────┐
│ DOCUMENT INGESTION PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
📄 PDF/TXT File
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │───▶│ Parse │───▶│ Chunk │───▶│ Store │
│ (Multer) │ │ (PDFLoader) │ │ (500 chars) │ │ (Pinecone) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────────┐
│ Chunk 1: "..." │
│ Chunk 2: "..." │
│ Chunk 3: "..." │
│ ... │
└─────────────────┘
│
▼
┌─────────────────┐
│ Pinecone │
│ Auto-Embeds │
│ (llama-text) │
└─────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ QUERY PROCESSING FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
👤 User: "What is transformer architecture?"
│
▼
┌─────────────┐
│ Groq LLM │──── Decides to call rag_search tool
└──────┬──────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PINECONE SEARCH │
│ │
│ Query: "transformer architecture" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SEMANTIC SEARCH (Top 6) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ BGE RERANKER (Top 3) │ │
│ │ │ │
│ │ #1 [0.92] "Transformer architecture uses attention..." │ │
│ │ #2 [0.87] "Attention mechanism allows the model..." │ │
│ │ #3 [0.81] "Tokenization is the process of..." │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────┐
│ Groq LLM │──── Generates answer using retrieved context
└──────┬──────┘
│
▼
💬 "Transformer architecture is a neural network design that uses
self-attention mechanisms to process sequences..."
| Category | Technology | Purpose |
|---|---|---|
| Runtime | Node.js 18+ | JavaScript runtime |
| Server | Express 5.x | HTTP server & routing |
| Vector DB | Pinecone | Vector storage & search |
| Embeddings | llama-text-embed-v2 | Text to vectors (integrated) |
| Reranker | bge-reranker-v2-m3 | Result reranking |
| LLM | Groq (Llama 3.3 70B) | Response generation |
| PDF Parsing | LangChain PDFLoader | Document extraction |
| Chunking | RecursiveCharacterTextSplitter | Text segmentation |
| File Upload | Multer | Multipart form handling |
| Metric | Value | Notes |
|---|---|---|
| Embedding Dimension | 1024 | llama-text-embed-v2 |
| Chunk Size | 500 chars | With 100 char overlap |
| Search + Rerank | ~200ms | Pinecone serverless |
| LLM Response | ~1-3s | Groq inference |
| Max Upload | ~10MB | PDF/TXT files |
Modify in server.js:
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500, // Characters per chunk
chunkOverlap: 100 // Overlap between chunks
});const results = await index.namespace(NAMESPACE).searchRecords({
query: {
topK: 6, // Initial candidates
inputs: { text: query }
},
rerank: {
model: "bge-reranker-v2-m3",
topN: 3, // Final results after reranking
rankFields: ["content"]
}
});- Multi-file batch upload - Upload multiple documents at once
- Document management - Delete/update specific documents
- Namespace support - Separate knowledge bases per user/topic
- Streaming responses - Real-time token streaming
- Authentication - User login and access control
- Analytics dashboard - Query logs and usage metrics
- Hybrid search - Combine semantic + keyword search
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing) - Open a Pull Request
MIT License - feel free to use this project for learning or production!
- Pinecone for vector database infrastructure
- Groq for blazing-fast LLM inference
- LangChain for document processing utilities
Built with ❤️ for the AI Engineering community
