A Retrieval-Augmented Generation (RAG) chatbot designed to help students study Java and Object-Oriented Programming (OOP) topics. The application provides intelligent document processing, visual content analysis, and contextual question-answering using course PDFs and generated Q&A pairs.
- π RAG-powered Knowledge Base: Index PDFs and generated Q&A into a persistent ChromaDB vector store
- π¬ Interactive Streamlit UI: Clean, modern chat interface with quick action buttons and styled chat bubbles
- π Advanced PDF Processing:
- Extracts text and structured tables
- Detects and analyzes visual elements (diagrams, tables, images)
- Uses GPT-4 Vision to describe visual content
- Smart content chunking for optimal retrieval
- π§ Conversation Memory: Maintains chat history and context through LangChain's ConversationBufferMemory
- π Session-based Document Upload: Upload and process PDFs during chat sessions with deduplication
- π Vector Search: Semantic search over embedded documents using OpenAI embeddings
- π LangSmith Tracing: Trace RAG calls for debugging and evaluation
- Prerequisites
- Installation
- Configuration
- Project Structure
- Usage
- Architecture
- Troubleshooting
- Security
- Python: 3.10 or higher (3.10+ recommended)
- OpenAI API Key: Required for embeddings and LLM responses
- LangSmith Account (Optional): For tracing and monitoring RAG calls
git clone <repository-url>
cd Java-TA-Chatbotpython3 -m venv .venvActivate the virtual environment:
-
On macOS/Linux:
source .venv/bin/activate -
On Windows:
.venv\Scripts\activate
python3 -m pip install --upgrade pip
python3 -m pip install pymupdf openai chromadb tqdm nltk tiktoken python-docx langchain langsmith langchain_openai moviepy streamlit python-dotenv pillowNote: If you encounter SSL errors on macOS (common with NLTK), the notebook includes SSL workarounds and will download required NLTK data automatically.
Create a .env file in the project root directory:
# Required: OpenAI API Key
OPENAI_API_KEY=your_openai_api_key_here
# Optional: LangSmith/LangChain Tracing
LANGCHAIN_API_KEY=your_langsmith_api_key_here
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_PROJECT=Java TA Chatbot
LANGCHAIN_TRACING_V2=trueImportant Notes:
- The application uses
python-dotenvto load environment variables viaload_dotenv() - Never commit your
.envfile or API keys to version control - LangSmith variables are optional and can be omitted if you don't need tracing
Java-TA-Chatbot/
βββ app.py # Main Streamlit application (chat interface + PDF processing)
βββ frontend.py # UI styling and sidebar components
βββ helper.py # PDF processing utilities (visual detection, chunking)
βββ rag_pipeline.py # RAG pipeline implementation (embedding, retrieval, generation)
βββ java_ta.ipynb # Jupyter notebook for knowledge base preparation
βββ knowledge_base_data/ # Directory for course materials
β βββ course_info/ # Course information PDFs
β βββ exercises/ # Exercise sheets and solutions
β βββ lecture_notes/ # Lecture slides and notes
β βββ past_papers/ # Past exam papers
β βββ videos/ # Video materials (if any)
βββ chroma_db/ # Persistent ChromaDB vector store (auto-created)
βββ .env # Environment variables (not committed)
βββ README.md # This file
The knowledge base must be populated before using the chat interface. Use the java_ta.ipynb notebook to ingest documents into ChromaDB.
-
Place PDFs: Copy past exam papers into
knowledge_base_data/past_papers/ -
Run Notebook Cells: Execute the relevant cells in
java_ta.ipynbto:- Extract exam-style questions from PDFs (detects pages with marks or numbered lists)
- Optionally attach page images for visual context
- Generate clear answers using GPT-4o
- Chunk question/answer text using an overlap-aware sentence tokenizer
- Embed chunks with
text-embedding-3-small - Store in the persistent
knowledge-base6collection in ChromaDB
Relevant cells (may vary):
- Text/image extraction, question parsing: Cells ~10-16
- Answer generation and embedding: Cells ~14, 20
-
Place PDFs: Copy PDFs to index in your working directory (or adjust paths in the notebook)
-
Run Generic Ingestion Cells: Execute cells around Cell 22 to:
- Extract raw text from PDFs
- Chunk text intelligently
- Embed with
text-embedding-3-small - Store in
knowledge-base6
Note: ChromaDB storage persists under ./chroma_db and can be reused across application runs.
Start the Streamlit application from the project root:
streamlit run app.pyThe application will automatically open in your default web browser, typically at http://localhost:8501.
-
Sidebar:
- Upload PDF files for session-based processing
- Quick action buttons for common Java/OOP queries:
- "Explain Inheritance"
- "Show UML Example"
- "List OOP Principles"
-
Main Chat Interface:
- Enter questions in the chat input at the bottom
- The application retrieves relevant chunks from the knowledge base
- Responses include contextual information from course materials
-
PDF Processing:
- Uploaded PDFs are processed with:
- Text extraction
- Table detection and extraction
- Visual element analysis (diagrams, images)
- Smart chunking for optimal context retrieval
- Processed content is available as additional context for the current session
- Duplicate uploads are detected and prevented
- Uploaded PDFs are processed with:
The application follows this flow for each query:
- Embedding: Create embeddings for the user query using
text-embedding-3-small - Retrieval: Retrieve top-k (default: 5) relevant chunks from ChromaDB
- Context Assembly: Build a prompt including:
- Retrieved chunks from the knowledge base
- Conversation history from memory
- Optional: Session-uploaded PDF content
- Generation: Generate answer using GPT-4o with context-aware prompting
-
UI Layer (
app.py,frontend.py):- Streamlit-based web interface
- File upload handling
- Chat history management
- Session state management
-
Processing Layer (
helper.py):- PDF parsing using PyMuPDF (fitz)
- Visual element detection (tables, images, drawings)
- GPT-4 Vision integration for visual analysis
- Smart content chunking
-
RAG Pipeline (
rag_pipeline.py):- Query embedding generation
- Vector similarity search in ChromaDB
- Prompt construction with context
- LLM response generation
-
Vector Database:
- ChromaDB persistent client at
./chroma_db - Collection:
knowledge-base6 - Uses OpenAI
text-embedding-3-smallfor embeddings
- ChromaDB persistent client at
- Embeddings:
text-embedding-3-small(OpenAI) - Text Generation:
gpt-4o(OpenAI ChatOpenAI) - Vision Analysis:
gpt-4owith vision capabilities
- Uses LangChain's
ConversationBufferMemoryfor conversation history - Stores chat history in Streamlit session state
- Maintains context across conversation turns
"OpenAI key not found"
- Ensure
.envfile exists in the project root - Verify
OPENAI_API_KEYis set correctly - Restart your shell or IDE after creating
.env
LangSmith not collecting traces
- Set
LANGCHAIN_API_KEYin.env - Enable
LANGCHAIN_TRACING_V2=true - Verify LangSmith credentials are valid
ChromaDB permission errors
- Ensure
./chroma_dbdirectory is writable - Check file system permissions
- Create the directory manually if it doesn't exist:
mkdir chroma_db
NLTK errors (macOS SSL)
- The notebook includes SSL workarounds
- NLTK will automatically download required data (
punkt,punkt_tab) - If issues persist, manually download NLTK data:
import nltk nltk.download('punkt') nltk.download('punkt_tab')
Empty retrievals
- Verify documents were embedded successfully
- Check that documents exist in the
knowledge-base6collection - Re-run the notebook ingestion process if needed
PDF processing errors
- Ensure PDFs are not corrupted or password-protected
- Check that PyMuPDF (pymupdf) is installed correctly
- Verify sufficient disk space for temporary file processing
- API Keys: Never commit
.envfiles or API keys to version control - Local Storage: Vector database is stored locally by default (
./chroma_db) - Session Data: Chat history is stored in Streamlit session state (cleared on refresh)
- Use environment variables for all sensitive information
- Regularly back up your
chroma_dbdirectory if it contains valuable data - Review uploaded PDFs before processing (ensure no sensitive data)
- Keep dependencies up to date for security patches
To back up your knowledge base:
# Copy the entire chroma_db directory
cp -r chroma_db chroma_db_backup
# Or create a zip archive
zip -r chroma_db_backup.zip chroma_db/This project is built with:
- Streamlit - Web application framework
- ChromaDB - Vector database
- LangChain - LLM application framework
- LangSmith - LLM observability platform
- OpenAI - Embedding and language models
- PyMuPDF - PDF processing library
Note: This chatbot is designed specifically for Java and OOP course materials. Adapt the prompts and processing logic in rag_pipeline.py and helper.py if you want to use it for other subjects.