A Streamlit app that ingests a public GitHub repository and lets you ask questions about its codebase.
Demo dataset: the psf/requests repository. The app builds a local ChromaDB vector store from the repo's Python files and uses a Retrieval-Augmented Generation pipeline with Google's Gemini for answers.
- Python 3.9+
- Streamlit (UI)
- GitPython (clone repo)
- LangChain (RAG, text splitting, orchestration)
- sentence-transformers (
all-MiniLM-L6-v2embeddings) - ChromaDB (persistent vector store)
- Google Gemini via
langchain-google-genai
- Create and activate a virtual environment
python -m venv .venv
. .venv\Scripts\Activate.ps1- Install dependencies
pip install -r requirements.txt- Set your Google API key (Gemini)
# Replace with your key
$env:GOOGLE_API_KEY = "YOUR_GOOGLE_API_KEY"- Run the app
streamlit run app.py- Build the knowledge base
- In the app, expand "Knowledge base setup" and click "Build knowledge base from 'requests' repo".
- The first run will clone the repo, split code with a Python-aware splitter, embed chunks, and persist them to
data/chroma_db_requests.
-
Ingestion (
src/ingest.py):- Clones or updates the
psf/requestsrepo. - Loads all
*.pyfiles. - Splits code with
PythonCodeTextSplitterto preserve semantic boundaries (functions/classes). - Embeds chunks using
sentence-transformers/all-MiniLM-L6-v2. - Stores in a persistent ChromaDB directory.
- Clones or updates the
-
Q&A (
src/qa.py):- Creates a Chroma retriever over the persisted store.
- Uses Gemini (via LangChain) with a prompt instructing it to answer strictly from the provided context.
- Returns the answer and the source file paths.
-
UI (
app.py):- Shows knowledge base status and ingestion controls.
- Lets you ask questions about the
requestscodebase.
- The architecture is modular so you can later add multiple repos by parameterizing the repo URL, clone path, and persist dir.
- If you change the embedding model or chunking, rebuild the vector store.
- Ensure your environment can download the HuggingFace embedding model the first time (it will cache afterward).
- Missing GOOGLE_API_KEY: set
$env:GOOGLE_API_KEYand restart the app. - CUDA/torch issues for embeddings: the default model runs on CPU; ensure
torchis installed bysentence-transformersdeps. If problems persist, installtorchexplicitly for your platform. - Proxy/network errors when cloning: re-run ingestion after fixing connectivity.