GitHub Open Source Repo Code Explainer (RAG)

A Streamlit app that ingests a public GitHub repository and lets you ask questions about its codebase.

Demo dataset: the psf/requests repository. The app builds a local ChromaDB vector store from the repo's Python files and uses a Retrieval-Augmented Generation pipeline with Google's Gemini for answers.

Tech stack

Python 3.9+
Streamlit (UI)
GitPython (clone repo)
LangChain (RAG, text splitting, orchestration)
sentence-transformers (all-MiniLM-L6-v2 embeddings)
ChromaDB (persistent vector store)
Google Gemini via langchain-google-genai

Quickstart (Windows PowerShell)

Create and activate a virtual environment

python -m venv .venv
. .venv\Scripts\Activate.ps1

Install dependencies

pip install -r requirements.txt

Set your Google API key (Gemini)

# Replace with your key
$env:GOOGLE_API_KEY = "YOUR_GOOGLE_API_KEY"

Run the app

streamlit run app.py

Build the knowledge base

In the app, expand "Knowledge base setup" and click "Build knowledge base from 'requests' repo".
The first run will clone the repo, split code with a Python-aware splitter, embed chunks, and persist them to data/chroma_db_requests.

How it works

Ingestion (src/ingest.py):
- Clones or updates the psf/requests repo.
- Loads all *.py files.
- Splits code with PythonCodeTextSplitter to preserve semantic boundaries (functions/classes).
- Embeds chunks using sentence-transformers/all-MiniLM-L6-v2.
- Stores in a persistent ChromaDB directory.
Q&A (src/qa.py):
- Creates a Chroma retriever over the persisted store.
- Uses Gemini (via LangChain) with a prompt instructing it to answer strictly from the provided context.
- Returns the answer and the source file paths.
UI (app.py):
- Shows knowledge base status and ingestion controls.
- Lets you ask questions about the requests codebase.

Notes

The architecture is modular so you can later add multiple repos by parameterizing the repo URL, clone path, and persist dir.
If you change the embedding model or chunking, rebuild the vector store.
Ensure your environment can download the HuggingFace embedding model the first time (it will cache afterward).

Troubleshooting

Missing GOOGLE_API_KEY: set $env:GOOGLE_API_KEY and restart the app.
CUDA/torch issues for embeddings: the default model runs on CPU; ensure torch is installed by sentence-transformers deps. If problems persist, install torch explicitly for your platform.
Proxy/network errors when cloning: re-run ingestion after fixing connectivity.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Open Source Repo Code Explainer (RAG)

Tech stack

Quickstart (Windows PowerShell)

How it works

Notes

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitHub Open Source Repo Code Explainer (RAG)

Tech stack

Quickstart (Windows PowerShell)

How it works

Notes

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages