Skip to content

ak1606/Codexa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub Open Source Repo Code Explainer (RAG)

A Streamlit app that ingests a public GitHub repository and lets you ask questions about its codebase.

Demo dataset: the psf/requests repository. The app builds a local ChromaDB vector store from the repo's Python files and uses a Retrieval-Augmented Generation pipeline with Google's Gemini for answers.

Tech stack

  • Python 3.9+
  • Streamlit (UI)
  • GitPython (clone repo)
  • LangChain (RAG, text splitting, orchestration)
  • sentence-transformers (all-MiniLM-L6-v2 embeddings)
  • ChromaDB (persistent vector store)
  • Google Gemini via langchain-google-genai

Quickstart (Windows PowerShell)

  1. Create and activate a virtual environment
python -m venv .venv
. .venv\Scripts\Activate.ps1
  1. Install dependencies
pip install -r requirements.txt
  1. Set your Google API key (Gemini)
# Replace with your key
$env:GOOGLE_API_KEY = "YOUR_GOOGLE_API_KEY"
  1. Run the app
streamlit run app.py
  1. Build the knowledge base
  • In the app, expand "Knowledge base setup" and click "Build knowledge base from 'requests' repo".
  • The first run will clone the repo, split code with a Python-aware splitter, embed chunks, and persist them to data/chroma_db_requests.

How it works

  • Ingestion (src/ingest.py):

    • Clones or updates the psf/requests repo.
    • Loads all *.py files.
    • Splits code with PythonCodeTextSplitter to preserve semantic boundaries (functions/classes).
    • Embeds chunks using sentence-transformers/all-MiniLM-L6-v2.
    • Stores in a persistent ChromaDB directory.
  • Q&A (src/qa.py):

    • Creates a Chroma retriever over the persisted store.
    • Uses Gemini (via LangChain) with a prompt instructing it to answer strictly from the provided context.
    • Returns the answer and the source file paths.
  • UI (app.py):

    • Shows knowledge base status and ingestion controls.
    • Lets you ask questions about the requests codebase.

Notes

  • The architecture is modular so you can later add multiple repos by parameterizing the repo URL, clone path, and persist dir.
  • If you change the embedding model or chunking, rebuild the vector store.
  • Ensure your environment can download the HuggingFace embedding model the first time (it will cache afterward).

Troubleshooting

  • Missing GOOGLE_API_KEY: set $env:GOOGLE_API_KEY and restart the app.
  • CUDA/torch issues for embeddings: the default model runs on CPU; ensure torch is installed by sentence-transformers deps. If problems persist, install torch explicitly for your platform.
  • Proxy/network errors when cloning: re-run ingestion after fixing connectivity.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages