Cheeserag Studio is an end-to-end, fully offline AI workspace. Upload PDFs, CSVs, and meeting transcripts, then chat with them through a rich 3-panel web interface. Every answer is strictly grounded in your documents — no hallucinations, no data leakage, zero cloud calls.
| Layer | Component | Language | Role |
|---|---|---|---|
| LLM inference | Cheesebrain (third_party/cheesebrain) |
C++20 | OpenAI-compatible server (/v1/chat/completions, /v1/embeddings) |
| Vector database | PomaiDB (third_party/pomaidb) |
C++20 + Python | Multi-membrane edge vector DB; zero-OOM guarantee |
| API orchestrator | cheese_api/ |
Python / FastAPI | Workspace CRUD, async ingest, citation metadata, closed-book chat, audio overviews |
| Autonomous agent | cmd/cheeserag-agent/ (uses Cheesepath) |
Go | CLI agent with ReAct, planning, multi-role panel, tool registry |
| Web UI | studio/ |
TypeScript / Next.js 14 | 3-panel workspace: sources, chat, notes |
Install these once on your host machine before anything else.
Ubuntu / Debian:
sudo apt-get update
sudo apt-get install -y \
build-essential cmake ninja-build git \
g++-13 pkg-config libssl-dev \
python3 python3-venv python3-pip \
golang-go \
nodejs npm \
tesseract-ocr # OCR fallback for scanned PDFsmacOS (Homebrew):
brew install cmake ninja git openssl python go node tesseractWindows: Use WSL2 (Ubuntu 24.04) and follow the Ubuntu steps above.
| Tool | Minimum | Check |
|---|---|---|
| CMake | 3.20 | cmake --version |
| g++ / clang++ | C++20 capable (GCC 11+, Clang 14+) | g++ --version |
| Go | 1.23 | go version |
| Python | 3.10 | python3 --version |
| Node.js | 18 | node --version |
| npm | 9 | npm --version |
git clone https://github.com/pomagrenate/cheeserag.git
cd cheeserag
# Pull all three submodules (Cheesebrain, PomaiDB, Cheesepath)
git submodule update --init --recursiveThis populates:
third_party/cheesebrain/— C++ LLM inference enginethird_party/pomaidb/— C++ vector database (+ its ownthird_party/pallocsub-submodule)third_party/cheesepath/— Go agent framework
PomaiDB must be compiled first because its shared library (libpomai_c.so) is loaded by the Python API at runtime.
cd third_party/pomaidb
# PomaiDB has its own sub-submodule (palloc allocator)
git submodule update --init third_party/palloc
# Release build — produces libpomai_c.so + pomaidb_server
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_COMPILER=g++ \
-DPOMAI_BUILD_TESTS=OFF
cmake --build build -j$(nproc)
# Confirm the shared library was built
ls build/libpomai_c.so # Linux
# ls build/libpomai_c.dylib # macOS
cd ../..Optional: edge-optimised build (smaller binary, lower RAM footprint):
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DPOMAI_EDGE_BUILD=ON \
-DPOMAI_BUILD_TESTS=OFF
cmake --build build -j$(nproc)Cheesebrain is the LLM inference server. The release build produces cheese-server and cheese-cli.
cd third_party/cheesebrain
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Verify
./build/bin/cheese-server --version
cd ../..GPU acceleration (optional):
# CUDA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
# Metal (Apple Silicon)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON
cmake --build build --config Release -j$(nproc)After building, download a GGUF model into
models/. A good default isqwen2.5-0.5b-instruct-q4_k_m.gguf(~400 MB, runs on 1 GB RAM).
The Go agent links Cheesepath from the local submodule via a replace directive in go.mod — no separate build step is needed for the library itself.
# From the repo root
go build -o build/cheeserag-agent ./cmd/cheeserag-agent/
# Run once to verify
./build/cheeserag-agent --helpIf you also want the standalone ingestion CLI:
go build -o build/cheeserag-ingest ./cmd/cheeserag-ingest/# From the repo root — create an isolated virtualenv
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
# Export the path to the compiled PomaiDB C library
export POMAI_C_LIB=$(pwd)/third_party/pomaidb/build/libpomai_c.so
# macOS: export POMAI_C_LIB=$(pwd)/third_party/pomaidb/build/libpomai_c.dylib
# PomaiDB Python module path
export PYTHONPATH=$(pwd)/third_party/pomaidb/python:$PYTHONPATHcd studio
npm install
cd ..Open four terminal tabs from the repo root.
./third_party/cheesebrain/build/bin/cheese-server \
--embeddings \
--pooling mean \
-m models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080source .venv/bin/activate
export POMAI_C_LIB=$(pwd)/third_party/pomaidb/build/libpomai_c.so
export PYTHONPATH=$(pwd)/third_party/pomaidb/python:$PYTHONPATH
export CHEESEBRAIN_URL=http://127.0.0.1:8080
export RAG_DB_PATH=$(pwd)/rag_db
uvicorn cheese_api.server:app --host 0.0.0.0 --port 9090 --reloadcd studio
NEXT_PUBLIC_API_URL=http://localhost:9090 npm run devOpen http://localhost:3000 in your browser.
export CHEESEBRAIN_URL=http://127.0.0.1:8080
export RAG_FACADE_URL=http://127.0.0.1:9090
./build/cheeserag-agentThis is the easiest path. Docker builds all three C++ submodules automatically inside containers.
- Docker 24+
- Docker Compose v2
# Build images + start all services (Cheesebrain, Cheese API, Studio)
docker-compose up --build
# Or in the background
docker-compose up --build -dService URLs once running:
| Service | URL | Description |
|---|---|---|
| Studio | http://localhost:3000 | Web workspace UI |
| Cheese API | http://localhost:9090 | FastAPI + Swagger at /docs |
| Cheesebrain | http://localhost:8080 | LLM inference (internal) |
docker-compose down # stop + remove containers
docker-compose down -v # also remove persisted DB volumedocker-compose --profile legacy up --build
# Available at http://localhost:8501Set these in your shell, a .env file, or docker-compose.yml:
| Variable | Default | Description |
|---|---|---|
CHEESEBRAIN_URL |
http://127.0.0.1:8080 |
Cheesebrain server URL |
RAG_DB_PATH |
(required) | Directory where PomaiDB stores its files |
POMAI_C_LIB |
(auto-detected) | Path to libpomai_c.so / .dylib |
PYTHONPATH |
(set manually) | Must include third_party/pomaidb/python |
CHEESE_API_KEY |
cheese-admin-key |
API key for all X-API-Key requests |
CHEESE_EMBEDDING_MODEL |
(empty — auto) | Model name to pass to /v1/embeddings |
CHEESE_CHAT_MODEL |
(empty — auto) | Model name to pass to /v1/chat/completions |
CHEESE_CLOSED_BOOK_THRESHOLD |
0.35 |
Similarity below which "not found" is returned |
RAG_MEMBRANE |
rag |
Legacy default membrane name |
RAG_SHARDS |
1 |
Number of PomaiDB DB shards |
RAG_EF_SEARCH |
128 |
HNSW search ef parameter |
NEXT_PUBLIC_API_URL |
http://localhost:9090 |
API base URL used by the Studio frontend |
NEXT_PUBLIC_API_KEY |
cheese-admin-key |
API key used by the Studio frontend |
- Open http://localhost:3000
- Create a Workspace — give it a name (e.g. "Thesis", "Meeting Notes")
- Upload documents — drag & drop PDFs, CSVs, or text files onto the Sources panel. A progress bar tracks each file's ingestion chunk by chunk.
- Chat — type a question. Answers are strictly grounded: if the answer is not in your documents, you'll see a yellow "cannot find" badge instead of a hallucination.
- Click a citation — answers include
[1],[2]footnote markers. Clicking one opens the PDF at the exact cited page. - Pin to notes — click "Pin to notes" under any AI reply to send it to the right-hand scratchpad. Write your own analysis there and export as
.md.
./build/cheeserag-agent [flags]Key flags:
| Flag | Default | Description |
|---|---|---|
--strategy |
react |
Agent strategy: react, reflect, planexec, architect, fnagent, panel |
--memory |
buffer |
Memory type: buffer, vector, summary, sliding |
--max-history |
40 |
Max LLM history messages (sliding window) |
--max-obs-bytes |
16384 |
Max bytes per tool observation stored in history |
--panel-synth |
llm |
Panel synthesis mode: concat, llm, first, vote |
--auto-approve |
false |
Skip confirmation for dangerous tools |
| Command | Description |
|---|---|
/ingest <file> |
Ingest a file into the RAG database |
/pin <file> |
Pin a file's content into session context (8 KB cap) |
/unpin <file> |
Remove a pinned file |
/strategy <name> |
Switch agent strategy mid-session |
/panel <goal> |
Run a multi-role panel (researcher + critic + planner) |
/memory |
Show current memory state |
/history |
Show conversation turns |
/clear |
Clear conversation history |
/help |
List all commands |
The FastAPI server runs at :9090. Full interactive docs at http://localhost:9090/docs.
POST /v1/workspaces Create workspace
GET /v1/workspaces List all workspaces
DELETE /v1/workspaces/{id} Delete workspace
GET /v1/workspaces/{id}/docs List documents in workspace
POST /v1/ingest
Form fields: file (multipart), doc_id (int), workspace_id (str), max_chunk_bytes, overlap_bytes
→ { job_id, doc_id }
GET /v1/jobs/{job_id}/stream SSE stream: { status, progress, total }
GET /v1/jobs/{job_id} Poll job status
POST /v1/retrieve
Body: { query, top_k, workspace_id, min_score }
→ { context, hits: [{ text, score, citation: { file, page, byte_offset, line } }] }
POST /v1/chat
Body: { workspace_id, message, history }
→ SSE stream: citations event, then token events, then [DONE]
POST /v1/audio_overview
Body: { workspace_id, top_k }
→ { job_id }
GET /v1/audio_overview/{job_id}/status
GET /v1/audio_overview/{job_id}/download → audio/wav
cheeserag/
├── third_party/
│ ├── cheesebrain/ # C++ LLM inference engine (submodule)
│ ├── pomaidb/ # C++ vector database (submodule)
│ │ └── third_party/palloc/ # Memory allocator (sub-submodule)
│ └── cheesepath/ # Go agent framework (submodule)
│
├── cheese_api/ # Python FastAPI orchestrator
│ ├── server.py # All API endpoints
│ ├── ingestion.py # PDF/CSV/text chunking + OCR
│ ├── pomaidb_extra.py # PomaiDB ctypes wrappers + KV metadata
│ ├── embeddings.py # Cheesebrain embedding client
│ ├── audio_overview.py # TTS dialogue generation
│ └── workspace_indexer.py # AST-based code indexing
│
├── cmd/
│ ├── cheeserag-agent/ # Go CLI agent (31 tools, 6 strategies)
│ └── cheeserag-ingest/ # Standalone ingestion CLI
│
├── studio/ # Next.js 14 web workspace
│ ├── app/ # App Router pages
│ ├── components/ # SourcePanel, ChatPanel, NotesPanel, CitationModal
│ └── lib/ # API client, Zustand stores
│
├── models/ # GGUF model files (not in git)
├── rag_db/ # PomaiDB database files (not in git)
├── docker-compose.yml # Production stack
├── Dockerfile # cheese-api container
├── requirements.txt # Python dependencies
└── go.mod # Go module
"I could have easily integrated the OpenAI API, but my goal was to engineer a highly secure, air-gapped, local-first RAG workspace capable of running on resource-constrained hardware — a standard laptop or Raspberry Pi. By designing a micro-agent pipeline architecture and integrating it tightly with PomaiDB — a custom-built C++ vector database — I successfully mitigated the reasoning limitations of a 0.5B model. This kept the total memory footprint under 1 GB while maintaining high extraction accuracy and zero data leakage."
A 0.5B parameter model has real constraints: it hallucinates with long contexts, loses formatting under complex instructions, and cannot reliably track citation markers. Cheeserag Studio works around every one of these with backend engineering rather than a bigger model.
The LLM is never asked to place [1], [2] markers. Instead:
- The model receives a short, completion-style extractive prompt (
max_tokens=150):Context: <chunk text> Question: <user question> Answer (one short sentence, use exact words from the context above): - The backend (
citation_engine.py) runs TF-IDF cosine similarity between each sentence of the answer and the retrieved chunks. [N]markers are programmatically inserted by the Python backend after sentences that match a chunk above a confidence threshold.
The frontend always receives citations that are guaranteed to map to real source chunks — because the code assigned them, not the model.
Instead of a single large prompt ("write a podcast about all these pages"), the audio overview pipeline runs three sequential micro-steps:
| Step | Task | Who does it | max_tokens |
|---|---|---|---|
| 1 — Extract | Summarise each chunk into one bullet | LLM (loop) | 80 |
| 2 — Aggregate | Dedup + rank bullets | Pure Python | — |
| 3 — Dialogue | Convert each bullet into a Host A / B exchange | LLM (loop) | 120 + 80 |
Each LLM call is ≤ 512 tokens total — well within the reliable range of a 0.5B model.
All LLM calls enforce:
max_tokens: 150(or lower per task) — forces conciseness, prevents ramblingtemperature: 0.2–0.3— reduces hallucination without making output roboticrepeat_penalty: 1.1— suppresses repetition loops common in small models- Completion-style prompts ending with a colon force the model to fill a blank rather than generate freely
| Variable | Default | Description |
|---|---|---|
CHEESE_CHAT_TOP_K |
3 |
Max chunks fed to the 0.5B model per chat turn (keep low) |
CHEESE_CLOSED_BOOK_THRESHOLD |
0.35 |
Similarity below which "not found" is returned |
PomaiDB's shared library was not found. Set the path explicitly:
export POMAI_C_LIB=$(pwd)/third_party/pomaidb/build/libpomai_c.soOr add it to LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$(pwd)/third_party/pomaidb/build:$LD_LIBRARY_PATHPomaiDB's sub-submodule (third_party/palloc) must be initialised from inside the PomaiDB directory:
cd third_party/pomaidb
git submodule update --init third_party/pallocPomaiDB requires CMake 3.20+. Install from cmake.org or via pip:
pip install cmake --upgradeExport the environment variable before starting the server:
export RAG_DB_PATH=$(pwd)/rag_db
mkdir -p rag_db# Check what is using port 8080 / 9090
lsof -i :8080
lsof -i :9090
# Kill the process or change the port in docker-compose.ymlEnsure the API is running and NEXT_PUBLIC_API_URL points to it:
curl http://localhost:9090/healthBrowser
│ drag-drop PDF
▼
Studio (Next.js :3000)
│ POST /api/v1/ingest (multipart)
▼
Cheese API (FastAPI :9090)
│ 1. process_file_with_meta() — chunk PDF pages with byte offsets
│ 2. fetch_embedding() — POST /v1/embeddings → Cheesebrain
│ 3. put_chunk_with_text() — write vector to ws_{id}_rag membrane
│ 4. store_chunk_meta() — write {file,page,offset} to ws_{id}_meta KV
│ SSE progress → browser
▼
PomaiDB (libpomai_c.so — in-process)
Chat query
Browser ──────────────────────────────────────►
Cheese API
│ embed query
│ search_rag_membrane()
│ if max_score < 0.35 → "not found"
│ else: build grounded system prompt
│ stream /v1/chat/completions
▼
Cheesebrain (:8080)
MIT — see LICENSE.

