Skip to content

TJ-Neary/CoreRag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

156 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoreRag

CI Python 3.12+ License: MIT

CoreRag Banner

A local-first, privacy-preserving knowledge engine with semantic search, exposed via MCP (Claude Desktop) and REST API. Optimized for Apple Silicon.

Features

Search

  • Hybrid Search: Vector (BAAI/bge-m3, 1024d) + BM25 full-text with RRF fusion
  • Cross-Encoder Reranking: ms-marco-MiniLM-L-6-v2
  • HyDE Expansion: Hypothetical document embeddings for better recall
  • Multi-Query Fusion: Parallel query variants merged via RRF
  • Time-Decay Scoring: Recent documents weighted higher
  • Collection Tags: Filter searches by tagged document groups

Ingestion Pipeline

  • Inbox Workflow: Drop files, auto-process via watchdog or dashboard batch
  • Human-in-the-Loop: Web dashboard with skip/error management, quality report banner, per-detection redaction editor
  • Dual RAG Databases: Main (redacted for cloud-safe search) + Restricted (unredacted for local-only access)
  • Document Catalog: SQLite catalog tracking every file across all destinations (RAG, Obsidian, archive)
  • Three-Layer PII Detection: Presidio NER + custom dictionary + LLM advisory, with per-detection Keep/Redact toggles
  • Smart Filing: Archive to ~/Documents/PKM/, export markdown with LLM tags + summaries to Obsidian vault
  • Per-Agent Access Control: Settings tab with per-action permission toggles per agent, API key management
  • Archive Manager: Browse, search, filter cataloged documents; cold storage migration with folder structure replication
  • Parent-Child Chunking: Context-preserving hierarchical chunks with quality scoring
  • Corrective RAG: Post-retrieval relevance filtering (correct/ambiguous/incorrect)

Multi-Format Support

  • Documents: PDF (with OCR fallback), DOCX, TXT, Markdown, JSON, YAML, CSV
  • Spreadsheets: XLSX, XLS, XLSM (markdown table output per sheet)
  • Code: Python, JavaScript, TypeScript, JSX, TSX, Go, Rust, Java, Ruby (AST + line-based chunking)
  • Images: PNG, JPG, WebP, HEIC (Vision.framework OCR + VLM captioning)
  • Audio: MP3, WAV, M4A (mlx-whisper transcription + topic segmentation)
  • Video: MP4, MOV (keyframe extraction + scene detection + audio)

Quality Assurance

  • LLM-Powered Tagging: Purpose-driven collection tags (replaces keyword auto-tagger), year as tag
  • Duplicate Detection: Content hash + MinHash/LSH + semantic similarity
  • Link Checker: Async URL validation with caching
  • Freshness Indicators: Age classification + staleness warnings
  • Conflict Detection: Find contradictions across documents

Advanced

  • GraphRAG: Bitemporal knowledge graph with confidence decay
  • Episodic Memory: User context and search pattern tracking
  • Rate-Limited REST API: Authenticated v1 endpoints with slowapi
  • MCP Server: Full tool suite for Claude Desktop integration
  • Memory Safety: Auto-pause at high RAM usage, GC between files
  • Security Hardening: CSRF protection, XSS escaping, PII redaction fail-safe, asyncio.to_thread for blocking I/O, LanceDB connection caching, thread-safe embedding singleton, OrderedDict LRU cache

Quick Start

# Clone and setup
git clone https://github.com/TJ-Neary/CoreRag.git
cd CoreRag
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_lg

# Copy and configure environment
cp .env.example .env
# Edit .env with your paths

# Install menu bar app (auto-starts server at login)
./scripts/install_menubar.sh

# Or start manually
./scripts/run_system.sh

See StartHere.md for detailed setup instructions.

Usage

CLI

python -m src.cli.main status                          # System status
python -m src.cli.main search "your query"             # Search knowledge base
python -m src.cli.main ingest /path/to/folder -r -t mytag  # Ingest with tags
python -m src.cli.main health                          # System health checks
python -m src.cli.main check-links /path               # Find broken links
python -m src.cli.main duplicates /path                # Find duplicates
python -m src.cli.main stale /path --days 365          # Find stale content
python -m src.cli.main tag /path                       # Auto-tag files
python -m src.cli.main pii list                        # Manage PII dictionary
python -m src.cli.main optimize-db                     # Optimize LanceDB
python -m src.cli.main backup create                   # Create backup
python -m src.cli.main graph stats                     # Knowledge graph stats
python -m src.cli.main memory list                     # Episodic memory

REST API (v1)

# Capability manifest (no auth required)
curl http://localhost:8000/api/v1/manifest

# Search (with optional tag filtering)
curl -X POST http://localhost:8000/api/v1/search \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $CORERAG_API_KEY" \
  -d '{"query": "authentication setup", "k": 5, "tags": ["sphr-study"]}'

# Ingest content
curl -X POST http://localhost:8000/api/v1/ingest \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $CORERAG_API_KEY" \
  -d '{"content": "...", "source": "my-app", "metadata": {}}'

# Stats and deletion
curl -H "X-API-Key: $CORERAG_API_KEY" http://localhost:8000/api/v1/stats
curl -X DELETE -H "X-API-Key: $CORERAG_API_KEY" http://localhost:8000/api/v1/documents/{id}

MCP (Claude Desktop)

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "corerag": {
      "command": "/path/to/CoreRag/venv/bin/python",
      "args": ["-m", "src.mcp_server.server"],
      "cwd": "/path/to/CoreRag"
    }
  }
}

Dashboard

python -m src.server    # http://localhost:8000

Web UI for reviewing AI-proposed metadata, editing tags, marking sensitivity, and committing documents through the pipeline.

Configuration

Create .env from the example:

cp .env.example .env

Key variables:

Variable Default Purpose
INBOX_PATH ~/Desktop/Inbox Watched folder for new documents
VAULT_PATH ~/Documents/ObsidianVault Obsidian vault for markdown exports
ARCHIVE_PATH ~/Documents Long-term storage for originals
CORERAG_DB_PATH ~/.corerag/lancedb LanceDB vector database
CORERAG_API_KEY (unset) API key for v1 endpoints (omit for open access)
OLLAMA_MODEL qwen3:32b Local LLM for document analysis
CORERAG_EMBEDDING_MODEL BAAI/bge-m3 Embedding model (1024d)

Technology Stack

Component Technology
Vector Database LanceDB (embedded, Lance format)
Embeddings BAAI/bge-m3 (1024d, MPS-optimized)
Reranker cross-encoder/ms-marco-MiniLM-L-6-v2
LLM Ollama (qwen3:32b, local)
Audio mlx-whisper (Apple Silicon)
Video OpenCV (keyframe + scene detection)
OCR Vision.framework (native macOS)
VLM LLaVA (optional image captioning)
PII Presidio + spaCy + custom dictionary
MCP FastMCP (stdio transport)
Web FastAPI + Jinja2
Rate Limiting slowapi

Testing

pytest                           # Full suite with coverage
pytest -m "not slow"             # Skip slow tests
pytest -m "not integration"      # Skip integration tests
pytest -k "test_name"            # Single test

Development

black src/ tests/ --line-length 100    # Format
ruff check src/ tests/                 # Lint
mypy src/                              # Type check
./scripts/security_scan.sh --staged    # Security scan before commit

See CONVENTIONS.md for coding standards and CLAUDE.md for AI agent instructions.

License

MIT

About

Local-first knowledge engine with semantic search, GraphRAG, and Claude Desktop integration via MCP. Features hybrid search (vector + BM25), cross-encoder reranking, multi-format document ingestion (PDF, DOCX, audio, video), and privacy-preserving PII detection. Apple Silicon optimized.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors