GitHub Repository Ingestion

Java Chat supports GitHub source-code ingestion into dedicated hybrid Qdrant collections, with canonical repository identity and incremental updates.

Goals

Use one canonical repository identity for all ingestion entrypoints.
Prevent local-path and URL ingestion from creating duplicate variants of the same repo.
Reindex only changed files on reruns.
Provide batch sync for all existing github-* collections.

Canonical repository identity

Every repository is identified by canonical owner/repository (lowercase).

Derived values:

repoKey: owner/repository
repoUrl: https://github.com/owner/repository
canonical collection name:
- default: github-owner-repository (when owner/repo use only [a-z0-9-])
- collision-safe encoded form with hash suffix for punctuation variants (for example underscores/dots)

The ingestion pipeline stores this identity in payload metadata:

repoKey
repoUrl
repoOwner
repoName
repoBranch
commitHash

docSet for GitHub source files is github/owner/repository.

Commands

All GitHub ingestion commands run in headless CLI mode (spring.main.web-application-type=none), so they do not bind an HTTP port and can run concurrently with the main app or other ingestion jobs. Each CLI invocation runs in its own short-lived JVM process and exits automatically after completion.

Runtime configuration precedence for this pipeline:

CLI flags / inline Make variables
Exported shell environment variables
.env values (when .env exists)
Script defaults

Ingest from a local clone

REPO_PATH=/absolute/path/to/repository make process-github-repo

When REPO_PATH mode is used, repository identity is resolved from the local clone's origin remote (.git/config). REPO_URL is optional and only needed to override metadata explicitly.

Ingest from a GitHub URL

REPO_URL=https://github.com/owner/repository make process-github-repo

URL mode clones or refreshes the repository in cache (data/repos/github by default).

Optional cache override:

REPO_URL=https://github.com/owner/repository REPO_CACHE_DIR=/tmp/repo-cache make process-github-repo

Per-repo exact cache path override:

REPO_URL=https://github.com/owner/repository REPO_CACHE_PATH=/tmp/repos/openai/java-chat make process-github-repo

Default cache root is the project-local path data/repos/github.

Batch sync existing GitHub collections

SYNC_EXISTING=1 make process-github-repo

Batch sync flow:

Discover all Qdrant collections prefixed with github-.
Read repoUrl/repoKey and commitHash payload metadata from each collection.
Resolve remote HEAD commit (git ls-remote <repoUrl> HEAD).
Reingest only collections where remote HEAD differs from stored commitHash.

Incremental update behavior

Per file, ingestion stores marker metadata in data/index/file_*.marker:

file size
mtime
content fingerprint (SHA-256)
ingested chunk hashes

On rerun:

unchanged file + sufficient Qdrant points: skipped
changed file: strict prune and reindex

Strict prune removes:

stale points for that file URL in the target collection
old chunk hash markers
old parsed chunk text files
old file marker

Then the file is chunked and upserted again.

Failure diagnostics and retry behavior

GitHub ingestion fails fast when embedding or vector writes fail.
Preflight now validates remote embedding payload quality before ingestion starts:
- plain text probe
- code-like multiline probe
- probe failures include explicit endpoint/model context and payload anomaly details
- plain-text failures stop immediately
- code-like failures stop by default; set EMBEDDING_CODE_PROBE_MODE=warn to continue explicitly
When null/invalid vectors are detected, diagnostics now state likely causes explicitly:
- wrong endpoint (must resolve to /v1/embeddings)
- non-embedding model
- provider payload bug
Remote embedding calls use exponential backoff retries for transient HTTP errors (for example 429, 5xx, and provider-side 400: null gateway responses).
Remote embedding response-shape failures that are typically transient (for example null/missing embedding entries) also retry with exponential backoff before failing terminally.
On terminal failure, scripts/process_github_repo.sh prints a failure summary extracted from process_github_repo.log, including:
- failure source classification (AI Embedding API, Qdrant API, GitHub API, or Application/Unknown),
- explicit rate-limit diagnosis (No rate limit detected or detected API + evidence),
- the last root-cause headline (for example, file URL and failing embedding batch),
- the final exception cause chain (Caused by),
- recent retry/error trace lines,
- a compact tail of significant error lines.

Collection and payload indexes

GitHub collections are created (if missing) with hybrid vector schema copied from reference collection (java-docs by default), then payload indexes are ensured for:

core fields: url, hash, chunkIndex, docSet, docPath, sourceName, sourceKind, docVersion, docType
GitHub fields: filePath, language, repoUrl, repoOwner, repoName, repoKey, repoBranch, commitHash, license, repoDescription

Environment variables

Common optional variables for GitHub ingestion:

REPO_PATH
REPO_URL
REPO_CACHE_DIR
REPO_CACHE_PATH (single-repo URL mode only)
SYNC_EXISTING
QDRANT_REFERENCE_COLLECTION (default java-docs)

Qdrant connectivity and embedding provider variables follow the existing pipeline conventions in docs/configuration.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Repository Ingestion

Goals

Canonical repository identity

Commands

Ingest from a local clone

Ingest from a GitHub URL

Batch sync existing GitHub collections

Incremental update behavior

Failure diagnostics and retry behavior

Collection and payload indexes

Environment variables

FilesExpand file tree

github-repository-ingestion.md

Latest commit

History

github-repository-ingestion.md

File metadata and controls

GitHub Repository Ingestion

Goals

Canonical repository identity

Commands

Ingest from a local clone

Ingest from a GitHub URL

Batch sync existing GitHub collections

Incremental update behavior

Failure diagnostics and retry behavior

Collection and payload indexes

Environment variables