Version: 1.0 (WebEncode v2025) Role: System Administrator / DevOps
WebEncode is designed for High Availability (HA).
- 1x Kernel Node: Managing state and API.
- 1x Worker Node: Processing transcoding jobs (CPU/GPU).
- 1x NATS Server: Message bus (JetStream enabled).
- 1x PostgreSQL: Persistent metadata storage.
- 3x Kernel Nodes: Load balanced (Stateless API).
- N+1 Worker Nodes: Auto-scaling based on queue depth.
- 3x NATS Cluster: Resilient messaging.
- Postgres Primary + Replica: Data safety.
Configuration is handled via Environment Variables or .env file.
| Variable | Default | Description |
|---|---|---|
NATS_URL |
nats://localhost:4222 |
Connection string for NATS JetStream. |
DATABASE_URL |
postgres://... |
Postgres connection (Kernel only). |
PLUGIN_DIR |
./plugins |
Directory containing compiled plugin binaries. |
PORT |
8080 |
Kernel API bind port. |
WORKER_ID |
hostname |
Unique identifier for worker nodes (Worker only). |
Run migrations before starting the Kernel.
# Using golang-migrate
migrate -path pkg/db/migrations -database $DATABASE_URL upEnsure plugins directory is populated with binaries (make build).
./bin/kernelHealth Check: curl http://localhost:8080/v1/health
Workers auto-register upon connection to NATS.
export WORKER_ID=worker-gpu-01
./bin/workerVerification: Check Kernel logs or GET /v1/workers.
Plugins are subprocesses managed by the Kernel/Worker.
- Install: Place binary in
PLUGIN_DIRand updateplugins.toml(if using manifest) or register via API. - Upgrade: Replace binary and restart Kernel/Worker (Hot reload planned).
All components emit structured JSON logs to stderr.
- Level: Info by default.
- Fields:
service,level,msg,error.
1. "NATS Connection Failed"
- Ensure NATS JetStream is enabled (
nats-server -js). - Check firewall rules between Kernel/Worker and NATS.
2. "Plugin Mismatch"
- Error:
Incompatible API version - Cause: Kernel and Plugin built with different SDK versions.
- Fix: Rebuild both with same
pkg/apiversion.
3. "Job Stuck in Pending"
- Cause 1: No healthy workers connected.
- Cause 2: Workers lack capabilities (e.g., job requires
nvidia, worker hascpu). - Fix: Check
GET /v1/workersfor capabilities and status.
4. "Database Migration Failed"
- Cause: Dirty state from failed previous migration.
- Fix:
migrate force <version>(Use with caution).
- Stop Kernel (Users cannot submit new jobs).
- Upgrade DB Schema (Run migrations).
- Upgrade Plugins (Replace binaries).
- Start Kernel.
- Rolling Upgrade Workers (Wait for idle or drain).
- Postgres: Regular
pg_dump. - NATS: JetStream limits are configured to 90 days for audit logs. Backup if critical.
- Storage: External (S3/FS) - Managed separately.
-- Create indexes for common queries
CREATE INDEX idx_jobs_user_created ON jobs(user_id, created_at DESC) WHERE status != 'completed';
CREATE INDEX idx_tasks_job_status ON tasks(job_id, status);
CREATE INDEX idx_streams_user_live ON streams(user_id) WHERE is_live = true;
-- Analyze tables for query planner
VACUUM ANALYZE jobs;
VACUUM ANALYZE tasks;
VACUUM ANALYZE streams;
-- Tune PostgreSQL for heavy workloads
ALTER SYSTEM SET work_mem = '256MB';
ALTER SYSTEM SET shared_buffers = '1GB';
ALTER SYSTEM SET effective_cache_size = '4GB';
SELECT pg_reload_conf();# nats-server.conf
max_connections = 100000
max_subscriptions = 100000
jetstream {
max_mem_store = 2GB
max_file_store = 50GB
}# Environment variables for Go runtime
export GOGC=75 # Reduce GC frequency
export GOMEMLIMIT=4GiB # Set soft memory limit
export GOMAXPROCS=0 # Use all available cores- TLS Everywhere: Enable TLS between all services.
- Firewall Rules: Only allow necessary ports (8080 for API, 4222 for NATS).
- Network Segmentation: Isolate workers in separate VLAN.
- Use auth-oidc plugin for production (Keycloak, Auth0, Okta).
- Set
OIDC_ISSUER_URL,OIDC_CLIENT_ID,OIDC_CLIENT_SECRET. - Never use dev-mode tokens in production.
- Store credentials in environment variables or secrets manager.
- Never commit
.envfiles with production secrets. - Rotate access tokens for publisher plugins regularly.
name: Build & Deploy
on:
push:
branches: [main]
tags: [v*]
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:17
env:
POSTGRES_PASSWORD: test
nats:
image: nats:latest
options: --js
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with:
go-version: '1.24'
- run: go test ./... -v -coverprofile=coverage.out
- uses: codecov/codecov-action@v4
build:
needs: test
runs-on: ubuntu-latest
strategy:
matrix:
target: [kernel, worker]
steps:
- uses: actions/checkout@v4
- run: docker build -f docker/${{ matrix.target }}.Dockerfile -t webencode/${{ matrix.target }}:${{ github.ref_name }} .
- run: docker push webencode/${{ matrix.target }}:${{ github.ref_name }}- RTO (Recovery Time Objective): 5 minutes (restart all services)
- RPO (Recovery Point Objective): 1 minute (NATS replication lag)
| Tier | Method | Frequency | Retention |
|---|---|---|---|
| 1 | PostgreSQL streaming replication | Continuous | Real-time |
| 2 | pg_dump to S3 | Hourly | 30 days |
| 3 | Full snapshot | Weekly | 90 days |
Scenario: Database Failure
- Promote PostgreSQL replica to primary.
- Update
DATABASE_URLin all Kernel instances. - Restart Kernel services.
- Verify connectivity:
curl /v1/system/health.
Scenario: NATS Cluster Failure
- If quorum lost, restore from latest snapshot.
- Restart NATS cluster with clean data directory.
- Workers will automatically reconnect and resubscribe.
- In-flight tasks will be retried (idempotent design).
Scenario: Complete Site Failure
- Activate standby Kubernetes cluster in DR region.
- Restore PostgreSQL from S3 backup.
- Update DNS to point to DR site.
- Start all services with restored configuration.
WebEncode exposes metrics at /metrics:
webencode_jobs_total{status}: Total jobs by status.webencode_tasks_processing: Currently processing tasks.webencode_workers_healthy: Number of healthy workers.webencode_streams_live: Number of live streams.
| Alert | Condition | Severity |
|---|---|---|
| No Healthy Workers | webencode_workers_healthy == 0 |
Critical |
| Job Queue Backlog | webencode_jobs_total{status="queued"} > 100 |
Warning |
| High Error Rate | rate(webencode_errors_total[5m]) > 0.1 |
Warning |
| Database Connection Pool | pg_stat_activity_count > 80% |
Warning |
Import the provided dashboard JSON from docs/grafana-dashboard.json (if available) or create custom panels using the above metrics.