After imaging and booting the appliance for the first time:
sudo /usr/libexec/secure-ai/first-boot-check.shThis validates:
- All core services are running
- Health endpoints respond
- Attestation state is verified
- Integrity monitor baseline is established
- No open incidents
- Service token is present
- No services exposed on public interfaces
-
Pre-upgrade snapshot (if supported by hardware):
rpm-ostree status # Record current deployment sudo cp /var/lib/secure-ai/data/incidents.jsonl /tmp/incidents-backup.jsonl -
Pull update:
rpm-ostree upgrade
-
Reboot and verify:
sudo systemctl reboot # After reboot: sudo /usr/libexec/secure-ai/first-boot-check.sh -
Rollback if needed:
rpm-ostree rollback sudo systemctl reboot
Audit logs are rotated automatically via /etc/logrotate.d/secure-ai:
- Audit JSONL (
/var/lib/secure-ai/logs/*.jsonl): daily, 30 days retained, max 50MB per file - Incident store (
/var/lib/secure-ai/data/*.jsonl): weekly, 12 weeks retained, max 100MB
To manually trigger rotation:
sudo logrotate -f /etc/logrotate.d/secure-aiThe inter-service bearer token at /run/secure-ai/service-token:
-
Generate new token:
openssl rand -hex 32 > /tmp/new-token -
Replace atomically:
sudo mv /tmp/new-token /run/secure-ai/service-token sudo chmod 0640 /run/secure-ai/service-token
-
Restart all services (they read token at startup):
sudo systemctl restart secure-ai-*.service
-
Generate new key:
openssl rand 32 > /tmp/new-hmac-key -
Replace and restart attestor:
sudo mv /tmp/new-hmac-key /run/secure-ai/attestation-hmac-key sudo chmod 0640 /run/secure-ai/attestation-hmac-key sudo systemctl restart secure-ai-runtime-attestor
The cosign signing key is used to sign:
- OCI container images (
ghcr.io/secai-hub/secai_os:*) - SBOM attestations (CycloneDX per-service)
- Release checksums (
SHA256SUMS.sig) - SLSA provenance attestations
# Generate a new cosign key pair (interactive passphrase prompt)
cosign generate-key-pair
# This creates cosign.key (private) and cosign.pub (public)
# Store cosign.key in a password manager or HSM — never commit to git| Trigger | Action |
|---|---|
| Annual (recommended) | Proactive rotation, even with no incident |
| Key compromise | Immediate emergency rotation |
| Personnel change | Rotate if key holder leaves the project |
| CI provider breach | Rotate if GitHub Actions secrets may be exposed |
-
Generate new key pair (on an air-gapped machine if possible):
cosign generate-key-pair
-
Update GitHub repository secret:
- Go to: Settings → Secrets and variables → Actions
- Update
SIGNING_SECRETwith the newcosign.keycontents - Verify the secret is updated (name shows "Updated just now")
-
Update the public key in deployed appliances:
# Copy the new cosign.pub to the appliance scp cosign.pub admin@appliance:/tmp/cosign.pub # On the appliance (requires local admin access): sudo cp /tmp/cosign.pub /etc/secure-ai/cosign.pub sudo chmod 0644 /etc/secure-ai/cosign.pub
-
Tag a new release to produce signed artifacts with the new key:
git tag -s vX.Y.Z -m "Release vX.Y.Z (key rotation)" git push origin vX.Y.Z -
Verify the new signature:
cosign verify --key cosign.pub ghcr.io/secai-hub/secai_os:vX.Y.Z
If the signing key is compromised:
- Immediately rotate the key (steps above)
- Revoke trust in old images: update all deployed appliances'
cosign.pub - Re-sign the latest stable release with the new key:
cosign sign --key cosign.key ghcr.io/secai-hub/secai_os:latest
- Announce the compromise via GitHub Security Advisory
- Audit CI logs for any unauthorized release activity during the exposure window
- Private key stored in encrypted vault or HSM (never on disk in plaintext)
- Only CI (
SIGNING_SECRET) and key custodian have access to private key - Public key shipped in OS image at
/etc/secure-ai/cosign.pub -
verify-release.shuses the shipped public key, not a remote fetch - Key rotation date recorded in
CHANGELOG.md - Previous public key archived (for verifying older releases)
When HSM support is implemented (planned milestone):
- Private key will be generated inside the HSM and never exported
- Cosign will use
--keywith a PKCS#11 URI instead of a file path - Rotation becomes: generate new key in HSM → update PKCS#11 URI in CI
- See
docs/security-status.mdfor HSM milestone tracking
All services expose /health on their localhost ports:
| Service | Port | Endpoint |
|---|---|---|
| Policy Engine | 8500 | /health |
| Registry | 8470 | /health |
| Tool Firewall | 8475 | /health |
| Runtime Attestor | 8505 | /health |
| Integrity Monitor | 8510 | /health |
| Incident Recorder | 8515 | /health |
| MCP Firewall | 8496 | /health |
| GPU Integrity Watch | 8495 | /health |
# Open incidents
curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.tool
# Attestation state
curl -s http://127.0.0.1:8505/api/v1/verify | python3 -m json.tool
# Integrity status
curl -s http://127.0.0.1:8510/api/v1/status | python3 -m json.tool# All Secure AI services
journalctl -u 'secure-ai-*' --since "1 hour ago"
# Specific service
journalctl -u secure-ai-incident-recorder -f
# Security events only
journalctl -u 'secure-ai-*' -g 'FAIL\|DENIED\|degraded\|violation' --since today| Resource | Service | Default Limit | Notes |
|---|---|---|---|
| Memory | Agent | 512MB | Increase for large context windows |
| Memory | Registry | 128MB | Scales with model count |
| Memory | Policy Engine | 128MB | Scales with rule count |
| CPU | Agent | 50% | Primary workload |
| CPU | GPU Integrity | 15% | Background monitoring |
| Incidents | Recorder | 1000 max | Oldest trimmed; persisted to disk |
| Models | Registry | Unlimited | Bounded by vault size |
All Go services handle SIGTERM for clean shutdown:
- In-flight HTTP requests complete (up to 10s drain)
- Incident recorder flushes persistence file
- Audit log files are closed cleanly
- systemd
TimeoutStopSec=15enforces hard deadline
| Category | Paths | Criticality |
|---|---|---|
| Policy + config | /etc/secure-ai/policy/*.yaml, /etc/secure-ai/config/appliance.yaml, /etc/secure-ai/model-catalog.yaml |
High |
| Incidents | /var/lib/secure-ai/data/incidents.jsonl |
High — audit trail |
| Audit logs | /var/lib/secure-ai/logs/*.jsonl |
High — hash-chained evidence |
| Registry manifest | /var/lib/secure-ai/registry/manifest.json |
Medium — model inventory |
| Signing keys | /var/lib/secure-ai/keys/ (cosign, TPM2) |
Critical |
| LUKS header | Vault partition header | Critical — data recovery |
Note: Model files (GGUF binaries) are NOT included in backups due to their size (4–70 GB each). The registry manifest is backed up, so you know exactly which models to re-download after a restore.
# Full backup (config + logs + keys + manifest + LUKS header)
sudo secai-backup.sh full
# Config only (policy, appliance config, model catalog)
sudo secai-backup.sh config
# Logs and incidents only
sudo secai-backup.sh logs
# Keys and LUKS header only (most sensitive)
sudo secai-backup.sh keys --encrypt
# Full encrypted backup to external USB
sudo secai-backup.sh full --encrypt --output /media/usb/backupssudo secai-backup.sh verify /var/lib/secure-ai/backups/secai-backup-full-20260314-120000.tar.gz# Inspect backup contents before restoring
sudo secai-restore.sh inspect <backup-file>
# Full restore
sudo secai-restore.sh full <backup-file>
# Selective restore (config, logs, or keys)
sudo secai-restore.sh config <backup-file>
sudo secai-restore.sh logs <backup-file>
sudo secai-restore.sh keys <backup-file> # Requires YES confirmation for LUKS headerAfter restore, the script automatically restarts services and runs the health check.
| Environment | Frequency | Retention | Encryption |
|---|---|---|---|
| Production | Daily (config + logs), weekly (full) | 30 days | Required |
| Development | Weekly (full) | 14 days | Optional |
| Pre-upgrade | Immediately before upgrade | Until next successful upgrade verified | Required |
- Store backups on a separate physical device (USB, NAS, air-gapped machine).
- Never store unencrypted key backups on network-attached storage.
- The LUKS header backup + vault passphrase can decrypt the entire vault — treat as highly sensitive.
- Verify backup integrity periodically:
secai-backup.sh verify <file>.
Greenboot triggers automatic rpm-ostree rollback when health checks fail
after boot. The checks are defined in
/etc/greenboot/check/required.d/01-secure-ai-health.sh:
| Check | Failure Condition | Auto-Rollback? |
|---|---|---|
| nftables service | Not active | Yes |
| Registry / Tool Firewall / UI | Enabled but failed to start within 60s | Yes |
| Registry API | /health unreachable after 30s |
Yes |
| Model integrity | SHA256 mismatch against manifest | Yes |
| nftables rules | secure_ai table not loaded |
Yes |
| Critical scripts | securectl / verify-boot-chain / canary-check missing | Yes |
Maximum 2 automatic rollback attempts. After exhaustion, the system halts on the broken deployment for manual intervention (see Break-Glass Scenario 5 below).
| Symptom | Severity | Action | Rationale |
|---|---|---|---|
| Inference quality degraded, all services healthy | Low | Fix forward | Not a security regression |
| Single non-critical service failing (e.g., GPU watch) | Low | Fix forward | Other services compensate |
| Policy engine or tool firewall failing | Critical | Rollback | Security enforcement compromised |
| Attestation stuck in failed state | Critical | Rollback | Trust root broken |
| Multiple services crash-looping | High | Rollback | Systemic regression |
| Disk full preventing log writes | Medium | Fix forward | Clear space, not a code issue |
| Network rules missing or wrong | Critical | Rollback | Default-deny bypassed |
| Incident recorder down | Critical | Rollback | Audit trail broken |
| UI unreachable but API services healthy | Low | Fix forward | Non-critical for security |
# Manual rollback
sudo rpm-ostree rollback
sudo systemctl reboot
# Or via the update verification tool
sudo /usr/libexec/secure-ai/update-verify.sh rollback# Health check
sudo /usr/libexec/secure-ai/first-boot-check.sh
# Check deployment status
rpm-ostree status
# Review journal for the failed deployment
journalctl -u 'secure-ai-*' --since "1 hour ago" -g 'FAIL\|ERROR\|panic'These procedures are for exceptional situations where normal operational tools are unavailable. Each requires physical or console access to the appliance.
Symptoms: All inter-service calls fail with 401. Services are running but cannot communicate.
Diagnosis:
ls -la /run/secure-ai/service-token # Missing or empty?
curl -s http://127.0.0.1:8470/health # Registry returns 401?Recovery:
# Generate a new token
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
sudo chmod 0640 /run/secure-ai/service-token
# Restart all services so they pick up the new token
sudo systemctl restart secure-ai-*.service
# Verify
sudo /usr/libexec/secure-ai/first-boot-check.shSymptoms: Services frozen in degraded mode. Incident recorder shows latched attestation_failure or integrity_violation. Normal recovery ceremony fails because the incident recorder is unreachable.
Diagnosis:
curl -sf http://127.0.0.1:8505/health # Attestor healthy?
curl -sf http://127.0.0.1:8515/health # Incident recorder healthy?
curl -s http://127.0.0.1:8515/api/v1/stats | python3 -m json.toolRecovery Option A (if incident recorder is reachable): Run the recovery ceremony — acknowledge → re-attest → resolve.
Recovery Option B (if incident recorder is unreachable):
# Stop all services
sudo systemctl stop secure-ai-*.service
# Clear panic state
sudo rm -f /run/secure-ai/panic-state.json
# Regenerate service token
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
sudo chmod 0640 /run/secure-ai/service-token
# Restart services
sudo systemctl start secure-ai-*.service
# Run full health check and recovery ceremony
sudo /usr/libexec/secure-ai/first-boot-check.shContext: securectl panic 1 was triggered — vault locked, services stopped,
sessions invalidated. This is fully reversible.
Recovery:
# Unlock the vault (you will be prompted for the passphrase)
sudo cryptsetup open /dev/<vault-partition> secure-ai-vault
sudo mount /dev/mapper/secure-ai-vault /var/lib/secure-ai
# Regenerate service token (invalidated by panic)
openssl rand -hex 32 | sudo tee /run/secure-ai/service-token > /dev/null
sudo chmod 0640 /run/secure-ai/service-token
# Start all services
sudo systemctl start secure-ai-*.service
# Verify
sudo /usr/libexec/secure-ai/first-boot-check.shFind the vault partition: grep secure-ai-vault /etc/crypttab.
Context: rpm-ostree upgrade fails with signature verification errors.
The signing policy (policy.json or cosign public key) is corrupted.
Diagnosis:
cat /etc/containers/policy.json | python3 -m json.tool # Valid JSON?
cat /etc/containers/registries.d/secai-os.yaml # Present?
ls /etc/pki/containers/secai-cosign.pub # Present?Recovery: Re-run the bootstrap script in dry-run mode first, then for real:
curl -sSfL https://raw.githubusercontent.com/SecAI-Hub/SecAI_OS/main/files/scripts/secai-bootstrap.sh \
-o /tmp/secai-bootstrap.sh
sudo bash /tmp/secai-bootstrap.sh --dry-run # verify
sudo bash /tmp/secai-bootstrap.sh # applySee recovery-bootstrap.md for the full manual fallback procedure.
Context: Greenboot hit MAX_ROLLBACKS=2. The system is halted on the
broken deployment. Automatic rollback has stopped to prevent an infinite
reboot loop.
Recovery (requires USB boot media):
- Boot from a Fedora Silverblue USB drive (Live session, not install).
- Mount the system partition:
sudo mount /dev/sda3 /mnt # Adjust device as needed - Reset the rollback counter:
sudo rm -f /mnt/run/secure-ai/rollback-count
- Pin the last known-good deployment:
sudo chroot /mnt rpm-ostree rollback
- Reboot into the system (remove USB):
sudo reboot
See recover-failed-update.md for additional boot loop recovery scenarios.
| Data Class | Minimum Retention | Maximum Retention | Rotation | Notes |
|---|---|---|---|---|
Audit logs (*.jsonl) |
30 days | 90 days | logrotate (daily, 30 rotations) | Hash-chained; broken chains are snapshotted |
| Incident store | 12 weeks | 6 months | logrotate (weekly, 12 rotations) | Latched incidents retained until resolution |
| Forensic bundles | 1 year | Indefinite | Manual export | Export via /api/v1/forensic/export before pruning |
| Backup archives | 30 days (prod) | 90 days | Operator-managed | Encrypted, stored on external media |
| Model files (GGUF) | While promoted | N/A | Manual prune | Quarantined models auto-expire in 30 days |
| LUKS header backup | Indefinite | Indefinite | Manual | Critical for recovery; store offline |
| Panic audit log | 1 year | Indefinite | Not rotated | Emergency event record |
When /var/lib/secure-ai usage exceeds thresholds (check with
df -h /var/lib/secure-ai):
| Usage | Action |
|---|---|
| > 70% | Review and prune quarantined models in /var/lib/secure-ai/quarantine/ |
| > 80% | Archive oldest audit logs to external media, remove unpromoted models from staging |
| > 90% | Emergency: force logrotate (sudo logrotate -f /etc/logrotate.d/secure-ai), remove all quarantined models |
| > 95% | Critical: Services may fail to write logs. Immediate operator intervention required |
# List models by size
du -sh /var/lib/secure-ai/registry/*.gguf 2>/dev/null | sort -rh
# List quarantined models (safe to remove)
ls -lh /var/lib/secure-ai/quarantine/incoming/
ls -lh /var/lib/secure-ai/quarantine/tampered/
# Remove all quarantined models
sudo rm -rf /var/lib/secure-ai/quarantine/incoming/*
sudo rm -rf /var/lib/secure-ai/quarantine/tampered/*Before allowing logrotate to trim old data:
- Export forensic bundle (preserves incident + audit evidence with HMAC signature):
curl -s http://127.0.0.1:8515/api/v1/forensic/export > forensic-$(date +%Y%m%d).json
- Create a log-only backup to external media:
sudo secai-backup.sh logs --encrypt --output /media/usb/archives
- Verify the archive before allowing rotation:
sudo secai-backup.sh verify /media/usb/archives/secai-backup-logs-*.tar.gz