Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 28 additions & 3 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ nix = { version = "0.29", features = ["signal", "process", "user", "fs", "term"]
# Serialization
serde = { version = "1", features = ["derive"] }
serde_json = "1"
serde_yaml = "0.9"
serde_yml = "0.0.12"

# HTTP client
reqwest = { version = "0.12", default-features = false, features = ["json", "rustls-tls"] }
Expand Down
2 changes: 1 addition & 1 deletion architecture/gateway-single-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ For the target daemon (local or remote):

After the container starts:

1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
1. **Clean stale nodes**: `clean_stale_nodes()` finds nodes whose name does not match the deterministic k3s `--node-name` and deletes them. That node name is derived from the gateway name but normalized to a Kubernetes-safe lowercase form so existing gateway names that contain `_`, `.`, or uppercase characters still produce a valid node identity. This cleanup is needed when a container is recreated but reuses the persistent volume -- old node entries can persist in etcd. Non-fatal on error; returns the count of removed nodes.
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
- Container exits during polling: error includes recent log lines.
Expand Down
2 changes: 1 addition & 1 deletion architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,7 @@ The Helm chart template is at `deploy/helm/openshell/templates/statefulset.yaml`

`SandboxClient` (`crates/openshell-server/src/sandbox/mod.rs`) manages `agents.x-k8s.io/v1alpha1/Sandbox` CRDs.

- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.).
- **Create**: Translates a `Sandbox` proto into a Kubernetes `DynamicObject` with labels (`openshell.ai/sandbox-id`, `openshell.ai/managed-by: openshell`) and a spec that includes the pod template, environment variables, and gateway-required env vars (`OPENSHELL_SANDBOX_ID`, `OPENSHELL_ENDPOINT`, `OPENSHELL_SSH_LISTEN_ADDR`, etc.). When callers do not provide custom `volumeClaimTemplates`, the server injects a default `workspace` PVC and mounts it at `/sandbox` so the default sandbox home/workdir survives pod rescheduling.
- **Delete**: Calls the Kubernetes API to delete the CRD by name. Returns `false` if already gone (404).
- **Pod IP resolution**: `agent_pod_ip()` fetches the agent pod and reads `status.podIP`.

Expand Down
46 changes: 38 additions & 8 deletions architecture/sandbox.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ All paths are relative to `crates/openshell-sandbox/src/`.
| `sandbox/mod.rs` | Platform abstraction -- dispatches to Linux or no-op |
| `sandbox/linux/mod.rs` | Linux composition: Landlock then seccomp |
| `sandbox/linux/landlock.rs` | Filesystem isolation via Landlock LSM (ABI V1) |
| `sandbox/linux/seccomp.rs` | Syscall filtering via BPF on `SYS_socket` |
| `sandbox/linux/seccomp.rs` | Syscall filtering via BPF: socket domain blocks, dangerous syscall blocks, conditional flag blocks |
| `bypass_monitor.rs` | Background `/dev/kmsg` reader for iptables bypass detection events |
| `sandbox/linux/netns.rs` | Network namespace creation, veth pair setup, bypass detection iptables rules, cleanup on drop |
| `l7/mod.rs` | L7 types (`L7Protocol`, `TlsMode`, `EnforcementMode`, `L7EndpointConfig`), config parsing, validation, access preset expansion, deprecated `tls` value handling |
Expand Down Expand Up @@ -451,22 +451,52 @@ Kernel-level error behavior (e.g., Landlock ABI unavailable) depends on `Landloc

**File:** `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`

Seccomp blocks socket creation for specific address families. The filter targets a single syscall (`SYS_socket`) and inspects argument 0 (the domain).

**Always blocked** (regardless of network mode):
- `AF_NETLINK`, `AF_PACKET`, `AF_BLUETOOTH`, `AF_VSOCK`

**Additionally blocked in `Block` mode** (no proxy):
- `AF_INET`, `AF_INET6`
Seccomp provides three layers of syscall restriction: socket domain blocks, unconditional syscall blocks, and conditional syscall blocks. The filter uses a default-allow policy (`SeccompAction::Allow`) with targeted rules that return `Errno(EPERM)`.

**Skipped entirely** in `Allow` mode.

Setup:
1. `prctl(PR_SET_NO_NEW_PRIVS, 1)` -- required before seccomp
2. `seccompiler::apply_filter()` with default action `Allow` and per-rule action `Errno(EPERM)`

#### Socket domain blocks

| Domain | Always blocked | Additionally blocked in Block mode |
|--------|:-:|:-:|
| `AF_PACKET` | Yes | |
| `AF_BLUETOOTH` | Yes | |
| `AF_VSOCK` | Yes | |
| `AF_INET` | | Yes |
| `AF_INET6` | | Yes |
| `AF_NETLINK` | | Yes |

In `Proxy` mode, `AF_INET`/`AF_INET6` are allowed because the sandboxed process needs to connect to the proxy over the veth pair. The network namespace ensures it can only reach the proxy's IP (`10.200.0.1`).

#### Unconditional syscall blocks

These syscalls are blocked entirely (EPERM for any invocation):

| Syscall | Reason |
|---------|--------|
| `memfd_create` | Fileless binary execution bypasses Landlock filesystem restrictions |
| `ptrace` | Cross-process memory inspection and code injection |
| `bpf` | Kernel BPF program loading |
| `process_vm_readv` | Cross-process memory read |
| `io_uring_setup` | Async I/O subsystem with extensive CVE history |
| `mount` | Filesystem mount could subvert Landlock or overlay writable paths |

#### Conditional syscall blocks

These syscalls are only blocked when specific flag patterns are present:

| Syscall | Condition | Reason |
|---------|-----------|--------|
| `execveat` | `AT_EMPTY_PATH` flag set (arg4) | Fileless execution from an anonymous fd |
| `unshare` | `CLONE_NEWUSER` flag set (arg0) | User namespace creation enables privilege escalation |
| `seccomp` | operation == `SECCOMP_SET_MODE_FILTER` (arg0) | Prevents sandboxed code from replacing the active filter |

Conditional blocks use `MaskedEq` for flag checks (bit-test) and `Eq` for exact-value matches. This allows normal use of these syscalls while blocking the dangerous flag combinations.

### Network namespace isolation

**File:** `crates/openshell-sandbox/src/sandbox/linux/netns.rs`
Expand Down
29 changes: 28 additions & 1 deletion architecture/security-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -850,6 +850,10 @@ The response includes an `X-OpenShell-Policy` header and `Connection: close`. Se

## Seccomp Filter Details

The seccomp filter uses a default-allow policy (`SeccompAction::Allow`) with targeted rules that return `EPERM`. It provides three layers of protection: socket domain blocks, unconditional syscall blocks, and conditional syscall blocks. See `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`.

### Blocked socket domains

Regardless of network mode, certain socket domains are always blocked:

| Domain | Constant | Reason |
Expand All @@ -861,7 +865,30 @@ Regardless of network mode, certain socket domains are always blocked:

In proxy mode (which is always active), `AF_INET` (2) and `AF_INET6` (10) are allowed so the sandbox process can reach the proxy.

The seccomp filter uses a default-allow policy (`SeccompAction::Allow`) with specific `socket()` syscall rules that return `EPERM` when the first argument (domain) matches a blocked value. See `crates/openshell-sandbox/src/sandbox/linux/seccomp.rs`.
### Blocked syscalls

These syscalls are blocked unconditionally (EPERM for any invocation):

| Syscall | NR (x86-64) | Reason |
|---------|-------------|--------|
| `memfd_create` | 319 | Fileless binary execution bypasses Landlock filesystem restrictions |
| `ptrace` | 101 | Cross-process memory inspection and code injection |
| `bpf` | 321 | Kernel BPF program loading |
| `process_vm_readv` | 310 | Cross-process memory read |
| `io_uring_setup` | 425 | Async I/O subsystem with extensive CVE history |
| `mount` | 165 | Filesystem mount could subvert Landlock or overlay writable paths |

### Conditionally blocked syscalls

These syscalls are blocked only when specific flag patterns are present in their arguments:

| Syscall | NR (x86-64) | Condition | Reason |
|---------|-------------|-----------|--------|
| `execveat` | 322 | `AT_EMPTY_PATH` (0x1000) set in flags (arg4) | Fileless execution from an anonymous fd |
| `unshare` | 272 | `CLONE_NEWUSER` (0x10000000) set in flags (arg0) | User namespace creation enables privilege escalation |
| `seccomp` | 317 | operation == `SECCOMP_SET_MODE_FILTER` (1) in arg0 | Prevents sandboxed code from replacing the active filter |

Flag checks use `MaskedEq` (`(arg & mask) == mask`) to detect the flag bit regardless of other bits. The `seccomp` syscall check uses `Eq` for exact value comparison on the operation argument.

---

Expand Down
85 changes: 85 additions & 0 deletions crates/openshell-bootstrap/src/constants.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,100 @@ pub const SERVER_CLIENT_CA_SECRET_NAME: &str = "openshell-server-client-ca";
pub const CLIENT_TLS_SECRET_NAME: &str = "openshell-client-tls";
/// K8s secret holding the SSH handshake HMAC secret (shared by gateway and sandbox pods).
pub const SSH_HANDSHAKE_SECRET_NAME: &str = "openshell-ssh-handshake";
const NODE_NAME_PREFIX: &str = "openshell-";
const NODE_NAME_FALLBACK_SUFFIX: &str = "gateway";
const KUBERNETES_MAX_NAME_LEN: usize = 253;

pub fn container_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

/// Deterministic k3s node name derived from the gateway name.
///
/// k3s defaults to using the container hostname (= Docker container ID) as
/// the node name. When the container is recreated (e.g. after an image
/// upgrade), the container ID changes, creating a new k3s node. The
/// `clean_stale_nodes` function then deletes PVCs whose backing PVs have
/// node affinity for the old node — wiping the server database and any
/// sandbox persistent volumes.
///
/// By passing a deterministic `--node-name` to k3s, the node identity
/// survives container recreation, and PVCs are never orphaned.
///
/// Gateway names allow Docker-friendly separators and uppercase characters,
/// but Kubernetes node names must be DNS-safe. Normalize the gateway name into
/// a single lowercase RFC 1123 label so previously accepted names such as
/// `prod_us` or `Prod.US` still deploy successfully.
pub fn node_name(name: &str) -> String {
format!("{NODE_NAME_PREFIX}{}", normalize_node_name_suffix(name))
}

fn normalize_node_name_suffix(name: &str) -> String {
let mut normalized = String::with_capacity(name.len());
let mut last_was_separator = false;

for ch in name.chars() {
if ch.is_ascii_alphanumeric() {
normalized.push(ch.to_ascii_lowercase());
last_was_separator = false;
} else if !last_was_separator {
normalized.push('-');
last_was_separator = true;
}
}

let mut normalized = normalized.trim_matches('-').to_string();
if normalized.is_empty() {
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
}

let max_suffix_len = KUBERNETES_MAX_NAME_LEN.saturating_sub(NODE_NAME_PREFIX.len());
if normalized.len() > max_suffix_len {
normalized.truncate(max_suffix_len);
normalized.truncate(normalized.trim_end_matches('-').len());
}

if normalized.is_empty() {
normalized.push_str(NODE_NAME_FALLBACK_SUFFIX);
}

normalized
}

pub fn volume_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

pub fn network_name(name: &str) -> String {
format!("openshell-cluster-{name}")
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn node_name_normalizes_uppercase_and_underscores() {
assert_eq!(node_name("Prod_US"), "openshell-prod-us");
}

#[test]
fn node_name_collapses_and_trims_separator_runs() {
assert_eq!(node_name("._Prod..__-Gateway-."), "openshell-prod-gateway");
}

#[test]
fn node_name_falls_back_when_gateway_name_has_no_alphanumerics() {
assert_eq!(node_name("...___---"), "openshell-gateway");
}

#[test]
fn node_name_truncates_to_kubernetes_name_limit() {
let gateway_name = "A".repeat(400);
let node_name = node_name(&gateway_name);

assert!(node_name.len() <= KUBERNETES_MAX_NAME_LEN);
assert!(node_name.starts_with(NODE_NAME_PREFIX));
assert!(node_name.ends_with('a'));
}
}
10 changes: 10 additions & 0 deletions crates/openshell-bootstrap/src/container_runtime.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,16 @@ fn podman_rootless_socket_path() -> Option<String> {
Some(format!("{runtime_dir}/podman/podman.sock"))
}

/// Check whether the current process is running as a non-root user.
///
/// Returns `true` when the effective UID is non-zero (rootless mode).
/// Used to decide container configuration — for example, rootless Podman
/// needs a private cgroup namespace while rootful Podman (and Docker) can
/// use the host cgroup namespace.
pub(crate) fn is_rootless() -> bool {
current_uid().map_or(false, |uid| uid != 0)
}

/// Get the current user's UID by reading `/proc/self/status`.
///
/// Returns `None` on non-Linux systems or if the file cannot be parsed.
Expand Down
Loading
Loading