Skip to content

Add Helm chart for GPU Dev Server deployment#30

Open
wdvr wants to merge 8 commits into
devfrom
feat/helm-chart
Open

Add Helm chart for GPU Dev Server deployment#30
wdvr wants to merge 8 commits into
devfrom
feat/helm-chart

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Feb 4, 2026

Summary

  • Creates comprehensive Helm chart for deploying GPU Dev Server infrastructure
  • Supports AWS (EKS with IRSA) and GCP (GKE with Workload Identity)
  • Includes all K8s resources currently managed by OpenTofu

What's Included

Chart Components

  • PostgreSQL with PGMQ extension (primary/replica StatefulSets)
  • API Service (Deployment + LoadBalancer Service)
  • Reservation Processor (Deployment with ClusterRole RBAC)
  • Availability Updater (CronJob)
  • Reservation Expiry (CronJob)
  • Registry caches (ghcr.io and native)
  • Database migration Job (runs as Helm hook)
  • Storage class definitions (AWS gp3, GCP pd-ssd)

Cloud Provider Support

  • AWS: values-aws.yaml with IRSA service account annotations
  • GCP: values-gcp.yaml with Workload Identity annotations
  • Custom: Base values.yaml for other providers

Key Features

  • Fully templated with Helm best practices
  • Configurable replicas, resources, and tolerations
  • Automatic database schema migration via Helm hooks
  • Supports external secrets for passwords
  • Cloud-agnostic base values with provider-specific overrides

Usage

# AWS
helm install gpu-dev ./charts/gpu-dev-server \
  -f charts/gpu-dev-server/values-aws.yaml \
  -f my-values.yaml

# GCP
helm install gpu-dev ./charts/gpu-dev-server \
  -f charts/gpu-dev-server/values-gcp.yaml \
  -f my-values.yaml

What Stays in OpenTofu

The Helm chart manages K8s resources only. The following still require OpenTofu:

  • EKS/GKE cluster creation
  • VPC/networking
  • IAM roles and policies
  • Node groups / ASGs
  • EFS/Filestore
  • CloudFront/Load Balancer (external)

Test plan

  • Verify helm template renders correctly
  • Test installation on k3d/kind cluster
  • Test on AWS EKS with IRSA
  • Test on GCP GKE with Workload Identity

🤖 Generated with Claude Code

wdvr and others added 8 commits February 4, 2026 14:08
Phase 2 of cloud-agnostic migration:

- Add providers/ module with CloudProvider and AuthProvider interfaces
- Implement AWSProvider wrapping existing boto3 code
- Add GCP and Custom provider stubs with documentation
- Refactor snapshot_utils.py to use provider interface:
  - Replace direct boto3 ec2_client with _get_provider()
  - Replace s3_client with provider.upload/download_from_object_storage()
  - Update all snapshot operations to use provider methods
  - Support pagination in list_snapshots for large result sets
  - Add volume_id and status filtering to snapshot queries

Provider interface supports:
- Block storage (create/delete/attach/detach volumes)
- Snapshots (create/delete/list/wait)
- Object storage (upload/download)
- Compute node queries

Next steps: Refactor disk_reconciler.py (requires adding tag operations
to provider interface)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Creates a comprehensive Helm chart that packages all Kubernetes resources
for deploying the GPU development server infrastructure.

Chart components:
- PostgreSQL with PGMQ extension (primary/replica StatefulSets)
- API Service (Deployment + LoadBalancer Service)
- Reservation Processor (Deployment with ClusterRole RBAC)
- Availability Updater (CronJob)
- Reservation Expiry (CronJob)
- Registry caches (ghcr.io and native)
- Database migration Job (Helm hook)
- Storage class definitions (AWS gp3, GCP pd-ssd)

Cloud provider support:
- AWS (EKS) with IRSA service account annotations
- GCP (GKE) with Workload Identity annotations
- Configurable via values-aws.yaml and values-gcp.yaml

Key features:
- Fully templated with Helm best practices
- Configurable replicas, resources, and tolerations
- Automatic database schema migration via Helm hooks
- Supports external secrets for passwords
- Cloud-agnostic base values with provider-specific overrides

Usage:
  helm install gpu-dev ./charts/gpu-dev-server \
    -f charts/gpu-dev-server/values-aws.yaml \
    -f my-values.yaml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- BUG-001: Fix race condition in reservation status update
  Capture cur.rowcount inside context manager before cursor is closed

- BUG-002: Fix connection pool leak on health check failure
  Use putconn(conn, close=True) instead of conn.close() directly

- HIGH-002: Add authorization checks on job action endpoints
  Verify user owns job before allowing cancel/extend/jupyter/add_user

- MEDIUM-006: Fix SSH proxy domain validation security issue
  Use endswith() instead of 'in' to prevent hostname spoofing

- BUG-010: Fix SQL field name injection in disk_db.py
  Add whitelist validation for field names in UPDATE queries

- BUG-023: Remove unused ssl import

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- security.md: 21 findings (2 critical, 5 high, 8 medium, 6 low)
- bugs.md: 23 bugs identified across CLI, API, and shared code
- cleanup.md: duplicate code patterns and refactoring opportunities
- feature_parity.md: Helm chart vs OpenTofu comparison (~85% parity)
- progress.md: tracking completed fixes and remaining work
- multicloud.md: multi-cloud architecture documentation

Generated by automated code review agents.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add minimum 1 hour validation in CLI (--extend 0.5 now gives clear error)
- Fix float-to-int truncation that caused 0.5 to become 0
- Use round() instead of int() for extension hours
- Update type hints to reflect float parameter

Fixes E2E test finding: "extension_hours Input should be >= 1" error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- BUG-003: Fix WebSocket cleanup race in SSH proxy (asyncio.wait + cancellation)
- BUG-004: Fix silent exception swallowing in disk operations
- BUG-008: Prevent infinite auth retry loop in API client
- BUG-009: Fix variable scope issue with explicit_no_disk
- BUG-011: Improve job name validation in poller recovery
- BUG-012: Fix duplicate work from message deletion failure
- BUG-013: Add deadlock error handling in disk_db NOWAIT
- BUG-014: Ensure timezone-aware datetime comparisons in CLI
- BUG-017: Document single-threaded safety of active_jobs iteration
- BUG-018: Remove internal detail leaks from API error responses
- BUG-019: Replace deprecated asyncio.get_event_loop()
- BUG-020: Replace magic numbers with named constants in poller
- BUG-021: Guard against division by zero for CPU instance multinode
- Dead code: Remove _REMOVED DynamoDB functions
- Cleanup: Consolidate duplicate _extract_ip_from_reservation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unnecessary backslashes before $ signs in f-string bash heredocs
that produced Python SyntaxWarning about invalid escape sequences.
Update progress.md with session 2 fixes and E2E test results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Git Mirror:
- In-cluster bare mirror of pytorch/pytorch updated every 15 min
- Served via git daemon (port 9418) as ClusterIP service
- User pods auto-configured with url.insteadOf for transparent use
- 20Gi PVC for mirror storage on CPU nodes

Disk Warming:
- New init container runs between ssh-setup and main container
- Only activates for existing persistent disks (not new/empty)
- Three-stage warming: metadata → critical dirs → remaining files
- Pre-warms EBS volumes lazily restored from S3 snapshots
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant