Add Helm chart for GPU Dev Server deployment#30
Open
wdvr wants to merge 8 commits into
Open
Conversation
Phase 2 of cloud-agnostic migration: - Add providers/ module with CloudProvider and AuthProvider interfaces - Implement AWSProvider wrapping existing boto3 code - Add GCP and Custom provider stubs with documentation - Refactor snapshot_utils.py to use provider interface: - Replace direct boto3 ec2_client with _get_provider() - Replace s3_client with provider.upload/download_from_object_storage() - Update all snapshot operations to use provider methods - Support pagination in list_snapshots for large result sets - Add volume_id and status filtering to snapshot queries Provider interface supports: - Block storage (create/delete/attach/detach volumes) - Snapshots (create/delete/list/wait) - Object storage (upload/download) - Compute node queries Next steps: Refactor disk_reconciler.py (requires adding tag operations to provider interface) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Creates a comprehensive Helm chart that packages all Kubernetes resources
for deploying the GPU development server infrastructure.
Chart components:
- PostgreSQL with PGMQ extension (primary/replica StatefulSets)
- API Service (Deployment + LoadBalancer Service)
- Reservation Processor (Deployment with ClusterRole RBAC)
- Availability Updater (CronJob)
- Reservation Expiry (CronJob)
- Registry caches (ghcr.io and native)
- Database migration Job (Helm hook)
- Storage class definitions (AWS gp3, GCP pd-ssd)
Cloud provider support:
- AWS (EKS) with IRSA service account annotations
- GCP (GKE) with Workload Identity annotations
- Configurable via values-aws.yaml and values-gcp.yaml
Key features:
- Fully templated with Helm best practices
- Configurable replicas, resources, and tolerations
- Automatic database schema migration via Helm hooks
- Supports external secrets for passwords
- Cloud-agnostic base values with provider-specific overrides
Usage:
helm install gpu-dev ./charts/gpu-dev-server \
-f charts/gpu-dev-server/values-aws.yaml \
-f my-values.yaml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- BUG-001: Fix race condition in reservation status update Capture cur.rowcount inside context manager before cursor is closed - BUG-002: Fix connection pool leak on health check failure Use putconn(conn, close=True) instead of conn.close() directly - HIGH-002: Add authorization checks on job action endpoints Verify user owns job before allowing cancel/extend/jupyter/add_user - MEDIUM-006: Fix SSH proxy domain validation security issue Use endswith() instead of 'in' to prevent hostname spoofing - BUG-010: Fix SQL field name injection in disk_db.py Add whitelist validation for field names in UPDATE queries - BUG-023: Remove unused ssl import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- security.md: 21 findings (2 critical, 5 high, 8 medium, 6 low) - bugs.md: 23 bugs identified across CLI, API, and shared code - cleanup.md: duplicate code patterns and refactoring opportunities - feature_parity.md: Helm chart vs OpenTofu comparison (~85% parity) - progress.md: tracking completed fixes and remaining work - multicloud.md: multi-cloud architecture documentation Generated by automated code review agents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add minimum 1 hour validation in CLI (--extend 0.5 now gives clear error) - Fix float-to-int truncation that caused 0.5 to become 0 - Use round() instead of int() for extension hours - Update type hints to reflect float parameter Fixes E2E test finding: "extension_hours Input should be >= 1" error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- BUG-003: Fix WebSocket cleanup race in SSH proxy (asyncio.wait + cancellation) - BUG-004: Fix silent exception swallowing in disk operations - BUG-008: Prevent infinite auth retry loop in API client - BUG-009: Fix variable scope issue with explicit_no_disk - BUG-011: Improve job name validation in poller recovery - BUG-012: Fix duplicate work from message deletion failure - BUG-013: Add deadlock error handling in disk_db NOWAIT - BUG-014: Ensure timezone-aware datetime comparisons in CLI - BUG-017: Document single-threaded safety of active_jobs iteration - BUG-018: Remove internal detail leaks from API error responses - BUG-019: Replace deprecated asyncio.get_event_loop() - BUG-020: Replace magic numbers with named constants in poller - BUG-021: Guard against division by zero for CPU instance multinode - Dead code: Remove _REMOVED DynamoDB functions - Cleanup: Consolidate duplicate _extract_ip_from_reservation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unnecessary backslashes before $ signs in f-string bash heredocs that produced Python SyntaxWarning about invalid escape sequences. Update progress.md with session 2 fixes and E2E test results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Git Mirror: - In-cluster bare mirror of pytorch/pytorch updated every 15 min - Served via git daemon (port 9418) as ClusterIP service - User pods auto-configured with url.insteadOf for transparent use - 20Gi PVC for mirror storage on CPU nodes Disk Warming: - New init container runs between ssh-setup and main container - Only activates for existing persistent disks (not new/empty) - Three-stage warming: metadata → critical dirs → remaining files - Pre-warms EBS volumes lazily restored from S3 snapshots
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What's Included
Chart Components
Cloud Provider Support
values-aws.yamlwith IRSA service account annotationsvalues-gcp.yamlwith Workload Identity annotationsvalues.yamlfor other providersKey Features
Usage
What Stays in OpenTofu
The Helm chart manages K8s resources only. The following still require OpenTofu:
Test plan
helm templaterenders correctly🤖 Generated with Claude Code