Enhance job processing by increasing batch size and adding database f…#7
Enhance job processing by increasing batch size and adding database f…#7mpatrikios wants to merge 5 commits into
Conversation
…allback for job status retrieval
There was a problem hiding this comment.
Pull request overview
This pull request enhances job processing capabilities by increasing the batch size for photo processing from 1 to 3 and adding a robust database fallback mechanism for job status retrieval when jobs are not found in memory.
Key changes:
- Increased batch processing from 1 to 3 photos per batch for improved throughput
- Added database fallback in get_processing_status to reconstruct job state from the database when not found in memory
- Enhanced sync_jobs_from_database to accurately recompute job progress and status from database records
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| backend/app/api/process_tasks.py | Increased BATCH_SIZE from 1 to 3, added database fallback for job status retrieval with job reconstruction logic, and enhanced the sync_jobs_from_database function to recompute progress and status accurately |
| .firebase/hosting.ZnJvbnRlbmQ.cache | Auto-generated Firebase hosting cache with updated file timestamps and hashes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| status_map = { | ||
| "completed": ProcessingStatus.COMPLETED, | ||
| "failed": ProcessingStatus.FAILED, | ||
| "processing": ProcessingStatus.PROCESSING, | ||
| } |
There was a problem hiding this comment.
The status_map dictionary is duplicated between get_processing_status (lines 545-549) and sync_jobs_from_database (lines 810-815). This code duplication can lead to maintenance issues if the status mappings need to be updated. Consider extracting this mapping into a module-level constant or a shared function to ensure consistency and reduce duplication.
| progress = int((completed_photos / total_photos) * 100) if total_photos else 0 | ||
|
|
||
| # Map database status to processing status safely | ||
| status_map = { | ||
| "completed": ProcessingStatus.COMPLETED, | ||
| "failed": ProcessingStatus.FAILED, | ||
| "processing": ProcessingStatus.PROCESSING, | ||
| } | ||
| current_status = status_map.get(db_job.status, ProcessingStatus.PROCESSING) | ||
| if progress >= 100 and current_status == ProcessingStatus.PROCESSING: | ||
| current_status = ProcessingStatus.COMPLETED |
There was a problem hiding this comment.
The progress calculation and status update logic (lines 542, 550-552) is duplicated in sync_jobs_from_database (lines 826-833). Consider extracting this logic into a shared helper function to improve maintainability and ensure consistent behavior across both functions.
| total_photos = db.query(PhotoDB).filter(PhotoDB.processing_job_id == db_job.id).count() | ||
| completed_photos = db.query(PhotoDB).filter( | ||
| PhotoDB.processing_job_id == db_job.id, | ||
| PhotoDB.processing_status == DBProcessingStatus.COMPLETED | ||
| ).count() |
There was a problem hiding this comment.
The database fallback performs two separate COUNT queries (lines 536 and 537-540) when reconstructing the job status. This could be optimized into a single query using conditional aggregation (COUNT with CASE WHEN) to reduce database round trips, especially when this endpoint might be called frequently for status polling.
| completed_photos = db.query(PhotoDB).filter( | ||
| PhotoDB.processing_job_id == db_job.id, | ||
| PhotoDB.processing_status == DBProcessingStatus.COMPLETED | ||
| ).count() |
There was a problem hiding this comment.
The completed_photos count query is executed once per job in the loop, which creates an N+1 query problem. If there are many active jobs, this will result in excessive database queries. Consider using a single query with GROUP BY to fetch all completed photo counts at once, or use a JOIN with aggregation to retrieve all job data in one query.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Reduce PostgreSQL pool_size from 20 to 3 (match Cloud Tasks concurrency) - Set max_overflow=0 to prevent connection exhaustion - Set GEMINI_CONCURRENCY default from 1 to 2 for balanced throughput - Add comprehensive timing logs (T1-T4) for performance diagnosis - Add in-process file cache to eliminate redundant GCS downloads 🚀 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…allback for job status retrieval