Koredotcom · vivekj-kore · Feb 28, 2025 · Mar 11, 2025 · Mar 11, 2025 · Mar 12, 2025
diff --git a/Utilities/image_answering_utility/README.md b/Utilities/image_answering_utility/README.md
@@ -0,0 +1,52 @@
+PDF Processor
+==============
+
+Overview:
+---------
+PDF Processor is a service for processing PDF files. It supports both individual PDF processing and batch processing of directories containing multiple PDFs. The service can also extract PDFs from archives (e.g., ZIP files) and integrates with MongoDB and AWS services.
+
+Features:
+---------
+- Process single PDF files or entire directories.
+- Automatically extract PDFs from ZIP or other archive formats (.tar, .gz, .tgz).
+- Asynchronous processing with a configurable maximum number of concurrent tasks (set via the MAX_CONCURRENT_PROCESSING environment variable).
+- MongoDB integration for tracking processing status.
+- Logging for monitoring and error handling.
+- Convert PDF pages into markdown representations and corresponding image snapshots.
+- Upload processed results and generated images to AWS S3 for URL generation.
+
+Tech Stack:
+-----------
+- Node.js, Express
+- MongoDB & Mongoose
+- AWS SDK (S3, etc.)
+- Various utilities: axios, bull, pdf2pic, extract-zip, and more
+
+Setup:
+------
+1. Clone the repository.
+2. (Optional) Initialize or install MongoDB using:
+   - `npm run init-mongodb`
+   - `npm run install-mongodb`
+3. Make the setup scripts executable: run `chmod +x setup/*.sh`
+4. Start the API service by running `./setup/start_api_service.sh`
+
+Example cURL Commands:
+------------------------
+To test the API endpoint for processing a PDF from a URL without base64 encoding, run:
+
+curl -X POST http://localhost:3000/process-directory-from-url \
+  -H "Content-Type: application/json" \
+  -d '{"downloadUrl": "https://censusindia.gov.in/nada/index.php/catalog/28594/download/31776/50187_1961_DEM.pdf", "include_base64": false}'
+
+Directory Structure:
+--------------------
+- src/
+  - api/            : API layer (server.js, etc.)
+  - services/       : PDF processing and image service implementations
+  - scripts/        : Utility and setup scripts
+  - config/         : Configuration files (including environment variables)
+- utils/             : Helper utilities
+- storage/           : Temporary directories and file storage
+
+
diff --git a/Utilities/image_answering_utility/config/.env b/Utilities/image_answering_utility/config/.env
@@ -0,0 +1,46 @@
+# Server Configuration
+HOST=0.0.0.0
+PORT=3000
+MAX_CONCURRENT_PROCESSING=3
+
+# AWS Credentials
+AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID
+AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY+hynhGJdL79
+AWS_REGION=AWS_REGION
+
+# S3 Configuration
+S3_BUCKET_NAME=S3_BUCKET_NAME
+S3_FOLDER_PATH=S3_FOLDER_PATH
+S3_URL_EXPIRY=S3_URL_EXPIRY
+
+# Service URLs
+MARKDOWN_SERVICE_URL=http://0.0.0.0:8000/process-pdf-markdown/
+CALLBACK_SERVICE_URL=http://localhost:8081/callback
+
+# Processing Configuration
+MAX_CONCURRENT_FILES=3
+MAX_RETRIES=5
+RETRY_DELAY=5000
+REQUEST_TIMEOUT=300000
+CALLBACK_RETRIES=3
+CALLBACK_RETRY_DELAY=5000
+
+# Directory & Logging Configuration
+LOGS_DIR=logs
+LOG_LEVEL=info
+
+# Redis Configuration
+REDIS_HOST=localhost
+REDIS_PORT=6379
+QUEUE_CONCURRENCY=3
+
+# PDF & Image Processing Configuration
+PDF_IMAGE_DENSITY=150
+PDF_IMAGE_FORMAT=png
+PDF_IMAGE_WIDTH=1200
+PDF_IMAGE_HEIGHT=1600
+PDF_IMAGE_QUALITY=80
+PDF_PRESERVE_ASPECT_RATIO=true
+
+# MongoDB Configuration
+MONGODB_URI=mongodb://localhost:27017/pdf-processor