Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions Utilities/image_answering_utility/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
PDF Processor
==============

Overview:
---------
PDF Processor is a service for processing PDF files. It supports both individual PDF processing and batch processing of directories containing multiple PDFs. The service can also extract PDFs from archives (e.g., ZIP files) and integrates with MongoDB and AWS services.

Features:
---------
- Process single PDF files or entire directories.
- Automatically extract PDFs from ZIP or other archive formats (.tar, .gz, .tgz).
- Asynchronous processing with a configurable maximum number of concurrent tasks (set via the MAX_CONCURRENT_PROCESSING environment variable).
- MongoDB integration for tracking processing status.
- Logging for monitoring and error handling.
- Convert PDF pages into markdown representations and corresponding image snapshots.
- Upload processed results and generated images to AWS S3 for URL generation.

Tech Stack:
-----------
- Node.js, Express
- MongoDB & Mongoose
- AWS SDK (S3, etc.)
- Various utilities: axios, bull, pdf2pic, extract-zip, and more

Setup:
------
1. Clone the repository.
2. (Optional) Initialize or install MongoDB using:
- `npm run init-mongodb`
- `npm run install-mongodb`
3. Make the setup scripts executable: run `chmod +x setup/*.sh`
4. Start the API service by running `./setup/start_api_service.sh`

Example cURL Commands:
------------------------
To test the API endpoint for processing a PDF from a URL without base64 encoding, run:

curl -X POST http://localhost:3000/process-directory-from-url \
-H "Content-Type: application/json" \
-d '{"downloadUrl": "https://censusindia.gov.in/nada/index.php/catalog/28594/download/31776/50187_1961_DEM.pdf", "include_base64": false}'

Directory Structure:
--------------------
- src/
- api/ : API layer (server.js, etc.)
- services/ : PDF processing and image service implementations
- scripts/ : Utility and setup scripts
- config/ : Configuration files (including environment variables)
- utils/ : Helper utilities
- storage/ : Temporary directories and file storage


46 changes: 46 additions & 0 deletions Utilities/image_answering_utility/config/.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Server Configuration
HOST=0.0.0.0
PORT=3000
MAX_CONCURRENT_PROCESSING=3

# AWS Credentials
AWS_ACCESS_KEY_ID=AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=AWS_SECRET_ACCESS_KEY+hynhGJdL79
AWS_REGION=AWS_REGION

# S3 Configuration
S3_BUCKET_NAME=S3_BUCKET_NAME
S3_FOLDER_PATH=S3_FOLDER_PATH
S3_URL_EXPIRY=S3_URL_EXPIRY

# Service URLs
MARKDOWN_SERVICE_URL=http://0.0.0.0:8000/process-pdf-markdown/
CALLBACK_SERVICE_URL=http://localhost:8081/callback

# Processing Configuration
MAX_CONCURRENT_FILES=3
MAX_RETRIES=5
RETRY_DELAY=5000
REQUEST_TIMEOUT=300000
CALLBACK_RETRIES=3
CALLBACK_RETRY_DELAY=5000

# Directory & Logging Configuration
LOGS_DIR=logs
LOG_LEVEL=info

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
QUEUE_CONCURRENCY=3

# PDF & Image Processing Configuration
PDF_IMAGE_DENSITY=150
PDF_IMAGE_FORMAT=png
PDF_IMAGE_WIDTH=1200
PDF_IMAGE_HEIGHT=1600
PDF_IMAGE_QUALITY=80
PDF_PRESERVE_ASPECT_RATIO=true

# MongoDB Configuration
MONGODB_URI=mongodb://localhost:27017/pdf-processor
Loading