Skip to content

solidigm/nvme-optimized-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solidigm NVMe Optimized RAG

Solidigm SSDs enhance scalable retrieval-augmented generation (RAG) databases and AI inference workloads by offering speed, reliability, and cost efficiency. They enable faster, more accurate AI outputs and allow enterprises to handle larger datasets while reducing the need for expensive GPU and System Memory upgrades, optimizing the total cost of ownership (TCO).

By offloading RAG workloads i.e, vectorDB indexing and large language model generation to SSDs, businesses achieve faster results and improved energy efficiency compared to memory-bound solutions. These high-performance SSDs are engineered for AI inference, delivering exceptional performance and minimizing power consumption for optimal speed and cost.

Pre-requisites

Software Requirements

  • Ubuntu version 22.04.5 LTS or higher
  • Docker version 25.0.3 or higher

Hardware Requirements

Note

This tool has been validated with the following configuration:

  • AMD EPYC 9554 64-Core with 4x NVIDIA L40S GPUs
  • Ubuntu version 22.04.5 LTS
  • Docker version 25.0.3
  • SOLIDIGM D7-PS1010 (SB5PH27X153T) NVMe SSD (15.3TB)

Building Solution Image

The solution utilizes vLLM serving.

  1. Download the required models from HuggingFace.

    i. Install the HuggingFace CLI

    pip3 install  -U "huggingface_hub[cli]"

    ii. Download the Phi-3-vision vision language model

    huggingface-cli download microsoft/Phi-3-vision-128k-instruct

    iii. Download the GTE-Qwen2 embedding model

    huggingface-cli download Alibaba-NLP/gte-Qwen2-1.5B-instruct

    iv. Download the Llama 3.3 70B large language model for generation

    huggingface-cli download meta-llama/Llama-3.3-70b

Setting up the Environment Variables

Update the appropriate values for the environment variables in the file ./solidigm/.env by referring to the table below:

Variable Name Value Description
HF_CACHE_DIR /mnt/ssd/cache Storage location for model cache files on the SSD under test
VECTOR_DB_OFFLOAD /mnt/ssd/offload/vector_db Swap space for vector database operations on the SSD under test
OFFLOAD_ROOT /mnt/ssd/offload/inference Directory for inference offload operations on the SSD under test
STACK_ROOT /mnt/ssd/stack Directory for storing video and processing data on the SSD under test
HF_TOKEN Refer HuggingFace User Access Tokens Token for accessing Hugging Face models
IMAGE_SERVER_UR http://localhost:8100/backend/output/images Report server image endpoint URL.
Change the localhost to the hosted endpoint.

Creating Required Directories

Before starting the solution, you need to create the required directories for model cache, vector database, inference offload, and video processing:

# Create directories for model cache and offload
sudo mkdir -p /mnt/ssd/cache
sudo mkdir -p /mnt/ssd/offload/vector_db
sudo mkdir -p /mnt/ssd/offload/inference
sudo mkdir -p /mnt/ssd/stack

# Set appropriate permissions
sudo chmod -R 777 /mnt/ssd/cache
sudo chmod -R 777 /mnt/ssd/offload
sudo chmod -R 777 /mnt/ssd/stack

Note

Make sure the SSD is properly mounted at /mnt/ssd before creating these directories. If your SSD is mounted at a different location, update the paths in the .env file accordingly.

Advanced Configuration

The following environment variables can be configured for fine-tuning the system's performance and behavior:

Variable Name Value Description
DEFAULT_FPS 1.0 Default frames per second for video processing
DEFAULT_WIDTH 640 Default video width
DEFAULT_HEIGHT 480 Default video height
WORKER_PROCESSES 4 Number of worker processes
VISION_SERVER_URL http://vllm-serving:8000 Vision server endpoint
VISION_BATCH_SIZE 128 Batch size for vision processing
VISION_MAX_CONCURRENT 8 Maximum concurrent vision processes
EMBEDDING_ENDPOINT http://embedding-service:80 Endpoint for embedding service
MILVUS_ENDPOINT http://standalone:19530 Milvus vector database endpoint
SECRET_KEY 1234567890 Authentication secret key

Run Solution

Note

Make sure all Setup and Configuration steps are completed.

  1. First, build all the service under the Solidigm by following the below command.

    cd solidigm
    sudo docker compose build
  2. Once the build is completed, start the all services by running the below command

    sudo docker compose up -d
  3. Make sure that all the services are running by checking the status.

    sudo docker compose ps

Note

Refer back to the Environment variables configurations if you are facing issues running the solution.

Solution Interface

Important

Please be aware that after deployment, the service may take several minutes to fully start. During this initialization period, you might encounter:

  • Partially Functional Webpage: The webpage may not fully load or display all components correctly.
  • Temporary Errors: Error messages or "Service Unavailable" notices might appear.

These issues are expected and should resolve automatically as the service completes its startup process. If problems persist beyond the expected startup time, please check the service logs for any errors or misconfigurations.

Note

Access to port 8100 is necessary for the interface to work properly. If needed, port forwarding can be used.

Navigate to localhost:8100/login to open the Solution Interface on browser.

Dashboard Overview

The Solution Interface provides an intuitive dashboard for video analysis and processing. Here's how to use it:

1. Video Upload and Analysis

  • Upload Area: Drop your videos in the designated area on the left panel
    • Supported formats: MP4, AVI, MOV, MKV
  • Upload Process:
    • Click "UPLOAD VIDEOS" button to initiate upload
    • Once uploaded, click "BEGIN ANALYSIS" to start processing
    • A unique job ID is created for tracking your analysis

Note

For testing purposes, you can use sample video datasets from the Urban Tracker Dataset. The dataset includes various urban traffic videos with different resolutions and frame rates. Please refer to the website for detailed information about video formats, annotations, and licensing terms.

2. Processing Pipeline

The dashboard shows a 5-step pipeline with real-time status:

  1. Processing Videos (Step 1)

    • Initial video processing and frame extraction
    • Green color indicates completion
  2. Summarizing Content (Step 2)

    • Content analysis and summarization
    • Purple color indicates active processing
  3. Generating Embeddings (Step 3)

    • Vector embedding generation
    • Gray indicates pending status
  4. Querying Insights (Step 4)

    • Information retrieval and analysis
    • Updates to gray when pending
  5. Generating Report (Step 5)

    • Final report compilation
    • Last step in the pipeline

3. Real-time Performance Monitoring

  • Bandwidth Metrics

    • Data Read Bandwidth: Real-time monitoring of read operations for both SSDs
    • Data Write Bandwidth: Real-time monitoring of write operations for both SSDs
  • Storage Metrics

    • SSD Usage: Percentage utilization monitoring for both SSDs
    • SSD Temperature: Operational temperature monitoring

4. System Resource Metrics

  • System Usage

    • CPU Usage: Real-time CPU utilization monitoring
    • Memory Usage: System memory utilization tracking
  • Power Consumption

    • System Power: Total system power consumption
    • CPU Power: Processor power usage monitoring
    • GPU Power: Graphics processor power usage
  • Temperature Monitoring

    • CPU Temperature: Processor temperature monitoring
    • GPU Temperature: Graphics processor temperature monitoring

5. VectorDB Performance

  • Query Performance
    • Queries per second
    • Search latency

6. Configuration Controls

  • Toggle Options
    • VectorDB with SSD Offload: Enable/disable SSD-based vector storage
    • LLM with GPU Offload: Toggle GPU acceleration for language models
    • Additional processing parameters available for customization

Stopping the Solution

To stop and remove all the containers, networks, and volumes created by docker compose up, run:

sudo docker compose down -v

Note

The generation server may take several minutes to shut down completely due to its large model size and cleanup processes. This is normal behavior and you should wait for the process to complete before proceeding.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published