KnowledgeSpace AI Agent - Dataset Search Assistant

A web application that provides an AI-powered chatbot interface for dataset discovery, using Google Gemini API on the backend and a React-based frontend.

Weblink For Interacting with Agent

https://chat.knowledge-space.org/

Prerequisites

Python: 3.11 or higher
Node.js: 18.x or higher (for frontend development)
Google API Key for Gemini
Google Cloud Platform Account (for BigQuery and Vertex AI)
UV package manager (for backend environment & dependencies)
Docker & Docker Compose (optional, for containerized deployment)

Setup

1. Clone the repository

git clone https://github.com/INCF/knowledge-space-agent.git
cd knowledge-space-agent

2. Install UV package manager

Windows:
```
pip install uv
```
macOS/Linux: Follow the official guide: https://docs.astral.sh/uv/getting-started/installation/

3. Configure environment variables

Create a file named .env in the project root based on .env.template. You can choose between two authentication modes:

Option 1: Google API Key (Recommended for development)

Set GOOGLE_API_KEY in your .env file

Option 2: Vertex AI (Recommended for production)

Configure Google Cloud credentials and Vertex AI settings as shown in .env.template

Note: Do not commit .env to version control.

4. Create and activate a virtual environment

# Create a virtual environment using UV
uv venv

**Windows (Command Prompt):**
.\venv\Scripts\activate

**Windows (PowerShell):**
.\venv\Scripts\Activate.ps1

### 5. Install backend dependencies

With the virtual environment activated:

```bash
uv sync

6. Install frontend dependencies

cd frontend
npm install

Database Setup

Google Cloud Setup

1. Install Google Cloud CLI and Authenticate

# Install Google Cloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh

# Initialize and authenticate
gcloud init
gcloud auth application-default login

BigQuery and Vertex AI Configuration

The backend requires specific environment variables to connect to Google Cloud services, including BigQuery and Vertex AI. Configure the following variables in your .env file:

Variable	Description	How to Get It
`GOOGLE_API_KEY`	API key for Gemini models	Generate from Google AI Studio
`GEMINI_USE_VERTEX`	Toggle for Vertex AI vs standard API	Set to `false` for local development
`GCP_PROJECT_ID`	Google Cloud Project ID	Required for Vertex AI and BigQuery
`BQ_DATASET_ID`	BigQuery dataset ID	Dataset containing KnowledgeSpace metadata
`INDEX_ENDPOINT_ID`	Vertex AI Vector Search endpoint	ID of deployed vector index for RAG
`ELASTIC_BASE_URL`	Elasticsearch base URL	URL of the text search engine

Running the Application

Backend (port 8000)

In one terminal, from the project root with the virtual environment active:

uv run main.py

By default, this will start the backend server on port 8000. Adjust configuration if you need a different port.

Frontend (port 5000)

In another terminal:

cd frontend
npm start

This will start the React development server, typically on http://localhost:5000.

Accessing the Application

Open your browser to:

http://localhost:5000

The frontend will communicate with the backend at port 8000.

Docker Setup

Prerequisites

Docker and Docker Compose installed
.env file configured with required environment variables

Quick Start

To build and start both the backend and frontend in containers:

docker-compose up --build

Frontend → http://localhost:3000 Backend health → http://localhost:8000/api/health

Individual Container Management

Backend only:

docker build -t knowledge-space-backend ./backend
docker run -p 8000:8000 --env-file .env knowledge-space-backend

Frontend only:

docker build -t knowledge-space-frontend ./frontend
docker run -p 3000:3000 knowledge-space-frontend

Data Processing Pipeline

This repository provides a set of Python scripts and modules to ingest, clean, and enrich neuroscience metadata from Google Cloud Storage, as well as scrape identifiers and references from linked resources.

System Flow (High-Level)

The following diagram shows the high-level request and data flow through the system:

flowchart LR

    A[User Query] --> B[Parsing and Classification using LLM]

    B --> C[Identify Filters]
    B --> D[Keywords for Ontology]

    C --> E{Decision}

    E -->|No Retrieval Needed| F[No Retrieval - General Output]

    E -->|Retrieval Needed| G[Query Enhancement]

    G --> H[Hybrid Retrieval]

    H --> I[Vectorstore Search]
    I --> J[Retrieved Chunks]

    J --> K[LLM with Prompt Template]

    K --> L[User Output]

    %% Ontology path
    D --> M[Search using Neo4j]
    M --> N[Retrieved Doc IDs]
    N --> O[API Call and Reranking based on Filters]
    O --> H

    %% Feedback loop
    L --> P[Update Memory based on Query]
    P --> B

Key Features

Elasticsearch Scraping: The ksdata_scraping.py script harvests raw dataset records directly from our Elasticsearch cluster and writes them to GCS. It uses a Point-In-Time (PIT) scroll to page through each index safely, authenticating via credentials stored in your environment.
GCS I/O: Download raw JSON lists from gs://ks_datasets/raw_dataset/... and upload preprocessed outputs to gs://ks_datasets/preprocessed_data/....
HTML Cleaning: Strip or convert embedded HTML (e.g. <a> tags) into plain text or Markdown.
URL Extraction: Find and dedupe all links in descriptions and metadata for later retrieval.
Chunk Construction: Build semantic "chunks" by concatenating fields (title, description, context labels, etc.) for downstream vectorization.
Metadata Filters: Assemble structured metadata dictionaries (species, region, keywords, identifier1…n, etc.) for each record.
Per-Datasource Preprocessing: Each data source has its own preprocessing script (e.g. scr_017041_dandi.py, scr_006274_neuroelectro_ephys.py) saved in gcs://ks_datasets/preprocessed_data/.
Extensible Configs: Easily add new datasources by updating GCS paths and field mappings.

Updating Vector Store with New Datasets

To update the vector store with new datasets from Knowledge Space, run:

python data_processing/full_pipeline.py

Pipeline Process

The script performs a complete data processing workflow:

Scrapes all data - Runs preprocessing scripts to collect data from Knowledge Space datasources
Generates hashes - Creates unique hash-based datapoint IDs for all chunks
Matches BigQuery datapoint IDs - Queries existing data to find what's already processed
Selects new/unique data - Identifies only new chunks that need processing
Creates embeddings - Generates vector embeddings for new chunks only
Upserts to vector store - Uploads new embeddings to Vertex AI Matching Engine
Inserts to BigQuery - Stores new chunk metadata and content

This completes the update process with only new data, avoiding reprocessing existing content.

Deployment

Docker Compose + Caddy Deployment

Prerequisites

VM: Debian/Ubuntu server with Docker & Docker Compose installed
Firewall: Open ports 80 and 443 (http-server, https-server tags on GCP)
DNS: Domain pointing to your server's external IP
SSL: Caddy will auto-provision Let's Encrypt certificates

Deployment Steps

Clean Previous Deployments:

cd ~/knowledge-space-agent || true

# Stop current stack
sudo docker compose down || true

# Clean Docker cache and old images
sudo docker system prune -af
sudo docker builder prune -af

# Optional: Clear HF model cache (will re-download on first use)
sudo docker volume rm knowledge-space-agent_hf_cache 2>/dev/null || true

# Stop host nginx if installed
sudo systemctl stop nginx || true
sudo systemctl disable nginx || true

Create Required Configuration Files:

Environment file: Create .env based on .env.template with your specific values.

Caddy configuration (Caddyfile):
```
your-domain.com, www.your-domain.com {
  reverse_proxy frontend:80
  encode gzip
  header {
    Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
  }
}
```
Frontend Nginx: The nginx configuration is already provided in frontend/nginx.conf.

Deploy Stack:

cd ~/knowledge-space-agent
sudo docker compose up -d --build
sudo docker compose ps

Verify Deployment:

# Check services are running
sudo docker compose ps

# Test local endpoints
curl -I http://127.0.0.1/
curl -sS http://127.0.0.1/api/health

# Test public HTTPS
curl -I https://your-domain.com/
curl -sS https://your-domain.com/api/health

Daily Operations

View logs:

sudo docker compose logs -f backend
sudo docker compose logs -f frontend  
sudo docker compose logs -f caddy

Update and redeploy:

git pull
sudo docker compose up -d --build

Status check:

sudo docker compose ps

Troubleshooting

Backend unhealthy:

sudo docker inspect -f '{{json .State.Health}}' knowledge-space-agent-backend-1

502/504 errors:

sudo docker exec -it knowledge-space-agent-frontend-1 sh -c 'wget -S -O- http://backend:8000/health'

DNS issues:

dig +short your-domain.com
curl -s -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip

API Documentation

Base URL

Development: http://localhost:8000
Production: https://your-domain.com

Endpoints

General

GET /

Description: Root endpoint, returns service status

Response:

{
  "message": "KnowledgeSpace AI Backend is running",
  "version": "2.0.0"
}

GET /health

Description: Basic health check for Docker/load balancers

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-01T12:00:00.000Z",
  "service": "knowledge-space-agent-backend",
  "version": "2.0.0"
}

GET /api/health

Description: Detailed health check with component status

Response:

{
  "status": "healthy",
  "version": "2.0.0",
  "components": {
    "vector_search": "enabled|disabled",
    "llm": "enabled|disabled", 
    "keyword_search": "enabled"
  },
  "timestamp": "2024-01-01T12:00:00.000Z"
}

Chat

POST /api/chat

Description: Send a query to the neuroscience assistant

Request Body:

{
  "query": "Find datasets about motor cortex recordings",
  "session_id": "optional-session-id",
  "reset": false
}

Response:

{
  "response": "I found several datasets related to motor cortex recordings...",
  "metadata": {
    "process_time": 2.5,
    "session_id": "default",
    "timestamp": "2024-01-01T12:00:00.000Z",
    "reset": false
  }
}

POST /api/session/reset

Description: Clear conversation history for a session
Request Body:
```
{
  "session_id": "session-to-reset"
}
```

Response:

{
  "status": "ok",
  "session_id": "session-to-reset",
  "message": "Session cleared"
}

Error Responses

504 Gateway Timeout

{
  "detail": "Request timed out. Please try with a simpler query."
}

500 Internal Server Error

{
  "response": "Error: [error description]",
  "metadata": {
    "error": true,
    "session_id": "session-id"
  }
}

Environment Configuration

For required environment variables, see .env.template in the project root.

Additional Notes

Environment: Make sure .env is present before starting the backend.
Ports: If ports 5000 or 8000 are in use, adjust scripts/configuration accordingly.
UV Commands:
- uv venv creates the virtual environment.
- uv sync installs dependencies as defined in your project’s config.
Troubleshooting:
- Verify Python version (python --version) and that dependencies installed correctly.
- Ensure the .env file syntax is correct (no extra quotes).
- For frontend issues, check Node.js version (node --version) and logs in terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
backend		backend
data_processing		data_processing
frontend		frontend
.env.template		.env.template
.gitignore		.gitignore
Caddyfile		Caddyfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

KnowledgeSpace AI Agent - Dataset Search Assistant

Table of Contents

Weblink For Interacting with Agent

Prerequisites

Setup

1. Clone the repository

2. Install UV package manager

3. Configure environment variables

4. Create and activate a virtual environment

6. Install frontend dependencies

Database Setup

Google Cloud Setup

1. Install Google Cloud CLI and Authenticate

BigQuery and Vertex AI Configuration

Running the Application

Backend (port 8000)

Frontend (port 5000)

Accessing the Application

Docker Setup

Prerequisites

Quick Start

Individual Container Management

Data Processing Pipeline

System Flow (High-Level)

Key Features

Updating Vector Store with New Datasets

Pipeline Process

Deployment

Docker Compose + Caddy Deployment

Prerequisites

Deployment Steps

Daily Operations

Troubleshooting

API Documentation

Base URL

Endpoints

General

Chat

Error Responses

Environment Configuration

Additional Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages