A web application that provides an AI-powered chatbot interface for dataset discovery, using Google Gemini API on the backend and a React-based frontend.
- Prerequisites
- Setup
- Database Setup
- Running the Application
- Data Processing Pipeline
- Deployment
- API Documentation
- Environment Configuration
https://chat.knowledge-space.org/
- Python: 3.11 or higher
- Node.js: 18.x or higher (for frontend development)
- Google API Key for Gemini
- Google Cloud Platform Account (for BigQuery and Vertex AI)
- UV package manager (for backend environment & dependencies)
- Docker & Docker Compose (optional, for containerized deployment)
git clone https://github.com/INCF/knowledge-space-agent.git
cd knowledge-space-agent- Windows:
pip install uv
- macOS/Linux: Follow the official guide: https://docs.astral.sh/uv/getting-started/installation/
Create a file named .env in the project root based on .env.template. You can choose between two authentication modes:
Option 1: Google API Key (Recommended for development)
- Set
GOOGLE_API_KEYin your.envfile
Option 2: Vertex AI (Recommended for production)
- Configure Google Cloud credentials and Vertex AI settings as shown in
.env.template
Note: Do not commit
.envto version control.
# Create a virtual environment using UV
uv venv
**Windows (Command Prompt):**
.\venv\Scripts\activate
**Windows (PowerShell):**
.\venv\Scripts\Activate.ps1
### 5. Install backend dependencies
With the virtual environment activated:
```bash
uv synccd frontend
npm install
# Install Google Cloud CLI
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz
tar -xf google-cloud-cli-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
# Initialize and authenticate
gcloud init
gcloud auth application-default loginThe backend requires specific environment variables to connect to Google Cloud services, including BigQuery and Vertex AI. Configure the following variables in your .env file:
| Variable | Description | How to Get It |
|---|---|---|
GOOGLE_API_KEY |
API key for Gemini models | Generate from Google AI Studio |
GEMINI_USE_VERTEX |
Toggle for Vertex AI vs standard API | Set to false for local development |
GCP_PROJECT_ID |
Google Cloud Project ID | Required for Vertex AI and BigQuery |
BQ_DATASET_ID |
BigQuery dataset ID | Dataset containing KnowledgeSpace metadata |
INDEX_ENDPOINT_ID |
Vertex AI Vector Search endpoint | ID of deployed vector index for RAG |
ELASTIC_BASE_URL |
Elasticsearch base URL | URL of the text search engine |
In one terminal, from the project root with the virtual environment active:
uv run main.py- By default, this will start the backend server on port 8000. Adjust configuration if you need a different port.
In another terminal:
cd frontend
npm start- This will start the React development server, typically on http://localhost:5000.
Open your browser to:
http://localhost:5000
The frontend will communicate with the backend at port 8000.
- Docker and Docker Compose installed
.envfile configured with required environment variables
To build and start both the backend and frontend in containers:
docker-compose up --buildFrontend → http://localhost:3000
Backend health → http://localhost:8000/api/health
Backend only:
docker build -t knowledge-space-backend ./backend
docker run -p 8000:8000 --env-file .env knowledge-space-backendFrontend only:
docker build -t knowledge-space-frontend ./frontend
docker run -p 3000:3000 knowledge-space-frontendThis repository provides a set of Python scripts and modules to ingest, clean, and enrich neuroscience metadata from Google Cloud Storage, as well as scrape identifiers and references from linked resources.
The following diagram shows the high-level request and data flow through the system:
flowchart LR
A[User Query] --> B[Parsing and Classification using LLM]
B --> C[Identify Filters]
B --> D[Keywords for Ontology]
C --> E{Decision}
E -->|No Retrieval Needed| F[No Retrieval - General Output]
E -->|Retrieval Needed| G[Query Enhancement]
G --> H[Hybrid Retrieval]
H --> I[Vectorstore Search]
I --> J[Retrieved Chunks]
J --> K[LLM with Prompt Template]
K --> L[User Output]
%% Ontology path
D --> M[Search using Neo4j]
M --> N[Retrieved Doc IDs]
N --> O[API Call and Reranking based on Filters]
O --> H
%% Feedback loop
L --> P[Update Memory based on Query]
P --> B
-
Elasticsearch Scraping: The
ksdata_scraping.pyscript harvests raw dataset records directly from our Elasticsearch cluster and writes them to GCS. It uses a Point-In-Time (PIT) scroll to page through each index safely, authenticating via credentials stored in your environment. -
GCS I/O: Download raw JSON lists from
gs://ks_datasets/raw_dataset/...and upload preprocessed outputs togs://ks_datasets/preprocessed_data/.... -
HTML Cleaning: Strip or convert embedded HTML (e.g.
<a>tags) into plain text or Markdown. -
URL Extraction: Find and dedupe all links in descriptions and metadata for later retrieval.
-
Chunk Construction: Build semantic "chunks" by concatenating fields (title, description, context labels, etc.) for downstream vectorization.
-
Metadata Filters: Assemble structured metadata dictionaries (
species,region,keywords,identifier1…n, etc.) for each record. -
Per-Datasource Preprocessing: Each data source has its own preprocessing script (e.g.
scr_017041_dandi.py,scr_006274_neuroelectro_ephys.py) saved ingcs://ks_datasets/preprocessed_data/. -
Extensible Configs: Easily add new datasources by updating GCS paths and field mappings.
To update the vector store with new datasets from Knowledge Space, run:
python data_processing/full_pipeline.pyThe script performs a complete data processing workflow:
- Scrapes all data - Runs preprocessing scripts to collect data from Knowledge Space datasources
- Generates hashes - Creates unique hash-based datapoint IDs for all chunks
- Matches BigQuery datapoint IDs - Queries existing data to find what's already processed
- Selects new/unique data - Identifies only new chunks that need processing
- Creates embeddings - Generates vector embeddings for new chunks only
- Upserts to vector store - Uploads new embeddings to Vertex AI Matching Engine
- Inserts to BigQuery - Stores new chunk metadata and content
This completes the update process with only new data, avoiding reprocessing existing content.
- VM: Debian/Ubuntu server with Docker & Docker Compose installed
- Firewall: Open ports 80 and 443 (http-server, https-server tags on GCP)
- DNS: Domain pointing to your server's external IP
- SSL: Caddy will auto-provision Let's Encrypt certificates
-
Clean Previous Deployments:
cd ~/knowledge-space-agent || true # Stop current stack sudo docker compose down || true # Clean Docker cache and old images sudo docker system prune -af sudo docker builder prune -af # Optional: Clear HF model cache (will re-download on first use) sudo docker volume rm knowledge-space-agent_hf_cache 2>/dev/null || true # Stop host nginx if installed sudo systemctl stop nginx || true sudo systemctl disable nginx || true
-
Create Required Configuration Files:
Environment file: Create
.envbased on.env.templatewith your specific values.Caddy configuration (
Caddyfile):your-domain.com, www.your-domain.com { reverse_proxy frontend:80 encode gzip header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" } }Frontend Nginx: The nginx configuration is already provided in
frontend/nginx.conf. -
Deploy Stack:
cd ~/knowledge-space-agent sudo docker compose up -d --build sudo docker compose ps
-
Verify Deployment:
# Check services are running sudo docker compose ps # Test local endpoints curl -I http://127.0.0.1/ curl -sS http://127.0.0.1/api/health # Test public HTTPS curl -I https://your-domain.com/ curl -sS https://your-domain.com/api/health
View logs:
sudo docker compose logs -f backend
sudo docker compose logs -f frontend
sudo docker compose logs -f caddyUpdate and redeploy:
git pull
sudo docker compose up -d --buildStatus check:
sudo docker compose psBackend unhealthy:
sudo docker inspect -f '{{json .State.Health}}' knowledge-space-agent-backend-1502/504 errors:
sudo docker exec -it knowledge-space-agent-frontend-1 sh -c 'wget -S -O- http://backend:8000/health'DNS issues:
dig +short your-domain.com
curl -s -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip- Development:
http://localhost:8000 - Production:
https://your-domain.com
GET /
- Description: Root endpoint, returns service status
- Response:
{ "message": "KnowledgeSpace AI Backend is running", "version": "2.0.0" }
GET /health
- Description: Basic health check for Docker/load balancers
- Response:
{ "status": "healthy", "timestamp": "2024-01-01T12:00:00.000Z", "service": "knowledge-space-agent-backend", "version": "2.0.0" }
GET /api/health
- Description: Detailed health check with component status
- Response:
{ "status": "healthy", "version": "2.0.0", "components": { "vector_search": "enabled|disabled", "llm": "enabled|disabled", "keyword_search": "enabled" }, "timestamp": "2024-01-01T12:00:00.000Z" }
POST /api/chat
- Description: Send a query to the neuroscience assistant
- Request Body:
{ "query": "Find datasets about motor cortex recordings", "session_id": "optional-session-id", "reset": false } - Response:
{ "response": "I found several datasets related to motor cortex recordings...", "metadata": { "process_time": 2.5, "session_id": "default", "timestamp": "2024-01-01T12:00:00.000Z", "reset": false } }
POST /api/session/reset
- Description: Clear conversation history for a session
- Request Body:
{ "session_id": "session-to-reset" } - Response:
{ "status": "ok", "session_id": "session-to-reset", "message": "Session cleared" }
504 Gateway Timeout
{
"detail": "Request timed out. Please try with a simpler query."
}500 Internal Server Error
{
"response": "Error: [error description]",
"metadata": {
"error": true,
"session_id": "session-id"
}
}For required environment variables, see .env.template in the project root.
- Environment: Make sure
.envis present before starting the backend. - Ports: If ports 5000 or 8000 are in use, adjust scripts/configuration accordingly.
- UV Commands:
uv venvcreates the virtual environment.uv syncinstalls dependencies as defined in your project’s config.
- Troubleshooting:
- Verify Python version (
python --version) and that dependencies installed correctly. - Ensure the
.envfile syntax is correct (no extra quotes). - For frontend issues, check Node.js version (
node --version) and logs in terminal.
- Verify Python version (