Common Knowledge Base

The Common Knowledge Base (CKB) is a comprehensive data platform that collects, processes, and manages knowledge from public sector websites and APIs. The system automatically scrapes content, cleans it for large language model consumption, and provides structured access through REST APIs.

Overview

The CKB serves as a critical data pipeline for the Bürokratt AI assistant, ensuring that responses are based on current, accurate information from Estonian public sector sources. The platform handles the complete lifecycle of data from initial collection through processing and storage.

Key Features

Multi-Source Data Collection: Web scraping, API integration, and manual file uploads
Automated Content Updates: Periodic data refresh with change detection
Content Processing: HTML cleaning and document text extraction for LLM consumption
Scalable Architecture: Microservices-based design for high availability
Comprehensive API: REST endpoints for all data operations
Real-time Monitoring: Processing status tracking and error reporting

Quick Start

Prerequisites

Docker and Docker Compose
PostgreSQL database
S3-compatible storage (AWS S3, MinIO, etc.)

Local Development

# Clone the repository
git clone https://github.com/buerokratt/Common-Knowledge.git
cd Common-Knowledge

# Start all services
docker-compose up -d

# Access the web interface
open http://localhost:3000

# API available at
curl http://localhost:8080/ckb/agency/all

Environment Setup

Create .env file with required configuration:

# Database
DATABASE_URL=postgresql://user:password@localhost:5432/ckb

# Storage
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
S3_ENDPOINT_URL=your_s3_url
S3_BUCKET_NAME=ckb-storage

# Services
RUUTER_INTERNAL=http://ruuter-internal:8089
RUUTER_EXTERNAL=http://ruuter:8080

System Architecture

The CKB consists of multiple interconnected services:

┌─────────────┐    ┌──────────────┐    ┌─────────────────┐
│   Web GUI   │    │ External API │    │  Internal API   │
│   (React)   │───▶│   (Ruuter)   │───▶│   (Ruuter)      │
└─────────────┘    └──────────────┘    └─────────────────┘
                          │                      │
                          ▼                      ▼
┌─────────────┐    ┌──────────────┐    ┌─────────────────┐
│  Scrapper   │    │   Cleaning   │    │ File Processing │
│  Service    │◀───┤   Service    │◀───┤    Service      │
└─────────────┘    └──────────────┘    └─────────────────┘
        │                 │                      │
        ▼                 ▼                      ▼
┌─────────────┐    ┌──────────────┐    ┌─────────────────┐
│ Scheduler   │    │ Data Export  │    │   PostgreSQL    │
│  Service    │    │   Service    │    │   Database      │
└─────────────┘    └──────────────┘    └─────────────────┘

For detailed architecture information, see ARCHITECTURE.md.

Services

Core Services

Service	Purpose	Technology	Port
GUI	Web interface for CKB management	React/TypeScript	3000
Ruuter External	Public API with authentication	Ruuter YAML configs	8080
Ruuter Internal	Internal service communication	Ruuter YAML configs	8089
Resql	SQL query engine and database abstraction	SQL with metadata	-
Scrapper	Web scraping and content extraction	Python/Scrapy	8000
Cleaning	Content cleaning and text extraction	Python/FastAPI	8001
File Processing	File upload and storage management	Python/FastAPI	8888
Scheduler	Task scheduling and automation	Python/FastAPI	8003
Data Export	Database export and archival	Python/FastAPI	8002

Supporting Components

PostgreSQL: Primary database for structured data
S3 Storage: Blob storage for files and content
Liquibase: Database schema migrations
Celery: Background task processing

Data Flow

ETL Pipeline

Extract: Collect data from websites, APIs, and uploads
Transform: Clean content and extract text for LLM consumption
Load: Store processed data in database and blob storage

For detailed ETL process documentation, see ETL_PROCESSES.md.

Processing Workflow

sequenceDiagram
    participant User
    participant GUI
    participant Scrapper
    participant Cleaning
    participant Storage
    participant DB

    User->>GUI: Configure data source
    GUI->>Scrapper: Trigger scraping
    Scrapper->>Storage: Store raw content
    Scrapper->>DB: Store metadata
    Scrapper->>Cleaning: Request cleaning
    Cleaning->>Storage: Store cleaned text
    Cleaning->>DB: Update status

Component Documentation

Each service has detailed documentation in its respective directory:

Scrapper Service - Web scraping and content collection
Cleaning Service - Content processing and text extraction
File Processing Service - File upload and storage management
Data Export Service - Database export and archival
Scheduler Service - Task scheduling and automation
External API Configuration - Public API endpoints
Internal API Configuration - Service communication
Resql Query Definitions - SQL query engine and database operations

Database Schema

The database uses a multi-schema design organized by functional areas:

Schema Organization

agency_management: Agency and organizational data
data_collection: Sources and file metadata
monitoring: Processing reports and execution logs

Core Tables

agency: Organization/department information
source: Data source configurations (websites, APIs)
source_file: Individual file metadata and processing status
source_run_report: Processing execution reports
source_run_page: Detailed scraping logs

For detailed schema documentation and ER diagram, see DATABASE_SCHEMA.md.

Migration Management

Database schema is managed through Liquibase:

# Create new migration
./create-migration.sh "descriptive-migration-name"

# Run migrations
./migrate.sh

# Load test data
./load-test-data.sh

Migration Scripts

create-migration.sh: Creates new Liquibase migration files with proper timestamps
- Generates SQL migration file (changelog/YYYYMMDDHHMMSS-name.sql)
- Creates rollback file (changelog/YYYYMMDDHHMMSS-rollback.sql)
- Generates Liquibase XML configuration (changelog/YYYYMMDDHHMMSS-name.xml)
- Uses git user.name for author attribution
migrate.sh: Executes pending database migrations using Docker
load-test-data.sh: Loads test fixtures for development

Development

Local Setup

Prerequisites

# Install Docker and Docker Compose
# Ensure PostgreSQL is available
# Configure AWS/S3 access

Environment Configuration

# Copy example configuration
cp .env.example .env

# Edit configuration file
vim .env

Database Setup

# Run database migrations
./migrate.sh

# Load test data (optional)
./load-test-data.sh

Start Services

# Start all services
docker-compose up -d

# Or start individual services
docker-compose up gui scrapper cleaning

Testing

# Run API tests
curl http://localhost:8080/ckb/agency/all

# Test scraping functionality
curl -X POST http://localhost:8000/specified-pages-scrapper-task \
  -H "Content-Type: application/json" \
  -d '{"agency_id": "test", "source_id": "test", "urls": []}'

# Check service health
docker-compose ps

Configuration

The system uses DSL (Domain Specific Language) configurations for:

API Endpoints: YAML definitions in DSL/Ruuter/ and DSL/Ruuter.internal/
Database Queries: SQL definitions with metadata in DSL/Resql/
Data Mapping: Transformation templates in DSL/DMapper/
Data Exports: Export task definitions in DSL/Export/
Scheduling: Cron configurations in DSL/CronManager/

Resql Query Engine

Resql provides type-safe database operations:

SQL Separation: Database logic separated from application code
Parameter Binding: Safe parameterized queries prevent SQL injection
Type Validation: Parameter and response type checking
Self-Documentation: Metadata declarations in SQL files

Deployment

Production Deployment

Container Registry

# Build and push images
docker build -t ckb/gui ./GUI
docker build -t ckb/scrapper ./scrapper
docker build -t ckb/cleaning ./cleaning
# ... build other services

Environment Variables
- Configure database connections
- Set up S3 credentials
- Define service endpoints
- Set security keys
Service Orchestration
- Deploy using Kubernetes or Docker Swarm
- Configure load balancers
- Set up monitoring and logging

Monitoring

Health Checks: Each service exposes health endpoints
Logging: Centralized logging with structured formats
Metrics: Performance and usage metrics collection
Alerts: Automated alerting for critical issues

API Usage

Authentication

# Login to get JWT token
curl -X POST http://localhost:8080/ckb/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

# Use token in subsequent requests
curl -H "Authorization: Bearer <token>" \
  http://localhost:8080/ckb/agency/all

Common Operations

# List all agencies
curl http://localhost:8080/ckb/agency/all

# Create new source
curl -X POST http://localhost:8080/ckb/source/add \
  -H "Content-Type: application/json" \
  -d '{"agency_id": "agency1", "name": "Source Name", "url": "https://example.com"}'

# Trigger scraping
curl -X POST http://localhost:8080/ckb/source/refresh \
  -d '{"source_id": "source1"}'

# Check processing status
curl http://localhost:8080/ckb/reports/all

Contributing

Development Guidelines

Code Standards: Follow existing code conventions in each service
Testing: Add tests for new functionality
Documentation: Update relevant README files
ADR Compliance: Follow Architectural Decision Records

Pull Request Process

Create feature branch from main
Implement changes with appropriate tests
Update documentation as needed
Submit pull request for review
Address feedback and merge

Issue Management

Issues are refined during grooming sessions in collaboration with developers to achieve optimal results. Please provide detailed requirements and use cases when submitting issues.

Troubleshooting

Common Issues

Service Connection Errors
- Check service health: docker-compose ps
- Verify network connectivity between services
- Review environment variable configuration
Database Issues
- Check PostgreSQL connection
- Verify migration status
- Review database logs
Scraping Failures
- Check target website availability
- Review scraper logs for errors
- Verify authentication credentials

Log Locations

Service Logs: docker-compose logs <service_name>
Scraper Logs: ./scrapped-data/logs/scraper/
Cleaning Logs: ./scrapped-data/logs/cleaning/
Database Logs: PostgreSQL container logs

License

This project is licensed under the terms specified in the LICENSE file.

Links

ADR Requirements: Data Pipeline ADRs
Bürokratt Project: Main Bürokratt Repository
Architecture Documentation: ARCHITECTURE.md
ETL Process Documentation: ETL_PROCESSES.md
API Specifications: API_SPECIFICATION.md
Database Schema: DATABASE_SCHEMA.md

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
.github/workflows		.github/workflows
DSL		DSL
GUI		GUI
charts/ckb		charts/ckb
cleaning		cleaning
data-export		data-export
file-processing		file-processing
scheduler		scheduler
scrapper		scrapper
search-service		search-service
.env.dev		.env.dev
.env.scrapper		.env.scrapper
.gitignore		.gitignore
.sqlfluff		.sqlfluff
API_SPECIFICATION.md		API_SPECIFICATION.md
ARCHITECTURE.md		ARCHITECTURE.md
DATABASE_SCHEMA.md		DATABASE_SCHEMA.md
ETL_PROCESSES.md		ETL_PROCESSES.md
LICENSE		LICENSE
README.md		README.md
constants.ini		constants.ini
create-migration.sh		create-migration.sh
docker-compose.yml		docker-compose.yml
load-test-data.sh		load-test-data.sh
migrate.sh		migrate.sh

License

rootcodelabs/Common-Knowledge

Folders and files

Latest commit

History

Repository files navigation

Common Knowledge Base

Overview

Key Features

Quick Start

Prerequisites

Local Development

Environment Setup

System Architecture

Services

Core Services

Supporting Components

Data Flow

ETL Pipeline

Processing Workflow

Component Documentation

Database Schema

Schema Organization

Core Tables

Migration Management

Migration Scripts

Development

Local Setup

Testing

Configuration

Resql Query Engine

Deployment

Production Deployment

Monitoring

API Usage

Authentication

Common Operations

Contributing

Development Guidelines

Pull Request Process

Issue Management

Troubleshooting

Common Issues

Log Locations

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages