diff --git a/DEVELOPMENT-zh.md b/DEVELOPMENT-zh.md new file mode 100644 index 000000000..6c096d68f --- /dev/null +++ b/DEVELOPMENT-zh.md @@ -0,0 +1,251 @@ +# 开发指南 + +本文档为 DataMate 提供全面的本地开发环境搭建和工作流程指南,涵盖 Java、Python、React 三种语言。 + +## 概述 + +DataMate 是由多语言(Java 后端、Python 运行时、React 前端)组成的微服务项目,通过 Docker Compose 进行本地开发协调。 + +## 前置条件 + +- Git (用于拉取源码) +- Make (用于构建和安装) +- Docker (用于构建镜像和部署服务) +- Docker Compose (用于部署服务 - docker 方式) +- Kubernetes (用于部署服务 - k8s 方式) +- Helm (用于部署服务 - k8s 方式) + +注意: +- 确保 Java 和 Python 环境在系统 PATH 中(如适用) +- Docker Compose 将编排本地开发栈 + +## 快速开始 + +### 1. 克隆仓库并安装依赖 +```bash +git clone git@github.com:ModelEngine-Group/DataMate.git +cd DataMate +``` + +### 2. 启动基础服务 +```bash +make install +``` + +本项目支持 docker-compose 和 helm 两种方式部署,请在执行命令后输入部署部署方式的对应编号,命令回显如下所示: +```shell +Choose a deployment method: +1. Docker/Docker-Compose +2. Kubernetes/Helm +Enter choice: +``` + +若您使用的机器没有 make,您也可以执行如下命令部署: +```bash +REGISTRY=ghcr.io/modelengine-group/ docker compose -f deployment/docker/datamate/docker-compose.yml --profile milvus up -d +``` + +当容器运行后,请在浏览器打开 http://localhost:30000 查看前端界面。 + +### 3. 本地开发部署 +本地代码修改后,请执行以下命令构建镜像并使用本地镜像部署: +```bash +make build +make install dev=true +``` + +### 4. 卸载服务 +```bash +make uninstall +``` + +在运行 `make uninstall` 时,卸载流程会只询问一次是否删除卷(数据),该选择会应用到所有组件。卸载顺序为:milvus -> label-studio -> datamate,确保在移除 datamate 网络前,所有使用该网络的服务已先停止。 + +## 项目结构 + +``` +DataMate/ +├── backend/ # Java 后端 +│ ├── api-gateway/ # API Gateway +│ ├── services/ # 核心服务 +│ └── shared/ # 共享库 +├── runtime/ # Python 运行时 +│ ├── datamate-python/ # FastAPI 后端 +│ ├── python-executor/ # Ray 执行器 +│ ├── ops/ # 算子生态 +│ ├── datax/ # DataX 框架 +│ └── deer-flow # DeerFlow 服务 +├── frontend/ # React 前端 +├── deployment/ # 部署配置 +└── docs/ # 文档 +``` + +## 开发工作流程 + +### Java 后端开发 +```bash +# 构建 +cd backend +mvn clean install + +# 运行测试 +mvn test + +# 运行特定服务 +cd backend/services/main-application +mvn spring-boot:run +``` + +### Python 运行时开发 +```bash +# 安装依赖 +cd runtime/datamate-python +poetry install + +# 运行服务 +poetry run uvicorn app.main:app --reload --port 18000 + +# 运行测试 +poetry run pytest +``` + +### React 前端开发 +```bash +# 安装依赖 +cd frontend +npm ci + +# 运行开发服务器 +npm run dev + +# 构建生产版本 +npm run build +``` + +### Docker Compose 开发 +```bash +# 启动所有服务 +docker compose up -d + +# 查看日志 +docker compose logs -f [service-name] + +# 停止所有服务 +docker compose down +``` + +## 环境配置 + +每个组件可以有自己的环境变量文件。不要提交包含密钥的 .env 文件。 + +### 后端(Java) +- **路径**: `backend/.env` +- **典型密钥**: + - `DB_URL`: 数据库连接字符串 + - `DB_USER`: 数据库用户名 + - `DB_PASSWORD`: 数据库密码 + - `REDIS_URL`: Redis 连接字符串 + - `REDIS_PASSWORD`: Redis 密码 + - `JWT_SECRET`: JWT 密钥 + +### 运行时(Python) +- **路径**: `runtime/datamate-python/.env` +- **典型密钥**: + - `DATABASE_URL`: PostgreSQL 连接字符串 + - `RAY_ENABLED`: 是否启用 Ray 执行器 + - `RAY_ADDRESS`: Ray 集群地址 + - `LABEL_STUDIO_BASE_URL`: Label Studio 基础 URL + +### 前端(React) +- **路径**: `frontend/.env` +- **典型密钥**: + - `VITE_API_BASE_URL`: API 基础 URL + - `VITE_RUNTIME_API_URL`: 运行时 API 基础 URL + +## 测试 + +### Java(JUnit 5) +```bash +cd backend +mvn test +``` + +### Python(pytest) +```bash +cd runtime/datamate-python +poetry run pytest +``` + +### 前端 +当前未配置测试框架。 + +## 调试 + +### Java 后端 +```bash +# 启用 JDWP 调试端口 5005 +export JAVA_TOOL_OPTIONS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005' +java -jar backend/main-application/target/*.jar +``` + +### Python 运行时 +```bash +# 启用 debugpy 监听端口 5678 +cd runtime/datamate-python +python -m debugpy --listen 5678 --wait-for-client -m uvicorn app.main:app --reload --port 18000 --host 0.0.0.0 +``` + +### React 前端 +使用浏览器开发者工具或 VS Code 调试器。 + +## 常见问题 + +### 端口冲突 +检查哪个进程正在使用端口: +```bash +lsof -i TCP:8080 +lsof -i TCP:18000 +lsof -i TCP:5173 +``` +停止或重新配置冲突的服务。 + +### 数据库连接失败 +确保 `.env` 包含正确的 `DATABASE_URL` 和凭据;确保数据库服务在 Docker Compose 中已启动。 + +### Ray 集群问题 +确保 Ray 已正确启动;检查 Ray 工作进程日志;确保 `RAY_ADDRESS` 配置正确。 + +## 文档 + +- **核心文档**: + - [ARCHITECTURE.md](./ARCHITECTURE.md) - 系统架构、微服务通信、数据流 + - [DEVELOPMENT.md](./DEVELOPMENT.md) - 本地开发环境搭建和工作流程 + - [AGENTS.md](./AGENTS.md) - AI 助手指南和代码规范 + +- **后端文档**: + - [backend/README.md](./backend/README.md) - 后端架构、服务和技术栈 + - [backend/api-gateway/README.md](./backend/api-gateway/README.md) - API Gateway 配置和路由 + - [backend/services/main-application/README.md](./backend/services/main-application/README.md) - 主应用模块 + - [backend/shared/README.md](./backend/shared/README.md) - 共享库(domain-common, security-common) + +- **运行时文档**: + - [runtime/README.md](./runtime/README.md) - 运行时架构和组件 + - [runtime/datamate-python/README.md](./runtime/datamate-python/README.md) - FastAPI 后端服务 + - [runtime/python-executor/README.md](./runtime/python-executor/README.md) - Ray 执行器框架 + - [runtime/ops/README.md](./runtime/ops/README.md) - 算子生态 + - [runtime/datax/README.md](./runtime/datax/README.md) - DataX 数据框架 + - [runtime/deer-flow/README.md](./runtime/deer-flow/README.md) - DeerFlow LLM 服务 + +- **前端文档**: + - [frontend/README.md](./frontend/README.md) - React 前端应用 + +## 贡献指南 + +感谢您对本项目的关注!我们非常欢迎社区的贡献,无论是提交 Bug 报告、提出功能建议,还是直接参与代码开发,都能帮助项目变得更好。 + +• 📮 [GitHub Issues](../../issues):提交 Bug 或功能建议。 +• 🔧 [GitHub Pull Requests](../../pulls):贡献代码改进。 + +## 许可证 + +DataMate 基于 [MIT](LICENSE) 开源,您可以在遵守许可证条款的前提下自由使用、修改和分发本项目的代码。 diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md new file mode 100644 index 000000000..bf509f71f --- /dev/null +++ b/DEVELOPMENT.md @@ -0,0 +1,181 @@ +# DEVELOPMENT GUIDE for DataMate + +This document provides a comprehensive development guide for DataMate, a polyglot, microservices-based project consisting of Java, Python, and React components. It describes how to set up, build, test, run, and contribute in a local Docker Compose-based environment, without exposing secrets. + + + +## Overview + +DataMate is composed of multiple services (Java backend, Python runtime, and React frontend) coordinated via Docker Compose for local development. The guide below covers prerequisites, quick-start steps, project structure, development workflow, environment configuration, testing, debugging, common issues, documentation, contribution workflow, and licensing. + +Refer to the component READMEs for detailed implementation notes: +- Backend: backend/README.md +- Runtime: runtime/datamate-python/README.md +- Frontend: frontend/README.md + +For code style guidelines, see AGENTS.md in the repository root. + +## Prerequisites + +- Java Development: JDK 21 and Maven +- Python: Python 3.12 and Poetry +- Node.js: Node.js 18 +- Docker and Docker Compose +- Optional: Make (for convenience) + +Notes: +- Ensure Java and Python environments are on the system PATH where applicable. +- Docker Compose will orchestrate the local development stack. + +## Quick Start + +1) Clone the repository and install dependencies: +- git clone https://github.com/your-org/datemate.git +- cd datemate +- (Optional) Create and activate a Python virtual environment if not using Poetry-managed envs. +- Build dependencies per component as described below. + +2) Start the local stack with Docker Compose: +- docker compose up -d +- This brings up the Java backend, Python runtime, and React frontend services along with any required databases and caches as defined in the docker-compose.yml. + +3) Start individual components (if you prefer not to use the Docker stack): +- Java backend + - mvn -f backend/pom.xml -DskipTests package + - Run the main application (path may vary): java -jar backend/main-application/target/*.jar +- Python runtime + - cd runtime/datamate-python + - poetry install + - uvicorn app.main:app --reload --port 18000 --host 0.0.0.0 +- React frontend + - cd frontend + - npm ci + - npm run dev + +4) Stop the stack: +- docker compose down + +> Tip: In a team setting, prefer Docker Compose for consistency across development environments. + +## Project Structure + +- backend/ +- frontend/ +- runtime/ +- deployment/ +- docs/ +- AGENTS.md (code style guidelines) +- docker/ (docker-related tooling) +- .env* files (per-component configurations, see Environment Configuration section) + +This is a polyglot project with the following language footprints: +- Java for the backend services under backend/ +- Python for the runtime under runtime/datamate-python/ +- React/TypeScript for the frontend under frontend/ + +## Development Workflow + +Language-specific workflows: + +- Java (Backend) + - Build: mvn -f backend/pom.xml -DskipTests package + - Test: mvn -f backend/pom.xml test + - Run: mvn -f backend/pom.xml -Dexec.mainClass=... spring-boot:run (or run the packaged jar) +- Python (Runtime) + - Install: cd runtime/datamate-python && poetry install + - Test: pytest + - Run: uvicorn app.main:app --reload --port 18000 --host 0.0.0.0 +- Frontend (React) + - Install: cd frontend && npm ci + - Test: No frontend tests configured + - Build: npm run build + - Run: npm run dev + +General tips: +- Use Docker Compose for a repeatable local stack. +- Run linters and tests before creating PRs. +- Keep dependencies in sync across environments. + +## Environment Configuration + +Each component can have its own environment file(s). Do not commit secrets. Use sample/.env.example files as references when available. + +- Backend + - Path: backend/.env (example keys below) + - Typical keys: DB_URL, DB_USER, DB_PASSWORD, JWT_SECRET, REDIS_URL, CLOUD_STORAGE_ENDPOINT +- Runtime (Python) + - Path: runtime/datamate-python/.env + - Typical keys: DATABASE_URL, RAY_ADDRESS, CELERY_BROKER_URL, APP_SETTINGS +- Frontend + - Path: frontend/.env + - Typical keys: VITE_API_BASE_URL, VITE_DEFAULT_LOCALE, NODE_ENV + +Notes: +- Copy the corresponding .env.example to .env and fill in values as needed. +- Do not commit .env files containing secrets. + +## Testing + +- Java: JUnit 5 tests run via Maven (mvn test). +- Python: pytest in runtime/datamate-python/test or relevant tests. +- Frontend: No frontend tests configured in this repo. + +## Code Style + +Code style follows the repository-wide guidelines described in AGENTS.md. See: +- AGENTS.md (root): Code style guidelines for all languages. +- Java: Follow Java conventions in backend/ and accordance with project conventions. +- Python: Follow PEP 8 and project-specific conventions in runtime/datamate-python. +- React: Follow the frontend conventions in frontend/ (TypeScript/TSX). + +Link to guidelines: AGENTS.md + +## Debugging + +- Java (Backend): Enable JPDA debugging by starting the JVM with a debug port and attach a debugger. + - Example (local): export JAVA_TOOL_OPTIONS='-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005' && java -jar path/to/app.jar + - Attach with IDE on port 5005 after launch. +- Python (Runtime): Run with debugpy listening on port 5678 to attach from IDEs. + - Example: cd runtime/datamate-python && poetry install + python -m debugpy --listen 5678 --wait-for-client -m uvicorn app.main:app --reload --port 18000 --host 0.0.0.0 +- Frontend (React): Use Node inspector to debug front-end code in dev server. + - Example: npm run dev -- --inspect-brk=9229 + +Tips: Use your preferred IDEs (IntelliJ/VSCode/WebStorm) to attach to the running processes on their respective ports. + +## Common Issues + +- Port conflicts: Check which process is using a port with lsof -i TCP: or ss -ltnp. Stop or reconfigure conflicting services. +- Database connection errors: Ensure .env contains correct DATABASE_URL and credentials; ensure the database service is up in Docker Compose. +- Ray cluster issues (Python runtime): Ensure Ray is started and accessible at the configured RAY_ADDRESS; check logs for worker failures and bootstrap status. + +## Documentation + +Component READMEs provide detailed usage and design decisions. See: +- backend/README.md +- runtime/datamate-python/README.md +- frontend/README.md +- deployment/README.md + +## Contributing + +Contributions follow a PR workflow: +- Create a feature/bugfix branch from main (e.g., feature/new-action) +- Implement changes with tests where applicable +- Run unit tests for the changed components +- Open a PR with a clear description of the changes and the rationale +- Ensure CI checks pass (build, unit tests, lint) +- Obtain reviews and address feedback +- Merge to main after approval + +## License + +Apache 2.0 + +--- + +References: +- AGENTS.md for code style guidelines: AGENTS.md +- Java dependencies: backend/pom.xml +- Node dependencies: frontend/package.json +- Python dependencies: runtime/datamate-python/pyproject.toml diff --git a/README-zh.md b/README-zh.md index 91e443d3a..1d4987ed0 100644 --- a/README-zh.md +++ b/README-zh.md @@ -110,6 +110,29 @@ make uninstall 在运行 `make uninstall` 时,卸载流程会只询问一次是否删除卷(数据),该选择会应用到所有组件。卸载顺序为:milvus -> label-studio -> datamate,确保在移除 datamate 网络前,所有使用该网络的服务已先停止。 +## 📚 文档 + +### 核心文档 +- **[DEVELOPMENT.md](./DEVELOPMENT.md)** - 本地开发环境搭建和工作流程 +- **[AGENTS.md](./AGENTS.md)** - AI 助手指南和代码规范 + +### 后端文档 +- **[backend/README-zh.md](./backend/README-zh.md)** - 后端架构、服务和技术栈 +- **[backend/api-gateway/README-zh.md](./backend/api-gateway/README-zh.md)** - API Gateway 配置和路由 +- **[backend/services/main-application/README-zh.md](./backend/services/main-application/README-zh.md)** - 主应用模块 +- **[backend/shared/README-zh.md](./backend/shared/README-zh.md)** - 共享库(domain-common, security-common) + +### 运行时文档 +- **[runtime/README-zh.md](./runtime/README-zh.md)** - 运行时架构和组件 +- **[runtime/datamate-python/README-zh.md](./runtime/datamate-python/README-zh.md)** - FastAPI 后端服务 +- **[runtime/python-executor/README-zh.md](./runtime/python-executor/README-zh.md)** - Ray 执行器框架 +- **[runtime/ops/README.md](./runtime/ops/README.md)** - 算子生态 +- **[runtime/datax/README-zh.md](./runtime/datax/README-zh.md)** - DataX 数据框架 +- **[runtime/deer-flow/README-zh.md](./runtime/deer-flow/README-zh.md)** - DeerFlow LLM 服务 + +### 前端文档 +- **[frontend/README-zh.md](./frontend/README-zh.md)** - React 前端应用 + ## 🤝 贡献指南 感谢您对本项目的关注!我们非常欢迎社区的贡献,无论是提交 Bug 报告、提出功能建议,还是直接参与代码开发,都能帮助项目变得更好。 diff --git a/README.md b/README.md index 8b30c5973..97ee80593 100644 --- a/README.md +++ b/README.md @@ -113,10 +113,33 @@ make uninstall When running make uninstall, the installer will prompt once whether to delete volumes; that single choice is applied to all components. The uninstall order is: milvus -> label-studio -> datamate, which ensures the datamate network is removed cleanly after services that use it have stopped. +## 📚 Documentation + +### Core Documentation +- **[DEVELOPMENT.md](./DEVELOPMENT.md)** - Local development environment setup and workflow +- **[AGENTS.md](./AGENTS.md)** - AI assistant guidelines and code style + +### Backend Documentation +- **[backend/README.md](./backend/README.md)** - Backend architecture, services, and technology stack +- **[backend/api-gateway/README.md](./backend/api-gateway/README.md)** - API Gateway configuration and routing +- **[backend/services/main-application/README.md](./backend/services/main-application/README.md)** - Main application modules +- **[backend/shared/README.md](./backend/shared/README.md)** - Shared libraries (domain-common, security-common) + +### Runtime Documentation +- **[runtime/README.md](./runtime/README.md)** - Runtime architecture and components +- **[runtime/datamate-python/README.md](./runtime/datamate-python/README.md)** - FastAPI backend service +- **[runtime/python-executor/README.md](./runtime/python-executor/README.md)** - Ray executor framework +- **[runtime/ops/README.md](./runtime/ops/README.md)** - Operator ecosystem +- **[runtime/datax/README.md](./runtime/datax/README.md)** - DataX data framework +- **[runtime/deer-flow/README.md](./runtime/deer-flow/README.md)** - DeerFlow LLM service + +### Frontend Documentation +- **[frontend/README.md](./frontend/README.md)** - React frontend application + ## 🤝 Contribution Guidelines Thank you for your interest in this project! We warmly welcome contributions from the community. Whether it's submitting -bug reports, suggesting new features, or directly participating in code development, all forms of help make the project +bug reports, suggesting new features, or directly participating in code development, all forms of help make a project better. • 📮 [GitHub Issues](../../issues): Submit bugs or feature suggestions. diff --git a/backend/README-zh.md b/backend/README-zh.md new file mode 100644 index 000000000..cdf749b63 --- /dev/null +++ b/backend/README-zh.md @@ -0,0 +1,137 @@ +# DataMate 后端 + +## 概述 + +DataMate 后端是基于 Spring Boot 3.5 + Java 21 的微服务架构,提供数据管理、RAG 索引、API 网关等核心功能。 + +## 架构 + +``` +backend/ +├── api-gateway/ # API Gateway + 认证 +├── services/ +│ ├── data-management-service/ # 数据集管理 +│ ├── rag-indexer-service/ # RAG 索引 +│ └── main-application/ # 主应用入口 +└── shared/ + ├── domain-common/ # DDD 构建块、异常处理 + └── security-common/ # JWT 工具 +``` + +## 服务 + +| 服务 | 端口 | 描述 | +|---------|-------|-------------| +| **main-application** | 8080 | 主应用,包含数据管理、数据清洗、算子市场等模块 | +| **api-gateway** | 8080 | API Gateway,路由转发和认证 | + +## 技术栈 + +- **框架**: Spring Boot 3.5.6, Spring Cloud 2025.0.0 +- **语言**: Java 21 +- **数据库**: PostgreSQL 8.0.33 + MyBatis-Plus 3.5.14 +- **缓存**: Redis 3.2.0 +- **向量数据库**: Milvus (via SDK 2.6.6) +- **文档**: SpringDoc OpenAPI 2.2.0 +- **构建**: Maven + +## 依赖 + +### 外部服务 +- **PostgreSQL**: `datamate-database:5432` +- **Redis**: `datamate-redis:6379` +- **Milvus**: 向量数据库(RAG 索引) + +### 共享库 +- **domain-common**: 业务异常、系统参数、领域实体基类 +- **security-common**: JWT 工具、认证辅助 + +## 快速开始 + +### 前置条件 +- JDK 21+ +- Maven 3.8+ +- PostgreSQL 12+ +- Redis 6+ + +### 构建 +```bash +cd backend +mvn clean install +``` + +### 运行主应用 +```bash +cd backend/services/main-application +mvn spring-boot:run +``` + +### 运行 API Gateway +```bash +cd backend/api-gateway +mvn spring-boot:run +``` + +## 开发 + +### 模块结构 (DDD) +``` +com.datamate.{module}/ +├── interfaces/ +│ ├── rest/ # Controllers +│ ├── dto/ # Request/Response DTOs +│ ├── converter/ # MapStruct converters +│ └── validation/ # Custom validators +├── application/ # Application services +├── domain/ +│ ├── model/ # Entities +│ └── repository/ # Repository interfaces +└── infrastructure/ + ├── persistence/ # Repository implementations + ├── client/ # External API clients + └── config/ # Service configuration +``` + +### 代码约定 +- **实体**: Extend `BaseEntity`, use `@TableName("t_*")` +- **控制器**: `@RestController` + `@RequiredArgsConstructor` +- **服务**: `@Service` + `@Transactional` +- **错误处理**: `throw BusinessException.of(ErrorCode.XXX)` +- **MapStruct**: `@Mapper(componentModel = "spring")` + +## 测试 + +```bash +# 运行所有测试 +mvn test + +# 运行特定测试 +mvn test -Dtest=ClassName#methodName + +# 运行特定模块测试 +mvn -pl services/data-management-service -am test +``` + +## 配置 + +### 环境变量 +- `DB_USERNAME`: 数据库用户名 +- `DB_PASSWORD`: 数据库密码 +- `REDIS_PASSWORD`: Redis 密码 +- `JWT_SECRET`: JWT 密钥 + +### 配置文件 +- `application.yml`: 默认配置 +- `application-dev.yml`: 开发环境覆盖 + +## 文档 + +- **API 文档**: http://localhost:8080/api/swagger-ui.html +- **AGENTS.md**: 见 `backend/shared/AGENTS.md` 获取共享库文档 +- **服务文档**: 见各服务 README + +## 相关链接 + +- [Spring Boot 文档](https://docs.spring.io/spring-boot/) +- [MyBatis-Plus 文档](https://baomidou.com/) +- [PostgreSQL 文档](https://www.postgresql.org/docs/) diff --git a/backend/README.md b/backend/README.md new file mode 100644 index 000000000..fb5bb4727 --- /dev/null +++ b/backend/README.md @@ -0,0 +1,137 @@ +# DataMate Backend + +## Overview + +DataMate Backend is a microservices architecture based on Spring Boot 3.5 + Java 21, providing core functions such as data management, RAG indexing, and API gateway. + +## Architecture + +``` +backend/ +├── api-gateway/ # API Gateway + Authentication +├── services/ +│ ├── data-management-service/ # Dataset management +│ ├── rag-indexer-service/ # RAG indexing +│ └── main-application/ # Main application entry +└── shared/ + ├── domain-common/ # DDD building blocks, exception handling + └── security-common/ # JWT utilities +``` + +## Services + +| Service | Port | Description | +|---------|-------|-------------| +| **main-application** | 8080 | Main application, includes data management, data cleaning, operator marketplace modules | +| **api-gateway** | 8080 | API Gateway, route forwarding and authentication | + +## Technology Stack + +- **Framework**: Spring Boot 3.5.6, Spring Cloud 2025.0.0 +- **Language**: Java 21 +- **Database**: PostgreSQL 8.0.33 + MyBatis-Plus 3.5.14 +- **Cache**: Redis 3.2.0 +- **Vector DB**: Milvus (via SDK 2.6.6) +- **Documentation**: SpringDoc OpenAPI 2.2.0 +- **Build**: Maven + +## Dependencies + +### External Services +- **PostgreSQL**: `datamate-database:5432` +- **Redis**: `datamate-redis:6379` +- **Milvus**: Vector database (RAG indexing) + +### Shared Libraries +- **domain-common**: Business exceptions, system parameters, domain entity base classes +- **security-common**: JWT utilities, auth helpers + +## Quick Start + +### Prerequisites +- JDK 21+ +- Maven 3.8+ +- PostgreSQL 12+ +- Redis 6+ + +### Build +```bash +cd backend +mvn clean install +``` + +### Run Main Application +```bash +cd backend/services/main-application +mvn spring-boot:run +``` + +### Run API Gateway +```bash +cd backend/api-gateway +mvn spring-boot:run +``` + +## Development + +### Module Structure (DDD) +``` +com.datamate.{module}/ +├── interfaces/ +│ ├── rest/ # Controllers +│ ├── dto/ # Request/Response DTOs +│ ├── converter/ # MapStruct converters +│ └── validation/ # Custom validators +├── application/ # Application services +├── domain/ +│ ├── model/ # Entities +│ └── repository/ # Repository interfaces +└── infrastructure/ + ├── persistence/ # Repository implementations + ├── client/ # External API clients + └── config/ # Service configuration +``` + +### Code Conventions +- **Entities**: Extend `BaseEntity`, use `@TableName("t_*")` +- **Controllers**: `@RestController` + `@RequiredArgsConstructor` +- **Services**: `@Service` + `@Transactional` +- **Error Handling**: `throw BusinessException.of(ErrorCode.XXX)` +- **MapStruct**: `@Mapper(componentModel = "spring")` + +## Testing + +```bash +# Run all tests +mvn test + +# Run specific test +mvn test -Dtest=ClassName#methodName + +# Run specific module tests +mvn -pl services/data-management-service -am test +``` + +## Configuration + +### Environment Variables +- `DB_USERNAME`: Database username +- `DB_PASSWORD`: Database password +- `REDIS_PASSWORD`: Redis password +- `JWT_SECRET`: JWT secret key + +### Profiles +- `application.yml`: Default configuration +- `application-dev.yml`: Development overrides + +## Documentation + +- **API Docs**: http://localhost:8080/api/swagger-ui.html +- **AGENTS.md**: See `backend/shared/AGENTS.md` for shared libraries documentation +- **Service Docs**: See individual service READMEs + +## Related Links + +- [Spring Boot Documentation](https://docs.spring.io/spring-boot/) +- [MyBatis-Plus Documentation](https://baomidou.com/) +- [PostgreSQL Documentation](https://www.postgresql.org/docs/) diff --git a/backend/api-gateway/README-zh.md b/backend/api-gateway/README-zh.md new file mode 100644 index 000000000..a300f7f74 --- /dev/null +++ b/backend/api-gateway/README-zh.md @@ -0,0 +1,130 @@ +# API Gateway + +## 概述 + +API Gateway 是 DataMate 的统一入口,基于 Spring Cloud Gateway 实现,负责路由转发、JWT 认证和限流。 + +## 架构 + +``` +backend/api-gateway/ +├── src/main/java/com/datamate/gateway/ +│ ├── config/ # Gateway 配置 +│ ├── filter/ # JWT 认证过滤器 +│ └── route/ # 路由定义 +└── src/main/resources/ + └── application.yml # Gateway 配置 +``` + +## 配置 + +### 端口 +- **默认**: 8080 +- **Nacos 发现端口**: 30000 + +### 关键配置 +```yaml +spring: + application: + name: datamate-gateway + cloud: + nacos: + discovery: + port: 30000 + server-addr: ${NACOS_ADDR} + username: consul + password: +datamate: + jwt: + secret: ${JWT_SECRET} + expiration-seconds: 3600 +``` + +## 功能 + +### 1. 路由转发 +- 将前端请求转发到对应的后端服务 +- 支持负载均衡 +- 路径重写 + +### 2. JWT 认证 +- 基于 JWT Token 的认证 +- Token 验证和过期检查 +- 用户上下文传递 + +### 3. 限流 +- (如配置)请求频率限制 +- 防止 API 滥用 + +## 快速开始 + +### 前置条件 +- JDK 21+ +- Maven 3.8+ +- Nacos 服务(如果使用服务发现) + +### 构建 +```bash +cd backend/api-gateway +mvn clean install +``` + +### 运行 +```bash +cd backend/api-gateway +mvn spring-boot:run +``` + +## 开发 + +### 添加新路由 +在 `application.yml` 或通过 Nacos 配置路由规则: + +```yaml +spring: + cloud: + gateway: + routes: + - id: data-management + uri: lb://data-management-service + predicates: + - Path=/api/data-management/** + filters: + - StripPrefix=3 +``` + +### 添加自定义过滤器 +创建 `GlobalFilter` 或 `GatewayFilter`: + +```java +@Component +public class AuthFilter implements GlobalFilter { + @Override + public Mono filter(ServerWebExchange exchange, GatewayFilterChain chain) { + // 过滤逻辑 + return chain.filter(exchange); + } +} +``` + +## 测试 + +### 测试路由转发 +```bash +curl http://localhost:8080/api/data-management/datasets +``` + +### 测试 JWT 认证 +```bash +curl -H "Authorization: Bearer " http://localhost:8080/api/protected-endpoint +``` + +## 文档 + +- **Spring Cloud Gateway 文档**: https://docs.spring.io/spring-cloud-gateway/ +- **Nacos 发现**: https://nacos.io/ + +## 相关链接 + +- [后端 README](../README.md) +- [主应用 README](../services/main-application/README.md) diff --git a/backend/api-gateway/README.md b/backend/api-gateway/README.md new file mode 100644 index 000000000..23ef8fbf5 --- /dev/null +++ b/backend/api-gateway/README.md @@ -0,0 +1,130 @@ +# API Gateway + +## Overview + +API Gateway is DataMate's unified entry point, built on Spring Cloud Gateway, responsible for route forwarding, JWT authentication, and rate limiting. + +## Architecture + +``` +backend/api-gateway/ +├── src/main/java/com/datamate/gateway/ +│ ├── config/ # Gateway configuration +│ ├── filter/ # JWT authentication filter +│ └── route/ # Route definitions +└── src/main/resources/ + └── application.yml # Gateway configuration +``` + +## Configuration + +### Port +- **Default**: 8080 +- **Nacos Discovery Port**: 30000 + +### Key Configuration +```yaml +spring: + application: + name: datamate-gateway + cloud: + nacos: + discovery: + port: 30000 + server-addr: ${NACOS_ADDR} + username: consul + password: +datamate: + jwt: + secret: ${JWT_SECRET} + expiration-seconds: 3600 +``` + +## Features + +### 1. Route Forwarding +- Forward frontend requests to corresponding backend services +- Support for load balancing +- Path rewriting + +### 2. JWT Authentication +- JWT Token-based authentication +- Token validation and expiration checking +- User context propagation + +### 3. Rate Limiting +- Request rate limiting (if configured) +- Prevent API abuse + +## Quick Start + +### Prerequisites +- JDK 21+ +- Maven 3.8+ +- Nacos service (if using service discovery) + +### Build +```bash +cd backend/api-gateway +mvn clean install +``` + +### Run +```bash +cd backend/api-gateway +mvn spring-boot:run +``` + +## Development + +### Adding New Routes +Configure route rules in `application.yml` or via Nacos: + +```yaml +spring: + cloud: + gateway: + routes: + - id: data-management + uri: lb://data-management-service + predicates: + - Path=/api/data-management/** + filters: + - StripPrefix=3 +``` + +### Adding Custom Filters +Create a `GlobalFilter` or `GatewayFilter`: + +```java +@Component +public class AuthFilter implements GlobalFilter { + @Override + public Mono filter(ServerWebExchange exchange, GatewayFilterChain chain) { + // Filter logic + return chain.filter(exchange); + } +} +``` + +## Testing + +### Test Route Forwarding +```bash +curl http://localhost:8080/api/data-management/datasets +``` + +### Test JWT Authentication +```bash +curl -H "Authorization: Bearer " http://localhost:8080/api/protected-endpoint +``` + +## Documentation + +- **Spring Cloud Gateway Docs**: https://docs.spring.io/spring-cloud-gateway/ +- **Nacos Discovery**: https://nacos.io/ + +## Related Links + +- [Backend README](../README.md) +- [Main Application README](../services/main-application/README.md) diff --git a/backend/services/main-application/README-zh.md b/backend/services/main-application/README-zh.md new file mode 100644 index 000000000..1568c5a20 --- /dev/null +++ b/backend/services/main-application/README-zh.md @@ -0,0 +1,112 @@ +# 主应用 + +## 概述 + +主应用是 DataMate 的核心 Spring Boot 服务,包含数据管理、数据清洗、算子市场、数据收集等主要功能模块。 + +## 架构 + +``` +backend/services/main-application/ +├── src/main/java/com/datamate/main/ +│ ├── interfaces/ +│ │ ├── rest/ # Controllers +│ │ ├── dto/ # Request/Response DTOs +│ │ └── converter/ # MapStruct converters +│ ├── application/ # Application services +│ ├── domain/ +│ │ ├── model/ # Entities +│ │ └── repository/ # Repository interfaces +│ └── infrastructure/ +│ ├── persistence/ # Repository implementations +│ ├── client/ # External API clients +│ └── config/ # Service configuration +└── src/main/resources/ + ├── application.yml # 主配置 + ├── config/application-datamanagement.yml # 数据管理配置 + └── config/application-datacollection.yml # 数据收集配置 +``` + +## 模块 + +### 1. 数据管理 +- 数据集 CRUD 操作 +- 文件上传/下载 +- 标签管理 +- 数据集版本控制 + +### 2. 数据收集 +- 数据源配置 +- 定时数据收集任务 +- 数据同步 +- 数据导入/导出 + +## 配置 + +### 端口 +- **默认**: 8080 +- **上下文路径**: `/api` + +### 关键配置 +```yaml +server: + port: 8080 + servlet: + context-path: /api + +datamate: + data-management: + base-path: /dataset +``` + +## 快速开始 + +### 前置条件 +- JDK 21+ +- Maven 3.8+ +- PostgreSQL 12+ +- Redis 6+ + +### 构建 +```bash +cd backend/services/main-application +mvn clean install +``` + +### 运行 +```bash +cd backend/services/main-application +mvn spring-boot:run +``` + +## 开发 + +### 添加新模块 +1. 在 `domain/model/` 创建实体类 +2. 在 `domain/repository/` 创建 repository 接口 +3. 在 `infrastructure/persistence/` 实现 repository +4. 在 `application/` 创建 application service +5. 在 `interfaces/rest/` 创建 controller + +## 测试 + +### 运行测试 +```bash +cd backend/services/main-application +mvn test +``` + +### 运行特定测试 +```bash +mvn test -Dtest=DatasetControllerTest +``` + +## 文档 + +- **Spring Boot 文档**: https://docs.spring.io/spring-boot/ +- [AGENTS.md](../../shared/AGENTS.md) + +## 相关链接 + +- [后端 README](../../README.md) +- [API Gateway README](../../api-gateway/README.md) diff --git a/backend/services/main-application/README.md b/backend/services/main-application/README.md new file mode 100644 index 000000000..51b4c65c5 --- /dev/null +++ b/backend/services/main-application/README.md @@ -0,0 +1,112 @@ +# Main Application + +## Overview + +The Main Application is DataMate's core Spring Boot service, containing major functional modules including data management, data cleaning, operator marketplace, and data collection. + +## Architecture + +``` +backend/services/main-application/ +├── src/main/java/com/datamate/main/ +│ ├── interfaces/ +│ │ ├── rest/ # Controllers +│ │ ├── dto/ # Request/Response DTOs +│ │ └── converter/ # MapStruct converters +│ ├── application/ # Application services +│ ├── domain/ +│ │ ├── model/ # Entities +│ │ └── repository/ # Repository interfaces +│ └── infrastructure/ +│ ├── persistence/ # Repository implementations +│ ├── client/ # External API clients +│ └── config/ # Service configuration +└── src/main/resources/ + ├── application.yml # Main configuration + ├── config/application-datamanagement.yml # Data management config + └── config/application-datacollection.yml # Data collection config +``` + +## Modules + +### 1. Data Management +- Dataset CRUD operations +- File upload/download +- Tag management +- Dataset versioning + +### 2. Data Collection +- Data source configuration +- Scheduled data collection tasks +- Data synchronization +- Data import/export + +## Configuration + +### Port +- **Default**: 8080 +- **Context Path**: `/api` + +### Key Configuration +```yaml +server: + port: 8080 + servlet: + context-path: /api + +datamate: + data-management: + base-path: /dataset +``` + +## Quick Start + +### Prerequisites +- JDK 21+ +- Maven 3.8+ +- PostgreSQL 12+ +- Redis 6+ + +### Build +```bash +cd backend/services/main-application +mvn clean install +``` + +### Run +```bash +cd backend/services/main-application +mvn spring-boot:run +``` + +## Development + +### Adding a New Module +1. Create entity class in `domain/model/` +2. Create repository interface in `domain/repository/` +3. Implement repository in `infrastructure/persistence/` +4. Create application service in `application/` +5. Create controller in `interfaces/rest/` + +## Testing + +### Run Tests +```bash +cd backend/services/main-application +mvn test +``` + +### Run Specific Test +```bash +mvn test -Dtest=DatasetControllerTest +``` + +## Documentation + +- **Spring Boot Docs**: https://docs.spring.io/spring-boot/ +- [AGENTS.md](../../shared/AGENTS.md) + +## Related Links + +- [Backend README](../../README.md) +- [API Gateway README](../../api-gateway/README.md) diff --git a/backend/shared/README-zh.md b/backend/shared/README-zh.md new file mode 100644 index 000000000..d2dc48abf --- /dev/null +++ b/backend/shared/README-zh.md @@ -0,0 +1,144 @@ +# 共享库 + +## 概述 + +共享库包含所有后端服务共用的代码和工具,包括领域构建块、异常处理、JWT 工具等。 + +## 架构 + +``` +backend/shared/ +├── domain-common/ # DDD 构建块、异常处理 +│ └── src/main/java/com/datamate/common/ +│ ├── infrastructure/exception/ # BusinessException, ErrorCode +│ ├── setting/ # 系统参数、模型配置 +│ └── domain/ # Base entities, repositories +└── security-common/ # JWT 工具、认证辅助 + └── src/main/java/com/datamate/security/ +``` + +## 库 + +### 1. domain-common + +#### BusinessException +统一的业务异常处理机制: + +```java +// 抛出业务异常 +throw BusinessException.of(ErrorCode.DATASET_NOT_FOUND) + .withDetail("dataset_id", datasetId); + +// 带上下文的异常 +throw BusinessException.of(ErrorCode.VALIDATION_FAILED) + .withDetail("field", "email") + .withDetail("reason", "Invalid format"); +``` + +#### ErrorCode +错误码枚举接口: + +```java +public interface ErrorCode { + String getCode(); + String getMessage(); + HttpStatus getHttpStatus(); +} + +// 示例 +public enum CommonErrorCode implements ErrorCode { + SUCCESS("0000", "Success", HttpStatus.OK), + DATABASE_NOT_FOUND("4001", "Database not found", HttpStatus.NOT_FOUND); +} +``` + +#### BaseEntity +所有实体的基类,包含审计字段: + +```java +@Data +@EqualsAndHashCode(callSuper = true) +public class BaseEntity implements Serializable { + @TableId(type = IdType.ASSIGN_ID) + private String id; + + @TableField(fill = FieldFill.INSERT) + private LocalDateTime createdAt; + + @TableField(fill = FieldFill.INSERT_UPDATE) + private LocalDateTime updatedAt; + + @TableField(fill = FieldFill.INSERT) + private String createdBy; + + @TableField(fill = FieldFill.INSERT_UPDATE) + private String updatedBy; +} +``` + +### 2. security-common + +#### JWT 工具 +JWT Token 生成和验证: + +```java +// 生成 Token +String token = JwtUtil.generateToken(userId, secret, expiration); + +// 验证 Token +Claims claims = JwtUtil.validateToken(token, secret); +String userId = claims.getSubject(); +``` + +## 使用 + +### 在服务中使用共享库 + +#### Maven 依赖 +```xml + + com.datamate + domain-common + 1.0.0-SNAPSHOT + + + com.datamate + security-common + 1.0.0-SNAPSHOT + +``` + +#### 使用 BusinessException +```java +@RestController +@RequiredArgsConstructor +public class DatasetController { + + public ResponseEntity getDataset(String id) { + Dataset dataset = datasetService.findById(id); + if (dataset == null) { + throw BusinessException.of(ErrorCode.DATASET_NOT_FOUND); + } + return ResponseEntity.ok(DatasetResponse.from(dataset)); + } +} +``` + +## 快速开始 + +### 构建共享库 +```bash +cd backend +mvn clean install +``` + +### 在服务中使用 +共享库会自动被所有后端服务继承。 + +## 文档 + +- [AGENTS.md](./AGENTS.md) + +## 相关链接 + +- [后端 README](../README.md) diff --git a/backend/shared/README.md b/backend/shared/README.md new file mode 100644 index 000000000..eb8c13630 --- /dev/null +++ b/backend/shared/README.md @@ -0,0 +1,144 @@ +# Shared Libraries + +## Overview + +Shared Libraries contain code and utilities shared across all backend services, including domain building blocks, exception handling, JWT utilities, and more. + +## Architecture + +``` +backend/shared/ +├── domain-common/ # DDD building blocks, exception handling +│ └── src/main/java/com/datamate/common/ +│ ├── infrastructure/exception/ # BusinessException, ErrorCode +│ ├── setting/ # System params, model configs +│ └── domain/ # Base entities, repositories +└── security-common/ # JWT utilities, auth helpers + └── src/main/java/com/datamate/security/ +``` + +## Libraries + +### 1. domain-common + +#### BusinessException +Unified business exception handling mechanism: + +```java +// Throw business exception +throw BusinessException.of(ErrorCode.DATASET_NOT_FOUND) + .withDetail("dataset_id", datasetId); + +// Exception with context +throw BusinessException.of(ErrorCode.VALIDATION_FAILED) + .withDetail("field", "email") + .withDetail("reason", "Invalid format"); +``` + +#### ErrorCode +Error code enumeration interface: + +```java +public interface ErrorCode { + String getCode(); + String getMessage(); + HttpStatus getHttpStatus(); +} + +// Example +public enum CommonErrorCode implements ErrorCode { + SUCCESS("0000", "Success", HttpStatus.OK), + DATABASE_NOT_FOUND("4001", "Database not found", HttpStatus.NOT_FOUND); +} +``` + +#### BaseEntity +Base class for all entities, including audit fields: + +```java +@Data +@EqualsAndHashCode(callSuper = true) +public class BaseEntity implements Serializable { + @TableId(type = IdType.ASSIGN_ID) + private String id; + + @TableField(fill = FieldFill.INSERT) + private LocalDateTime createdAt; + + @TableField(fill = FieldFill.INSERT_UPDATE) + private LocalDateTime updatedAt; + + @TableField(fill = FieldFill.INSERT) + private String createdBy; + + @TableField(fill = FieldFill.INSERT_UPDATE) + private String updatedBy; +} +``` + +### 2. security-common + +#### JWT Utilities +JWT Token generation and validation: + +```java +// Generate Token +String token = JwtUtil.generateToken(userId, secret, expiration); + +// Validate Token +Claims claims = JwtUtil.validateToken(token, secret); +String userId = claims.getSubject(); +``` + +## Usage + +### Using Shared Libraries in Services + +#### Maven Dependencies +```xml + + com.datamate + domain-common + 1.0.0-SNAPSHOT + + + com.datamate + security-common + 1.0.0-SNAPSHOT + +``` + +#### Using BusinessException +```java +@RestController +@RequiredArgsConstructor +public class DatasetController { + + public ResponseEntity getDataset(String id) { + Dataset dataset = datasetService.findById(id); + if (dataset == null) { + throw BusinessException.of(ErrorCode.DATASET_NOT_FOUND); + } + return ResponseEntity.ok(DatasetResponse.from(dataset)); + } +} +``` + +## Quick Start + +### Build Shared Libraries +```bash +cd backend +mvn clean install +``` + +### Use in Services +Shared libraries are automatically inherited by all backend services. + +## Documentation + +- [AGENTS.md](./AGENTS.md) + +## Related Links + +- [Backend README](../README.md) diff --git a/runtime/README-zh.md b/runtime/README-zh.md new file mode 100644 index 000000000..5aa180ddd --- /dev/null +++ b/runtime/README-zh.md @@ -0,0 +1,146 @@ +# DataMate 运行时 + +## 概述 + +DataMate 运行时提供数据处理、算子执行、数据收集等核心功能,基于 Python 3.12+ 和 FastAPI 框架。 + +## 架构 + +``` +runtime/ +├── datamate-python/ # FastAPI 后端服务(端口 18000) +├── python-executor/ # Ray 分布式执行器 +├── ops/ # 算子生态 +├── datax/ # DataX 数据读写框架 +└── deer-flow/ # DeerFlow 服务 +``` + +## 组件 + +### 1. datamate-python (FastAPI 后端) +**端口**: 18000 + +核心 Python 服务,提供以下功能: +- **数据合成**: QA 生成、文档处理 +- **数据标注**: Label Studio 集成、自动标注 +- **数据评估**: 模型评估、质量检查 +- **数据清洗**: 数据清洗管道 +- **算子市场**: 算子管理、上传 +- **RAG 索引**: 向量索引、知识库管理 +- **数据收集**: 定时任务、数据源集成 + +**技术栈**: +- FastAPI 0.124+ +- SQLAlchemy 2.0+ (async) +- Pydantic 2.12+ +- PostgreSQL (via asyncpg) +- Milvus (via pymilvus) +- APScheduler (定时任务) + +### 2. python-executor (Ray 执行器) +Ray 分布式执行框架,负责: +- **算子执行**: 执行数据处理算子 +- **任务调度**: 异步任务管理 +- **分布式计算**: 多节点并行处理 + +**技术栈**: +- Ray 2.7.0 +- FastAPI (执行器 API) +- Data-Juicer (数据处理) + +### 3. ops (算子生态) +算子生态,包含: +- **filter**: 数据过滤(去重、敏感内容、质量过滤) +- **mapper**: 数据转换(清洗、归一化) +- **slicer**: 数据切片(文本分割、幻灯片提取) +- **formatter**: 格式转换(PDF → text, slide → JSON) +- **llms**: LLM 算子(质量评估、条件检查) +- **annotation**: 标注算子(目标检测、分割) + +**见**: `runtime/ops/README.md` 获取算子开发指南 + +### 4. datax (DataX 框架) +DataX 数据读写框架,支持多种数据源: +- **Readers**: MySQL, PostgreSQL, Oracle, MongoDB, Elasticsearch, HDFS, S3, NFS, GlusterFS, API, 等 +- **Writers**: 同上,支持写入目标 + +**技术栈**: Java (Maven 构建) + +### 5. deer-flow (DeerFlow 服务) +DeerFlow 服务(配置见 `conf.yaml`)。 + +## 快速开始 + +### 前置条件 +- Python 3.12+ +- Poetry (for datamate-python) +- Ray 2.7.0+ (for python-executor) + +### 运行 datamate-python +```bash +cd runtime/datamate-python +poetry install +poetry run uvicorn app.main:app --reload --port 18000 +``` + +### 运行 python-executor +```bash +cd runtime/python-executor +poetry install +ray start --head +``` + +## 开发 + +### datamate-python 模块结构 +``` +app/ +├── core/ # 日志、异常、配置 +├── db/ +│ ├── models/ # SQLAlchemy 模型 +│ └── session.py # 异步会话 +├── module/ +│ ├── annotation/ # Label Studio 集成 +│ ├── collection/ # 数据收集 +│ ├── cleaning/ # 数据清洗 +│ ├── dataset/ # 数据集管理 +│ ├── evaluation/ # 模型评估 +│ ├── generation/ # QA 合成 +│ ├── operator/ # 算子市场 +│ ├── rag/ # RAG 索引 +│ └── shared/ # 共享 schemas +└── main.py # FastAPI 入口 +``` + +### 代码约定 +- **路由**: `APIRouter` 在 `interface/*.py` +- **依赖注入**: `Depends(get_db)` 获取会话 +- **错误**: `raise BusinessError(ErrorCode.XXX, context)` +- **事务**: `async with transaction(db):` +- **模型**: Extend `BaseEntity` (审计字段自动填充) + +## 测试 + +```bash +cd runtime/datamate-python +poetry run pytest +``` + +## 配置 + +### 环境变量 +- `DATABASE_URL`: PostgreSQL 连接字符串 +- `LABEL_STUDIO_BASE_URL`: Label Studio URL +- `RAY_ENABLED`: 启用 Ray 执行器 +- `RAY_ADDRESS`: Ray 集群地址 + +## 文档 + +- **API 文档**: http://localhost:18000/redoc +- **算子指南**: 见 `runtime/ops/README.md` 获取算子开发 + +## 相关链接 + +- [FastAPI 文档](https://fastapi.tiangolo.com/) +- [Ray 文档](https://docs.ray.io/) +- [SQLAlchemy 文档](https://docs.sqlalchemy.org/) diff --git a/runtime/README.md b/runtime/README.md new file mode 100644 index 000000000..8d3a5621c --- /dev/null +++ b/runtime/README.md @@ -0,0 +1,146 @@ +# DataMate Runtime + +## Overview + +DataMate Runtime provides core functionality for data processing, operator execution, and data collection, built on Python 3.12+ and the FastAPI framework. + +## Architecture + +``` +runtime/ +├── datamate-python/ # FastAPI backend service (port 18000) +├── python-executor/ # Ray distributed executor +├── ops/ # Operator ecosystem +├── datax/ # DataX data read/write framework +└── deer-flow/ # DeerFlow service +``` + +## Components + +### 1. datamate-python (FastAPI Backend) +**Port**: 18000 + +Core Python service providing: +- **Data Synthesis**: QA generation, document processing +- **Data Annotation**: Label Studio integration, auto-annotation +- **Data Evaluation**: Model evaluation, quality checks +- **Data Cleaning**: Data cleaning pipelines +- **Operator Marketplace**: Operator management, upload +- **RAG Indexing**: Vector indexing, knowledge base management +- **Data Collection**: Scheduled tasks, data source integration + +**Technology Stack**: +- FastAPI 0.124+ +- SQLAlchemy 2.0+ (async) +- Pydantic 2.12+ +- PostgreSQL (via asyncpg) +- Milvus (via pymilvus) +- APScheduler (scheduled tasks) + +### 2. python-executor (Ray Executor) +Ray distributed execution framework responsible for: +- **Operator Execution**: Execute data processing operators +- **Task Scheduling**: Async task management +- **Distributed Computing**: Multi-node parallel processing + +**Technology Stack**: +- Ray 2.7.0 +- FastAPI (executor API) +- Data-Juicer (data processing) + +### 3. ops (Operator Ecosystem) +Operator ecosystem including: +- **filter**: Data filtering (deduplication, sensitive content, quality filtering) +- **mapper**: Data transformation (cleaning, normalization) +- **slicer**: Data slicing (text splitting, slide extraction) +- **formatter**: Format conversion (PDF → text, slide → JSON) +- **llms**: LLM operators (quality evaluation, condition checking) +- **annotation**: Annotation operators (object detection, segmentation) + +**See**: `runtime/ops/README.md` for operator development guide + +### 4. datax (DataX Framework) +DataX data read/write framework supporting multiple data sources: +- **Readers**: MySQL, PostgreSQL, Oracle, MongoDB, Elasticsearch, HDFS, S3, NFS, GlusterFS, API, etc. +- **Writers**: Same as above, supports writing to targets + +**Technology Stack**: Java (Maven build) + +### 5. deer-flow (DeerFlow Service) +DeerFlow service (see `conf.yaml` for configuration). + +## Quick Start + +### Prerequisites +- Python 3.12+ +- Poetry (for datamate-python) +- Ray 2.7.0+ (for python-executor) + +### Run datamate-python +```bash +cd runtime/datamate-python +poetry install +poetry run uvicorn app.main:app --reload --port 18000 +``` + +### Run python-executor +```bash +cd runtime/python-executor +poetry install +ray start --head +``` + +## Development + +### datamate-python Module Structure +``` +app/ +├── core/ # Logging, exception, config +├── db/ +│ ├── models/ # SQLAlchemy models +│ └── session.py # Async session +├── module/ +│ ├── annotation/ # Label Studio integration +│ ├── collection/ # Data collection +│ ├── cleaning/ # Data cleaning +│ ├── dataset/ # Dataset management +│ ├── evaluation/ # Model evaluation +│ ├── generation/ # QA synthesis +│ ├── operator/ # Operator marketplace +│ ├── rag/ # RAG indexing +│ └── shared/ # Shared schemas +└── main.py # FastAPI entry +``` + +### Code Conventions +- **Routes**: `APIRouter` in `interface/*.py` +- **Dependency Injection**: `Depends(get_db)` for session +- **Error Handling**: `raise BusinessError(ErrorCodes.XXX, context)` +- **Transactions**: `async with transaction(db):` +- **Models**: Extend `BaseEntity` (audit fields auto-filled) + +## Testing + +```bash +cd runtime/datamate-python +poetry run pytest +``` + +## Configuration + +### Environment Variables +- `DATABASE_URL`: PostgreSQL connection string +- `LABEL_STUDIO_BASE_URL`: Label Studio URL +- `RAY_ENABLED`: Enable Ray executor +- `RAY_ADDRESS`: Ray cluster address + +## Documentation + +- **API Docs**: http://localhost:18000/redoc +- **Operator Guide**: See `runtime/ops/README.md` for operator development + +## Related Links + +- [FastAPI Documentation](https://fastapi.tiangolo.com/) +- [Ray Documentation](https://docs.ray.io/) +- [SQLAlchemy Documentation](https://docs.sqlalchemy.org/) diff --git a/runtime/datax/README-zh.md b/runtime/datax/README-zh.md new file mode 100644 index 000000000..40d3c8e0a --- /dev/null +++ b/runtime/datax/README-zh.md @@ -0,0 +1,151 @@ +# DataX 框架 + +## 概述 + +DataX 是一个数据传输框架,支持多种数据源和数据目标之间的数据传输,用于数据收集和同步。 + +## 架构 + +``` +runtime/datax/ +├── core/ # DataX 核心组件 +├── transformer/ # 数据转换器 +├── readers/ # 数据读取器 +│ ├── mysqlreader/ +│ ├── postgresqlreader/ +│ ├── oracleReader/ +│ ├── mongodbreader/ +│ ├── hdfsreader/ +│ ├── s3rader/ +│ ├── nfsreader/ +│ ├── glusterfsreader/ +│ └── apireader/ +└── writers/ # 数据写入器 + ├── mysqlwriter/ + ├── postgresqlwriter/ + ├── oraclewriter/ + ├── mongodbwriter/ + ├── hdfswriter/ + ├── s3writer/ + ├── nfswriter/ + ├── glusterfswriter/ + └── txtfilewriter/ +``` + +## 支持的数据源 + +### 关系型数据库 +- MySQL +- PostgreSQL +- Oracle +- SQL Server +- DB2 +- KingbaseES +- GaussDB + +### NoSQL 数据库 +- MongoDB +- Elasticsearch +- Cassandra +- HBase +- Redis + +### 文件系统 +- HDFS +- S3 (AWS S3, MinIO, 阿里云 OSS) +- NFS +- GlusterFS +- 本地文件系统 + +### 其他 +- API 接口 +- Kafka +- Pulsar +- DataHub +- LogHub + +## 使用 + +### 基本配置 +```json +{ + "job": { + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + "username": "root", + "password": "password", + "column": ["id", "name", "email"], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://localhost:3306/database", + "table": ["users"] + } + ] + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/output/users.txt", + "fileName": "users", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + +### 运行 DataX +```bash +# 构建 DataX +cd runtime/datax +mvn clean package + +# 运行 +python datax.py -j job.json +``` + +## 快速开始 + +### 前置条件 +- JDK 8+ +- Maven 3.8+ +- Python 3.6+ + +### 构建 +```bash +cd runtime/datax +mvn clean package +``` + +### 运行示例 +```bash +python datax.py -j examples/mysql2text.json +``` + +## 开发 + +### 添加新的 Reader +1. 在 `readers/` 创建新模块 +2. 实现 Reader 接口 +3. 配置 reader 参数 +4. 添加到 package.xml + +### 添加新的 Writer +1. 在 `writers/` 创建新模块 +2. 实现 Writer 接口 +3. 配置 writer 参数 +4. 添加到 package.xml + +## 文档 + +- [DataX 官方文档](https://github.com/alibaba/DataX) + +## 相关链接 + +- [运行时 README](../README.md) diff --git a/runtime/datax/README.md b/runtime/datax/README.md new file mode 100644 index 000000000..af2366255 --- /dev/null +++ b/runtime/datax/README.md @@ -0,0 +1,151 @@ +# DataX Framework + +## Overview + +DataX is a data transfer framework that supports data transmission between various data sources and targets, used for data collection and synchronization. + +## Architecture + +``` +runtime/datax/ +├── core/ # DataX core components +├── transformer/ # Data transformers +├── readers/ # Data readers +│ ├── mysqlreader/ +│ ├── postgresqlreader/ +│ ├── oracleReader/ +│ ├── mongodbreader/ +│ ├── hdfsreader/ +│ ├── s3rader/ +│ ├── nfsreader/ +│ ├── glusterfsreader/ +│ └── apireader/ +└── writers/ # Data writers + ├── mysqlwriter/ + ├── postgresqlwriter/ + ├── oraclewriter/ + ├── mongodbwriter/ + ├── hdfswriter/ + ├── s3writer/ + ├── nfswriter/ + ├── glusterfswriter/ + └── txtfilewriter/ +``` + +## Supported Data Sources + +### Relational Databases +- MySQL +- PostgreSQL +- Oracle +- SQL Server +- DB2 +- KingbaseES +- GaussDB + +### NoSQL Databases +- MongoDB +- Elasticsearch +- Cassandra +- HBase +- Redis + +### File Systems +- HDFS +- S3 (AWS S3, MinIO, Alibaba Cloud OSS) +- NFS +- GlusterFS +- Local file system + +### Others +- API interfaces +- Kafka +- Pulsar +- DataHub +- LogHub + +## Usage + +### Basic Configuration +```json +{ + "job": { + "content": [ + { + "reader": { + "name": "mysqlreader", + "parameter": { + "username": "root", + "password": "password", + "column": ["id", "name", "email"], + "connection": [ + { + "jdbcUrl": "jdbc:mysql://localhost:3306/database", + "table": ["users"] + } + ] + } + }, + "writer": { + "name": "txtfilewriter", + "parameter": { + "path": "/output/users.txt", + "fileName": "users", + "writeMode": "truncate" + } + } + } + ] + } +} +``` + +### Run DataX +```bash +# Build DataX +cd runtime/datax +mvn clean package + +# Run +python datax.py -j job.json +``` + +## Quick Start + +### Prerequisites +- JDK 8+ +- Maven 3.8+ +- Python 3.6+ + +### Build +```bash +cd runtime/datax +mvn clean package +``` + +### Run Example +```bash +python datax.py -j examples/mysql2text.json +``` + +## Development + +### Adding a New Reader +1. Create new module in `readers/` +2. Implement Reader interface +3. Configure reader parameters +4. Add to package.xml + +### Adding a New Writer +1. Create new module in `writers/` +2. Implement Writer interface +3. Configure writer parameters +4. Add to package.xml + +## Documentation + +- [DataX Official Documentation](https://github.com/alibaba/DataX) + +## Related Links + +- [Runtime README](../README.md) diff --git a/runtime/deer-flow/README-zh.md b/runtime/deer-flow/README-zh.md new file mode 100644 index 000000000..209d436de --- /dev/null +++ b/runtime/deer-flow/README-zh.md @@ -0,0 +1,97 @@ +# DeerFlow 服务 + +## 概述 + +DeerFlow 是一个 LLM 驱动的服务,用于规划和推理任务,支持多种 LLM 提供商。 + +## 架构 + +``` +runtime/deer-flow/ +├── conf.yaml # DeerFlow 配置文件 +├── .env # 环境变量 +└── (其他源代码) +``` + +## 配置 + +### 基本配置 (conf.yaml) + +```yaml +# 基础模型配置 +BASIC_MODEL: + base_url: https://api.example.com/v1 + model: "model-name" + api_key: your_api_key + max_retries: 3 + verify_ssl: false # 如果使用自签名证书,设为 false + +# 推理模型配置(可选) +REASONING_MODEL: + base_url: https://api.example.com/v1 + model: "reasoning-model-name" + api_key: your_api_key + max_retries: 3 + +# 搜索引擎配置(可选) +SEARCH_ENGINE: + engine: tavily + include_domains: + - example.com + - trusted-news.com + exclude_domains: + - spam-site.com + search_depth: "advanced" + include_raw_content: true + include_images: true + include_image_descriptions: true + min_score_threshold: 0.0 + max_content_length_per_page: 4000 +``` + +## 支的 LLM 提供商 + +#### OpenAI +```yaml +BASIC_MODEL: + base_url: https://api.openai.com/v1 + model: "gpt-4" + api_key: sk-... +``` + +#### Ollama (本地部署) +```yaml +BASIC_MODEL: + base_url: "http://localhost:11434/v1" + model: "qwen2:7b" + api_key: "ollama" + verify_ssl: false +``` + +#### Google AI Studio +```yaml +BASIC_MODEL: + platform: "google_aistudio" + model: "gemini-2.5-flash" + api_key: your_gemini_api_key +``` + +## 开发 + +### 添加新的 LLM 提供商 +1. 在 `conf.yaml` 添加新的模型配置 +2. 实现对应的 API 调用逻辑 +3. 测试连接和推理 + +### 自定义提示词模板 +1. 创建提示词模板文件 +2. 在 `conf.yaml` 引用模板 +3. 测试提示词效果 + +## 文档 + +- [DeerFlow 官方文档](https://github.com/ModelEngine-Group/DeerFlow) + +## 相关链接 + +- [运行时 README](../README.md) diff --git a/runtime/deer-flow/README.md b/runtime/deer-flow/README.md new file mode 100644 index 000000000..ee7642ab2 --- /dev/null +++ b/runtime/deer-flow/README.md @@ -0,0 +1,97 @@ +# DeerFlow Service + +## Overview + +DeerFlow is an LLM-driven service for planning and reasoning tasks, supporting multiple LLM providers. + +## Architecture + +``` +runtime/deer-flow/ +├── conf.yaml # DeerFlow configuration file +├── .env # Environment variables +└── (other source code) +``` + +## Configuration + +### Basic Configuration (conf.yaml) + +```yaml +# Basic model configuration +BASIC_MODEL: + base_url: https://api.example.com/v1 + model: "model-name" + api_key: your_api_key + max_retries: 3 + verify_ssl: false # Set to false if using self-signed certificates + +# Reasoning model configuration (optional) +REASONING_MODEL: + base_url: https://api.example.com/v1 + model: "reasoning-model-name" + api_key: your_api_key + max_retries: 3 + +# Search engine configuration (optional) +SEARCH_ENGINE: + engine: tavily + include_domains: + - example.com + - trusted-news.com + exclude_domains: + - spam-site.com + search_depth: "advanced" + include_raw_content: true + include_images: true + include_image_descriptions: true + min_score_threshold: 0.0 + max_content_length_per_page: 4000 +``` + +## Supported LLM Providers + +#### OpenAI +```yaml +BASIC_MODEL: + base_url: https://api.openai.com/v1 + model: "gpt-4" + api_key: sk-... +``` + +#### Ollama (Local Deployment) +```yaml +BASIC_MODEL: + base_url: "http://localhost:11434/v1" + model: "qwen2:7b" + api_key: "ollama" + verify_ssl: false +``` + +#### Google AI Studio +```yaml +BASIC_MODEL: + platform: "google_aistudio" + model: "gemini-2.5-flash" + api_key: your_gemini_api_key +``` + +## Development + +### Adding a New LLM Provider +1. Add new model configuration in `conf.yaml` +2. Implement corresponding API call logic +3. Test connection and inference + +### Customizing Prompt Templates +1. Create a prompt template file +2. Reference the template in `conf.yaml` +3. Test prompt effectiveness + +## Documentation + +- [DeerFlow Official Documentation](https://github.com/ModelEngine-Group/DeerFlow) + +## Related Links + +- [Runtime README](../README.md) diff --git a/runtime/python-executor/README-zh.md b/runtime/python-executor/README-zh.md new file mode 100644 index 000000000..b833aece6 --- /dev/null +++ b/runtime/python-executor/README-zh.md @@ -0,0 +1,221 @@ +# Ray 执行器 + +## 概述 + +Ray 执行器是基于 Ray 的分布式执行框架,负责执行数据处理算子、任务调度和分布式计算。 + +## 架构 + +``` +runtime/python-executor/ +└── datamate/ + ├── core/ + │ ├── base_op.py # BaseOp, Mapper, Filter, Slicer, LLM + │ ├── dataset.py # Dataset 处理 + │ └── constant.py # 常量定义 + ├── scheduler/ + │ ├── scheduler.py # TaskScheduler, Task, TaskStatus + │ ├── func_task_scheduler.py # 函数任务调度 + │ └── cmd_task_scheduler.py # 命令任务调度 + ├── wrappers/ + │ ├── executor.py # Ray 执行器入口 + │ ├── datamate_wrapper.py # DataMate 任务包装 + │ └── data_juicer_wrapper.py # DataJuicer 集成 + └── common/utils/ # 工具函数 + ├── bytes_transform.py + ├── file_scanner.py + ├── lazy_loader.py + └── text_splitter.py +``` + +## 组件 + +### 1. Base 类 + +#### BaseOp +所有算子的基类: + +```python +class BaseOp: + def __init__(self, *args, **kwargs): + self.accelerator = kwargs.get('accelerator', "cpu") + self.text_key = kwargs.get('text_key', "text") + # ... 其他配置 + + def execute(self, sample): + raise NotImplementedError +``` + +#### Mapper +数据转换算子基类(1:1): + +```python +class Mapper(BaseOp): + def execute(self, sample: Dict) -> Dict: + # 转换逻辑 + return processed_sample +``` + +#### Filter +数据过滤算子基类(返回 bool): + +```python +class Filter(BaseOp): + def execute(self, sample: Dict) -> bool: + # 过滤逻辑 + return True # 保留或过滤 +``` + +#### Slicer +数据切片算子基类(1:N): + +```python +class Slicer(BaseOp): + def execute(self, sample: Dict) -> List[Dict]: + # 切片逻辑 + return [sample1, sample2, ...] +``` + +#### LLM +LLM 算子基类: + +```python +class LLM(Mapper): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.llm = self.get_llm(*args, **kwargs) + + def build_llm_prompt(self, *args, **kwargs): + raise NotImplementedError +``` + +### 2. Task Scheduler + +异步任务调度器: + +```python +class TaskScheduler: + def __init__(self, max_concurrent: int = 10): + self.tasks: Dict[str, Task] = {} + self.semaphore = asyncio.Semaphore(max_concurrent) + + async def submit(self, task_id, task, *args, **kwargs): + # 提交任务 + pass + + def get_task_status(self, task_id: str) -> Optional[TaskResult]: + # 获取任务状态 + pass + + def cancel_task(self, task_id: str) -> bool: + # 取消任务 + pass +``` + +### 3. 算子执行 + +#### 算子注册 +```python +from datamate.core.base_op import OPERATORS + +OPERATORS.register_module( + module_name='YourOperatorName', + module_path="ops.user.operator_package.process" +) +``` + +#### 执行算子 +```python +from datamate.core.base_op import Mapper + +class MyMapper(Mapper): + def execute(self, sample): + text = sample.get('text', '') + processed = text.upper() + sample['text'] = processed + return sample +``` + +## 快速开始 + +### 前置条件 +- Python 3.11+ +- Ray 2.7.0+ +- Poetry + +### 安装 +```bash +cd runtime/python-executor +poetry install +``` + +### 启动 Ray Head +```bash +ray start --head +``` + +### 启动 Ray Worker +```bash +ray start --head-address=:6379 +``` + +## 使用 + +### 提交任务到 Ray +```python +from ray import remote + +@remote +def execute_operator(sample, operator_config): + # 执行算子逻辑 + return result + +# 提交任务 +result_ref = execute_operator.remote(sample, config) +result = ray.get(result_ref) +``` + +### 使用 Task Scheduler +```python +from datamate.scheduler.scheduler import TaskScheduler + +scheduler = TaskScheduler(max_concurrent=10) +task_id = "task-001" +scheduler.submit(task_id, my_function, arg1, arg2) +status = scheduler.get_task_status(task_id) +``` + +## 开发 + +### 添加新算子 +1. 在 `runtime/ops/` 创建算子目录 +2. 实现 `process.py` 和 `__init__.py` +3. 在 `__init__.py` 注册算子 +4. 测试算子 + +### 调试算子 +```bash +# 本地测试 +python -c "from ops.user.operator_package.process import YourOperatorName; op = YourOperatorName(); print(op.execute({'text': 'test'}))" +``` + +## 性能 + +### 并行执行 +Ray 自动处理并行执行和资源分配。 + +### 容错 +Ray 提供自动任务重试和故障转移。 + +### 资源管理 +Ray 动态分配 CPU、GPU、内存资源。 + +## 文档 + +- [Ray 文档](https://docs.ray.io/) +- [AGENTS.md](./AGENTS.md) + +## 相关链接 + +- [运行时 README](../README.md) +- [算子生态](../ops/README.md) diff --git a/runtime/python-executor/README.md b/runtime/python-executor/README.md new file mode 100644 index 000000000..9cee6c708 --- /dev/null +++ b/runtime/python-executor/README.md @@ -0,0 +1,221 @@ +# Ray Executor + +## Overview + +Ray Executor is a Ray-based distributed execution framework responsible for executing data processing operators, task scheduling, and distributed computing. + +## Architecture + +``` +runtime/python-executor/ +└── datamate/ + ├── core/ + │ ├── base_op.py # BaseOp, Mapper, Filter, Slicer, LLM + │ ├── dataset.py # Dataset processing + │ └── constant.py # Constant definitions + ├── scheduler/ + │ ├── scheduler.py # TaskScheduler, Task, TaskStatus + │ ├── func_task_scheduler.py # Function task scheduling + │ └── cmd_task_scheduler.py # Command task scheduling + ├── wrappers/ + │ ├── executor.py # Ray executor entry point + │ ├── datamate_wrapper.py # DataMate task wrapper + │ └── data_juicer_wrapper.py # DataJuicer integration + └── common/utils/ # Utility functions + ├── bytes_transform.py + ├── file_scanner.py + ├── lazy_loader.py + └── text_splitter.py +``` + +## Components + +### 1. Base Classes + +#### BaseOp +Base class for all operators: + +```python +class BaseOp: + def __init__(self, *args, **kwargs): + self.accelerator = kwargs.get('accelerator', "cpu") + self.text_key = kwargs.get('text_key', "text") + # ... other configuration + + def execute(self, sample): + raise NotImplementedError +``` + +#### Mapper +Base class for data transformation operators (1:1): + +```python +class Mapper(BaseOp): + def execute(self, sample: Dict) -> Dict: + # Transformation logic + return processed_sample +``` + +#### Filter +Base class for data filtering operators (returns bool): + +```python +class Filter(BaseOp): + def execute(self, sample: Dict) -> bool: + # Filtering logic + return True # Keep or filter out +``` + +#### Slicer +Base class for data slicing operators (1:N): + +```python +class Slicer(BaseOp): + def execute(self, sample: Dict) -> List[Dict]: + # Slicing logic + return [sample1, sample2, ...] +``` + +#### LLM +Base class for LLM operators: + +```python +class LLM(Mapper): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + self.llm = self.get_llm(*args, **kwargs) + + def build_llm_prompt(self, *args, **kwargs): + raise NotImplementedError +``` + +### 2. Task Scheduler + +Async task scheduler: + +```python +class TaskScheduler: + def __init__(self, max_concurrent: int = 10): + self.tasks: Dict[str, Task] = {} + self.semaphore = asyncio.Semaphore(max_concurrent) + + async def submit(self, task_id, task, *args, **kwargs): + # Submit task + pass + + def get_task_status(self, task_id: str) -> Optional[TaskResult]: + # Get task status + pass + + def cancel_task(self, task_id: str) -> bool: + # Cancel task + pass +``` + +### 3. Operator Execution + +#### Operator Registration +```python +from datamate.core.base_op import OPERATORS + +OPERATORS.register_module( + module_name='YourOperatorName', + module_path="ops.user.operator_package.process" +) +``` + +#### Execute Operator +```python +from datamate.core.base_op import Mapper + +class MyMapper(Mapper): + def execute(self, sample): + text = sample.get('text', '') + processed = text.upper() + sample['text'] = processed + return sample +``` + +## Quick Start + +### Prerequisites +- Python 3.11+ +- Ray 2.7.0+ +- Poetry + +### Installation +```bash +cd runtime/python-executor +poetry install +``` + +### Start Ray Head +```bash +ray start --head +``` + +### Start Ray Worker +```bash +ray start --head-address=:6379 +``` + +## Usage + +### Submit Task to Ray +```python +from ray import remote + +@remote +def execute_operator(sample, operator_config): + # Execute operator logic + return result + +# Submit task +result_ref = execute_operator.remote(sample, config) +result = ray.get(result_ref) +``` + +### Use Task Scheduler +```python +from datamate.scheduler.scheduler import TaskScheduler + +scheduler = TaskScheduler(max_concurrent=10) +task_id = "task-001" +scheduler.submit(task_id, my_function, arg1, arg2) +status = scheduler.get_task_status(task_id) +``` + +## Development + +### Adding a New Operator +1. Create operator directory in `runtime/ops/` +2. Implement `process.py` and `__init__.py` +3. Register operator in `__init__.py` +4. Test the operator + +### Debugging Operators +```bash +# Local test +python -c "from ops.user.operator_package.process import YourOperatorName; op = YourOperatorName(); print(op.execute({'text': 'test'}))" +``` + +## Performance + +### Parallel Execution +Ray automatically handles parallel execution and resource allocation. + +### Fault Tolerance +Ray provides automatic task retry and failover. + +### Resource Management +Ray dynamically allocates CPU, GPU, and memory resources. + +## Documentation + +- [Ray Documentation](https://docs.ray.io/) +- [AGENTS.md](./AGENTS.md) + +## Related Links + +- [Runtime README](../README.md) +- [Operator Ecosystem](../ops/README.md)