🤖 A flexible dev kit for building LLMs, AI agents, and intelligent systems on CtxOS.
BigQuery-to-Parquet pipeline and machine learning dataset utilities.
This project provides a high-performance, scalable pipeline for exporting code repositories from BigQuery to Parquet, along with utilities for preprocessing datasets and training/fine-tuning ML models (e.g., CodeGen, Transformers-based models). It leverages Apache Beam, PyArrow, and modern Python ML tooling.
- Features
- Project Structure
- Installation
- Usage
- Pipeline Overview
- Training / Evaluation
- Requirements
- License
- Export code datasets from BigQuery to Parquet efficiently.
- Preprocessing pipeline: deduplication, filtering, comment cleaning, PII redaction.
- Dataset splits: train, test, validation.
- Language-aware processing: Python, Java, C/C++, JS, PHP, Ruby, Rust, Scala, Kotlin, etc.
- CLI interface for quick execution.
- Fully compatible with GCP and local development.