Skip to content

CtxOS/ctxos-ai-devkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 A flexible dev kit for building LLMs, AI agents, and intelligent systems on CtxOS.

BigQuery-to-Parquet pipeline and machine learning dataset utilities.

This project provides a high-performance, scalable pipeline for exporting code repositories from BigQuery to Parquet, along with utilities for preprocessing datasets and training/fine-tuning ML models (e.g., CodeGen, Transformers-based models). It leverages Apache Beam, PyArrow, and modern Python ML tooling.


Table of Contents


Features

  • Export code datasets from BigQuery to Parquet efficiently.
  • Preprocessing pipeline: deduplication, filtering, comment cleaning, PII redaction.
  • Dataset splits: train, test, validation.
  • Language-aware processing: Python, Java, C/C++, JS, PHP, Ruby, Rust, Scala, Kotlin, etc.
  • CLI interface for quick execution.
  • Fully compatible with GCP and local development.

Project Structure

About

🤖 A flexible dev kit for building LLMs, AI agents, and intelligent systems on CtxOS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages