Skip to content

idkBsy/native-python-ai-orchestrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Native Python AI Orchestrator: Podcast Script Pipeline

An end-to-end automated data pipeline engineered to extract raw textual data from remote sources, sanitize the ingestion stream, and interact directly with a Large Language Model to synthesize structured podcast scripts.

Core Architectural Principles

Zero-Dependency Policy

This system is engineered using 100% Native Python with strictly no external library requirements. It adheres to a Zero pip install policy, relying exclusively on the Python Standard Library for all operations including network requests, data sanitization, and JSON serialization.

Modular Integrity

The architecture is divided into a 4-tier modular structure:

  • text_extractor.py: The ingestion layer responsible for executing secure HTTP GET requests, implementing HTTP User-Agent header spoofing to bypass basic bot-protection, and providing robust I/O failure handling during data persistence.
  • data_cleaner.py: The sanitization engine utilizing class-level Regex pre-compilation for optimal CPU execution and lazy-loading iterators (finditer) for strict RAM optimization to prevent Out-of-Memory (OOM) faults.
  • script_generator.py: The AI Orchestration layer executing raw REST API POST requests directly against the Gemini 2.5 Flash endpoints. It constructs JSON payloads, processes deserialization, and enforces robust API exception handling.
  • main_pipeline.py: The centralized execution entry point orchestrating the sequence of operations. It employs absolute path resolution via pathlib for environment-agnostic execution and enforces Fail-Fast logic, instantly terminating the process (sys.exit(1)) upon any component failure.

Technical Topology

The execution lifecycle follows a linear procedural pipeline:

  1. Network Ingestion: The system initializes by acquiring the target URL via CLI arguments. text_extractor.py retrieves the raw payload and writes it to data/input/raw.txt.
  2. Sanitization Stream: data_cleaner.py applies pre-compiled regular expressions to strip HTML tags and normalize whitespace. The optimized output is written to data/input/clean.txt.
  3. AI Generation: script_generator.py transmits the sanitized payload via a POST request to the Google API Gateway.
  4. Output Persistence: The resulting Markdown script is safely persisted to data/output/podcast_script.md.

Directory Tree

.
├── data/
│   ├── input/
│   │   ├── .gitkeep
│   │   ├── clean.txt
│   │   └── raw.txt
│   └── output/
│       ├── .gitkeep
│       └── podcast_script.md
├── src/
│   ├── check_models.py
│   ├── data_cleaner.py
│   ├── main_pipeline.py
│   ├── script_generator.py
│   └── text_extractor.py
├── .gitignore
└── README.md

Installation & Execution Protocol

# Clone the repository
git clone https://github.com/idkBsy/native-python-ai-orchestrator.git

# Execute the central orchestrator pipeline
python native-python-ai-orchestrator/src/main_pipeline.py --url "<TARGET_URL>" --api-key "<GEMINI_KEY>"

# Execute the diagnostic probe to verify API access and model authorization
python native-python-ai-orchestrator/src/check_models.py --api-key "<GEMINI_KEY>"

Security Policy

CRITICAL WARNING: CREDENTIAL LEAKAGE Under no circumstances should production authentication tokens be committed to version control. The Gemini API keys grant access to sensitive network resources.

  • API keys must strictly be injected via Command Line Interface (CLI) arguments.
  • API keys must never be hardcoded into the source files or unencrypted environment configurations.

About

End-to-end automated data pipeline using Native Python and Gemini 2.5 Flash API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages