An end-to-end automated data pipeline engineered to extract raw textual data from remote sources, sanitize the ingestion stream, and interact directly with a Large Language Model to synthesize structured podcast scripts.
This system is engineered using 100% Native Python with strictly no external library requirements. It adheres to a Zero pip install policy, relying exclusively on the Python Standard Library for all operations including network requests, data sanitization, and JSON serialization.
The architecture is divided into a 4-tier modular structure:
text_extractor.py: The ingestion layer responsible for executing secure HTTP GET requests, implementing HTTPUser-Agentheader spoofing to bypass basic bot-protection, and providing robust I/O failure handling during data persistence.data_cleaner.py: The sanitization engine utilizing class-level Regex pre-compilation for optimal CPU execution and lazy-loading iterators (finditer) for strict RAM optimization to prevent Out-of-Memory (OOM) faults.script_generator.py: The AI Orchestration layer executing raw REST API POST requests directly against the Gemini 2.5 Flash endpoints. It constructs JSON payloads, processes deserialization, and enforces robust API exception handling.main_pipeline.py: The centralized execution entry point orchestrating the sequence of operations. It employs absolute path resolution viapathlibfor environment-agnostic execution and enforces Fail-Fast logic, instantly terminating the process (sys.exit(1)) upon any component failure.
The execution lifecycle follows a linear procedural pipeline:
- Network Ingestion: The system initializes by acquiring the target URL via CLI arguments.
text_extractor.pyretrieves the raw payload and writes it todata/input/raw.txt. - Sanitization Stream:
data_cleaner.pyapplies pre-compiled regular expressions to strip HTML tags and normalize whitespace. The optimized output is written todata/input/clean.txt. - AI Generation:
script_generator.pytransmits the sanitized payload via a POST request to the Google API Gateway. - Output Persistence: The resulting Markdown script is safely persisted to
data/output/podcast_script.md.
.
├── data/
│ ├── input/
│ │ ├── .gitkeep
│ │ ├── clean.txt
│ │ └── raw.txt
│ └── output/
│ ├── .gitkeep
│ └── podcast_script.md
├── src/
│ ├── check_models.py
│ ├── data_cleaner.py
│ ├── main_pipeline.py
│ ├── script_generator.py
│ └── text_extractor.py
├── .gitignore
└── README.md
# Clone the repository
git clone https://github.com/idkBsy/native-python-ai-orchestrator.git
# Execute the central orchestrator pipeline
python native-python-ai-orchestrator/src/main_pipeline.py --url "<TARGET_URL>" --api-key "<GEMINI_KEY>"
# Execute the diagnostic probe to verify API access and model authorization
python native-python-ai-orchestrator/src/check_models.py --api-key "<GEMINI_KEY>"CRITICAL WARNING: CREDENTIAL LEAKAGE Under no circumstances should production authentication tokens be committed to version control. The Gemini API keys grant access to sensitive network resources.
- API keys must strictly be injected via Command Line Interface (CLI) arguments.
- API keys must never be hardcoded into the source files or unencrypted environment configurations.