Skip to content

Conversation

@cyclux
Copy link
Contributor

@cyclux cyclux commented Dec 5, 2025

This pull request introduces a new, modular data integration layer for loading and preparing the Jaffle Shop dataset in Snowflake for getML Feature Store integration. It provides robust infrastructure bootstrapping, typed configuration, session management, and SQL utilities, all with clear documentation and usage examples. The codebase is organized for maintainability and extensibility, and includes scripts, configuration, and test scaffolding.

The most important changes are:

Infrastructure Bootstrapping and Session Management

  • Added ensure_infrastructure and BootstrapError in data/_bootstrap.py to automatically create Snowflake warehouses and databases if missing, with idempotent operations and clear error handling.
  • Introduced create_session in data/_snowflake_session.py for robust, context-managed Snowflake Snowpark sessions, including error handling for failed connections.

Configuration and Environment Management

  • Added SnowflakeSettings in data/_settings.py for typed configuration loaded from SNOWFLAKE_* environment variables, making authentication and connection setup consistent and secure.
  • Provided a mise.toml file with templated environment variable setup for seamless local development and CI configuration.

Data Ingestion and SQL Utilities

  • Created data/_sql_loader.py utility for loading and formatting SQL files, and added a suite of parameterized SQL templates for schema, stage, and table creation, as well as data ingestion from Parquet files and cloud storage. [1] [2] [3] [4] [5] [6] [7] [8] [9]
  • Added ingest_jaffle_shop_data.py script to orchestrate end-to-end data ingestion from a public GCS bucket into Snowflake, with automatic infrastructure setup.

Project Structure and Documentation

  • Established a clear package structure with __init__.py files, public API exports, and comprehensive docstrings for all modules. [1] [2]
  • Added pyproject.toml with dependencies, development tools, and code style/linting configuration for consistent development and testing.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive data integration layer for loading the Jaffle Shop dataset from cloud storage (GCS/S3) into Snowflake. The implementation provides typed configuration, automatic infrastructure bootstrapping, robust session management, and SQL utilities with comprehensive test coverage.

Key changes:

  • Infrastructure auto-provisioning with idempotent warehouse/database creation
  • Cloud storage integration supporting both GCS (via HTTPS download/upload) and S3 (via external staging)
  • Comprehensive test suite with unit tests (mocked) and integration tests (real Snowflake connections)

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated no comments.

Show a summary per file
File Description
data/__init__.py Public API exports for settings, session management, bootstrapping, and ingestion functions
data/_settings.py Typed Snowflake configuration loaded from environment variables using Pydantic
data/_snowflake_session.py Session factory with context manager support and error handling
data/_bootstrap.py Idempotent warehouse and database creation with privilege checks
data/_sql_loader.py Utility for loading and parameterizing SQL templates
data/ingestion.py Main ingestion logic for GCS and S3 with file caching and stage management
data/sql/**/*.sql SQL templates for schema, stage, table creation and data loading
ingest_jaffle_shop_data.py CLI script orchestrating end-to-end GCS ingestion
tests/**/*.py Comprehensive unit and integration test suite with fixtures
pyproject.toml Project configuration with dependencies and linting rules
mise.toml Environment variable configuration using 1Password integration
.gitignore Updated to exclude LLM-related files and fix preparation/ path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants