This Databricks Asset Bundle (DAB) implements an ETL pipeline that extracts data from the WorldWideImporters SQL Server database and loads it into Unity Catalog's bronze layer for further analytics processing.
SQL Server (WWI) β Databricks Jobs β Unity Catalog (Bronze Layer)
β
Silver Layer (Future)
β
Gold Layer (Future)
datalab/
βββ databricks.yml # Main bundle configuration
βββ deploy-databricks.ps1 # Local deployment script
βββ notebooks/
β βββ bronze/
β βββ extract_customers.ipynb # Customer data extraction
β βββ extract_orders.py # Orders data extraction
β βββ extract_stock_items.py # Stock items extraction
βββ resources/
β βββ jobs.yml # Job definitions
β βββ clusters.yml # Cluster configurations
β βββ init-scripts/
β βββ install-sql-driver.sh # SQL Server JDBC driver setup
βββ .github/
βββ workflows/
βββ deploy-databricks.yaml # CI/CD pipeline
- Customer Data: Extracts customer information with data quality validation
- Orders Data: Extracts sales orders and order lines with business metrics
- Stock Items: Extracts inventory data including holdings and categories
- Comprehensive data validation checks
- Automatic data profiling and statistics
- Email notifications on job success/failure
- Detailed logging and error handling
- Declarative configuration using Databricks Asset Bundles
- Environment-specific deployments (dev/prod)
- Automated cluster provisioning and management
- Version-controlled infrastructure
-
Databricks CLI (v0.210.0 or later)
pip install databricks-cli -
Environment Variables
$env:DATABRICKS_HOST = "https://your-databricks-instance.cloud.databricks.com" $env:DATABRICKS_TOKEN = "your-databricks-token" $env:SQL_SERVER_HOST = "your-sql-server.database.windows.net" $env:SQL_USERNAME = "your-sql-username" $env:SQL_PASSWORD = "your-sql-password" $env:NOTIFICATION_EMAIL = "your-email@company.com"
-
Unity Catalog Setup
- Ensure Unity Catalog is enabled in your Databricks workspace
- Create the target catalog (
don_datalab_catalogby default)
-
Validate Configuration
.\deploy-databricks.ps1 -Target dev -ValidateOnly
-
Deploy to Development
.\deploy-databricks.ps1 -Target dev
-
Deploy to Production
.\deploy-databricks.ps1 -Target prod
The bundle includes GitHub Actions workflows for automated deployment:
- Trigger: Push to
develop(dev) ormain(prod) branches - Secrets Required:
DATABRICKS_HOSTDATABRICKS_TOKENSQL_SERVER_HOSTSQL_USERNAMESQL_PASSWORDNOTIFICATION_EMAIL
| Table | Source | Description | Partitioning |
|---|---|---|---|
customers |
Sales.Customers | Customer master data | CustomerID |
orders |
Sales.Orders | Sales order headers | OrderDate |
order_lines |
Sales.OrderLines | Order line items | None |
stock_items |
Warehouse.StockItems | Product catalog | None |
stock_item_holdings |
Warehouse.StockItemHoldings | Current inventory | None |
stock_groups |
Warehouse.StockGroups | Product categories | None |
stock_item_stock_groups |
Warehouse.StockItemStockGroups | Product-category mapping | None |
Each extraction job includes:
- Null Value Detection: Identifies missing required fields
- Data Type Validation: Ensures proper data types
- Business Rule Validation: Validates business constraints
- Statistical Profiling: Generates data distribution metrics
- Completeness Metrics: Tracks data completeness percentages
All tables include these metadata columns for data lineage:
_extract_timestamp: When the data was extracted_source_system: Source system identifier_batch_id: Unique batch identifier
| Variable | Description | Default |
|---|---|---|
catalog_name |
Unity Catalog name | don_datalab_catalog |
schema_name |
Bronze layer schema | bronze |
sql_server_host |
SQL Server hostname | Required |
sql_database_name |
Database name | WorldWideImporters |
sql_username |
SQL Server username | Required |
sql_password |
SQL Server password | Required |
notification_email |
Alert email address | admin@example.com |
- Frequency: Daily at 2:00 AM UTC
- Cron Expression:
0 0 2 * * ? - Status: Paused by default (enable after testing)
- Timeout: 2 hours maximum
- Concurrency: 1 (prevents overlapping runs)
-
SQL Server Connection Failed
- Verify firewall rules allow Databricks IP ranges
- Check SQL Server authentication settings
- Ensure database exists and user has permissions
-
Unity Catalog Access Denied
- Verify catalog exists and user has CREATE TABLE permissions
- Check workspace Unity Catalog configuration
- Ensure proper RBAC assignments
-
Job Timeout
- Check network connectivity to SQL Server
- Monitor cluster resource utilization
- Consider increasing timeout or cluster size
- Job Logs: Check Databricks job run logs for detailed error messages
- Cluster Logs: Review cluster event logs for infrastructure issues
- SQL Server Logs: Monitor SQL Server for connection and query issues
- Email Alerts: Configure notification email for job failures
- Store sensitive values in Databricks secrets or Azure Key Vault
- Use service principals for production deployments
- Rotate credentials regularly
- Configure VNet peering between Databricks and SQL Server
- Use private endpoints where possible
- Implement network security groups
- Implement row-level security where needed
- Use Unity Catalog's built-in RBAC
- Audit data access regularly
- Use node types appropriate for workload (Standard_DS3_v2 recommended)
- Enable auto-scaling for variable workloads
- Configure appropriate auto-termination
- Use column pruning and predicate pushdown
- Implement incremental loading for large tables
- Consider partitioning strategies for large datasets
- Use Spot instances for non-critical workloads
- Monitor cluster utilization and right-size
- Implement auto-termination policies
- Silver Layer Development: Create transformations for cleaned/enriched data
- Gold Layer Analytics: Build aggregated tables for reporting
- Delta Live Tables: Consider migrating to DLT for complex pipelines
- ML Integration: Add machine learning workflows
- Real-time Processing: Implement streaming for real-time data
For issues or questions:
- Check the troubleshooting section above
- Review Databricks job logs and error messages
- Contact your Databricks administrator
- Refer to Databricks Asset Bundles documentation