BuildItAll, a European consulting firm specializing in scalable data platform solutions for small-size companies, recently secured €20M in Series A funding. This milestone strengthens their position in delivering enterprise-grade data platform services.
In early 2024, a Belgian client approached BuildItAll with a critical need: they needed a robust data platform to handle their massive daily data generation and enable data-driven decision-making through big data analytics.
Our team at BuildItAll took on this challenge as our first major project post-funding. The goal was to create a production-ready Big Data Processing Platform using Apache Spark, with a focus on cost optimization while maintaining enterprise-grade capabilities.
- Ifeanyi - Analytics Engineer (GitHub)
- Taiwo - Data Engineer (GitHub)
- Chidera - Data Infrastructure Engineer (GitHub)
- Project Overview
- Architecture
- Infrastructure as Code
- AWS Services Used
- Security Implementation
- Data Processing Pipeline
- Setup and Installation
- Monitoring and Logging
- Cost Optimization
- Best Practices
- Troubleshooting
- Contributing
- Key Takeaways
- Resources and Documentations
This project implements a scalable Big Data Processing Platform for BuildItAll's Belgian client. The platform enables big data analytics through Apache Spark workloads, featuring automated cluster management, cost optimization, and robust CI/CD practices.
- Client: BuildItAll Consulting (European Consulting Firm)
- Funding: €20M Series A
- Target: Small-size companies requiring scalable data platforms
- First Client: Belgian company requiring big data analytics capabilities
- Fully versioned, controlled codebase
- Automated Apache Spark cluster management
- Cost-optimized infrastructure
- Comprehensive CI/CD pipeline
- Synthetic data generation for testing
- Production-grade security and monitoring
The platform is built on AWS Cloud with a focus on security, scalability, and cost optimization. The architecture follows AWS best practices and implements a multi-AZ design for high availability.
-
Data Ingestion Layer
- Users connect through AWS Client VPN (10.0.0.0/16)
- Data Engineers push code through GitHub Actions
- Data flows through the VPN Gateway into the VPC
-
Processing Layer
- EMR Cluster in private subnet (10.100.2.0/24)
- MWAA (Managed Workflows for Apache Airflow) for orchestration deployed in the two private subnets for high availability
- Auto-scaling based on workload
- NAT Gateways with Elastic IPs for limited internet access (primarily for Airflow email notifications)
-
Storage Layer
- S3 Data Lake with organized structure:
etl/- For ETL scripts and configurationsdags/- For Airflow DAG definitionslogs/- For application and system logs
- S3 Data Lake with organized structure:
-
Security Layer
- Used VPC Endpoints to allow private subnet resources to securely access AWS services
- NAT Gateways for limited internet access
- IAM roles and policies
- Security groups
- Certificate-based VPN authentication
-
VPC Endpoints
-
S3 Gateway Endpoint
- Enables private connectivity to S3 without internet access
- Reduces data transfer costs
- Improves security by keeping traffic within AWS network
-
CloudWatch Logs Interface Endpoint
- Secure logging without internet access
- Enables private subnet resources to send logs to CloudWatch
- Maintains security boundaries while enabling monitoring
-
Secrets Manager Interface Endpoint
- Secure access to secrets without internet exposure
- Enables private subnet resources to retrieve credentials securely
- Reduces attack surface by eliminating internet access for secrets
-
SQS Interface Endpoint
- Enables private communication between services
- Supports message queuing without internet access
- Maintains security for inter-service communication
-
EMR Interface Endpoint
- Despite EMR being in the VPC, the interface endpoint is crucial for:
- Secure communication with EMR control plane
- Private access to EMR API operations
- Enables cluster management without internet access
- Supports secure job submission and monitoring
- Reduces exposure of management operations to the internet
- Benefits:
- Enhanced security for cluster management operations
- Reduced attack surface for EMR control plane
- Improved reliability by keeping management traffic within AWS network
- Better compliance with security requirements
- Cost optimization by reducing NAT Gateway usage for management operations
- Despite EMR being in the VPC, the interface endpoint is crucial for:
-
The platform uses Apache Airflow (MWAA) to orchestrate data processing workflows. Two main DAGs are implemented:
A basic DAG that demonstrates the platform's workflow capabilities:
- Purpose: Demonstrates basic task execution and dependency management
- Schedule: Runs daily
- Tasks:
start: Initial dummy taskaddition_task: Performs a simple addition operation (5 + 3)end: Final dummy task
- Features:
- Email notifications on success/failure
- Automatic retry on failure
- Configurable retry delay
A production-grade DAG for orchestrating EMR-based data processing:
- Purpose: Manages the complete lifecycle of EMR clusters and Spark jobs
- Schedule: Runs daily
- Tasks:
begin_workflow: Initial dummy taskcreate_emr_cluster: Creates an EMR cluster with specified configurationcheck_cluster_ready: Monitors cluster creation statussubmit_spark_application: Submits Spark job to the clustercheck_submission_status: Monitors job executionterminate_emr_cluster: Terminates the cluster after job completionend_workflow: Final dummy task
- EMR Configuration:
- Release: EMR 7.8.0
- Applications: Hadoop, Spark
- Instance Types: m5.xlarge
- Cluster Size: 1 Master + 2 Core nodes
- Spot instances for cost optimization
- Features:
- Automated cluster lifecycle management
- Job monitoring and status tracking
- Email notifications
- Automatic retry on failure
- Secure VPC integration
- DAGs are stored in S3 and automatically synced to MWAA
- Version control through Git
- Configuration as code
- Secure access through VPC endpoints
-
Amazon EMR
- Native Spark support
- Cost-effective with spot instances
- Automated cluster management
- Integration with AWS services
-
MWAA (Managed Workflows for Apache Airflow)
- Fully managed orchestration
- Native AWS integration
- Scalable and reliable
- Reduced operational overhead
-
AWS Client VPN
- Secure access to resources
- Certificate-based authentication
- Split-tunnel support
- Detailed connection logging
Our infrastructure is managed entirely through Terraform, with modular components for maintainability and reusability.
terraform/
├── main.tf # Main infrastructure definitions
├── variables.tf # Variable declarations
├── outputs.tf # Output definitions
├── providers.tf # Provider configurations
├── backend.tf # State management configuration
├── certificates/ # VPN certificates
├── policies/ # IAM policies
└── .terraform/ # Terraform plugins and modules
-
Networking (VPC)
module "vpc" { cidr_block = "10.100.0.0/16" vpc_name = "big-data-VPC" create_igw = true enable_dns_support = true enable_dns_hostnames = true }
-
Subnets
- VPN Target Subnet: 10.100.0.0/24 (Public)
- Application Subnet: 10.100.1.0/24 (Public)
- Private Subnet A: 10.100.3.0/24 (Private)
- Private Subnet B: 10.100.4.0/24 (Private)
-
NAT Gateways with Elastic IPs
- One NAT Gateway per private subnet for limited internet access
- Each NAT Gateway associated with an Elastic IP
- Primarily used for Airflow email notifications
- Located in public subnets for high availability
- Provides stable public IP addresses for outbound internet access
-
Security Groups
module "vpc_private_sg" { vpc_id = module.vpc.vpc_id security_group_name = "private-sg" description = "Allow all traffic from VPC" }
-
IAM Configuration
- Service Roles (EMR, MWAA)
- User Management
- Policy Attachments
-
Data Ingestion
- Secure data upload through VPN
- S3 bucket with organized structure
- Event-driven triggers
-
Processing
- EMR cluster for Spark processing
- MWAA for workflow orchestration
- Auto-scaling based on workload
-
Storage
- S3 for raw and processed data
- RDS for storage
- Parquet format for efficiency
-
Network Security
- VPC isolation
- Private subnets for processing
- Security groups
- VPN access
- Elastic Network Interface (ENI) for secure inter-service communication
- VPC Endpoints for secure AWS service access
- Gateway endpoints for S3
- Interface endpoints for CloudWatch, Secrets Manager, SQS, and EMR
- Private connectivity to AWS services
- Reduced attack surface
-
Access Control
- IAM roles and policies
- Certificate-based authentication
- Least privilege principle
-
Data Security
- Encryption at rest
- Encryption in transit
- Secrets management
-
Compute Optimization
- EMR with spot instances
- Auto-scaling clusters
- Automated shutdown
-
Storage Optimization
- S3 lifecycle policies
- Parquet compression
- Data archival strategies
-
Compute and Processing
- Amazon EMR (Elastic MapReduce)
- MWAA (Managed Workflows for Apache Airflow)
-
Storage
- Amazon S3
-
Networking
- Amazon VPC
- Client VPN
- Internet Gateway
- NAT Gateway with Elastic IPs
- Gateway Endpoints
- Interface Endpoints
-
Security
- AWS Certificate Manager
- AWS Secrets Manager
- IAM
- Security Groups
-
Monitoring
- Amazon CloudWatch
- CloudWatch Logs
-
AWS Account
- Appropriate IAM permissions
- AWS CLI configured
-
Development Tools
- Terraform >= 1.0.0
- Python >= 3.8
- Apache Spark >= 3.x
- Git
-
Authentication
- AWS credentials
- VPN certificates
- GitHub access tokens
big-data-platform/
├── .github/
│ └── workflows/ # CI/CD pipeline definitions
├── terraform/
│ ├── policies/ # IAM Policies components
├── images/
├── dags/
├── etl/ # Integration tests
├── .gitignore # git ignore file
└── README.md # Project Documentation
-
Clone the Repository
git clone https://github.com/Tee-works/big-data-processing-platform.git cd big-data-processing-platform -
Install Dependencies
pip install -r requirements.txt
-
Configure AWS Credentials
aws configure
-
Initialize Terraform
cd terraform terraform init
-
Terraform Deployment
terraform plan -var-file=environments/prod.tfvars
-
VPN Setup
- Download VPN configuration
- Install AWS VPN client
- Import certificates
-
Synthetic Data Generation
- Multiple parquet datasets
- Total records: 5+ million
- Varied record distribution
-
Spark Job Execution
- Automated cluster provisioning
- Job submission
- Cluster termination
-
Data Quality Checks
- Schema validation
- Data integrity checks
- Performance monitoring
This documenttion describes the Continuous Integration and Continuous Deployment (CI/CD) workflow implemented through GitHub Actions. The workflow automatically runs when changes are pushed to the main branch, specifically when files in the dags/, terraform/, or etl/ directories are modified.
The workflow is triggered on:
- Push events to the
mainbranch - Only when changes affect files in these paths:
dags/**terraform/**etl/**
This job validates Python code quality in the PySpark files.
Environment:
- Runs on: Ubuntu latest
- Python version: 3.10
Steps:
- Checkout repository code
- Set up Python 3.10 environment
- Install linting tools:
flake8: Checks for code style and quality issuesisort: Validates import sorting
- Run
flake8ondags/andetl/directories - Run
isortin check-only mode ondags/andetl/directories
This job validates Terraform configuration files for best practices and potential errors.
Environment:
- Runs on: Ubuntu latest
Steps:
- Checkout repository code
- Install TFLint using the official installation script
- Run TFLint in the
terraformdirectory to validate configurations
This job uploads Airflow DAG files and ETL scripts to an S3 bucket, but only after the PySpark linting job passes successfully.
Dependencies:
- Requires successful completion of
lint-pysparkjob
Environment:
- Runs on: Ubuntu latest
- Environment variables:
BUCKET_NAME: Set tobig-data-bck
Steps:
- Checkout repository code
- Configure AWS credentials using secrets stored in GitHub:
- AWS Access Key ID
- AWS Secret Access Key
- AWS Region:
eu-north-1
- Sync files to S3:
- Copy
dags/directory tos3://big-data-bck/dags/ - Copy
etl/directory tos3://big-data-bck/etl/
- Copy
- AWS credentials are stored as GitHub secrets
- No sensitive information is exposed in the workflow file
- Ensure AWS secrets (
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY) are configured in your GitHub repository settings - The IAM user associated with these credentials should have permissions to write to the specified S3 bucket
- This workflow only triggers on changes to specific directories to avoid unnecessary runs
If the workflow fails:
- Linting failures: Check the job logs for specific code style issues that need to be fixed
- AWS authentication errors: Verify that your GitHub secrets are correctly configured
- S3 sync failures: Ensure the IAM user has sufficient permissions to write to the bucket
-
CloudWatch Metrics
- Cluster performance
- Job execution stats
- Cost metrics
-
Logging
- Application logs
- Infrastructure logs
- Audit logs
-
Cluster Management
- Auto-scaling
- Spot instances
- Automatic termination
-
Storage Optimization
- Data lifecycle policies
- Storage class selection
- Compression strategies
-
Code Management
- Version control
- Code review process
- Documentation
-
Security
- Least privilege access
- Regular security updates
- Encryption standards
-
Operations
- Automated deployments
- Monitoring and alerting
- Backup and recovery
Common issues and their solutions:
- VPN Connection Issues
- Cluster Provisioning Failures
- Job Execution Errors
- Permission Problems
- Fork the repository
- Create a feature branch
- Submit a pull request
- Follow coding standards
- Use Active Directory or SAML for authentication for Client VPN in production
- Modular Terraform code structure significantly improves infrastructure maintainability and scalability.
- Use spot instances with auto-termination for EMR to cut costs by up to 70% without sacrificing performance.
- Separate orchestration (MWAA) from processing (EMR) to improve fault tolerance and modularity.
- Enable S3 object versioning and lifecycle rules to manage storage costs and data retention.
- Set up metric alarms on EMR and RDS for proactive monitoring and resource optimization.
- CI/CD gating with linters and integration tests prevents poor-quality code from reaching production.
- AWS Documentation on Client VPN
- AWS Documentation on Amazon EMR
- AWS Documenation on Amazon Managed Workflows for Apache Airflow
- Orchestrating analytics jobs on Amazon EMR Notebooks using Amazon MWAA
- Building and operating data pipelines at scale using CI/CD, Amazon MWAA and Apache Spark on Amazon EMR by Wipro
- Terraform Documentation for AWS Provider
- How do I use Amazon SES as the SMTP host to send emails from Amazon MWAA DAG tasks?
- Create an AWS Secrets Manager secret
- Set up email sending with Amazon SES
For more information, contact the repo owner.
