Big Data Processing Platform

Background

BuildItAll, a European consulting firm specializing in scalable data platform solutions for small-size companies, recently secured €20M in Series A funding. This milestone strengthens their position in delivering enterprise-grade data platform services.

In early 2024, a Belgian client approached BuildItAll with a critical need: they needed a robust data platform to handle their massive daily data generation and enable data-driven decision-making through big data analytics.

Our team at BuildItAll took on this challenge as our first major project post-funding. The goal was to create a production-ready Big Data Processing Platform using Apache Spark, with a focus on cost optimization while maintaining enterprise-grade capabilities.

Team

Core Development Team

Ifeanyi - Analytics Engineer (GitHub)
Taiwo - Data Engineer (GitHub)
Chidera - Data Infrastructure Engineer (GitHub)

Project Overview

This project implements a scalable Big Data Processing Platform for BuildItAll's Belgian client. The platform enables big data analytics through Apache Spark workloads, featuring automated cluster management, cost optimization, and robust CI/CD practices.

Business Context

Client: BuildItAll Consulting (European Consulting Firm)
Funding: €20M Series A
Target: Small-size companies requiring scalable data platforms
First Client: Belgian company requiring big data analytics capabilities

Key Features

Fully versioned, controlled codebase
Automated Apache Spark cluster management
Cost-optimized infrastructure
Comprehensive CI/CD pipeline
Synthetic data generation for testing
Production-grade security and monitoring

Architecture

Infrastructure Overview

The platform is built on AWS Cloud with a focus on security, scalability, and cost optimization. The architecture follows AWS best practices and implements a multi-AZ design for high availability.

Architecture Flow

Data Ingestion Layer
- Users connect through AWS Client VPN (10.0.0.0/16)
- Data Engineers push code through GitHub Actions
- Data flows through the VPN Gateway into the VPC
Processing Layer
- EMR Cluster in private subnet (10.100.2.0/24)
- MWAA (Managed Workflows for Apache Airflow) for orchestration deployed in the two private subnets for high availability
- Auto-scaling based on workload
- NAT Gateways with Elastic IPs for limited internet access (primarily for Airflow email notifications)
Storage Layer
- S3 Data Lake with organized structure:
  - etl/ - For ETL scripts and configurations
  - dags/ - For Airflow DAG definitions
  - logs/ - For application and system logs
Security Layer
- Used VPC Endpoints to allow private subnet resources to securely access AWS services
- NAT Gateways for limited internet access
- IAM roles and policies
- Security groups
- Certificate-based VPN authentication
VPC Endpoints
- S3 Gateway Endpoint
  - Enables private connectivity to S3 without internet access
  - Reduces data transfer costs
  - Improves security by keeping traffic within AWS network
- CloudWatch Logs Interface Endpoint
  - Secure logging without internet access
  - Enables private subnet resources to send logs to CloudWatch
  - Maintains security boundaries while enabling monitoring
- Secrets Manager Interface Endpoint
  - Secure access to secrets without internet exposure
  - Enables private subnet resources to retrieve credentials securely
  - Reduces attack surface by eliminating internet access for secrets
- SQS Interface Endpoint
  - Enables private communication between services
  - Supports message queuing without internet access
  - Maintains security for inter-service communication
- EMR Interface Endpoint
  - Despite EMR being in the VPC, the interface endpoint is crucial for:
    - Secure communication with EMR control plane
    - Private access to EMR API operations
    - Enables cluster management without internet access
    - Supports secure job submission and monitoring
    - Reduces exposure of management operations to the internet
  - Benefits:
    - Enhanced security for cluster management operations
    - Reduced attack surface for EMR control plane
    - Improved reliability by keeping management traffic within AWS network
    - Better compliance with security requirements
    - Cost optimization by reducing NAT Gateway usage for management operations

Airflow DAGs

The platform uses Apache Airflow (MWAA) to orchestrate data processing workflows. Two main DAGs are implemented:

1. Simple Addition DAG (`test.py`)

A basic DAG that demonstrates the platform's workflow capabilities:

Purpose: Demonstrates basic task execution and dependency management
Schedule: Runs daily
Tasks:
- start: Initial dummy task
- addition_task: Performs a simple addition operation (5 + 3)
- end: Final dummy task
Features:
- Email notifications on success/failure
- Automatic retry on failure
- Configurable retry delay

2. Big Data Pipeline DAG (`emr.py`)

A production-grade DAG for orchestrating EMR-based data processing:

Purpose: Manages the complete lifecycle of EMR clusters and Spark jobs
Schedule: Runs daily
Tasks:
- begin_workflow: Initial dummy task
- create_emr_cluster: Creates an EMR cluster with specified configuration
- check_cluster_ready: Monitors cluster creation status
- submit_spark_application: Submits Spark job to the cluster
- check_submission_status: Monitors job execution
- terminate_emr_cluster: Terminates the cluster after job completion
- end_workflow: Final dummy task
EMR Configuration:
- Release: EMR 7.8.0
- Applications: Hadoop, Spark
- Instance Types: m5.xlarge
- Cluster Size: 1 Master + 2 Core nodes
- Spot instances for cost optimization
Features:
- Automated cluster lifecycle management
- Job monitoring and status tracking
- Email notifications
- Automatic retry on failure
- Secure VPC integration

DAG Management

DAGs are stored in S3 and automatically synced to MWAA
Version control through Git
Configuration as code
Secure access through VPC endpoints

Why These Services?

Amazon EMR
- Native Spark support
- Cost-effective with spot instances
- Automated cluster management
- Integration with AWS services
MWAA (Managed Workflows for Apache Airflow)
- Fully managed orchestration
- Native AWS integration
- Scalable and reliable
- Reduced operational overhead
AWS Client VPN
- Secure access to resources
- Certificate-based authentication
- Split-tunnel support
- Detailed connection logging

Infrastructure as Code

Our infrastructure is managed entirely through Terraform, with modular components for maintainability and reusability.

Terraform Structure

terraform/
├── main.tf              # Main infrastructure definitions
├── variables.tf         # Variable declarations
├── outputs.tf          # Output definitions
├── providers.tf        # Provider configurations
├── backend.tf         # State management configuration
├── certificates/      # VPN certificates
├── policies/         # IAM policies
└── .terraform/       # Terraform plugins and modules

Key Infrastructure Components

Networking (VPC)

module "vpc" {
  cidr_block           = "10.100.0.0/16"
  vpc_name             = "big-data-VPC"
  create_igw           = true
  enable_dns_support   = true
  enable_dns_hostnames = true
}

Subnets
- VPN Target Subnet: 10.100.0.0/24 (Public)
- Application Subnet: 10.100.1.0/24 (Public)
- Private Subnet A: 10.100.3.0/24 (Private)
- Private Subnet B: 10.100.4.0/24 (Private)
NAT Gateways with Elastic IPs
- One NAT Gateway per private subnet for limited internet access
- Each NAT Gateway associated with an Elastic IP
- Primarily used for Airflow email notifications
- Located in public subnets for high availability
- Provides stable public IP addresses for outbound internet access

Security Groups

module "vpc_private_sg" {
  vpc_id              = module.vpc.vpc_id
  security_group_name = "private-sg"
  description         = "Allow all traffic from VPC"
}

IAM Configuration
- Service Roles (EMR, MWAA)
- User Management
- Policy Attachments

Data Processing Pipeline

Data Ingestion
- Secure data upload through VPN
- S3 bucket with organized structure
- Event-driven triggers
Processing
- EMR cluster for Spark processing
- MWAA for workflow orchestration
- Auto-scaling based on workload
Storage
- S3 for raw and processed data
- RDS for storage
- Parquet format for efficiency

Security Implementation

Network Security
- VPC isolation
- Private subnets for processing
- Security groups
- VPN access
- Elastic Network Interface (ENI) for secure inter-service communication
- VPC Endpoints for secure AWS service access
  - Gateway endpoints for S3
  - Interface endpoints for CloudWatch, Secrets Manager, SQS, and EMR
  - Private connectivity to AWS services
  - Reduced attack surface
Access Control
- IAM roles and policies
- Certificate-based authentication
- Least privilege principle
Data Security
- Encryption at rest
- Encryption in transit
- Secrets management

Cost Optimization Strategies

Compute Optimization
- EMR with spot instances
- Auto-scaling clusters
- Automated shutdown
Storage Optimization
- S3 lifecycle policies
- Parquet compression
- Data archival strategies

AWS Services Used

Compute and Processing
- Amazon EMR (Elastic MapReduce)
- MWAA (Managed Workflows for Apache Airflow)
Storage
- Amazon S3
Networking
- Amazon VPC
- Client VPN
- Internet Gateway
- NAT Gateway with Elastic IPs
- Gateway Endpoints
- Interface Endpoints
Security
- AWS Certificate Manager
- AWS Secrets Manager
- IAM
- Security Groups
Monitoring
- Amazon CloudWatch
- CloudWatch Logs

Prerequisites

AWS Account
- Appropriate IAM permissions
- AWS CLI configured
Development Tools
- Terraform >= 1.0.0
- Python >= 3.8
- Apache Spark >= 3.x
- Git
Authentication
- AWS credentials
- VPN certificates
- GitHub access tokens

Project Structure

big-data-platform/
├── .github/
│   └── workflows/          # CI/CD pipeline definitions
├── terraform/
│   ├── policies/           # IAM Policies components
├── images/
├── dags/
├── etl/               # Integration tests
├── .gitignore                 # git ignore file
└── README.md            # Project Documentation

Setup and Installation

Clone the Repository

git clone https://github.com/Tee-works/big-data-processing-platform.git
cd big-data-processing-platform

Install Dependencies
```
pip install -r requirements.txt
```
Configure AWS Credentials
```
aws configure
```
Initialize Terraform
```
cd terraform
terraform init
```

Infrastructure Deployment

Terraform Deployment

terraform plan -var-file=environments/prod.tfvars

VPN Setup
- Download VPN configuration
- Install AWS VPN client
- Import certificates

Data Processing

Synthetic Data Generation
- Multiple parquet datasets
- Total records: 5+ million
- Varied record distribution
Spark Job Execution
- Automated cluster provisioning
- Job submission
- Cluster termination
Data Quality Checks
- Schema validation
- Data integrity checks
- Performance monitoring

CI/CD Pipeline

This documenttion describes the Continuous Integration and Continuous Deployment (CI/CD) workflow implemented through GitHub Actions. The workflow automatically runs when changes are pushed to the main branch, specifically when files in the dags/, terraform/, or etl/ directories are modified.

Workflow Triggers

The workflow is triggered on:

Push events to the main branch
Only when changes affect files in these paths:
- dags/**
- terraform/**
- etl/**

Jobs

1. Lint PySpark (`lint-pyspark`)

This job validates Python code quality in the PySpark files.

Environment:

Runs on: Ubuntu latest
Python version: 3.10

Steps:

Checkout repository code
Set up Python 3.10 environment
Install linting tools:
- flake8: Checks for code style and quality issues
- isort: Validates import sorting
Run flake8 on dags/ and etl/ directories
Run isort in check-only mode on dags/ and etl/ directories

2. Terraform Lint (`terraform-lint`)

This job validates Terraform configuration files for best practices and potential errors.

Environment:

Runs on: Ubuntu latest

Steps:

Checkout repository code
Install TFLint using the official installation script
Run TFLint in the terraform directory to validate configurations

3. Sync DAGs to S3 (`sync-dags`)

This job uploads Airflow DAG files and ETL scripts to an S3 bucket, but only after the PySpark linting job passes successfully.

Dependencies:

Requires successful completion of lint-pyspark job

Environment:

Runs on: Ubuntu latest
Environment variables:
- BUCKET_NAME: Set to big-data-bck

Steps:

Checkout repository code
Configure AWS credentials using secrets stored in GitHub:
- AWS Access Key ID
- AWS Secret Access Key
- AWS Region: eu-north-1
Sync files to S3:
- Copy dags/ directory to s3://big-data-bck/dags/
- Copy etl/ directory to s3://big-data-bck/etl/

Security Considerations

AWS credentials are stored as GitHub secrets
No sensitive information is exposed in the workflow file

Usage Notes

Ensure AWS secrets (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) are configured in your GitHub repository settings
The IAM user associated with these credentials should have permissions to write to the specified S3 bucket
This workflow only triggers on changes to specific directories to avoid unnecessary runs

Troubleshooting

If the workflow fails:

Linting failures: Check the job logs for specific code style issues that need to be fixed
AWS authentication errors: Verify that your GitHub secrets are correctly configured
S3 sync failures: Ensure the IAM user has sufficient permissions to write to the bucket

Monitoring and Logging

CloudWatch Metrics
- Cluster performance
- Job execution stats
- Cost metrics
Logging
- Application logs
- Infrastructure logs
- Audit logs

Cost Optimization

Cluster Management
- Auto-scaling
- Spot instances
- Automatic termination
Storage Optimization
- Data lifecycle policies
- Storage class selection
- Compression strategies

Best Practices

Code Management
- Version control
- Code review process
- Documentation
Security
- Least privilege access
- Regular security updates
- Encryption standards
Operations
- Automated deployments
- Monitoring and alerting
- Backup and recovery

Troubleshooting

Common issues and their solutions:

VPN Connection Issues
Cluster Provisioning Failures
Job Execution Errors
Permission Problems

Contributing

Fork the repository
Create a feature branch
Submit a pull request
Follow coding standards

Key Takeaways

Use Active Directory or SAML for authentication for Client VPN in production
Modular Terraform code structure significantly improves infrastructure maintainability and scalability.
Use spot instances with auto-termination for EMR to cut costs by up to 70% without sacrificing performance.
Separate orchestration (MWAA) from processing (EMR) to improve fault tolerance and modularity.
Enable S3 object versioning and lifecycle rules to manage storage costs and data retention.
Set up metric alarms on EMR and RDS for proactive monitoring and resource optimization.
CI/CD gating with linters and integration tests prevents poor-quality code from reaching production.

Resources and Documentations

For more information, contact the repo owner.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
dags		dags
etl		etl
images		images
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Big Data Processing Platform

Background

Team

Core Development Team

Table of Contents

Project Overview

Business Context

Key Features

Architecture

Infrastructure Overview

Architecture Flow

Airflow DAGs

1. Simple Addition DAG (test.py)

2. Big Data Pipeline DAG (emr.py)

DAG Management

Why These Services?

Infrastructure as Code

Terraform Structure

Key Infrastructure Components

Data Processing Pipeline

Security Implementation

Cost Optimization Strategies

AWS Services Used

Prerequisites

Project Structure

Setup and Installation

Infrastructure Deployment

Data Processing

CI/CD Pipeline

Workflow Triggers

Jobs

1. Lint PySpark (lint-pyspark)

2. Terraform Lint (terraform-lint)

3. Sync DAGs to S3 (sync-dags)

Security Considerations

Usage Notes

Troubleshooting

Monitoring and Logging

Cost Optimization

Best Practices

Troubleshooting

Contributing

Key Takeaways

Resources and Documentations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Simple Addition DAG (`test.py`)

2. Big Data Pipeline DAG (`emr.py`)

1. Lint PySpark (`lint-pyspark`)

2. Terraform Lint (`terraform-lint`)

3. Sync DAGs to S3 (`sync-dags`)

Packages