Welcome to the Data Engineering Projects repository. This repository contains various data data engineering projects implemented using various technologies.
- Description: A data pipeline that connects to the Alpha Vantage API to extract financial data and stores it in a structured format (SQLite database). The pipeline includes cleaning and preprocessing steps.
- Technologies Used: Python, pandas, requests, matplotlib, SQLite, python-dotenv, Windows Task Scheduler
- Practices Used: API integration, data extraction, data cleaning and preprocessing, data storage and management, data visualization, ETL pipeline development, task automation and scheduling, error handling and logging.
- Description: This project simulates a data engineering workflow using synthetic data. It includes generating data with the Faker library, creating a PostgreSQL database schema, and loading the data into the database. The project also provides an interactive dashboard built with Streamlit to visualize and analyze the data.
- Technologies Used: Python, pandas, SQLAlchemy, psycopg2, Faker, PostgreSQL, Streamlit, pgAdmin (optional)
- Practices Used: Data modeling and schema design, data extraction and generation, data loading and management, database integration, ETL pipeline development, data visualization, user interaction through a web dashboard, data validation, error handling, project structure organization.
- Description: This project simulates a data lake architecture using a local file system. It demonstrates key data engineering concepts, including ETL pipelines, data validation, metadata management, and querying with SQL. The project automates data ingestion, processing, aggregation, and analysis workflows using Python.
- Technologies Used: Python, pandas, Faker, PyArrow, DuckDB, Jupyter Notebook, Matplotlib
- Practices Used: ETL pipeline development, data cleaning and transformation, data quality validation, metadata management, batch processing, SQL querying on Parquet files, project automation and orchestration.
- Description: A data pipeline using Apache Spark to process and analyze transactional data for detecting spending trends and potential fraud. The workflow covers data ingestion, cleaning, transformation, analysis, and visualization.
- Technologies Used: Apache Spark, PySpark, Python, pandas, Matplotlib, Seaborn, Docker, Parquet
- Practices Used: Batch data processing, data cleaning and transformation, aggregation and analysis, data visualization, scalable data pipelines, Docker-based deployment, efficient data storage with Parquet.
Each project folder contains specific instructions on how to run each project. Refer to the README file within each project folder for detailed usage guidelines.
This repository is licensed under the MIT License. See the LICENSE file for more details.