Data Engineering Projects

Welcome to the Data Engineering Projects repository. This repository contains various data data engineering projects implemented using various technologies.

Project Overview

Daily Stock Tracker

Description: A data pipeline that connects to the Alpha Vantage API to extract financial data and stores it in a structured format (SQLite database). The pipeline includes cleaning and preprocessing steps.
Technologies Used: Python, pandas, requests, matplotlib, SQLite, python-dotenv, Windows Task Scheduler
Practices Used: API integration, data extraction, data cleaning and preprocessing, data storage and management, data visualization, ETL pipeline development, task automation and scheduling, error handling and logging.

Data Warehousing Simulation

Description: This project simulates a data engineering workflow using synthetic data. It includes generating data with the Faker library, creating a PostgreSQL database schema, and loading the data into the database. The project also provides an interactive dashboard built with Streamlit to visualize and analyze the data.
Technologies Used: Python, pandas, SQLAlchemy, psycopg2, Faker, PostgreSQL, Streamlit, pgAdmin (optional)
Practices Used: Data modeling and schema design, data extraction and generation, data loading and management, database integration, ETL pipeline development, data visualization, user interaction through a web dashboard, data validation, error handling, project structure organization.

File-Based Data Lake Simulation

Description: This project simulates a data lake architecture using a local file system. It demonstrates key data engineering concepts, including ETL pipelines, data validation, metadata management, and querying with SQL. The project automates data ingestion, processing, aggregation, and analysis workflows using Python.
Technologies Used: Python, pandas, Faker, PyArrow, DuckDB, Jupyter Notebook, Matplotlib
Practices Used: ETL pipeline development, data cleaning and transformation, data quality validation, metadata management, batch processing, SQL querying on Parquet files, project automation and orchestration.

Spark Batch Processing Project

Description: A data pipeline using Apache Spark to process and analyze transactional data for detecting spending trends and potential fraud. The workflow covers data ingestion, cleaning, transformation, analysis, and visualization.
Technologies Used: Apache Spark, PySpark, Python, pandas, Matplotlib, Seaborn, Docker, Parquet
Practices Used: Batch data processing, data cleaning and transformation, aggregation and analysis, data visualization, scalable data pipelines, Docker-based deployment, efficient data storage with Parquet.

Usage

Each project folder contains specific instructions on how to run each project. Refer to the README file within each project folder for detailed usage guidelines.

License

This repository is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
daily_stock_tracker		daily_stock_tracker
data_warehousing_simulation		data_warehousing_simulation
file-based-data-lake		file-based-data-lake
spark_batch_project		spark_batch_project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Projects

Project Overview

Daily Stock Tracker

Data Warehousing Simulation

File-Based Data Lake Simulation

Spark Batch Processing Project

Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Projects

Project Overview

Daily Stock Tracker

Data Warehousing Simulation

File-Based Data Lake Simulation

Spark Batch Processing Project

Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages