This page contains end-to-end Machine Learning | Data Engineering Projects.
View my Projects Back to Home
Streaming with Spark (Scala) In this project, we build a cloud native, fully dockerized real time data pipeline: orchestrated with Kubernetes, powered by Spark.
Serverless Podcast Transcription Pipeline In this project, we leverage AWS Lambda, along with asynchronous AWS Transcribe & Comprehend jobs, to create an event based, fast podcast transcription pipeline.
BoundaryDM Library This project consists of an sklearn style deployment of a novel technique for nonlinear dimensionality reduction, developed by me and my advisor.
Cloud Based Reddit ETL with Airflow In this project, we build a data pipeline which extracts data using the reddit API, transforms the extracted data into a structured format, and loads the result into an MySQL database.
Real Time Streaming Pipeline with Kafka In this project, we build a real time data pipeline which streams stock market data from the Twelve Data API and uploads the result to S3.
NBA Podcast Data Pipeline with Airflow In this project, I use airflow to create a data pipeline which automatically downloads podcasts and stores podcast metadata in an SQLite database.