AWS-Blog: Building Data Aggregation Pipelines using Apache Airflow and Athena

Business insights are frequently generated from aggregated data, like daily sales per market segment over time. In this blog post we’ll use Apache Airflow to build a data aggregation pipeline that utilizes Amazon Athena for the heavy lifting. We’ll cover best practices that you should follow to build a production-ready system.

2024-09-23 · 7 min · Maurice Borgmeier

AWS-Blog: Making the TPC-H dataset available in Athena using Airflow

The TPC-H dataset is commonly used to benchmark data warehouses or, more generally, decision support systems. It describes a typical e-commerce workload and includes benchmark queries to enable performance comparison between different data warehouses. I think the dataset is also useful to teach building different kinds of ETL or analytics workflows, so I decided to explore ways of making it available in Amazon Athena.

2024-08-29 · 7 min · Maurice Borgmeier

AWS-Blog: Enabling Apache Airflow to copy large S3 objects

If you’re trying to use Apache Airflow to copy large objects in S3, you might have encountered issues where S3 complains about you sending an InvalidRequest. We will fix that in this post by writing a custom operator to handle the underlying problem.

2024-08-27 · 3 min · Maurice Borgmeier

Installing Apache Airflow on MacOS

I’m currently diving a bit deeper into Apache Airflow and want to further my understanding of the system. I chose to install it locally on my Mac because a managed service like Managed Workflows for Apache Airflow (MWAA) on AWS limits how much I can tinker with the system. For anything remotely production-related, I’d still go with the managed service. I used the Airflow: Getting Started documentation to do exactly that, getting started....

2024-08-16 · 4 min · Maurice Borgmeier