Apache Airflow 2.x on Kubernetes – Production-Ready Data Orchestration for Big Data
Apache Airflow 2.x is an open-source workflow orchestration framework used for ETL, data engineering, and machine learning pipelines. When paired with the Kubernetes Executor in a production-grade environment, it enables dynamic, containerized task execution with fine-grained resource control, ideal for handling the 5Vs of Big Data (Volume, Velocity, Variety, Veracity, and Value).
This document outlines a production architecture, configuration specifications, and a feature comparison with Azure Data Factory (ADF) for enterprise-scale deployments.
Big Data Production Environment Requirements
A Kubernetes-deployed Airflow 2.x environment designed for large-scale, mission-critical data pipelines should address:
- Volume – Handle terabytes to petabytes of data in distributed storage systems.
- Velocity – Support both batch and near-real-time ingestion and transformation.
- Variety – Orchestrate across heterogeneous sources (databases, APIs, message queues, data lakes).
- Veracity – Integrate quality checks, schema validation, and retry logic.
- Value – Provide reliable SLAs for downstream analytics and business decisions.
Kubernetes Deployment Architecture
Core Components:
- Airflow 2.x Scheduler: Runs in a dedicated pod, managing DAG parsing and task scheduling.
- Kubernetes Executor: Spins up one pod per task on-demand in a configured namespace.
- Metadata Database: PostgreSQL or MySQL (RDS or CloudSQL recommended for HA).
- Webserver: Serves the Airflow UI with RBAC and authentication (OAuth2, SAML).
- Logging & Monitoring:
- Centralized logs in S3/GCS with lifecycle policies.
- Metrics in Prometheus/Grafana.
- Secrets & Config: Managed via Kubernetes Secrets, AWS Secrets Manager, or HashiCorp Vault.
Storage & Data Access:
- Persistent Volumes for temporary workspace.
- DAGs synced from Git (Git-Sync sidecar) or object storage (S3/GCS).
- Network policies to isolate database and data lake access.
Data Transfer & Integration
- Ingestion: Airflow DAGs using operators for Kafka, AWS S3, Azure Blob Storage, Google Pub/Sub.
- Transformation: PySpark on EMR, Dataproc, or native Spark-on-K8s operators.
- Load: Write to Redshift, Snowflake, BigQuery, or Delta Lake on S3.
- Metadata & Governance: Integrate with AWS Glue Data Catalog or Apache Atlas.
Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory
Feature | Airflow 2.x on Kubernetes | Azure Data Factory (ADF) |
---|---|---|
Execution Model | On-demand Kubernetes pods per task | Managed compute integration |
Flexibility | Fully customizable Python DAGs | Limited to ADF activities |
Big Data Integration | Native Spark, Hive, Flink operators | HDInsight, Synapse, Databricks |
Hybrid/Multicloud | Yes (any K8s cluster) | Azure-only |
Cost Model | Pay for K8s resources used | Pay per activity/runtime hour |
SLA Control | Full control over infra & scaling | Managed by Azure |
Extensibility | Python operators/plugins | Custom .NET/Java in Data Flows |
Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)
This benchmark measures end-to-end throughput for moving data from S3 → ephemeral pod storage → S3 using a container that runs aws s3 cp
(or s5cmd
) inside Kubernetes worker pods launched by Airflow. It demonstrates the impact of CPU/memory limits, concurrency, and network class on sustained throughput.
Objective
- Quantify S3 transfer throughput per pod and at cluster level under realistic Airflow orchestration.
- Validate pod sizing for the 5Vs: sustained Volume/Velocity while preserving Veracity (checksums), and maximizing Value (throughput per dollar).
Test Dataset
- Source:
s3://my-ingest-bucket/bench/source/
containing ~50 GB of Parquet/CSV (or any fixed-size corpus). - Output:
s3://my-ingest-bucket/bench/output/${RUN_ID}/
- Region: keep pods and bucket in the same region/AZs when possible.
Pod Configurations (KubernetesExecutor)
We’ll test three pod shapes. These are consumed by Airflow’s KubernetesExecutor via a pod template reference.
Pod template (pod_templates/transfer-small.yaml
):
apiVersion: v1
kind: Pod
metadata:
labels:
component: airflow-worker
spec:
restartPolicy: Never
nodeSelector:
node.kubernetes.io/instance-type: m5.large
containers:
- name: transfer
image: public.ecr.aws/aws-cli/aws-cli:2.17.0
command: ["/bin/sh","-lc"]
args:
- |
set -euo pipefail
START=$(date +%s)
aws s3 sync s3://my-ingest-bucket/bench/source /data/source --no-progress
aws s3 sync /data/source s3://my-ingest-bucket/bench/output/${RUN_ID} --no-progress
END=$(date +%s)
BYTES=$(du -sb /data/source | awk '{print $1}')
DUR=$((END-START))
echo "bytes=${BYTES} duration_s=${DUR} throughput_MBps=$(echo "${BYTES} / 1048576 / ${DUR}" | bc -l)"
volumeMounts:
- name: scratch
mountPath: /data
env:
- name: AWS_DEFAULT_REGION
value: us-east-1
- name: RUN_ID
valueFrom:
fieldRef:
fieldPath: metadata.annotations['airflow.apache.org/run-id']
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
volumes:
- name: scratch
emptyDir:
sizeLimit: "50Gi"
Large (transfer-large.yaml
): for peak sustained throughput:
spec:
nodeSelector:
node.kubernetes.io/instance-type: c6i.4xlarge
containers:
- name: transfer
resources:
requests: { cpu: "4", memory: "8Gi" }
limits: { cpu: "8", memory: "16Gi" }
volumes:
- name: scratch
emptyDir:
sizeLimit: "500Gi"
Notes • Use instance types with high network bandwidth (e.g., c6i/m6i families). • If your cluster supports it, attach a faster storage class for scratch (NVMe-backed ephemeral). • Replace
aws s3 sync
withs5cmd
for higher concurrency within a pod if desired.
Airflow DAG (KubernetesPodOperator)
The DAG spins N parallel pods using the chosen template, then aggregates simple metrics from task logs.
# dags/bench_s3_transfer.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
DEFAULT_ARGS = {
"owner": "data-eng",
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="bench_s3_transfer",
default_args=DEFAULT_ARGS,
schedule=None,
start_date=datetime(2024, 1, 1),
catchup=False,
tags=["benchmark","s3","throughput"],
) as dag:
# Choose one template per run: small | medium | large
template = "/opt/airflow/pod_templates/transfer-small.yaml"
parallelism = 4 # modify to 1, 4, 8, 16 for scaling tests
tasks = []
for i in range(parallelism):
t = KubernetesPodOperator(
task_id=f"s3_transfer_{i}",
name=f"s3-transfer-{i}",
namespace="airflow",
pod_template_file=template,
get_logs=True,
is_delete_operator_pod=True,
in_cluster=True,
do_xcom_push=False,
)
tasks.append(t)
# Simple fan-out, no dependencies between transfers
# Optionally add a Python task to parse logs and push a summary metric.
To vary pod shape, run the DAG with
transfer-medium.yaml
ortransfer-large.yaml
, and adjustparallelism
.
Metrics Collection
Per-pod metrics (from logs): bytes transferred, duration, throughput MB/s. Cluster metrics (Prometheus): CPU/memory, node NIC throughput, pod restarts. Example PromQL:
- Pod network Rx/Tx:
sum(rate(container_network_receive_bytes_total{pod=~"s3-transfer-.*"}[1m]))
sum(rate(container_network_transmit_bytes_total{pod=~"s3-transfer-.*"}[1m]))
- CPU and memory:
sum(rate(container_cpu_usage_seconds_total{pod=~"s3-transfer-.*"}[1m]))
sum(container_memory_working_set_bytes{pod=~"s3-transfer-.*"})
- Airflow task duration (if exported): derive from Airflow metrics or parse task logs via a follow-up task.
Example Results (Illustrative)
Pod Shape | Parallel Pods | Dataset Size (per pod) | Avg Pod Throughput (MB/s) | Aggregate Throughput (MB/s) | Node Type |
---|---|---|---|---|---|
Small | 4 | 12.5 GB | 120 | ~480 | m5.large |
Medium | 4 | 12.5 GB | 220 | ~880 | m5.2xlarge |
Large | 8 | 6.25 GB | 300 | ~2,400 | c6i.4xlarge |
Interpretation:
- Throughput scales with CPU and NIC bandwidth until the S3 or VPC egress limit becomes the bottleneck.
- Increasing parallel pods often raises aggregate throughput more than enlarging a single pod, but watch node NIC saturation and S3 request limits.
- Replace
aws s3 sync
withs5cmd sync --concurrency N
to increase intra-pod parallelism.
Validation and Data Integrity
- Enable checksums (
--checksum
ins5cmd cp/sync
) or verify sizes/hashes post-copy. - Use S3 multipart defaults; avoid tiny part sizes for large files.
- Keep pods and bucket in the same region and, where applicable, same VPC endpoint to minimize cross-AZ charges and latency.
Cost and Tuning Guidance
- Prefer spot nodes for non-critical benchmarks; switch to on-demand for production SLAs.
- Right-size
requests/limits
; avoid CPU throttling (checkcontainer_cpu_cfs_throttled_seconds_total
). - Use HPA for the Airflow webserver/scheduler under heavy DAG churn, but keep worker pods ephemeral.
- For heavy fan-out, control concurrency via
max_active_runs
per DAG andparallelism
at the Airflow level.
Optional: Using Spark for Parallel Transfer + Transform
If transformation is required, substitute the transfer container with a Spark-on-K8s or EMR on EKS step and write Parquet to s3://…/curated/
, registering schemas in AWS Glue. Airflow still controls orchestration and retries.
What This Benchmark Shows
- A practical ceiling for per-pod throughput and cluster aggregate throughput for S3 transfers in your environment.
- The trade-off between bigger pods versus more pods.
- The effect of node families and storage class on sustained transfer rates.
Table of Contents
- Big Data Production Environment Requirements
- Kubernetes Deployment Architecture
- Data Transfer & Integration
- Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory
- Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)
Trending
Table of Contents
- Big Data Production Environment Requirements
- Kubernetes Deployment Architecture
- Data Transfer & Integration
- Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory
- Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)