Apache Airflow 2.x on Kubernetes – Production-Ready Data Orchestration for Big Data

Apache Airflow 2.x is an open-source workflow orchestration framework used for ETL, data engineering, and machine learning pipelines. When paired with the Kubernetes Executor in a production-grade environment, it enables dynamic, containerized task execution with fine-grained resource control, ideal for handling the 5Vs of Big Data (Volume, Velocity, Variety, Veracity, and Value).

This document outlines a production architecture, configuration specifications, and a feature comparison with Azure Data Factory (ADF) for enterprise-scale deployments.

Big Data Production Environment Requirements

A Kubernetes-deployed Airflow 2.x environment designed for large-scale, mission-critical data pipelines should address:

Volume – Handle terabytes to petabytes of data in distributed storage systems.
Velocity – Support both batch and near-real-time ingestion and transformation.
Variety – Orchestrate across heterogeneous sources (databases, APIs, message queues, data lakes).
Veracity – Integrate quality checks, schema validation, and retry logic.
Value – Provide reliable SLAs for downstream analytics and business decisions.

Kubernetes Deployment Architecture

Core Components:

Airflow 2.x Scheduler: Runs in a dedicated pod, managing DAG parsing and task scheduling.
Kubernetes Executor: Spins up one pod per task on-demand in a configured namespace.
Metadata Database: PostgreSQL or MySQL (RDS or CloudSQL recommended for HA).
Webserver: Serves the Airflow UI with RBAC and authentication (OAuth2, SAML).
Logging & Monitoring:
- Centralized logs in S3/GCS with lifecycle policies.
- Metrics in Prometheus/Grafana.
Secrets & Config: Managed via Kubernetes Secrets, AWS Secrets Manager, or HashiCorp Vault.

Storage & Data Access:

Persistent Volumes for temporary workspace.
DAGs synced from Git (Git-Sync sidecar) or object storage (S3/GCS).
Network policies to isolate database and data lake access.

Data Transfer & Integration

Ingestion: Airflow DAGs using operators for Kafka, AWS S3, Azure Blob Storage, Google Pub/Sub.
Transformation: PySpark on EMR, Dataproc, or native Spark-on-K8s operators.
Load: Write to Redshift, Snowflake, BigQuery, or Delta Lake on S3.
Metadata & Governance: Integrate with AWS Glue Data Catalog or Apache Atlas.

Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory

Feature	Airflow 2.x on Kubernetes	Azure Data Factory (ADF)
Execution Model	On-demand Kubernetes pods per task	Managed compute integration
Flexibility	Fully customizable Python DAGs	Limited to ADF activities
Big Data Integration	Native Spark, Hive, Flink operators	HDInsight, Synapse, Databricks
Hybrid/Multicloud	Yes (any K8s cluster)	Azure-only
Cost Model	Pay for K8s resources used	Pay per activity/runtime hour
SLA Control	Full control over infra & scaling	Managed by Azure
Extensibility	Python operators/plugins	Custom .NET/Java in Data Flows

Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)

This benchmark measures end-to-end throughput for moving data from S3 → ephemeral pod storage → S3 using a container that runs aws s3 cp (or s5cmd) inside Kubernetes worker pods launched by Airflow. It demonstrates the impact of CPU/memory limits, concurrency, and network class on sustained throughput.

Objective

Quantify S3 transfer throughput per pod and at cluster level under realistic Airflow orchestration.
Validate pod sizing for the 5Vs: sustained Volume/Velocity while preserving Veracity (checksums), and maximizing Value (throughput per dollar).

Test Dataset

Source: s3://my-ingest-bucket/bench/source/ containing ~50 GB of Parquet/CSV (or any fixed-size corpus).
Output: s3://my-ingest-bucket/bench/output/${RUN_ID}/
Region: keep pods and bucket in the same region/AZs when possible.

Pod Configurations (KubernetesExecutor)

We’ll test three pod shapes. These are consumed by Airflow’s KubernetesExecutor via a pod template reference.

Pod template (pod_templates/transfer-small.yaml):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: airflow-worker
spec:
  restartPolicy: Never
  nodeSelector:
    node.kubernetes.io/instance-type: m5.large
  containers:
  - name: transfer
    image: public.ecr.aws/aws-cli/aws-cli:2.17.0
    command: ["/bin/sh","-lc"]
    args:
      - |
        set -euo pipefail
        START=$(date +%s)
        aws s3 sync s3://my-ingest-bucket/bench/source /data/source --no-progress
        aws s3 sync /data/source s3://my-ingest-bucket/bench/output/${RUN_ID} --no-progress
        END=$(date +%s)
        BYTES=$(du -sb /data/source | awk '{print $1}')
        DUR=$((END-START))
        echo "bytes=${BYTES} duration_s=${DUR} throughput_MBps=$(echo "${BYTES} / 1048576 / ${DUR}" | bc -l)"
    volumeMounts:
      - name: scratch
        mountPath: /data
    env:
      - name: AWS_DEFAULT_REGION
        value: us-east-1
      - name: RUN_ID
        valueFrom:
          fieldRef:
            fieldPath: metadata.annotations['airflow.apache.org/run-id']
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi"
  volumes:
    - name: scratch
      emptyDir:
        sizeLimit: "50Gi"

Large (transfer-large.yaml): for peak sustained throughput:

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: c6i.4xlarge
  containers:
  - name: transfer
    resources:
      requests: { cpu: "4",  memory: "8Gi" }
      limits:   { cpu: "8",  memory: "16Gi" }
  volumes:
    - name: scratch
      emptyDir:
        sizeLimit: "500Gi"

Notes • Use instance types with high network bandwidth (e.g., c6i/m6i families). • If your cluster supports it, attach a faster storage class for scratch (NVMe-backed ephemeral). • Replace aws s3 sync with s5cmd for higher concurrency within a pod if desired.

Airflow DAG (KubernetesPodOperator)

The DAG spins N parallel pods using the chosen template, then aggregates simple metrics from task logs.

# dags/bench_s3_transfer.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

DEFAULT_ARGS = {
    "owner": "data-eng",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="bench_s3_transfer",
    default_args=DEFAULT_ARGS,
    schedule=None,
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["benchmark","s3","throughput"],
) as dag:

    # Choose one template per run: small | medium | large
    template = "/opt/airflow/pod_templates/transfer-small.yaml"

    parallelism = 4  # modify to 1, 4, 8, 16 for scaling tests

    tasks = []
    for i in range(parallelism):
        t = KubernetesPodOperator(
            task_id=f"s3_transfer_{i}",
            name=f"s3-transfer-{i}",
            namespace="airflow",
            pod_template_file=template,
            get_logs=True,
            is_delete_operator_pod=True,
            in_cluster=True,
            do_xcom_push=False,
        )
        tasks.append(t)

    # Simple fan-out, no dependencies between transfers
    # Optionally add a Python task to parse logs and push a summary metric.

To vary pod shape, run the DAG with transfer-medium.yaml or transfer-large.yaml, and adjust parallelism.

Metrics Collection

Per-pod metrics (from logs): bytes transferred, duration, throughput MB/s. Cluster metrics (Prometheus): CPU/memory, node NIC throughput, pod restarts. Example PromQL:

Pod network Rx/Tx:

sum(rate(container_network_receive_bytes_total{pod=~"s3-transfer-.*"}[1m]))
sum(rate(container_network_transmit_bytes_total{pod=~"s3-transfer-.*"}[1m]))

CPU and memory:

sum(rate(container_cpu_usage_seconds_total{pod=~"s3-transfer-.*"}[1m]))
sum(container_memory_working_set_bytes{pod=~"s3-transfer-.*"})

Airflow task duration (if exported): derive from Airflow metrics or parse task logs via a follow-up task.

Example Results (Illustrative)

Pod Shape	Parallel Pods	Dataset Size (per pod)	Avg Pod Throughput (MB/s)	Aggregate Throughput (MB/s)	Node Type
Small	4	12.5 GB	120	~480	m5.large
Medium	4	12.5 GB	220	~880	m5.2xlarge
Large	8	6.25 GB	300	~2,400	c6i.4xlarge

Interpretation:

Throughput scales with CPU and NIC bandwidth until the S3 or VPC egress limit becomes the bottleneck.
Increasing parallel pods often raises aggregate throughput more than enlarging a single pod, but watch node NIC saturation and S3 request limits.
Replace aws s3 sync with s5cmd sync --concurrency N to increase intra-pod parallelism.

Validation and Data Integrity

Enable checksums (--checksum in s5cmd cp/sync) or verify sizes/hashes post-copy.
Use S3 multipart defaults; avoid tiny part sizes for large files.
Keep pods and bucket in the same region and, where applicable, same VPC endpoint to minimize cross-AZ charges and latency.

Cost and Tuning Guidance

Prefer spot nodes for non-critical benchmarks; switch to on-demand for production SLAs.
Right-size requests/limits; avoid CPU throttling (check container_cpu_cfs_throttled_seconds_total).
Use HPA for the Airflow webserver/scheduler under heavy DAG churn, but keep worker pods ephemeral.
For heavy fan-out, control concurrency via max_active_runs per DAG and parallelism at the Airflow level.

Optional: Using Spark for Parallel Transfer + Transform

If transformation is required, substitute the transfer container with a Spark-on-K8s or EMR on EKS step and write Parquet to s3://…/curated/, registering schemas in AWS Glue. Airflow still controls orchestration and retries.

What This Benchmark Shows

A practical ceiling for per-pod throughput and cluster aggregate throughput for S3 transfers in your environment.
The trade-off between bigger pods versus more pods.
The effect of node families and storage class on sustained transfer rates.

Big Data Production Environment Requirements
Kubernetes Deployment Architecture
Data Transfer & Integration
Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory
Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)

Designing with Intelligence: How AI Is Redefining UI/UX XGBoost and KMeans: Swiss Army Knife of ML OpenSearch in the Cloud: Essential Use Cases and Deployment Strategies for Modern Data Analytics Top 5 Shipping Tracking APIs for E-commerce (Including Veho)RoBERTa vs. BERT for Social Feedback Analysis: From Comments to Reports

category