category

DatabaseMachine learningKuberneteseCommerceCloudWeb Application

Apache Airflow 2.x on Kubernetes – Production-Ready Data Orchestration for Big Data

Apache Airflow 2.x is an open-source workflow orchestration framework used for ETL, data engineering, and machine learning pipelines. When paired with the Kubernetes Executor in a production-grade environment, it enables dynamic, containerized task execution with fine-grained resource control, ideal for handling the 5Vs of Big Data (Volume, Velocity, Variety, Veracity, and Value).

This document outlines a production architecture, configuration specifications, and a feature comparison with Azure Data Factory (ADF) for enterprise-scale deployments.


Big Data Production Environment Requirements

A Kubernetes-deployed Airflow 2.x environment designed for large-scale, mission-critical data pipelines should address:

  • Volume – Handle terabytes to petabytes of data in distributed storage systems.
  • Velocity – Support both batch and near-real-time ingestion and transformation.
  • Variety – Orchestrate across heterogeneous sources (databases, APIs, message queues, data lakes).
  • Veracity – Integrate quality checks, schema validation, and retry logic.
  • Value – Provide reliable SLAs for downstream analytics and business decisions.

Kubernetes Deployment Architecture

Core Components:

  • Airflow 2.x Scheduler: Runs in a dedicated pod, managing DAG parsing and task scheduling.
  • Kubernetes Executor: Spins up one pod per task on-demand in a configured namespace.
  • Metadata Database: PostgreSQL or MySQL (RDS or CloudSQL recommended for HA).
  • Webserver: Serves the Airflow UI with RBAC and authentication (OAuth2, SAML).
  • Logging & Monitoring:
    • Centralized logs in S3/GCS with lifecycle policies.
    • Metrics in Prometheus/Grafana.
  • Secrets & Config: Managed via Kubernetes Secrets, AWS Secrets Manager, or HashiCorp Vault.

Storage & Data Access:

  • Persistent Volumes for temporary workspace.
  • DAGs synced from Git (Git-Sync sidecar) or object storage (S3/GCS).
  • Network policies to isolate database and data lake access.

Data Transfer & Integration

  • Ingestion: Airflow DAGs using operators for Kafka, AWS S3, Azure Blob Storage, Google Pub/Sub.
  • Transformation: PySpark on EMR, Dataproc, or native Spark-on-K8s operators.
  • Load: Write to Redshift, Snowflake, BigQuery, or Delta Lake on S3.
  • Metadata & Governance: Integrate with AWS Glue Data Catalog or Apache Atlas.

Comparison: Apache Airflow 2.x on K8s vs. Azure Data Factory

FeatureAirflow 2.x on KubernetesAzure Data Factory (ADF)
Execution ModelOn-demand Kubernetes pods per taskManaged compute integration
FlexibilityFully customizable Python DAGsLimited to ADF activities
Big Data IntegrationNative Spark, Hive, Flink operatorsHDInsight, Synapse, Databricks
Hybrid/MulticloudYes (any K8s cluster)Azure-only
Cost ModelPay for K8s resources usedPay per activity/runtime hour
SLA ControlFull control over infra & scalingManaged by Azure
ExtensibilityPython operators/pluginsCustom .NET/Java in Data Flows

Benchmark: S3 Data Transfer Throughput on Airflow 2.x (KubernetesExecutor)

This benchmark measures end-to-end throughput for moving data from S3 → ephemeral pod storage → S3 using a container that runs aws s3 cp (or s5cmd) inside Kubernetes worker pods launched by Airflow. It demonstrates the impact of CPU/memory limits, concurrency, and network class on sustained throughput.

Objective

  • Quantify S3 transfer throughput per pod and at cluster level under realistic Airflow orchestration.
  • Validate pod sizing for the 5Vs: sustained Volume/Velocity while preserving Veracity (checksums), and maximizing Value (throughput per dollar).

Test Dataset

  • Source: s3://my-ingest-bucket/bench/source/ containing ~50 GB of Parquet/CSV (or any fixed-size corpus).
  • Output: s3://my-ingest-bucket/bench/output/${RUN_ID}/
  • Region: keep pods and bucket in the same region/AZs when possible.

Pod Configurations (KubernetesExecutor)

We’ll test three pod shapes. These are consumed by Airflow’s KubernetesExecutor via a pod template reference.

Pod template (pod_templates/transfer-small.yaml):

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: airflow-worker
spec:
  restartPolicy: Never
  nodeSelector:
    node.kubernetes.io/instance-type: m5.large
  containers:
  - name: transfer
    image: public.ecr.aws/aws-cli/aws-cli:2.17.0
    command: ["/bin/sh","-lc"]
    args:
      - |
        set -euo pipefail
        START=$(date +%s)
        aws s3 sync s3://my-ingest-bucket/bench/source /data/source --no-progress
        aws s3 sync /data/source s3://my-ingest-bucket/bench/output/${RUN_ID} --no-progress
        END=$(date +%s)
        BYTES=$(du -sb /data/source | awk '{print $1}')
        DUR=$((END-START))
        echo "bytes=${BYTES} duration_s=${DUR} throughput_MBps=$(echo "${BYTES} / 1048576 / ${DUR}" | bc -l)"
    volumeMounts:
      - name: scratch
        mountPath: /data
    env:
      - name: AWS_DEFAULT_REGION
        value: us-east-1
      - name: RUN_ID
        valueFrom:
          fieldRef:
            fieldPath: metadata.annotations['airflow.apache.org/run-id']
    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "2"
        memory: "4Gi"
  volumes:
    - name: scratch
      emptyDir:
        sizeLimit: "50Gi"

Large (transfer-large.yaml): for peak sustained throughput:

spec:
  nodeSelector:
    node.kubernetes.io/instance-type: c6i.4xlarge
  containers:
  - name: transfer
    resources:
      requests: { cpu: "4",  memory: "8Gi" }
      limits:   { cpu: "8",  memory: "16Gi" }
  volumes:
    - name: scratch
      emptyDir:
        sizeLimit: "500Gi"

Notes • Use instance types with high network bandwidth (e.g., c6i/m6i families). • If your cluster supports it, attach a faster storage class for scratch (NVMe-backed ephemeral). • Replace aws s3 sync with s5cmd for higher concurrency within a pod if desired.


Airflow DAG (KubernetesPodOperator)

The DAG spins N parallel pods using the chosen template, then aggregates simple metrics from task logs.

# dags/bench_s3_transfer.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

DEFAULT_ARGS = {
    "owner": "data-eng",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="bench_s3_transfer",
    default_args=DEFAULT_ARGS,
    schedule=None,
    start_date=datetime(2024, 1, 1),
    catchup=False,
    tags=["benchmark","s3","throughput"],
) as dag:

    # Choose one template per run: small | medium | large
    template = "/opt/airflow/pod_templates/transfer-small.yaml"

    parallelism = 4  # modify to 1, 4, 8, 16 for scaling tests

    tasks = []
    for i in range(parallelism):
        t = KubernetesPodOperator(
            task_id=f"s3_transfer_{i}",
            name=f"s3-transfer-{i}",
            namespace="airflow",
            pod_template_file=template,
            get_logs=True,
            is_delete_operator_pod=True,
            in_cluster=True,
            do_xcom_push=False,
        )
        tasks.append(t)

    # Simple fan-out, no dependencies between transfers
    # Optionally add a Python task to parse logs and push a summary metric.

To vary pod shape, run the DAG with transfer-medium.yaml or transfer-large.yaml, and adjust parallelism.


Metrics Collection

Per-pod metrics (from logs): bytes transferred, duration, throughput MB/s. Cluster metrics (Prometheus): CPU/memory, node NIC throughput, pod restarts. Example PromQL:

  • Pod network Rx/Tx:
sum(rate(container_network_receive_bytes_total{pod=~"s3-transfer-.*"}[1m]))
sum(rate(container_network_transmit_bytes_total{pod=~"s3-transfer-.*"}[1m]))
  • CPU and memory:
sum(rate(container_cpu_usage_seconds_total{pod=~"s3-transfer-.*"}[1m]))
sum(container_memory_working_set_bytes{pod=~"s3-transfer-.*"})
  • Airflow task duration (if exported): derive from Airflow metrics or parse task logs via a follow-up task.

Example Results (Illustrative)

Pod ShapeParallel PodsDataset Size (per pod)Avg Pod Throughput (MB/s)Aggregate Throughput (MB/s)Node Type
Small412.5 GB120~480m5.large
Medium412.5 GB220~880m5.2xlarge
Large86.25 GB300~2,400c6i.4xlarge

Interpretation:

  • Throughput scales with CPU and NIC bandwidth until the S3 or VPC egress limit becomes the bottleneck.
  • Increasing parallel pods often raises aggregate throughput more than enlarging a single pod, but watch node NIC saturation and S3 request limits.
  • Replace aws s3 sync with s5cmd sync --concurrency N to increase intra-pod parallelism.

Validation and Data Integrity

  • Enable checksums (--checksum in s5cmd cp/sync) or verify sizes/hashes post-copy.
  • Use S3 multipart defaults; avoid tiny part sizes for large files.
  • Keep pods and bucket in the same region and, where applicable, same VPC endpoint to minimize cross-AZ charges and latency.

Cost and Tuning Guidance

  • Prefer spot nodes for non-critical benchmarks; switch to on-demand for production SLAs.
  • Right-size requests/limits; avoid CPU throttling (check container_cpu_cfs_throttled_seconds_total).
  • Use HPA for the Airflow webserver/scheduler under heavy DAG churn, but keep worker pods ephemeral.
  • For heavy fan-out, control concurrency via max_active_runs per DAG and parallelism at the Airflow level.

Optional: Using Spark for Parallel Transfer + Transform

If transformation is required, substitute the transfer container with a Spark-on-K8s or EMR on EKS step and write Parquet to s3://…/curated/, registering schemas in AWS Glue. Airflow still controls orchestration and retries.


What This Benchmark Shows

  • A practical ceiling for per-pod throughput and cluster aggregate throughput for S3 transfers in your environment.
  • The trade-off between bigger pods versus more pods.
  • The effect of node families and storage class on sustained transfer rates.

Table of Contents


Trending

Serverless Database Showdown: Oracle, Azure, Redshift, and AuroraOrchestrating Spark on AWS EMR from Apache Airflow — The Low-Ops WayCase Study: A Lightweight Intrusion Detection System with OpenFaaS and PyTorchBuilding Resilient Kubernetes Clusters with Portworx Community EditionIntegrating Shopify into a Next.js React Web App