Orchestrating Spark on AWS EMR from Apache Airflow — The Low-Ops Way

Running Apache Spark at scale is a solved problem — but how you run it can have huge implications for cost, stability, and operational overhead. If your data platform already uses Apache Airflow for orchestration, you might be tempted to run Spark directly on Kubernetes pods. That works — but it also means managing Spark runtime images, shuffle services, JVM tuning, and resource isolation inside Kubernetes.

There’s a simpler route: keep Airflow as the conductor, and let AWS EMR handle the Spark heavy lifting.

In this post, we’ll look at why this approach works so well, and walk through a minimal Airflow 2.x configuration to launch Spark jobs on EMR — with zero Kubernetes complexity.

Why EMR + Airflow Works

With EMR, you get a managed Spark runtime that already knows how to talk to S3, integrate with Glue Data Catalog, and scale up or down on demand. Airflow’s job becomes orchestration: scheduling, dependencies, retries, and monitoring job states.

Key benefits:

No cluster babysitting: EMR handles runtime tuning, Hadoop/Spark upgrades, and node scaling.
Tight AWS integration: S3, Glue, KMS, and CloudWatch logging out-of-the-box.
Elastic cost model: Use EMR Serverless to pay only for job runtime, or transient EMR clusters for one-off heavy workloads.
Stable execution environment: No mismatches between local dev Spark and k8s pod Spark images.

Choosing the Right EMR Pattern

There are two main ways to pair Airflow with EMR:

Pattern	When to Use	Benefits
EMR Serverless	Most batch ETL jobs that read/write from S3	No cluster to manage, per-second billing
Transient EMR Cluster	Jobs requiring custom instance types or heavy HDFS usage	Full Spark control, cluster spins up on-demand

We’ll focus on EMR Serverless for simplicity, but the transient cluster pattern is just as easy to add later.

Minimal Airflow DAG for EMR Serverless

Here’s the shortest path from Airflow to EMR Serverless Spark execution.

from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import EmrServerlessStartJobOperator
from airflow.providers.amazon.aws.sensors.emr import EmrServerlessJobSensor

APPLICATION_ID = "YOUR-APP-ID"  # EMR Serverless Spark application
S3_SCRIPT = "s3://my-bucket/code/job.py"
S3_LOGS = "s3://my-bucket/logs/"

with DAG(
    dag_id="spark_on_emr_serverless",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:

    start_job = EmrServerlessStartJobOperator(
        task_id="start_spark",
        application_id=APPLICATION_ID,
        execution_role_arn="arn:aws:iam::123456789012:role/EmrJobRole",
        job_driver={
            "sparkSubmit": {
                "entryPoint": S3_SCRIPT,
                "sparkSubmitParameters": (
                    "--conf spark.executor.instances=4 "
                    "--conf spark.executor.memory=4g "
                    "--conf spark.executor.cores=2 "
                )
            }
        },
        configuration_overrides={
            "monitoringConfiguration": {
                "s3MonitoringConfiguration": {"logUri": S3_LOGS}
            }
        },
        aws_conn_id="aws_default",
        wait_for_completion=False,
    )

    wait_job = EmrServerlessJobSensor(
        task_id="wait_spark",
        application_id=APPLICATION_ID,
        job_id=start_job.output,
        aws_conn_id="aws_default",
        poke_interval=30,
        timeout=3600,
    )

    start_job >> wait_job

What’s happening here:

EmrServerlessStartJobOperator: Submits your PySpark script to EMR Serverless with runtime parameters.
EmrServerlessJobSensor: Polls until the job finishes, so you can trigger downstream tasks.
No cluster YAML, no JVM image maintenance — you only manage your Spark code.

Spark Job Example

A simple PySpark script (job.py) stored in S3:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ExampleJob").getOrCreate()

data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
df = spark.createDataFrame(data, ["name", "age"])

df.filter(df.age > 30).show()

spark.stop()

Comparing to Spark on Kubernetes

Factor	EMR from Airflow	Spark on Kubernetes
Setup time	Minutes	Hours–Days
Maintenance	AWS handles runtime	You handle runtime
Scaling	Automatic	You configure autoscalers
AWS integration	Native	Needs IAM/token plumbing
Portability	AWS-only	Any Kubernetes cluster

Final Thoughts

If your team’s priority is delivering data pipelines, not running infrastructure, then running Spark via EMR from Airflow is a clear win. You can always migrate to Kubernetes later if multi-cloud or custom runtimes become a priority — but EMR will get you to production faster, with fewer moving parts.

Pro tip: Start with EMR Serverless for quick wins. When a job’s performance or runtime cost becomes a concern, consider transient EMR clusters with tuned instance groups.

Why EMR + Airflow Works
Choosing the Right EMR Pattern
Minimal Airflow DAG for EMR Serverless
Spark Job Example
Comparing to Spark on Kubernetes
Final Thoughts

Designing with Intelligence: How AI Is Redefining UI/UX XGBoost and KMeans: Swiss Army Knife of ML OpenSearch in the Cloud: Essential Use Cases and Deployment Strategies for Modern Data Analytics Top 5 Shipping Tracking APIs for E-commerce (Including Veho)RoBERTa vs. BERT for Social Feedback Analysis: From Comments to Reports

category