Orchestrating Spark on AWS EMR from Apache Airflow — The Low-Ops Way
Running Apache Spark at scale is a solved problem — but how you run it can have huge implications for cost, stability, and operational overhead. If your data platform already uses Apache Airflow for orchestration, you might be tempted to run Spark directly on Kubernetes pods. That works — but it also means managing Spark runtime images, shuffle services, JVM tuning, and resource isolation inside Kubernetes.
There’s a simpler route: keep Airflow as the conductor, and let AWS EMR handle the Spark heavy lifting.
In this post, we’ll look at why this approach works so well, and walk through a minimal Airflow 2.x configuration to launch Spark jobs on EMR — with zero Kubernetes complexity.
Why EMR + Airflow Works
With EMR, you get a managed Spark runtime that already knows how to talk to S3, integrate with Glue Data Catalog, and scale up or down on demand. Airflow’s job becomes orchestration: scheduling, dependencies, retries, and monitoring job states.
Key benefits:
- No cluster babysitting: EMR handles runtime tuning, Hadoop/Spark upgrades, and node scaling.
- Tight AWS integration: S3, Glue, KMS, and CloudWatch logging out-of-the-box.
- Elastic cost model: Use EMR Serverless to pay only for job runtime, or transient EMR clusters for one-off heavy workloads.
- Stable execution environment: No mismatches between local dev Spark and k8s pod Spark images.
Choosing the Right EMR Pattern
There are two main ways to pair Airflow with EMR:
Pattern | When to Use | Benefits |
---|---|---|
EMR Serverless | Most batch ETL jobs that read/write from S3 | No cluster to manage, per-second billing |
Transient EMR Cluster | Jobs requiring custom instance types or heavy HDFS usage | Full Spark control, cluster spins up on-demand |
We’ll focus on EMR Serverless for simplicity, but the transient cluster pattern is just as easy to add later.
Minimal Airflow DAG for EMR Serverless
Here’s the shortest path from Airflow to EMR Serverless Spark execution.
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.emr import EmrServerlessStartJobOperator
from airflow.providers.amazon.aws.sensors.emr import EmrServerlessJobSensor
APPLICATION_ID = "YOUR-APP-ID" # EMR Serverless Spark application
S3_SCRIPT = "s3://my-bucket/code/job.py"
S3_LOGS = "s3://my-bucket/logs/"
with DAG(
dag_id="spark_on_emr_serverless",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
) as dag:
start_job = EmrServerlessStartJobOperator(
task_id="start_spark",
application_id=APPLICATION_ID,
execution_role_arn="arn:aws:iam::123456789012:role/EmrJobRole",
job_driver={
"sparkSubmit": {
"entryPoint": S3_SCRIPT,
"sparkSubmitParameters": (
"--conf spark.executor.instances=4 "
"--conf spark.executor.memory=4g "
"--conf spark.executor.cores=2 "
)
}
},
configuration_overrides={
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": S3_LOGS}
}
},
aws_conn_id="aws_default",
wait_for_completion=False,
)
wait_job = EmrServerlessJobSensor(
task_id="wait_spark",
application_id=APPLICATION_ID,
job_id=start_job.output,
aws_conn_id="aws_default",
poke_interval=30,
timeout=3600,
)
start_job >> wait_job
What’s happening here:
EmrServerlessStartJobOperator
: Submits your PySpark script to EMR Serverless with runtime parameters.EmrServerlessJobSensor
: Polls until the job finishes, so you can trigger downstream tasks.- No cluster YAML, no JVM image maintenance — you only manage your Spark code.
Spark Job Example
A simple PySpark script (job.py
) stored in S3:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ExampleJob").getOrCreate()
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
df = spark.createDataFrame(data, ["name", "age"])
df.filter(df.age > 30).show()
spark.stop()
Comparing to Spark on Kubernetes
Factor | EMR from Airflow | Spark on Kubernetes |
---|---|---|
Setup time | Minutes | Hours–Days |
Maintenance | AWS handles runtime | You handle runtime |
Scaling | Automatic | You configure autoscalers |
AWS integration | Native | Needs IAM/token plumbing |
Portability | AWS-only | Any Kubernetes cluster |
Final Thoughts
If your team’s priority is delivering data pipelines, not running infrastructure, then running Spark via EMR from Airflow is a clear win. You can always migrate to Kubernetes later if multi-cloud or custom runtimes become a priority — but EMR will get you to production faster, with fewer moving parts.
Pro tip: Start with EMR Serverless for quick wins. When a job’s performance or runtime cost becomes a concern, consider transient EMR clusters with tuned instance groups.
Table of Contents
- Why EMR + Airflow Works
- Choosing the Right EMR Pattern
- Minimal Airflow DAG for EMR Serverless
- Spark Job Example
- Comparing to Spark on Kubernetes
- Final Thoughts
Trending
Table of Contents
- Why EMR + Airflow Works
- Choosing the Right EMR Pattern
- Minimal Airflow DAG for EMR Serverless
- Spark Job Example
- Comparing to Spark on Kubernetes
- Final Thoughts