ZhangZhihui's Blog  

The SparkOperator in Apache Airflow is used to submit Spark applications (usually via spark-submit) from an Airflow DAG. Below is a clear, practical guide from setup to a working example.


1. When to use SparkOperator

Use SparkOperator if:

  • You want Airflow to orchestrate Spark jobs

  • You submit Spark jobs using spark-submit

  • Spark runs on:

    • Standalone cluster

    • YARN

    • Kubernetes

    • Mesos

If you use Databricks, EMR, or K8s-native Spark, there are specialized operators instead.


2. Prerequisites

A. Install provider package

(airflow-venv) frank@ZZHPC:~/download$ pip install pyspark==3.5.7

 

(airflow-venv) frank@ZZHPC:~/download$ pip install --upgrade pip setuptools wheel

 

(airflow-venv) frank@ZZHPC:~/download$ pip install apache-airflow-providers-apache-spark

 

B. Spark installed & accessible

spark-submit must be available on the Airflow worker machine.

Verify:

spark-submit --version

3. Airflow Connection (Optional but recommended)

Create a Spark connection in Airflow UI:

  • Conn Id: spark_default

  • Conn Type: Spark

  • Host: Spark master URL
    Examples:

    • spark://spark-master:7077

    • yarn

    • k8s://https://kubernetes.default.svc

1

 

 1

 

2

 

4. Minimal SparkOperator Example

from datetime import datetime

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.sdk import DAG


with DAG(
    dag_id="spark_submit_example",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    schedule=None
) as dag:

    spark_task = SparkSubmitOperator(
        task_id="run_spark_app",
        application="/home/frank/zspark.py",
        conn_id="spark_default",
        verbose=True,
    )

 

Spark master and spark worker must be started under the (airflow-venv) prompt.


5. Example Spark Application (example.py)

zspark.py:
from pyspark.sql import SparkSession
import sys


def main():
    # Create Spark session
    spark = (
        SparkSession.builder
        .appName("AirflowSparkSubmitExample")
        .getOrCreate()
    )

    # Example data
    data = [
        ("Alice", 100),
        ("Bob", 200),
        ("Charlie", 300),
    ]

    df = spark.createDataFrame(data, ["name", "amount"])

    print("=== Input Data ===")
    df.show()

    # Simple transformation
    result_df = df.withColumn("amount_plus_10", df.amount + 10)

    print("=== Transformed Data ===")
    result_df.show()

    # Optional: write output (example)
    # result_df.write.mode("overwrite").parquet("/tmp/spark_output")

    spark.stop()


if __name__ == "__main__":
    main()

 


6. Commonly Used Parameters

SparkSubmitOperator(
    task_id="spark_job",
    application="/path/to/app.py",
    conn_id="spark_default",

    # Spark configs
    conf={
        "spark.executor.memory": "2g",
        "spark.executor.cores": "2",
    },

    # Application arguments
    application_args=["--date", "{{ ds }}"],

    # Cluster settings
    master="yarn",
    deploy_mode="cluster",

    # Resource configs
    num_executors=4,
    executor_memory="2g",
    executor_cores=2,
    driver_memory="1g",

    # Other
    name="airflow_spark_job",
    verbose=True,
)

 


7. Passing Airflow Variables & Macros

application_args=[
    "{{ ds }}",
    "{{ var.value.input_path }}",
]

In Spark:

import sys
run_date = sys.argv[1]
input_path = sys.argv[2]

 


8. Running on Different Environments

A. Spark Standalone

master="spark://spark-master:7077"

B. YARN

master="yarn"
deploy_mode="cluster"

C. Kubernetes

master="k8s://https://kubernetes.default.svc"
conf={
    "spark.kubernetes.container.image": "spark:3.5.0",
}

 


9. Debugging Tips

  • Check Airflow task logs (spark-submit output is captured)

  • Use verbose=True

  • Confirm Spark paths are correct

  • Ensure Airflow worker has access to the application file


10. Common Errors

ErrorCause
spark-submit not found Spark not installed on worker
Connection spark_default not found Missing Airflow connection
Job stuck in RUNNING Cluster resource issue
File not found App path not accessible to worker

11. When NOT to use SparkOperator

Avoid SparkOperator if:

    • Using Databricks → use DatabricksSubmitRunOperator

    • Using EMR → use EmrAddStepsOperator

    • Using Kubernetes-native Spark → consider KubernetesPodOperator

 

    spark_task = SparkSubmitOperator(
        task_id="run_spark_app",
        application="/home/frank/zspark.py",
        conn_id="spark_default",
        verbose=True,
    )

For the above code, does the /home/frank/zspark.py must be on a Spark worker node?

 

Short answer: not necessarily — it depends on how your Spark cluster is set up and how spark-submit is executed.

How SparkSubmitOperator works

SparkSubmitOperator runs spark-submit from the machine where the Airflow task is executed (the Airflow worker).

So the path:

application="/home/frank/zspark.py"

must be accessible on the Airflow worker, not automatically on Spark workers.

What happens after submission

Once spark-submit is launched:

  • Client mode (default in many setups):

    • The driver runs on the Airflow worker

    • Spark automatically ships the application file to executors

    • ❌ The file does not need to exist on Spark worker nodes

  • Cluster mode:

    • The driver runs on a Spark worker / cluster node

    • The application must be:

      • On a shared filesystem, or

      • Distributed via Spark (e.g., HDFS, S3, or --files)

    • ❌ A local path like /home/frank/zspark.py that exists only on the Airflow worker will fail

Typical safe options

To avoid ambiguity:

  1. Put the application on a shared filesystem

    application="hdfs:///apps/zspark.py"
    # or s3://, gs://, NFS, etc.
  2. Or ensure identical paths on all nodes

    /home/frank/zspark.py
  3. Explicitly control deploy mode

    SparkSubmitOperator(
        ...
        application="/home/frank/zspark.py",
        deploy_mode="client",  # safer with local paths
    )

Summary table

Deploy modeWhere app must exist
client Airflow worker only
cluster Shared storage or Spark-accessible URI

Recommendation

If you are running Airflow → Spark in production, use HDFS/S3/GS paths for the application. This avoids node-specific path issues entirely.

 

posted on 2025-12-16 21:55  ZhangZhihuiAAA  阅读(2)  评论(0)    收藏  举报