公告

日历

from datetime import datetime

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.sdk import DAG


with DAG(
    dag_id="spark_submit_example",
    start_date=datetime(2024, 1, 1),
    catchup=False,
    schedule=None
) as dag:

    spark_task = SparkSubmitOperator(
        task_id="run_spark_app",
        application="/home/frank/zspark.py",
        conn_id="spark_default",
        verbose=True,
    )

Spark master and spark worker must be started under the (airflow-venv) prompt.

5. Example Spark Application (`example.py`)

zspark.py:

from pyspark.sql import SparkSession
import sys


def main():
    # Create Spark session
    spark = (
        SparkSession.builder
        .appName("AirflowSparkSubmitExample")
        .getOrCreate()
    )

    # Example data
    data = [
        ("Alice", 100),
        ("Bob", 200),
        ("Charlie", 300),
    ]

    df = spark.createDataFrame(data, ["name", "amount"])

    print("=== Input Data ===")
    df.show()

    # Simple transformation
    result_df = df.withColumn("amount_plus_10", df.amount + 10)

    print("=== Transformed Data ===")
    result_df.show()

    # Optional: write output (example)
    # result_df.write.mode("overwrite").parquet("/tmp/spark_output")

    spark.stop()


if __name__ == "__main__":
    main()

6. Commonly Used Parameters

SparkSubmitOperator(
    task_id="spark_job",
    application="/path/to/app.py",
    conn_id="spark_default",

    # Spark configs
    conf={
        "spark.executor.memory": "2g",
        "spark.executor.cores": "2",
    },

    # Application arguments
    application_args=["--date", "{{ ds }}"],

    # Cluster settings
    master="yarn",
    deploy_mode="cluster",

    # Resource configs
    num_executors=4,
    executor_memory="2g",
    executor_cores=2,
    driver_memory="1g",

    # Other
    name="airflow_spark_job",
    verbose=True,
)

7. Passing Airflow Variables & Macros

In Spark:

8. Running on Different Environments

A. Spark Standalone

B. YARN

C. Kubernetes

9. Debugging Tips

Check Airflow task logs (spark-submit output is captured)
Use verbose=True
Confirm Spark paths are correct
Ensure Airflow worker has access to the application file

10. Common Errors

Error	Cause
`spark-submit not found`	Spark not installed on worker
`Connection spark_default not found`	Missing Airflow connection
Job stuck in RUNNING	Cluster resource issue
File not found	App path not accessible to worker

11. When NOT to use SparkOperator

Avoid SparkOperator if:

Using Databricks → use DatabricksSubmitRunOperator
Using EMR → use EmrAddStepsOperator
Using Kubernetes-native Spark → consider KubernetesPodOperator

    spark_task = SparkSubmitOperator(
        task_id="run_spark_app",
        application="/home/frank/zspark.py",
        conn_id="spark_default",
        verbose=True,
    )

For the above code, does the /home/frank/zspark.py must be on a Spark worker node?

Short answer: not necessarily — it depends on how your Spark cluster is set up and how spark-submit is executed.

How `SparkSubmitOperator` works

SparkSubmitOperator runs spark-submit from the machine where the Airflow task is executed (the Airflow worker).

So the path:

must be accessible on the Airflow worker, not automatically on Spark workers.

What happens after submission

Once spark-submit is launched:

Client mode (default in many setups):
- The driver runs on the Airflow worker
- Spark automatically ships the application file to executors
- ❌ The file does not need to exist on Spark worker nodes
Cluster mode:
- The driver runs on a Spark worker / cluster node
- The application must be:
  - On a shared filesystem, or
  - Distributed via Spark (e.g., HDFS, S3, or --files)
- ❌ A local path like /home/frank/zspark.py that exists only on the Airflow worker will fail

Typical safe options

To avoid ambiguity:

Put the application on a shared filesystem

Or ensure identical paths on all nodes

/home/frank/zspark.py

Explicitly control deploy mode

Summary table

Deploy mode	Where app must exist
client	Airflow worker only
cluster	Shared storage or Spark-accessible URI

Recommendation

If you are running Airflow → Spark in production, use HDFS/S3/GS paths for the application. This avoids node-specific path issues entirely.

posted on 2025-12-16 21:55 ZhangZhihuiAAA 阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

导航

1. When to use `SparkOperator`

2. Prerequisites

A. Install provider package

B. Spark installed & accessible

3. Airflow Connection (Optional but recommended)

4. Minimal SparkOperator Example

5. Example Spark Application (`example.py`)

6. Commonly Used Parameters

7. Passing Airflow Variables & Macros

8. Running on Different Environments

A. Spark Standalone

B. YARN

C. Kubernetes

9. Debugging Tips

10. Common Errors

11. When NOT to use SparkOperator

How `SparkSubmitOperator` works

What happens after submission

Typical safe options

Summary table

Recommendation


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

1. When to use SparkOperator

2. Prerequisites

A. Install provider package

B. Spark installed & accessible

3. Airflow Connection (Optional but recommended)

4. Minimal SparkOperator Example

5. Example Spark Application (example.py)

6. Commonly Used Parameters

7. Passing Airflow Variables & Macros

8. Running on Different Environments

A. Spark Standalone

B. YARN

C. Kubernetes

9. Debugging Tips

10. Common Errors

11. When NOT to use SparkOperator

How SparkSubmitOperator works

What happens after submission

Typical safe options

Summary table

Recommendation

1. When to use `SparkOperator`

5. Example Spark Application (`example.py`)

How `SparkSubmitOperator` works