ZhangZhihui's Blog  

 

from __future__ import annotations

import datetime

from airflow.providers.standard.operators.empty import EmptyOperator
from airflow.providers.standard.operators.latest_only import LatestOnlyOperator
from airflow.sdk import DAG


with DAG(
    dag_id='latest_only',
    start_date=datetime.datetime(2021, 1, 1),
    catchup=False,
    schedule=datetime.timedelta(hours=4),
    tags=['example2', 'example3']
) as dag:
    latest_only = LatestOnlyOperator(task_id='latest_only')
    task1 = EmptyOperator(task_id='task1')

    latest_only >> task1

 

LatestOnlyOperator is a control-flow operator in Airflow that lets downstream tasks run only for the most recent DAG run, while automatically skipping them for older runs.

It’s mainly used in scheduled DAGs where you don’t want historical or overlapping runs to execute certain tasks.


What LatestOnlyOperator does

For each DAG run:

  • If the run is the latest logical date for that DAG →
    downstream tasks are allowed to run

  • If the run is not the latest
    downstream tasks are marked skipped

The operator itself succeeds in all cases; it only controls downstream behavior.


How this applies to your DAG

Execution behavior

ScenarioResult
Only one DAG run exists task1 runs
New run starts while previous run still executing Only the newest run executes task1
Old run still active when a newer run appears task1 in the old run is skipped
Manual trigger when another run already exists Manual run becomes “latest”

Timeline example (4-hour schedule)

00:00  Run A (older)
04:00  Run B (newer)

Behavior:

Runlatest_onlytask1
Run A success skipped
Run B success success

Why this operator exists

It solves problems like:

  • Preventing duplicate notifications

  • Avoiding expensive tasks on outdated data

  • Ensuring only fresh data is published

  • Handling long-running DAGs with overlapping runs

Example use cases:

  • Send Slack/email alerts

  • Publish reports

  • Trigger external systems

  • Run data exports


Important characteristics

1. Logical-date based

“Latest” means highest logical date, not wall-clock time.

2. Works only with scheduled DAGs

  • It is mostly meaningless for unscheduled DAGs

  • Manual runs still work, but the last triggered run wins

3. Skip propagation

Downstream tasks are skipped, not failed — so:

  • No retries

  • No alerts (unless configured for skips)


Common pitfalls

❌ Expecting it to stop old runs immediately

It does not cancel running tasks — it only skips downstream tasks once the operator runs.

❌ Using it as a concurrency control

Use:

  • max_active_runs

  • pools

  • task concurrency
    instead.

❌ Using it with catchup=True

You’ll skip almost all backfill runs — often not what you want.


LatestOnlyOperator vs alternatives

GoalBetter option
Only one DAG run at a time max_active_runs=1
Skip old runs entirely catchup=False
Run only latest tasks LatestOnlyOperator
Fine-grained control ShortCircuitOperator

Typical real-world pattern

latest_only = LatestOnlyOperator(task_id="latest_only")

latest_only >> notify_users
latest_only >> publish_results
latest_only >> trigger_external_system

Older runs silently skip these tasks, newest run proceeds.


Summary

    • LatestOnlyOperator allows downstream tasks only in the newest DAG run

    • Older runs get downstream tasks skipped

    • Useful for notifications, publishing, side effects

    • Logical-date based, not time-based

    • Complements scheduling, not concurrency control

 

posted on 2025-12-16 15:58  ZhangZhihuiAAA  阅读(4)  评论(0)    收藏  举报