ZhangZhihui's Blog  

In Apache Airflow, an asset is a logical representation of a piece of data that your workflows produce or consume, independent of where that data lives or how it’s computed.

Assets are part of Airflow’s data-aware scheduling model (introduced in Airflow 2.4 and expanded since).


High-level definition

An Airflow asset represents the existence or freshness of data, not the task that produced it.

An asset answers:

  • “This dataset now exists or has been updated”

  • not “This task finished”


Why assets exist (problem they solve)

Traditional Airflow scheduling is time-based and task-centric:

Task A → Task B → Task C

But real data pipelines care about data availability, not just task order:

  • “Run this DAG when the table is updated

  • “Trigger downstream jobs when a file arrives

  • “React to data, not clocks”

Assets enable this.


What an asset looks like

Defining an asset

from airflow import Dataset  # renamed to Asset in newer versions

orders_asset = Dataset("s3://data/orders/")

(or, in newer Airflow versions:)

from airflow.sdk import Asset

orders_asset = Asset("s3://data/orders/")

The URI is just an identifier, not a connection.


Producing an asset

@task(outlets=[orders_asset])
def produce_orders():
    ...

This means:

“When this task succeeds, the orders_asset is updated.”


Consuming an asset

@dag(schedule=[orders_asset])
def downstream_dag():
    ...

This DAG:

  • Has no cron schedule

  • Runs when the asset is updated


Key concept: assets ≠ storage

An asset:

  • ❌ Does NOT store data

  • ❌ Does NOT move data

  • ❌ Does NOT validate data

  • ✅ Represents data state

Think of it as a signal, not the data itself.


Asset vs Task dependency

Task dependency (old model)

Task A → Task B
  • Tight coupling

  • Same DAG

  • Time-based thinking

Asset dependency (new model)

Asset X → DAG B
  • Loose coupling

  • Cross-DAG

  • Data-based thinking


Example end-to-end

Producer DAG

orders = Asset("warehouse.orders")

@dag(schedule="@daily")
def orders_etl():
    @task(outlets=[orders])
    def load_orders():
        ...

Consumer DAG

@dag(schedule=[orders])
def reporting():
    @task
    def build_report():
        ...

Result:

  • reporting runs whenever orders updates

  • Even if producer DAG name or structure changes


Assets in the Airflow UI

In the Assets tab you’ll see:

  • Asset name / URI

  • Producing tasks

  • Consuming DAGs

  • Last update time

  • Lineage graph

This gives you data lineage, not just task lineage.


When you should use assets

✅ Decoupling producer and consumer DAGs
✅ Cross-DAG dependencies
✅ Event-driven pipelines
✅ Data freshness-based scheduling
✅ Clear data lineage


When you should NOT use assets

❌ Passing data between tasks
❌ Large or intermediate data exchange
❌ Tight step-by-step workflows within one DAG

(Use normal task dependencies or external storage instead.)


Mental model

Think of assets like “data events”:

  • Task succeeds → asset updated

  • Asset updated → downstream DAGs triggered


One-sentence summary

An Airflow asset is a declarative signal that a specific piece of data has been created or updated, enabling data-driven scheduling and lineage across DAGs.

 

from __future__ import annotations

import pendulum

from airflow.sdk import Asset, asset, dag, task


@asset(uri='s3://bucket/asset1_producer', schedule=None)
def asset1_producer():
    pass


@asset(uri='s3://bucket/object', schedule=None)
def asset2_producer(self, context, asset1_producer):
    print(self)
    print(context['inlet_events'][asset1_producer])


@dag(
    schedule=Asset(uri='s3://bucket/asset1_producer', name='asset1_producer') | 
             Asset(uri='s3://bucket/object', name='asset2_producer'),
    start_date=pendulum.datetime(2021, 1, 1, tz='UTC'),
    catchup=False,
    tags=['consumes', 'asset-scheduled']
)
def consumes_asset_decorator():
    @task(outlets=[Asset(name='process_nothing')])
    def process_nothing():
        pass

    process_nothing()


consumes_asset_decorator()

What does the parameter outlets do?

 

In Airflow, the outlets parameter tells Airflow which assets a task produces (updates) when it successfully finishes.

Let’s break it down clearly using your example.


What outlets means

outlets declares the assets that this task emits (updates) on success.

In plain terms:

  • When process_nothing succeeds

  • Airflow marks the asset process_nothing as updated

  • Any DAGs that depend on this asset become eligible to run


Why it’s called outlets

Airflow uses data-flow terminology:

TermMeaning
inlets Assets a task consumes
outlets Assets a task produces

So:

  • Data flows into tasks via inlets

  • Data flows out of tasks via outlets


Important clarifications

1️⃣ outlets does NOT move data

Your task:

def process_nothing():
    pass

Still:

  • Writes no data

  • Returns nothing

  • Touches no storage

Yet the asset is marked as updated.

📌 Assets are signals, not data.


2️⃣ Assets update only on success

  • If the task fails → asset is not updated

  • If the task is skipped → asset is not updated

  • Success is the trigger


3️⃣ Assets are global, not DAG-scoped

The asset name:

Asset(name="process_nothing")
  • Is globally visible across Airflow

  • Can be consumed by any DAG

  • Creates cross-DAG dependencies


How this is used downstream

Consumer DAG example

@dag(schedule=[Asset(name="process_nothing")])
def consumer():
    @task
    def react():
        print("Asset updated!")

This DAG runs:

  • When process_nothing succeeds

  • Even if it’s in a completely different DAG


What shows up in the UI

In the Assets view:

  • Asset: process_nothing

  • Producer: process_nothing task

  • Consumers: any DAG scheduled on it

  • Last updated timestamp


When outlets is useful

✅ Marking data availability
✅ Event-driven pipelines
✅ Decoupling producer/consumer DAGs
✅ Lineage & observability


When it’s misleading

⚠️ If the task doesn’t actually produce data
⚠️ If asset names don’t reflect real data

Best practice:

Only declare an outlet when the task truly updates the represented data.


One-line summary

outlets tells Airflow “this task updates these assets when it succeeds,” enabling data-driven scheduling and lineage.

 

How many DAGs will the above code generate? What are their dag_ids?

 

This code will generate 3 DAGs.

Let’s go through them one by one and explain why, because the @asset decorator behaves differently from @dag.


1️⃣ DAG from @assetasset1_producer

@asset(uri="s3://bucket/asset1_producer", schedule=None)
def asset1_producer():
    pass

What this does

  • @asset creates an asset-producing DAG

  • Even though it looks like a function, it is actually turned into:

    • a DAG

    • with a single task

    • that updates the asset s3://bucket/asset1_producer

DAG ID

asset1_producer

2️⃣ DAG from @assetasset2_producer

@asset(uri="s3://bucket/object", schedule=None)
def asset2_producer(self, context, asset1_producer):
    ...

What this does

  • This is another asset-defined DAG

  • It:

    • Produces asset s3://bucket/object

    • Consumes asset asset1_producer (via the function argument)

  • The parameters self, context, and asset1_producer are injected by Airflow’s asset system

DAG ID

asset2_producer

📌 Even though it consumes asset1_producer, it is still its own DAG.


3️⃣ DAG from @dagconsumes_asset_decorator

@dag(
    schedule=Asset(uri="s3://bucket/asset1_producer")
             | Asset(uri="s3://bucket/object"),
    ...
)
def consumes_asset_decorator():
    ...

What this does

  • This is a regular DAG

  • It is asset-scheduled

  • It runs when either:

    • s3://bucket/asset1_producer updates

    • OR s3://bucket/object updates

(The | operator means OR for assets.)

DAG ID

consumes_asset_decorator

Final answer

✅ Number of DAGs generated

3 DAGs

✅ DAG IDs

SourceDAG ID
@asset decorator asset1_producer
@asset decorator asset2_producer
@dag decorator consumes_asset_decorator

Mental model (very important)

  • @asset always creates a DAG

  • Each @asset function = one DAG

  • @dag creates a DAG as usual

  • Assets connect DAGs by data, not by code structure


What this looks like in the UI

You will see:

  • 3 DAGs in the DAG list

  • An asset lineage graph:

    asset1_producer ──▶ asset2_producer
             │
             └────────▶ consumes_asset_decorator

If you want, I can also explain:

  • how the injected context["inlet_events"] works

 
Yes, I want.
 

Great — let’s go through all three topics clearly and concretely, using your exact code as the anchor.


context["inlet_events"]: what it is and how it works

What problem it solves

When a DAG or asset runs because an asset was updated, you often want to know:

  • Which asset triggered this run?

  • When was it updated?

  • Which task produced it?

That information is exposed via inlet events.


What inlet_events contains

Inside an asset-decorated function:

context["inlet_events"]

is a dictionary:

{
    Asset(...) → [AssetEvent, AssetEvent, ...]
}

Each AssetEvent represents one update of that asset.


Your example

@asset(uri="s3://bucket/object", schedule=None)
def asset2_producer(self, context, asset1_producer):
    print(context["inlet_events"][asset1_producer])

Here:

  • asset1_producer (the function argument) is the Asset object

  • Airflow injects it because this asset depends on it

  • context["inlet_events"][asset1_producer] returns a list of events

Example content (simplified):

[
    AssetEvent(
        asset=Asset("s3://bucket/asset1_producer"),
        source_task_id="asset1_producer",
        source_dag_id="asset1_producer",
        timestamp=2025-01-01T00:00:00Z,
    )
]

Why it’s a list

  • Multiple upstream asset updates can trigger a run

  • Backfills can create multiple events

  • The list preserves full lineage


When this is useful

✅ Incremental processing
✅ Auditing & observability
✅ Data freshness logic
✅ Debugging asset-triggered runs

 

OperatorMeaningTypical use
` ` Any asset
& All assets Joins, reports, aggregates
posted on 2025-12-15 15:56  ZhangZhihuiAAA  阅读(2)  评论(0)    收藏  举报