ZhangZhihui's Blog  

🔹 1. Late Arriving Facts (a.k.a. Early Arriving Facts)

  • The fact record itself (transaction, order line, etc.) comes in earlier than some of its required dimension context.

  • Example:

    • You receive an Order Fact for Customer C789.

    • But Customer C789 doesn’t exist yet in the Customer Dimension.

  • Problem: Fact can’t find its foreign key in the dimension.

👉 Handling: Insert a dummy/placeholder dimension row so the fact can load, then update when the dimension arrives.


🔹 2. Late Arriving Dimensions

  • The dimension record itself arrives into the warehouse later than it should have — often after facts that reference it have already been loaded.

  • Example:

    • You already loaded Sales Facts for Customer C789 (using a placeholder row).

    • Days later, the real Customer C789 dimension record arrives.

👉 Handling: Update the placeholder row with the correct attributes when the late dimension arrives.


🔹 The Difference in Perspective

  • Late Arriving Fact → The fact row is “too early,” i.e., arrives before its dimensions are ready.

  • Late Arriving Dimension → The dimension row is “too late,” i.e., arrives after facts that reference it.

They describe the same timing mismatch, but from different angles:

  • If you’re looking at the fact ETL pipeline, you call it a late arriving fact problem.

  • If you’re looking at the dimension ETL pipeline, you call it a late arriving dimension problem.


✅ So:

  • They are related and often occur together.

  • But the naming emphasizes which side of the ETL flow you are focusing on (fact vs. dimension).

 

Here’s a step-by-step ETL workflow diagram for handling late arriving facts and late arriving dimensions with surrogate keys and placeholders.


🛠 Workflow for Late Arriving Facts & Dimensions

          ┌────────────────────┐
          │ Source Systems     │
          │  - CRM (Customers) │
          │  - POS (Sales)     │
          └─────────┬──────────┘
                    │
            ┌───────▼────────┐
            │ Staging Tables │
            │  - Stg_Customer│
            │  - Stg_Sales   │
            └───────┬────────┘
                    │
        ┌───────────▼────────────────────┐
        │ Load Dimensions First           │
        │ (Dim_Customer)                  │
        │   - If Customer exists → Update │
        │   - If new Customer → Insert    │
        │   - If Customer missing but     │
        │     fact arrives → Insert       │
        │     placeholder (Unknown)       │
        └───────────┬────────────────────┘
                    │
        ┌───────────▼────────────────────┐
        │ Load Facts (Fact_Sales)         │
        │   - Join on Business Key        │
        │   - Resolve to Surrogate Key    │
        │   - Always succeeds because     │
        │     placeholder ensures FK      │
        │     exists                      │
        └───────────┬────────────────────┘
                    │
        ┌───────────▼────────────────────┐
        │ Late Arriving Dimensions        │
        │   - When Customer arrives later │
        │   - Update placeholder row in   │
        │     Dim_Customer with real data │
        │   - Facts remain valid because  │
        │     surrogate key is unchanged  │
        └────────────────────────────────┘

 


🔑 Key Flow

  1. Staging Layer

    • Capture raw data from source systems (Sales, CRM).

  2. Dimension Load

    • Load or update dimension rows.

    • If dimension is missing but fact exists → insert placeholder row with surrogate key.

  3. Fact Load

    • Facts join to dimensions using business keys, then load surrogate keys.

    • Placeholders guarantee referential integrity.

  4. Late Arriving Dimension Handling

    • When the dimension record finally arrives, update the placeholder.

    • Surrogate key stays the same → no need to touch fact table.


✅ This is the standard Kimball-style handling for early arriving facts (late arriving facts) and late arriving dimensions.
✅ It ensures fact tables are always loadable, and later corrections flow only into dimension rows, without rewriting fact history.

 

 

Let’s build on the late arriving fact/dimension scenario we discussed and show real SQL insert examples for fact table loads from staging.


📂 Setup

Dimension Table

 
CREATE TABLE Dim_Customer ( customer_sk INT IDENTITY(1,1) PRIMARY KEY, -- surrogate key customer_id VARCHAR(50) UNIQUE, -- business key from source customer_name VARCHAR(200), city VARCHAR(100), is_placeholder BIT DEFAULT 0 );

Fact Table

 
CREATE TABLE Fact_Sales ( sales_id INT PRIMARY KEY, customer_sk INT NOT NULL, -- FK to Dim_Customer sales_date DATE, amount DECIMAL(12,2) );

Staging Tables

 
CREATE TABLE Stg_Sales ( sales_id INT, customer_id VARCHAR(50), sales_date DATE, amount DECIMAL(12,2) ); CREATE TABLE Stg_Customer ( customer_id VARCHAR(50), customer_name VARCHAR(200), city VARCHAR(100) );

🟢 1. Load Dimension First (with placeholder logic)

 
-- Insert new customers into Dim_Customer INSERT INTO Dim_Customer (customer_id, customer_name, city, is_placeholder) SELECT s.customer_id, s.customer_name, s.city, 0 FROM Stg_Customer s WHERE NOT EXISTS ( SELECT 1 FROM Dim_Customer d WHERE d.customer_id = s.customer_id );

🟢 2. Handle Late Arriving Facts (fact comes before dimension)

Before inserting facts, ensure placeholder customers exist:

 
-- Insert placeholder for missing customers in staging facts INSERT INTO Dim_Customer (customer_id, customer_name, city, is_placeholder) SELECT DISTINCT s.customer_id, 'Unknown', 'Unknown', 1 FROM Stg_Sales s WHERE NOT EXISTS ( SELECT 1 FROM Dim_Customer d WHERE d.customer_id = s.customer_id );

🟢 3. Insert Facts (using surrogate keys)

Now we can safely insert facts, resolving business key → surrogate key:

 
INSERT INTO Fact_Sales (sales_id, customer_sk, sales_date, amount) SELECT s.sales_id, d.customer_sk, s.sales_date, s.amount FROM Stg_Sales s JOIN Dim_Customer d ON s.customer_id = d.customer_id;

✅ Every fact row will find a matching customer_sk because placeholders guarantee referential integrity.


🟢 4. Handle Late Arriving Dimensions (update placeholders later)

 
-- Update placeholders when real customer data arrives UPDATE d SET d.customer_name = s.customer_name, d.city = s.city, d.is_placeholder = 0 FROM Dim_Customer d JOIN Stg_Customer s ON d.customer_id = s.customer_id WHERE d.is_placeholder = 1;

🔎 Example in Action

  1. Sales fact arrives first (no customer dimension yet):

     
    INSERT INTO Stg_Sales VALUES (1001, 'C789', '2025-08-15', 500.00);
  2. ETL ensures placeholder:

     
    INSERT INTO Dim_Customer (customer_id, customer_name, city, is_placeholder) VALUES ('C789', 'Unknown', 'Unknown', 1);
  3. Fact is loaded with surrogate key for C789:

     
    INSERT INTO Fact_Sales (sales_id, customer_sk, sales_date, amount) VALUES (1001, 5, '2025-08-15', 500.00); -- assuming SK=5
  4. Later, real customer data arrives:

     
    INSERT INTO Stg_Customer VALUES ('C789', 'Alice Zhang', 'Shanghai');
  5. Update placeholder:

     
    UPDATE Dim_Customer SET customer_name = 'Alice Zhang', city = 'Shanghai', is_placeholder = 0 WHERE customer_id = 'C789';

✅ Fact row remains valid — it still points to customer_sk=5, which now has real attributes.

 

posted on 2025-08-18 11:35  ZhangZhihuiAAA  阅读(13)  评论(0)    收藏  举报