ZhangZhihui's Blog  

1. The Problem: Fact-to-Fact Joins

In a data warehouse, fact tables are very large (millions/billions of rows).

Sometimes analysts want to combine two different metrics from different fact tables. For example:

  • Sales_Fact with measures like sales_amount, units_sold

  • Inventory_Fact with measures like on_hand_quantity

If you write a query like:

SELECT 
   s.product_id,
   s.store_id,
   s.sales_amount,
   i.on_hand_quantity
FROM Sales_Fact s
JOIN Inventory_Fact i
  ON s.product_id = i.product_id
 AND s.store_id   = i.store_id
 AND s.date_id    = i.date_id;

 

👉 This is a fact-to-fact join.
It’s problematic because:

  • Both fact tables are huge → performance disaster.

  • Grain may not align (Sales is at transaction grain, Inventory is at daily snapshot grain).

  • Double-counting and incorrect aggregations can happen.


2. The Multipass SQL Solution

Instead of directly joining fact tables, we use multipass SQL:

  1. First pass: Aggregate each fact table independently to the right level of detail (usually using conformed dimensions).

  2. Second pass: Join the aggregated results (much smaller) together.

Example:

-- Pass 1: Aggregate sales by product, store, day
WITH sales AS (
  SELECT 
    product_id,
    store_id,
    date_id,
    SUM(sales_amount) AS total_sales
  FROM Sales_Fact
  GROUP BY product_id, store_id, date_id
),

-- Pass 2: Aggregate inventory by product, store, day
inventory AS (
  SELECT 
    product_id,
    store_id,
    date_id,
    SUM(on_hand_quantity) AS total_inventory
  FROM Inventory_Fact
  GROUP BY product_id, store_id, date_id
)

-- Pass 3: Combine the aggregated results
SELECT 
  s.product_id,
  s.store_id,
  s.date_id,
  s.total_sales,
  i.total_inventory
FROM sales s
LEFT JOIN inventory i
  ON s.product_id = i.product_id
 AND s.store_id   = i.store_id
 AND s.date_id    = i.date_id;

 

✅ Now the join is between small aggregated tables, not massive fact tables.
✅ Avoids mis-grain joins because aggregation happens first.
✅ More efficient and semantically correct.


3. When to Use Multipass SQL

  • When users want to combine metrics from different processes (sales, inventory, shipments, returns, etc.).

  • When fact tables are at different grains (transaction-level vs. snapshot-level).

  • When fact-to-fact join would otherwise cause performance or correctness problems.


4. Related Design Principle

  • This is part of the "separate fact tables, conformed dimensions" strategy in dimensional modeling.

  • OLAP tools (like MicroStrategy, BusinessObjects, Cognos) often generate multipass SQL automatically behind the scenes to support cross-fact analysis.


👉 In short:
Multipass SQL means aggregating each fact table separately, then combining the smaller result sets, instead of directly joining giant fact tables.
It’s all about performance and avoiding mis-grain mistakes.

 

posted on 2025-08-18 09:56  ZhangZhihuiAAA  阅读(7)  评论(0)    收藏  举报