ZhangZhihui's Blog  

Temporary Tables

A table that has been created as a temporary table will only be visible to the current session. Data will be stored in the user’s scratch directory, and deleted at the end of the session.

If a temporary table is created with a database/table name of a permanent table which already exists in the database, then within that session any references to that table will resolve to the temporary table, rather than to the permanent table. The user will not be able to access the original table within that session without either dropping the temporary table, or renaming it to a non-conflicting name.

Temporary tables have the following limitations:

  • Partition columns are not supported.
  • No support for creation of indexes.

 

CREATE TEMPORARY TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING);

 

While CTEs (Common Table Expressions) are generally cleaner and easier to read, there are several technical scenarios where a Temporary Table is objectively better.

The choice usually comes down to reusability and optimization.


1. Multiple References (Performance)

A CTE is just a "named subquery." If you reference a CTE three times in your main query, Hive might execute the logic of that CTE three separate times (depending on the optimizer's version and complexity).

A Temporary Table is materialized. Hive calculates the result once, saves it to the scratch directory, and then simply reads the file for every subsequent reference.

  • Use a Temporary Table if you are going to join or reference that specific result set multiple times in a long script.

2. Multi-Step Logic

CTEs are limited to a single query block. If your data pipeline requires 10 steps of logic, a CTE becomes a massive, unreadable "spaghetti query."

Temporary tables allow you to break the logic into discrete steps:2

  1. Filter the data (Store in tmp1).

  2. Aggregate the data (Store in tmp2).

  3. Perform the final Join.

3. Optimizer Statistics (Join Performance)

This is the most critical technical reason.

  • CTEs: Since they aren't materialized, the Hive optimizer has to "guess" how many rows the CTE will produce. If it guesses wrong, it might choose a slow Shuffle Join instead of a fast Map Join.

  • Temporary Tables: After creating a temporary table, you can run ANALYZE TABLE ... COMPUTE STATISTICS. This tells Hive exactly how big the table is, allowing the engine to choose the most efficient execution plan (like a Map-side join).


4. Summary Comparison

Feature CTE (WITH clause) Temporary Table
Scope One single query Entire Beeline session
Storage Virtual (Logic only) Physical (Materialized in /tmp)
Optimization Optimizer makes estimates Can run ANALYZE for perfect stats
Debugging Hard (Must run whole query) Easy (Can SELECT * from it)
Syntax WITH name AS (...) CREATE TEMPORARY TABLE ...

When to use which?

  • Use a CTE for simple, one-off logic where you just want to avoid nesting subqueries. It’s better for readability and doesn't leave "footprints" in the file system.

  • Use a Temporary Table if the logic is heavy (takes minutes to run), if you need to reuse the results across multiple SQL statements, or if you need to force a specific join optimization by providing the optimizer with accurate statistics.

 

posted on 2026-01-13 17:41  ZhangZhihuiAAA  阅读(0)  评论(0)    收藏  举报